## Free datasets

There are 4 datasets:
- Backblaze dataset (https://www.kaggle.com/datasets/thedevastator/hard-drive-reliability-data-set; here are new datasets https://www.backblaze.com/b2/hard-drive-test-data.html)
- University of California dataset (could be found at 2006 by following link http://cmrr.ucsd.edu/smart, found this via wayback machine now https://web.archive.org/web/20100611213812/http://cmrr.ucsd.edu/people/hughes/smart/dataset/harddrive1.zip)
- Quantum Corporation dataset -- ??
- Baidu dataset - this can be used probably (probably https://www.kaggle.com/datasets/drtycoon/baidu-hdds-dataset-2017 can be used, but it has 14 columns and there 23 columns in this dataset in papers)

## Sets of features used in the papers

### Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application (University of California dataset)

 Attributes in the set of 25 are: GList1, PList, Servo1, Servo2, Servo3, Servo5, ReadError1, ReadError2, ReadError3,
FlyHeight5, FlyHeight6, FlyHeight7, FlyHeight8, FlyHeight9, FlyHeight10, FlyHeight11, FlyHeight12, ReadEr-
ror18, ReadError19, Servo7, Servo8, ReadError20, GList2, GList3, Servo10.

Single attribute tests using rank-sum were run on all 25 attributes selected in Section 3.3 with 15 samples per pattern. Of these 25, only 8 attributes (Figure 9) were able to detect failures at sufficiently low false alarm rates: ReadError1, ReadError2, ReadError3, ReadError18, ReadError19, Servo7, GList3 and Servo10. Confirming the observations of the feature selection process, ReadError18 was the best attribute, with 27.6% detection at 0.06% false alarms.

Using combinations of attributes in the rank-sum test can lead to improved results over single-attribute classifiers (Figure 11). The best single attributes from Figure 9 were ReadError1, ReadError3, ReadError18 and ReadError19. Using these four attributes and 15 samples per pattern, the rank-sum test detected 28.1% of the failures, with no measured false alarms. Higher detection rates (52.8%) can be had if more false alarms are allowed (0.7%). 


### Bayesian Approaches to Failure prediction for Disk Drives (Quantum dataset)

|Abbreviation | Description|
|---|---|
| RET | read error rate |
| SUT | spinup time |
| CSS | start-stop count |
| GDC | grown defects count |
| SKE | seek errors count |
| POH | power-on hours |
| RRT | calibration retries |
| PCC | power cycles count |
| RSE | read soft errors count |
| DMC | CRC errors count |
| OSS | offline surface scan |


### Health Monitoring of Hard Disk Drive Based on Mahalanobis Distance


It is found that 60% of drives failures are mechanical, often resulting from the
gradual degradation of the drive's performance. The key vital areas include:
- Head disk interface (HDI, including head and disk, also known as air bearing): Crack on head, broken head, head contamination, bad connection to electronics module; disk
scratches, defect, bad servo pattern, flying height variation and modulation.
- Head stack assembly: off-track, deformation.
- Motors/bearings: motor failure, worn bearing, excessive run out, no spin.
- Electronic module: circuit/chip failure, bad connection to drive or bus.


Typical characteristics of SMART are:
- Head flying height -- is the distance between the disk read/write head on a hard disk drive and the platter. Fly height variation can cause the media being insufficiently magnetized and the data are not readable. The physically bumping or banging during the HDD reading or writing process leading the head with strong vibration, which can induce the read/write failure. 
- Data throughput performance -- General throughput performance of the hard disk. Indicate problem with motor, servo or bearings.
- Spin up time -- S.M.A.R.T. parameter indicates an average time (in milliseconds or seconds) of spindle spinup (from zero RPM (Revolutions Per Minute) to fully operational). The low value means it takes too long for the hard disk to a fully operational state.
- Re-allocated sector count -- is the number of sectors that are marked as reallocated by the hard drive upon an error. A growing count is generally considered a bad sign and can result in hard drive failure.
- Seek error rate -- 	Rate of positioning errors of the read/write heads. Indicate problem with servo, head. High temperature can also cause this problem.
- Seek time performance --  the average performance of seek operations of the hard disk’s magnetic heads.
- Spin try recount -- Retry count of spin start attempts. Indicate problem with motor, bearings or power supply.
- Drive calibration retries count -- Number of attempts to calibrate a drive. Indicate problem with motor, bearings or power supply.



Here is the full [list](hdsentinel.com/smart/smartattr.php) of SMART parameters descriptions. 

Some conclusions about parameters:
- head disk interface as the dominant contributor to HDD reliability 
- the wear out, overstress of magnetic head and disk, and resonancehead assembly are categorized as potential failure mechanisms with high risk
- spindle motor and control board have a failure mode in low priorities. 

### HMM, HSMM

We then run our HMM and HSMM predictors and found four attributes provided good failure detection, namely, ReadError18, Servo2, Servo10, and FlyHeight7.

### Autoencoders (Backblaze dataset)

SMART attributes used in the experiments:

| SMART ID | Attribute Name |
| --- | --- |
| 1 | Read Error Rate |
| 3 | Spin up time |
| 4 | start stop count |
| 5 | Reallocated sectors count| 
| 7 | Seek error rate |
| 9 | Power on hours |
| 10 | Spin retry count |
| 12 | Power cycle count |
| 183 | sata downshift error count |
| 184 | End-to-End error / IOEDC | 
| 187 | Reported Uncorrectable Errors |
|188 | Command Timeout | 
| 189 | High Fly Writes |
| 190 | Temperature Difference |
| 191 | G-sense Error Rate |
| 192 | Unsafe Shutdown Count |
| 193 | Load Cycle Count | 
| 194 | Temperature |
| 197 | Current Pending Sector Count |
| 198 | Uncorrectable Sector Count |
| 199 | UltraDMA CRC Error Count |
| 240 | Head Flying Hours |
| 241  | Total LBAs Written |
| 242 | Total LBAs Read |



For the PCA method, we performed the transformation and selected the eigenvectors that resulted in features that
preserve 90% of the variance, resulting in 8 features out of the 24 described in Table I. For the Autoencoders, it was trained a neural network architecture with hidden layers of size (15-8-15) and a output layer of size 24 (the number of dimensions of the input), with the ReLU activation function and the backpropagation algorithm with L2 regularization.


## Datasets EDAs

### University of california dataset analysis

In [4]:
from scipy.io import arff

import pandas as pd

data = arff.loadarff('./uc/harddrive1.arff')
df_uc = pd.DataFrame(data[0])

df_uc.head()

Unnamed: 0,serial,Frame,Hours,HoursBeforeFail,Temp1,Temp2,Temp3,Temp4,FlyHeight1,FlyHeight2,...,ReadError18,ReadError19,Servo7,Servo8,ReadError20,GList2,GList3,Servo9,Servo10,class
0,b'100001',0.0,0.0,2.216,10.0,0.0,0.0,10.0,7962.0,8986.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,b'1'
1,b'100001',1.0,0.0,2.216,12.0,0.0,0.0,12.0,7972.0,8991.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,b'1'
2,b'100001',2.0,0.016,2.2,11.0,0.0,10.0,11.0,7949.0,8981.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,128.0,b'1'
3,b'100001',6.0,0.05,2.166,9.0,0.0,0.0,11.0,7955.0,8982.0,...,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,576.0,b'1'
4,b'100001',7.0,0.083,2.133,7.0,0.0,0.0,9.0,7964.0,8984.0,...,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,576.0,b'1'


### Backblaze dataset analysis

In [7]:
import os

# uploading one dataset (with data for 1 month)
df_bb = pd.read_csv('./backblaze/2017-01-01.csv')
df_bb.head()

# need to upload all datasets and concatenate them to one df probably

Unnamed: 0,index,date,serial_number,model,capacity_bytes,failure,smart_1_normalized,smart_1_raw,smart_2_normalized,smart_2_raw,...,smart_250_normalized,smart_250_raw,smart_251_normalized,smart_251_raw,smart_252_normalized,smart_252_raw,smart_254_normalized,smart_254_raw,smart_255_normalized,smart_255_raw
0,0,2017-01-01,MJ0351YNG9Z0XA,Hitachi HDS5C3030ALA630,3000592982016,0,100,0,135.0,108.0,...,,,,,,,,,,
1,1,2017-01-01,MJ0351YNG9WJSA,Hitachi HDS5C3030ALA630,3000592982016,0,100,0,136.0,104.0,...,,,,,,,,,,
2,2,2017-01-01,PL1321LAG34XWH,Hitachi HDS5C4040ALE630,4000787030016,0,100,0,134.0,101.0,...,,,,,,,,,,
3,3,2017-01-01,MJ0351YNGABYAA,Hitachi HDS5C3030ALA630,3000592982016,0,100,0,136.0,104.0,...,,,,,,,,,,
4,4,2017-01-01,Z305B2QN,ST4000DM000,4000787030016,0,113,58173272,,,...,,,,,,,,,,
