## Prediction of hard disk failures using S.M.A.R.T attributes

### What do SMART attributes mean?
S.M.A.R.T. stands for **S**elf-Monitoring, **A**nalysis and **R**eporting **T**echnology. SMART is a system for monitoring and early detection of errors of storage media such as hard disks or SSDs. All current hard drives and SSDs have SMART functionality.

### Dataset
Backblaze owns and operates multiple data centers that have thousands of hard disk drives. They regularly share data about the performance of these drives as well as other insights from their datacenters. <br>
<img src="https://i.pcmag.com/imagery/reviews/000jV9xQkF3oIkYuhgZursX-6..1569480108.png" alt="MarineGEO circle logo" style="height: 200px; width:250px;"/>

Data for this project is collected from Jan-Feb 2017 and Nov-Dec 2017 <br>
The dataset can be downloaded [here](https://www.backblaze.com/b2/hard-drive-test-data.html).

### Feature Selection
Hard drives feature SMART stats that monitor indicators of hard drive status & reliability. Out of over 250 SMART stats, we will aim to choose 5-7 stats that predict hard drive failure.<br>
Backblaze suggested the use of raw S.M.A.R.T statistic 5, 187, 188, 197 and 198 for the purpose of analysis.<br>
(Ref.: Andy Klein, "What SMART Stats Tell Us About Hard Drives", October 6, 2016: https://www.backblaze.com/blog/what-smart-stats-indicate-hard-drive-failures/ .)
Similar suggestions were made in research papers in this domain.<br>

| Attribute  | Description                   |
|------------|-------------------------------|
| SMART 5    | Reallocated Sectors Count     |
| SMART 187  | Reported Uncorrectable Errors |
| SMART 188  | Command Timeout               |
| SMART 197  | Current Pending Sector Count  |
| SMART 198  | Uncorrectable Sector Count    |

Import all modules.

In [1]:
import numpy as np
import pandas as pd

In [2]:
features = ['smart_5_raw', 'smart_187_raw', 'smart_188_raw',
            'smart_197_raw', 'smart_198_raw', 'failure']
train_data = pd.read_csv("jan_feb_backblaze_train.csv").reindex(columns=features)
test_data = pd.read_csv("nov_dec_backblaze_test.csv").reindex(columns=features)

In [3]:
train_data.head()

Unnamed: 0,smart_5_raw,smart_187_raw,smart_188_raw,smart_197_raw,smart_198_raw,failure
0,0,,,0,,0
1,2,,,0,,0
2,0,,,0,,0
3,0,,,0,,0
4,0,0.0,,0,,0


In [4]:
test_data.head()

Unnamed: 0,smart_5_raw,smart_187_raw,smart_188_raw,smart_197_raw,smart_198_raw,failure
0,0.0,2.0,0.0,0.0,0.0,1.0
1,102.0,,,0.0,0.0,1.0
2,0.0,0.0,0.0,0.0,0.0,1.0
3,0.0,5.0,0.0,64.0,64.0,1.0
4,0.0,,,1.0,0.0,1.0


### Preprocessing of Data

 In the printed data above, there seems to be missing values in smart_187_raw.<br>
 Dataset contains approximately 1:10 ratio of failed to working drives

In [5]:
print (train_data.smart_187_raw.value_counts(dropna=False))
print (test_data.smart_187_raw.value_counts(dropna=False))

0.0     1357
NaN      798
1.0       14
2.0        9
4.0        4
12.0       4
7.0        3
6.0        3
3.0        3
18.0       2
8.0        2
36.0       2
31.0       1
24.0       1
30.0       1
21.0       1
9.0        1
39.0       1
Name: smart_187_raw, dtype: int64
0.0     1349
NaN      825
2.0       13
1.0        8
6.0        7
12.0       4
3.0        4
4.0        3
5.0        1
33.0       1
78.0       1
7.0        1
27.0       1
11.0       1
18.0       1
24.0       1
8.0        1
17.0       1
9.0        1
Name: smart_187_raw, dtype: int64


### Managing missing values

There are approximately 800 missing values in train and test data.

In [6]:
train_data = train_data.fillna(value=-1)
test_data = test_data.fillna(value=-1)

In [7]:
train_data.head()

Unnamed: 0,smart_5_raw,smart_187_raw,smart_188_raw,smart_197_raw,smart_198_raw,failure
0,0,-1.0,-1.0,0,-1.0,0
1,2,-1.0,-1.0,0,-1.0,0
2,0,-1.0,-1.0,0,-1.0,0
3,0,-1.0,-1.0,0,-1.0,0
4,0,0.0,-1.0,0,-1.0,0


In [8]:
test_data.head()

Unnamed: 0,smart_5_raw,smart_187_raw,smart_188_raw,smart_197_raw,smart_198_raw,failure
0,0.0,2.0,0.0,0.0,0.0,1.0
1,102.0,-1.0,-1.0,0.0,0.0,1.0
2,0.0,0.0,0.0,0.0,0.0,1.0
3,0.0,5.0,0.0,64.0,64.0,1.0
4,0.0,-1.0,-1.0,1.0,0.0,1.0


### Training and Testing Gaussian Naive Bayes

### Why Gaussian Naive Bayes?
[Naive bayes](https://www.analyticsvidhya.com/blog/2015/09/naive-bayes-explained/) is a classifer based on Bayes' theorem. It assumes class conditional independence "naively". <br>
<img src="https://miro.medium.com/max/640/1*7lg_uLm8_1fYGjxPbTrQFQ.png" alt="MarineGEO circle logo" style="height: 100px; width:280px;"/>
As SMART attribites are relatively independent of each other, naive bayes is useful for classification of hard drive failures.
As the data from from Blackblaze is multivariate, we can assume Guassian distribution of data.4
Gaussian naive bayes computes the probability of belonging to each class (1 or 0; e.g. failed or working drive) based on the probability density function:
![alt](https://wikimedia.org/api/rest_v1/media/math/render/svg/acae0ab7740006874d2c7fd77eb5de61db3586c5)
The probability distribution of v given a class c is then calculated by:
![alt](https://wikimedia.org/api/rest_v1/media/math/render/svg/12ac511145223037a1378689333fe04c621845d4)

![alt](https://www.researchgate.net/profile/Yune_Lee/publication/255695722/figure/fig1/AS:297967207632900@1448052327024/Figure-1-Illustration-of-how-a-Gaussian-Naive-Bayes-GNB-classifier-works-For-each.png)
<p>After calculating the probability p(x | C) for some class c, the distance from the class mean divided by the standard deviation of that class is measured (also known as z-score). Gaussian Naive Bayes will classify point x as belonging in whichever class the point x is closest to (with the lowest z-score). </p>
<p>In the picture above, point X has a z-score (distance) closer to the mean of class A when observed as if it belonged in both classes A & B. Therefore it will classify as belonging to class A.</p>

In [9]:
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()

train_ds = train_data.drop(['failure'], axis=1)
train_target = train_data['failure']

test_ds = test_data.drop(['failure'], axis=1)
test_target = test_data['failure']

In [10]:
gnb = gnb.fit(train_ds, train_target)
y_prediction = gnb.predict(test_ds)

no_of_failed_predictions = sum([ test_target[i] == 1 and y_prediction[i] == 0 for i in range(len(test_target))])
total = test_ds.shape[0]
incorrect_predictions = (test_target != y_prediction).sum()
correct_predictions = (test_target == y_prediction).sum()
print ("Number of failed predictions are %d out of total %d points" % (incorrect_predictions, total))
print ("Num missed failed hard drive predictions: %s; %s%% out of total" % (no_of_failed_predictions, 100.0 * float(no_of_failed_predictions) / total))
print ("Percent accuracy: %s%%" % (str(100.0 * gnb.score(test_ds, test_target))))

Number of failed predictions are 127 out of total 2224 points
Num missed failed hard drive predictions: 116; 5.215827338129497% out of total
Percent accuracy: 94.28956834532374%


### Inference from GNB prediction
Our model acheived almost 95 % accuracy, which is quite amazing despite its simple algorithm. <br>
But, if one observe model predicted false positives at the rate of 5.17% whch is pretty high rate if we consider actual scal of today's data centers.

Say, if a datacenter has 10000 hard drives, then it will predict 500 drives as false positives which would cause extra problem.

In [11]:
new_input = ([2, -1, -1, 0, -1])
new_input = np.array(new_input).reshape(1, -1)
y_pred =   gnb.predict(new_input) 
print(y_pred)

[0]


In [12]:
import pickle

In [13]:
with open("model.pkl", 'wb') as f_out:
    pickle.dump(gnb, f_out)
    f_out.close()

In [14]:
with open('model.pkl', 'rb') as f_in:
    model = pickle.load(f_in)