<h1>Prediction of hard drive failure using SMART stats</h1>

The purpose of this study is to predict based on the set of SMART stats which hard drives are going to fail and compare the predictive model to the actual data collected by Backblaze data center from in 2017. This will give disk manufacturers an opportunity to eliminate the risks that adversely affect the hard drive reliability. Data centers could potentially use the predictive model to substitute hard drives and submit a purchase order for new equipment in a timely manner. 

<h2>Overview of the hard drive data</h2>

<p style="margin-bottom: 0px;">The raw hard drive data set is obtained from https://www.backblaze.com/b2/hard-drive-test-data.html. The daily snapshot of one drive is one record or row of data. All of the drive snapshots for a given day are collected into a file consisting of a row for each active hard drive. The format of this file is a "csv" (Comma Separated Values) file. Each day this file is named in the format YYYY-MM-DD.csv, for example, 2017-04-10.csv. Each csv file has a header which includes the following columns:</p>
<TABLE BORDER="3">
    <TR style="border-bottom: 1px solid black; background-color: #fff;">
        <TD COLSPAN="2">
            <H3 style="text-align: center; font-size: 20px; padding: 15px;"><BR/>Header Columns</H3>
        </TD>
    </TR>
    <TR style="text-align: center;">
        <TH style="text-align: left; font-size: 14px;"><strong>date</strong></TH>
        <TH style="text-align: left; font-size: 14px;">the date of the file in yyyy-mm-dd format</TH>
    </TR>
    <TR>
        <TH style="text-align: left; font-size: 14px;"><strong>serial_number</strong></TH>
        <TH style="text-align: left; font-size: 14px;">the manufacturer-assigned serial number of the drive</TH>
    </TR>
    <TR>
        <TH style="text-align: left; font-size: 14px;"><strong>model</strong></TH>
        <TH style="text-align: left; font-size: 14px;">the manufacturer-assigned model of the drive</TH>
    </TR>
    <TR>
        <TH style="text-align: left; font-size: 14px;"><strong>capacity_bytes</strong></TH>
        <TH style="text-align: left; font-size: 14px;">the drive capacity in bytes</TH>
    </TR>
    <TR>
        <TH style="text-align: left; font-size: 14px;"><strong>failure</strong></TH>
        <TH style="text-align: left; font-size: 14px;">contains a “0” if the drive is OK and “1” if this is the last day the drive was operational before failing</TH>
    </TR>
    <TR>
        <TH style="text-align: left; font-size: 14px;"><strong>smart_n_normalized</strong></TH>
        <TH style="text-align: left; font-size: 14px;">normalized value for SMART n(number) stats reported by the given drive</TH>
    </TR>
    <TR>
        <TH style="text-align: left; font-size: 14px;"><strong>smart_n_raw</strong></TH>
        <TH style="text-align: left; font-size: 14px;">raw value for SMART n(number) stats reported by the given drive</TH>
    </TR>
</TABLE>

The SMART attributes (columns 6-95) are defined using the parameters described in https://en.wikipedia.org/wiki/S.M.A.R.T.

<h2>Data Wrangling</h2>
<h3>Import necessary packages and csv files for 2017</h3>

In [1]:
import pandas as pd
import numpy as np
df = pd.read_csv('2017-07-01.csv')
#inspect the first csv file
print(df.head())

         date   serial_number                    model  capacity_bytes  \
0  2017-07-01  MJ0351YNG9Z0XA  Hitachi HDS5C3030ALA630   3000592982016   
1  2017-07-01  MJ0351YNG9WJSA  Hitachi HDS5C3030ALA630   3000592982016   
2  2017-07-01  PL1321LAG34XWH  Hitachi HDS5C4040ALE630   4000787030016   
3  2017-07-01  MJ0351YNGABYAA  Hitachi HDS5C3030ALA630   3000592982016   
4  2017-07-01  PL2331LAHDBJPJ     HGST HMS5C4040BLE640   4000787030016   

   failure  smart_1_normalized  smart_1_raw  smart_2_normalized  smart_2_raw  \
0        0                 100            0               135.0        107.0   
1        0                 100            0               136.0        104.0   
2        0                 100            0               134.0        101.0   
3        0                 100            0               136.0        104.0   
4        0                 100            0               133.0        104.0   

   smart_3_normalized      ...        smart_250_normalized  smart_250_raw 

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85092 entries, 0 to 85091
Data columns (total 95 columns):
date                    85092 non-null object
serial_number           85092 non-null object
model                   85092 non-null object
capacity_bytes          85092 non-null int64
failure                 85092 non-null int64
smart_1_normalized      85092 non-null int64
smart_1_raw             85092 non-null int64
smart_2_normalized      30660 non-null float64
smart_2_raw             30660 non-null float64
smart_3_normalized      85092 non-null int64
smart_3_raw             85092 non-null int64
smart_4_normalized      85092 non-null int64
smart_4_raw             85092 non-null int64
smart_5_normalized      85092 non-null int64
smart_5_raw             85092 non-null int64
smart_7_normalized      85092 non-null int64
smart_7_raw             85092 non-null int64
smart_8_normalized      30660 non-null float64
smart_8_raw             30660 non-null float64
smart_9_normalized      8

In [3]:
import glob
import os
print(os.path.realpath('2017-07-01.csv'))

/Users/valentina/CapstoneProject/2017-07-01.csv


In [4]:
path = r'/Users/valentina/CapstoneProject/data_Q3_2017'
filenames = glob.glob(path + "/*.csv")
frame = pd.DataFrame()
arr = []
for element in filenames:
    df2 = pd.read_csv(element, index_col=None, header=0)
    arr.append(df2)
frame = pd.concat(arr, ignore_index=True)

In [5]:
print(frame.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7919748 entries, 0 to 7919747
Data columns (total 95 columns):
date                    object
serial_number           object
model                   object
capacity_bytes          int64
failure                 int64
smart_1_normalized      float64
smart_1_raw             float64
smart_2_normalized      float64
smart_2_raw             float64
smart_3_normalized      float64
smart_3_raw             float64
smart_4_normalized      float64
smart_4_raw             float64
smart_5_normalized      float64
smart_5_raw             float64
smart_7_normalized      float64
smart_7_raw             float64
smart_8_normalized      float64
smart_8_raw             float64
smart_9_normalized      float64
smart_9_raw             float64
smart_10_normalized     float64
smart_10_raw            float64
smart_11_normalized     float64
smart_11_raw            float64
smart_12_normalized     float64
smart_12_raw            float64
smart_13_normalized     float6

In [6]:
#add the data frame for '2017-07-01.csv' 
frames = [df, frame]
drive_frame_q3_2017 = pd.concat(frames,ignore_index=True)

In [7]:
#inspect the first 5 members of the cumulative data frame
drive_frame_q3_2017.head()

Unnamed: 0,date,serial_number,model,capacity_bytes,failure,smart_1_normalized,smart_1_raw,smart_2_normalized,smart_2_raw,smart_3_normalized,...,smart_250_normalized,smart_250_raw,smart_251_normalized,smart_251_raw,smart_252_normalized,smart_252_raw,smart_254_normalized,smart_254_raw,smart_255_normalized,smart_255_raw
0,2017-07-01,MJ0351YNG9Z0XA,Hitachi HDS5C3030ALA630,3000592982016,0,100.0,0.0,135.0,107.0,127.0,...,,,,,,,,,,
1,2017-07-01,MJ0351YNG9WJSA,Hitachi HDS5C3030ALA630,3000592982016,0,100.0,0.0,136.0,104.0,126.0,...,,,,,,,,,,
2,2017-07-01,PL1321LAG34XWH,Hitachi HDS5C4040ALE630,4000787030016,0,100.0,0.0,134.0,101.0,130.0,...,,,,,,,,,,
3,2017-07-01,MJ0351YNGABYAA,Hitachi HDS5C3030ALA630,3000592982016,0,100.0,0.0,136.0,104.0,137.0,...,,,,,,,,,,
4,2017-07-01,PL2331LAHDBJPJ,HGST HMS5C4040BLE640,4000787030016,0,100.0,0.0,133.0,104.0,143.0,...,,,,,,,,,,


In [8]:
#read and save csv files from the first quarter of 2017
path = r'/Users/valentina/CapstoneProject/data_Q1_2017'
filenames = glob.glob(path + "/*.csv")
drive_frame_q1_2017 = pd.DataFrame()
arr = []
for element in filenames:
    df2 = pd.read_csv(element, index_col=None, header=0)
    arr.append(df2)
drive_frame_q1_2017 = pd.concat(arr,ignore_index=True)
#inspect the first 5 elements of the data frame 
drive_frame_q1_2017.head()

Unnamed: 0,date,serial_number,model,capacity_bytes,failure,smart_1_normalized,smart_1_raw,smart_2_normalized,smart_2_raw,smart_3_normalized,...,smart_250_normalized,smart_250_raw,smart_251_normalized,smart_251_raw,smart_252_normalized,smart_252_raw,smart_254_normalized,smart_254_raw,smart_255_normalized,smart_255_raw
0,2017-01-01,MJ0351YNG9Z0XA,Hitachi HDS5C3030ALA630,3000592982016,0,100.0,0.0,135.0,108.0,127.0,...,,,,,,,,,,
1,2017-01-01,MJ0351YNG9WJSA,Hitachi HDS5C3030ALA630,3000592982016,0,100.0,0.0,136.0,104.0,126.0,...,,,,,,,,,,
2,2017-01-01,PL1321LAG34XWH,Hitachi HDS5C4040ALE630,4000787030016,0,100.0,0.0,134.0,101.0,130.0,...,,,,,,,,,,
3,2017-01-01,MJ0351YNGABYAA,Hitachi HDS5C3030ALA630,3000592982016,0,100.0,0.0,136.0,104.0,137.0,...,,,,,,,,,,
4,2017-01-01,Z305B2QN,ST4000DM000,4000787030016,0,113.0,58173272.0,,,91.0,...,,,,,,,,,,


In [9]:
#read and save csv files from the second quarter of 2017
path = r'/Users/valentina/CapstoneProject/data_Q2_2017'
filenames = glob.glob(path + "/*.csv")
drive_frame_q2_2017 = pd.DataFrame()
arr = []
for element in filenames:
    df2 = pd.read_csv(element, index_col=None, header=0)
    arr.append(df2)
drive_frame_q2_2017 = pd.concat(arr,ignore_index=True)
#inspect the first 5 elements of the data frame 
drive_frame_q2_2017.head()

Unnamed: 0,date,serial_number,model,capacity_bytes,failure,smart_1_normalized,smart_1_raw,smart_2_normalized,smart_2_raw,smart_3_normalized,...,smart_250_normalized,smart_250_raw,smart_251_normalized,smart_251_raw,smart_252_normalized,smart_252_raw,smart_254_normalized,smart_254_raw,smart_255_normalized,smart_255_raw
0,2017-04-01,MJ0351YNG9Z0XA,Hitachi HDS5C3030ALA630,3000592982016,0,100.0,0.0,135.0,108.0,127.0,...,,,,,,,,,,
1,2017-04-01,MJ0351YNG9WJSA,Hitachi HDS5C3030ALA630,3000592982016,0,100.0,0.0,136.0,104.0,126.0,...,,,,,,,,,,
2,2017-04-01,PL1321LAG34XWH,Hitachi HDS5C4040ALE630,4000787030016,0,100.0,0.0,134.0,101.0,130.0,...,,,,,,,,,,
3,2017-04-01,MJ0351YNGABYAA,Hitachi HDS5C3030ALA630,3000592982016,0,100.0,0.0,136.0,104.0,137.0,...,,,,,,,,,,
4,2017-04-01,PL2331LAHDBJPJ,HGST HMS5C4040BLE640,4000787030016,0,100.0,0.0,133.0,104.0,143.0,...,,,,,,,,,,


In [10]:
#merge the quarter data frames to obtain one data frame for 2017
quarters = [drive_frame_q1_2017, drive_frame_q2_2017, drive_frame_q3_2017]
drives_2017 = pd.concat(quarters,ignore_index=True)

#inspect the master data frame for drive stats in 2017
drives_2017.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22205574 entries, 0 to 22205573
Data columns (total 95 columns):
date                    object
serial_number           object
model                   object
capacity_bytes          object
failure                 object
smart_1_normalized      float64
smart_1_raw             float64
smart_2_normalized      float64
smart_2_raw             float64
smart_3_normalized      float64
smart_3_raw             float64
smart_4_normalized      float64
smart_4_raw             float64
smart_5_normalized      float64
smart_5_raw             float64
smart_7_normalized      float64
smart_7_raw             float64
smart_8_normalized      float64
smart_8_raw             float64
smart_9_normalized      float64
smart_9_raw             float64
smart_10_normalized     float64
smart_10_raw            float64
smart_11_normalized     float64
smart_11_raw            float64
smart_12_normalized     float64
smart_12_raw            float64
smart_13_normalized     fl

In [11]:
#inspect the first 5 elements of the master data frame 
drives_2017.head()

Unnamed: 0,date,serial_number,model,capacity_bytes,failure,smart_1_normalized,smart_1_raw,smart_2_normalized,smart_2_raw,smart_3_normalized,...,smart_250_normalized,smart_250_raw,smart_251_normalized,smart_251_raw,smart_252_normalized,smart_252_raw,smart_254_normalized,smart_254_raw,smart_255_normalized,smart_255_raw
0,2017-01-01,MJ0351YNG9Z0XA,Hitachi HDS5C3030ALA630,3000592982016,0,100.0,0.0,135.0,108.0,127.0,...,,,,,,,,,,
1,2017-01-01,MJ0351YNG9WJSA,Hitachi HDS5C3030ALA630,3000592982016,0,100.0,0.0,136.0,104.0,126.0,...,,,,,,,,,,
2,2017-01-01,PL1321LAG34XWH,Hitachi HDS5C4040ALE630,4000787030016,0,100.0,0.0,134.0,101.0,130.0,...,,,,,,,,,,
3,2017-01-01,MJ0351YNGABYAA,Hitachi HDS5C3030ALA630,3000592982016,0,100.0,0.0,136.0,104.0,137.0,...,,,,,,,,,,
4,2017-01-01,Z305B2QN,ST4000DM000,4000787030016,0,113.0,58173272.0,,,91.0,...,,,,,,,,,,


In [12]:
drives_2017.tail()

Unnamed: 0,date,serial_number,model,capacity_bytes,failure,smart_1_normalized,smart_1_raw,smart_2_normalized,smart_2_raw,smart_3_normalized,...,smart_250_normalized,smart_250_raw,smart_251_normalized,smart_251_raw,smart_252_normalized,smart_252_raw,smart_254_normalized,smart_254_raw,smart_255_normalized,smart_255_raw
22205569,2017-09-30,PL1331LAHD1AWH,HGST HMS5C4040BLE640,4000787030016,0,100.0,0.0,134.0,100.0,100.0,...,,,,,,,,,,
22205570,2017-09-30,ZA10MCEQ,ST8000DM002,8001563222016,0,80.0,91943880.0,,,96.0,...,,,,,,,,,,
22205571,2017-09-30,PL1331LAHD0AHH,HGST HMS5C4040BLE640,4000787030016,0,100.0,0.0,134.0,100.0,100.0,...,,,,,,,,,,
22205572,2017-09-30,PL1331LAHD1T5H,HGST HMS5C4040BLE640,4000787030016,0,100.0,0.0,134.0,101.0,148.0,...,,,,,,,,,,
22205573,2017-09-30,Z30271GD,ST4000DM000,4000787030016,0,118.0,201027560.0,,,91.0,...,,,,,,,,,,


In [None]:
[{serial_number: 'asdfasd', model: 'asdfasdf', log:[day, smart1, smart2, smart3]}]