# Which hard drives are the most reliable? 

In [1]:
# data wrangling
import pandas as pd
import numpy as np

# py files
import acquire
import prepare

## Project Plan

Goal
- Determine which hard drives are the most reliable by classifing and predicting hard drive failure rates. 

Hypotheses
- h1 
- h2


## Acquire

The raw data is available on Backblaze.com. For this analysis, the hard drive data from 2016, 2017, 2108, and 2019 was utilized. The files were downloaded to a local computer and the files were unzipped. The files were renamed to the format, "data_Qx_201x", and placed in a folder titled "data". 

The `acquire.acquire_agg_data` function reads in the data, aggregates it, and returns the dataframe in pandas. 
- Using Spark, a dataframe was created from each directory of csv files. The dataframes are concated together with their common columns. This gave a dataframe with 95 columns and 121,390,247 rows. 
- Backblaze identified 5 SMART stats that predict hard drive failure (SMART 5, 187, 188, 197, 198). The max value of each of these stats were extracted and the dataframe was aggregated by serial number. This reduced the dataframe to 9 columns and 169,073 rows. 
- The spark dataframe is converted to pandas 
- The pandas dataframe is saved as "hard_drives_smart_5.csv" for future use

The csv is linked in the README and can be downloaded. If "hard_drives_smart_5.csv" is in the working directory, `acquire.acquire_agg_data` will read from the csv instead creating the dataframe.  

In [2]:
df = acquire.acquire_agg_data()

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 169073 entries, 0 to 169072
Data columns (total 10 columns):
serial_number         169072 non-null object
model                 169073 non-null object
capacity_bytes        169073 non-null int64
max(failure)          169073 non-null int64
max(smart_9_raw)      161975 non-null float64
max(smart_5_raw)      161851 non-null float64
max(smart_187_raw)    104189 non-null float64
max(smart_188_raw)    104179 non-null float64
max(smart_197_raw)    161841 non-null float64
max(smart_198_raw)    161841 non-null float64
dtypes: float64(6), int64(2), object(2)
memory usage: 12.9+ MB


In [4]:
df.head()

Unnamed: 0,serial_number,model,capacity_bytes,max(failure),max(smart_9_raw),max(smart_5_raw),max(smart_187_raw),max(smart_188_raw),max(smart_197_raw),max(smart_198_raw)
0,PL1311LAG1SJAA,Hitachi HDS5C4040ALE630,4000787030016,0,43819.0,0.0,,,0.0,0.0
1,Z305KB36,ST4000DM000,4000787030016,0,31045.0,0.0,0.0,0.0,0.0,0.0
2,MJ0351YNG9MZXA,Hitachi HDS5C3030ALA630,3000592982016,0,41668.0,0.0,,,0.0,0.0
3,ZA11NHSN,ST8000DM002,8001563222016,0,26284.0,0.0,0.0,0.0,0.0,0.0
4,MJ1311YNG2ZSEA,Hitachi HDS5C3030ALA630,3000592982016,0,47994.0,0.0,,,0.0,0.0


In [5]:
df.describe()

Unnamed: 0,capacity_bytes,max(failure),max(smart_9_raw),max(smart_5_raw),max(smart_187_raw),max(smart_188_raw),max(smart_197_raw),max(smart_198_raw)
count,169073.0,169073.0,161975.0,161851.0,104189.0,104179.0,161841.0,161841.0
mean,6829480000000.0,0.035085,23858.714839,69.851802,5.99619,326482300.0,6.26594,5.913261
std,3981103000000.0,0.183996,13357.230448,1393.236993,541.364663,30146610000.0,452.148242,447.550251
min,-1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,4000787000000.0,0.0,13727.0,0.0,0.0,0.0,0.0,0.0
50%,4000787000000.0,0.0,22932.0,0.0,0.0,0.0,0.0,0.0
75%,12000140000000.0,0.0,34866.0,0.0,0.0,0.0,0.0,0.0
max,14000520000000.0,1.0,90477.0,65392.0,65535.0,8933668000000.0,142616.0,142616.0


In [6]:
df.isnull().sum()

serial_number             1
model                     0
capacity_bytes            0
max(failure)              0
max(smart_9_raw)       7098
max(smart_5_raw)       7222
max(smart_187_raw)    64884
max(smart_188_raw)    64894
max(smart_197_raw)     7232
max(smart_198_raw)     7232
dtype: int64

## Prepare

The `prepare.prepare` function reads in the dataframe and applies the following changes:
- Converts capacity column from bytes to gigabytes.
- Converts max(smart_9_raw) from hours to years.
- Creates a new column for manfacturer.
- Renames all columns appropriately.
- Reorders columns for understandability. 

In [7]:
df = prepare.prepare(df)

The `prepare.unique` function reads in the dataframe and removes duplicated serial numbers that were created during aggregation.

In [8]:
df = prepare.unique(df)

In [9]:
df.head()

Unnamed: 0,serial_number,manufacturer,model,capacity_gigabytes,failure,drive_age_in_years,reallocated_sectors_count,reported_uncorrectable_errors,command_timeout,current_pending_sector_count,uncorrectable_sector_count
0,PL1311LAG1SJAA,Hitachi,Hitachi HDS5C4040ALE630,4000.8,0,5.0,0.0,,,0.0,0.0
1,Z305KB36,Seagate,ST4000DM000,4000.8,0,3.5,0.0,0.0,0.0,0.0,0.0
2,MJ0351YNG9MZXA,Hitachi,Hitachi HDS5C3030ALA630,3000.6,0,4.8,0.0,,,0.0,0.0
3,ZA11NHSN,Seagate,ST8000DM002,8001.6,0,3.0,0.0,0.0,0.0,0.0,0.0
4,MJ1311YNG2ZSEA,Hitachi,Hitachi HDS5C3030ALA630,3000.6,0,5.5,0.0,,,0.0,0.0


## Explore

Questions to answer:
- What does our data look like?
    - How many different models?
    - How many different manufacturers?
    - What are the different capacity sizes? 
    - How many hard drives are there for each manufacturer? Model?
- Does the data we obtained make sense? 
- Are there any observations that need to be dropped (why, how many)?
- Determine how to treat null values.
- Determine early failure cutoff by analyzing data.
- Does the average age of drive vary by model number?
- Which SMART attributes correlate most strongly with failure? 
- Which features or combination of features correlate with early failure?
- Do failure rates vary by model?