## Objective
The objective of this project is to analyze the collected hard drive  dataset for XYZ Corporation. We will build a model to predict when a hard drive will fail based on features of the SMART 255 variable system data.

Utilization of the predictive model will aid in solving XYZ Corporation's IT hardware costs by lowering costs associated with unexpected hard drive failures - such as delayed work due to downtime and stress applied to internal processes to replace the drives quickly.

## Importing Data
First step is to import our source data into our environment to begin our analysis. To do this we will import the collected dataset CSV Files from our local machine. 

To import our dataset we will utilize the **Pandas** python library. **Pandas** is a handy open-source data analysis library. 

Since our source datasets are located on our local machine we will all leverage the **glob** python library to help with local file system operations.

In [24]:
import glob         # Load glob libraries
import pandas as pd # Load the Pandas libraries with alias 'pd'

# Path to dataset file directory within workspace
path = "./data_Q1_2018/"

# glob.glob(path + '*.csv') - returns List[str]
# pd.read_csv(f) - returns pd.DataFrame()
# for f in glob.glob() - returns a List[DataFrames]
# pd.concat() - returns one pd.DataFrame()
df = pd.concat([pd.read_csv(f, encoding='latin1') for f in glob.glob(path + '*.csv')])

# Check to see all data files were loaded - we know that there are 90 csv files once the 
# Zip file is extracted, so the Out[] should equal 90 if all csv files were properly loaded.
df.date.unique().shape

(90,)

## Data Exploration & Cleansing
Now that the CSV data files have been loaded into **Pandas** and combined into one DataFrame, we can begin our data exploration.

We'll start by checking out the first 5 rows (default) of the DataFrame to get a quick look at the dataset.

In [13]:
# Preview the first 5 lines of the loaded data 
df.head()

Unnamed: 0,date,serial_number,model,capacity_bytes,failure,smart_1_normalized,smart_1_raw,smart_2_normalized,smart_2_raw,smart_3_normalized,...,smart_250_normalized,smart_250_raw,smart_251_normalized,smart_251_raw,smart_252_normalized,smart_252_raw,smart_254_normalized,smart_254_raw,smart_255_normalized,smart_255_raw
0,2018-02-04,Z305B2QN,ST4000DM000,4000787030016,0,118.0,179547432.0,,,91.0,...,,,,,,,,,,
1,2018-02-04,PL1331LAHG1S4H,HGST HMS5C4040ALE640,4000787030016,0,100.0,0.0,134.0,102.0,100.0,...,,,,,,,,,,
2,2018-02-04,ZA16NQJR,ST8000NM0055,8001563222016,0,78.0,67614504.0,,,94.0,...,,,,,,,,,,
3,2018-02-04,ZA18CEBT,ST8000NM0055,8001563222016,0,100.0,1580176.0,,,96.0,...,,,,,,,,,,
4,2018-02-04,ZA18CEBS,ST8000NM0055,8001563222016,0,82.0,161864784.0,,,97.0,...,,,,,,,,,,


In [17]:
# Get the row and column count of the DataFrame to illustrate shape of the dataset.
df.shape

(8949492, 105)

### Analytic Approach
Now that we have a better idea of the data within the DataFrame we need to make three distinctions to choice the right path forward with our analysis.

1. Is this going to be supervised or unsupervised learning?   
We have a defined dependent or y variable, 'failure', allowing **Supervised Learning** to be performed. The computer to learn from our clearly labeled dataset.   


2. Is this a classification or regression problem?   
One way to identify the type of problem is to look at the y variable to see if it is discrete or continuous, and if its categorical or quantitative. Since 'failure' is discrete and categorical in nature, this is a **Classification** problem, specifically this is a **Binary Classification** problem because each hard drive for a given day has either failed or not.


3. Is this a prediction or inference problem?   
The business use case is to create a model to be leverage to estimate when a hard drive will fail, therefore this is **Prediction** problem. We will want the model to estimate a y ('Failure') value, given a variety of features.   

### Feature Selection
Since there are a large number (105) of features in this dataset this would take a really long time to train and test, as well as leading to potential overfitting. Next, we need to determine which features we think are likely to be important our target variable to help reduce the computational workload and improve performance. 

One way is review a more concise summary of the DataFrame. The **info()** method in the **Pandas** library allows us to do just that by printing information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.

In [11]:
# View information about the DataFrame
df.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8949492 entries, 0 to 100349
Data columns (total 105 columns):
date                    8949492 non-null object
serial_number           8949492 non-null object
model                   8949492 non-null object
capacity_bytes          8949492 non-null int64
failure                 8949492 non-null int64
smart_1_normalized      8948907 non-null float64
smart_1_raw             8948907 non-null float64
smart_2_normalized      2235026 non-null float64
smart_2_raw             2235026 non-null float64
smart_3_normalized      8948907 non-null float64
smart_3_raw             8948907 non-null float64
smart_4_normalized      8948907 non-null float64
smart_4_raw             8948907 non-null float64
smart_5_normalized      8949141 non-null float64
smart_5_raw             8949141 non-null float64
smart_7_normalized      8948907 non-null float64
smart_7_raw             8948907 non-null float64
smart_8_normalized      2235026 non-null float64
smart_8_raw 

When reviewing the concise summary of the DataFrame information, we should focus on the non-null and data type values for each feature, as well as the DataFrame memory usage. 

As we can see, the DataFrame is using over 7GB of memory. We must take into consideration any resource limitations in our environment and select only the most relevant features, as well as avoid features with a low count of non-null values in comparison to the over all observation, or record count of 8,989,492.

Using this approach, we narrowed down our features of interest to 23: 1, 2, 3, 4, 5, 7, 9, 10, 12, 187, 188, 189, 190, 191, 192, 193, 194, 197, 198, 199, 240, 241, 242.

However, this can still be considered a larger feature set in environments with resource limitations. If you fall in this category, then we would recommend narrowing this feature set down even more. To do this, a little online research is required to understand what the SMART stats represent and what others with IT Operations expertise suggest. 

The outcome of our research was to include the 4 observation descriptive features, the y variable (failure), and 15 SMART metrics. The 15 SMART metrics were determined by the following: 

First we included the [5 SMART metrics BackBlaze](https://www.backblaze.com/blog/hard-drive-smart-stats/) uses in their analysis: smart_5 (Reallocated Sector Count), smart_187 (Reported Uncorrected Errors), smart_188 (Command Timeout), smart_197 (Current Pending Sector Count), smart_198 (Offline Uncorrectable). 

After reviewing the [definitions](https://www.backblaze.com/blog-smart-stats-2014-8.html) of the remaining metrics the following were some basic metrics that made sense to check as well:
smart_9 (Power On Hours), smart_193 (Load Cycle Count), smart_194 (Temperature Celsious), smart_241 (Total Logical Block Addressing Write), smart_242 (Total Logical Block Addressing Read). 

Initially we had thought to include the five new attributes recently added in 2018: smart_177 (Wear Range Delta), smart_179 (Used Reserve Block Count Total), smart_181 (Program Fail Count Total), smart_182 (Erase Fail Count), smart_235 (Good Block Count and System Free Block Count), however for this analysis the data is too sparse to aid in our perdictions, so we ultimately decided to leave them out for the time being.

Finally we decided that including both the raw and normalized values for each SMART metric was redundant, so we chose to go with the 'raw', non-transformed values. If normalized is required later for our analysis, we can normalize it ourselves.

In [51]:
# List of final features selected
df_selected_features = ['date', 'serial_number', 'model','capacity_bytes',
                        'failure', 'smart_5_raw', 'smart_9_raw',
                        'smart_193_raw','smart_194_raw','smart_197_raw', 
                        'smart_198_raw'
                        ]

# 'smart_187_raw', 'smart_188_raw', 'smart_241_raw','smart_242_raw'

# Remove all columns we no longer need for our analysis
df = df[df_selected_features]

# Check proper columns were removed
df.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8949492 entries, 0 to 100349
Data columns (total 11 columns):
date              8949492 non-null datetime64[ns]
serial_number     8949492 non-null object
model             8949492 non-null object
capacity_bytes    8949492 non-null int64
failure           8949492 non-null int64
smart_5_raw       8949141 non-null float64
smart_9_raw       8949141 non-null float64
smart_193_raw     8881919 non-null float64
smart_194_raw     8948907 non-null float64
smart_197_raw     8948907 non-null float64
smart_198_raw     8948907 non-null float64
dtypes: datetime64[ns](1), float64(6), int64(2), object(2)
memory usage: 819.4+ MB


### Exploratory Data Analysis (EDA)
Now that we have narrowed our dataset to just the relevant features that we feel are important for our analysis, it is time to start learning and understanding our dataset. During this step, we will note datapoints that need cleanses or corrected.

In [52]:
# Convert 'date' column from object to date data type.
df['date'] = pd.to_datetime(df['date'])

# Check data types properly converted
df.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8949492 entries, 0 to 100349
Data columns (total 11 columns):
date              8949492 non-null datetime64[ns]
serial_number     8949492 non-null object
model             8949492 non-null object
capacity_bytes    8949492 non-null int64
failure           8949492 non-null int64
smart_5_raw       8949141 non-null float64
smart_9_raw       8949141 non-null float64
smart_193_raw     8881919 non-null float64
smart_194_raw     8948907 non-null float64
smart_197_raw     8948907 non-null float64
smart_198_raw     8948907 non-null float64
dtypes: datetime64[ns](1), float64(6), int64(2), object(2)
memory usage: 819.4+ MB


In [53]:
# Preview our refined DataFrame
df.head()

Unnamed: 0,date,serial_number,model,capacity_bytes,failure,smart_5_raw,smart_9_raw,smart_193_raw,smart_194_raw,smart_197_raw,smart_198_raw
0,2018-02-04,Z305B2QN,ST4000DM000,4000787030016,0,0.0,18766.0,34066.0,22.0,0.0,0.0
1,2018-02-04,PL1331LAHG1S4H,HGST HMS5C4040ALE640,4000787030016,0,0.0,8840.0,93.0,33.0,0.0,0.0
2,2018-02-04,ZA16NQJR,ST8000NM0055,8001563222016,0,0.0,6725.0,3467.0,35.0,0.0,0.0
3,2018-02-04,ZA18CEBT,ST8000NM0055,8001563222016,0,0.0,3702.0,2111.0,39.0,0.0,0.0
4,2018-02-04,ZA18CEBS,ST8000NM0055,8001563222016,0,0.0,3701.0,1953.0,37.0,0.0,0.0


In [54]:
# Get number of hard drives
num_hhd = df['serial_number'].value_counts().shape
num_hhd_models = df['model'].value_counts().shape

print('There are {0} hard drives and {1} different hard drive models.'.format(num_hhd, num_hhd_models))

# Replace Null Values with 0

There are (105004,) hard drives and (53,) different hard drive models.


In [55]:
# Get failed hard drives
failed_hhd = df.loc[df.failure==1]['serial_number']

# Create DataFrame of only failed hard drives.
df_failed_hhd = df.loc[df["serial_number"].isin(failed_hhd)]

print('There were {0} hard drives that failed, failed dataframe shape {1}'.format(len(num_failed_hhd), df_failed_hhd.shape))

There were 336 hard drives that failed, failed dataframe shape (12836, 11)


In [58]:
y = df.failure
x = df[['smart_5_raw', 'smart_9_raw',
        'smart_193_raw','smart_194_raw','smart_197_raw', 
        'smart_198_raw']]

from sklearn.preprocessing import MinMaxScaler
cor_x = MinMaxScaler().fit_transform(x)

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(cor_x, y, test_size=0.2)
print (x_train.shape, y_train.shape)
print (x_test.shape, y_test.shape)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

## Baseline Modeling

## Secondary Modeling

## Results

## Conclusion