# State Farm Classification Coding Exercise

## Part 2 - Test Data Set Exploratory Data Analysis and Feature Engineering

### A. Import Libraries and Test Data Set, and Check for Missing Values

** Import numpy and pandas. **

In [None]:
import numpy as np
import pandas as pd

** Import data visualization libraries and set %matplotlib inline. **

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

** Import Exercise 2 test data set comma-separated (CSV) file into a Pandas dataframe. **

In [None]:
test = pd.read_csv('../State_Farm/Data/exercise_02_test.csv', sep=',')

** Create copy of test dataframe for exploratory data analysis and feature engineering. **

In [None]:
test1 = test.copy()

** View first five rows of test dataframe. **

In [None]:
test.head()

** Obtain number of rows and columns in test dataframe. **

In [None]:
test1.shape

** Check for presence of missing values for all features. **

In [None]:
test1.isnull().sum().sort_values(ascending=False)

### B. Explore and Engineer Numerical Features 

** Identify which test data set features are categorical. **

In [None]:
test1.select_dtypes(exclude=['int64', 'float']).columns

** Check that the data types for all numerical features are float64. **

In [None]:
num_features = test1.columns.difference(['x34', 'x35', 'x41', 'x45', 'x68', 'x93'])

In [None]:
test1[num_features].info()

** View scatter matrix of numerical features to inspect their distributions. **

In [None]:
test1[num_features].hist(figsize=(20,16));

* All the numerical features are normally distributed. The number of missing values for each feature ranges from 1 to 6 while the total number of rows in the test data set is 10,000. Given these conditions, I decided to impute the missing values with the mean of the feature.

** Impute missing values in numerical features with mean. **

In [None]:
test2 = test1.fillna(test1[num_features].mean())

** Check numerical features for any missing values. **

In [None]:
test2[num_features].isnull().sum().sort_values(ascending=False)

** View scatter matrix of imputed numerical features to check if the mean imputations skewed their distributions. **

In [None]:
test2[num_features].hist(figsize=(20,16));

* The histograms for all the numerical features show that their distributions still continue to remain normal after imputing their missing values with the mean.

### C. Explore and Engineer Categorical Features

** Check categorical feature data types. **

In [None]:
cat_features1 = ['x34', 'x35', 'x41', 'x45', 'x68', 'x93']

In [None]:
test2[cat_features1].info()

** View summary statistics for categorical features. **

In [None]:
test2.describe(include=['object'])

** Convert currency and percent string features (x41 and x45) to float data type. **

In [None]:
test2[['x41_flt']] = test2[['x41']].apply(lambda x: x.str.replace('$','')).astype(float)
test2[['x45_pct']] = test2[['x45']].apply(lambda x: x.str.replace('%','')).astype(float)
test3 = test2.drop(['x41', 'x45'], axis=1)

** Check the number of missing values for the numerical x41 and x45 features. **

In [None]:
test3[['x41_flt', 'x45_pct']].isnull().sum()

** View scatter matrix of numerical x41 and x45 features to inspect their distributions. **

In [None]:
test3[['x41_flt', 'x45_pct']].hist();

* The numerical x41 and x45 features are normally distributed. The number of missing values for the numerical x41 and x45 features is 2 and 2, respectively. Again, the total number of rows in the test data set is 10,000. Given these conditions, I decided to impute the missing values with the mean of the feature.

** Impute missing values in numerical x41 and x45 features with mean. **

In [None]:
test4 = test3.fillna(test3[['x41_flt', 'x45_pct']].mean())

** View scatter matrix of imputed numerical x41 and x45 features to check if the mean imputations skewed their distributions. **

In [None]:
test4[['x41_flt', 'x45_pct']].hist();

* The histograms for the numerical x41 and x45 features show that their distributions still continue to remain normal after imputing their missing values with the mean.

** Check for features that still have missing values. **

In [None]:
test4.isnull().sum().sort_values(ascending=False).head()

** Identify remaining categorical features. **

In [None]:
test4.select_dtypes(exclude=['int64', 'float']).columns

** View bar plots for categorical features of x34, x35, x68, and x93. **

In [None]:
test4.x34.value_counts().plot(kind='bar');

In [None]:
test4.x35.value_counts().plot(kind='bar');

In [None]:
test4.x68.value_counts().plot(kind='bar');

In [None]:
test4.x93.value_counts().plot(kind='bar');

* The missing values for the categorical features of x34, x35, x68, and x93 are truly blank. In other words, much more domain knowledge is required to impute these missing values. Going forward, I will assign these missing values their own missing category.

** Replace all categorical feature missing values with their own missing category. **

In [None]:
test4['x34'] = test4.x34.fillna('No_Car_Make')
test4['x68'] = test4.x68.fillna('No_Month')
test4['x93'] = test4.x93.fillna('No_Continent')

** Check that all categorical features have zero missing values. **

In [None]:
test4[['x34', 'x35', 'x68', 'x93']].isnull().sum().sort_values(ascending=False)

** Obtain value counts for each x34 category. **

In [None]:
test4.x34.value_counts()

** Clean x34 feature car make names and obtain value counts again. **

In [None]:
test4['x34'] = test4.x34.map({'volkswagon':'Volkswagen', 'Toyota':'Toyota', 'bmw':'BMW', 'Honda':'Honda', 'tesla':'Tesla', 
                             'chrystler':'Chrysler', 'nissan':'Nissan', 'ford':'Ford', 'mercades':'Mercedes', 
                              'chevrolet':'Chevrolet', 'No_Car_Make':'No_Car_Make'})
test4.x34.value_counts()

** Create x34 dummy features with Volkswagen as reference category and add it to test dataframe. **

In [None]:
x34_dummies = pd.get_dummies(test4.x34).drop('Volkswagen', axis=1)
test5 = pd.concat([test4, x34_dummies], axis=1)

** Obtain value counts for each x35 category. **

In [None]:
test5.x35.value_counts()

** Clean x35 feature weekday names and obtain value counts again. **

In [None]:
test5['x35'] = test5.x35.map({'wed':'Wednesday', 'thurday':'Thursday', 'wednesday':'Wednesday', 'thur':'Thursday', 
                              'tuesday':'Tuesday', 'friday':'Friday', 'monday':'Monday', 'fri':'Friday'})
test5.x35.value_counts()

** Create x35 dummy features with Wednesday as reference category and add it to test dataframe. **

In [None]:
x35_dummies = pd.get_dummies(test5.x35).drop('Wednesday', axis=1)
test6 = pd.concat([test5, x35_dummies], axis=1)

** Obtain value counts for each x68 category. **

In [None]:
test6.x68.value_counts()

** Clean x68 feature month names and obtain value counts again. **

In [None]:
test6['x68'] = test6.x68.map({'July':'July', 'Jun':'June', 'Aug':'August', 'May':'May', 'sept.':'September', 'Apr':'April', 
                              'Oct':'October', 'Mar':'March', 'Nov':'November', 'Feb':'February', 'Dev':'December', 
                              'January':'January', 'No_Month':'No_Month'})
test6.x68.value_counts()

** Create x68 dummy features with July as reference category and add it to test dataframe. **

In [None]:
x68_dummies = pd.get_dummies(test6.x68).drop('July', axis=1)
test7 = pd.concat([test6, x68_dummies], axis=1)

** Obtain value counts for each x93 category. **

In [None]:
test7.x93.value_counts()

** Clean x93 feature continent names and obtain value counts again. **

In [None]:
test7['x93'] = test7.x93.map({'asia':'Asia', 'america':'America', 'euorpe':'Europe', 'No_Continent':'No_Continent'})
test7.x93.value_counts()

** Create x93 dummy features with Asia as reference category and add it to test dataframe. **

In [None]:
x93_dummies = pd.get_dummies(test7.x93).drop('Asia', axis=1)
test8 = pd.concat([test7, x93_dummies], axis=1)

### D. Finalize and Export Cleaned Test Data Set for Export

** Drop categorical features from test dataframe. **

In [None]:
test_cleaned = test8.drop(['x34', 'x35', 'x68', 'x93'], axis=1)

** Obtain number of rows and columns in test dataframe with engineered and cleaned features. **

In [None]:
test_cleaned.shape

** Check for any remaining missing values in test dataframe with engineered and cleaned features. **

In [None]:
test_cleaned.isnull().sum().sort_values(ascending=False)

** Export test dataframe with engineered and cleaned features to CSV file. **

In [None]:
test_cleaned.to_csv('../State_Farm/Data/test_cleaned.csv', sep=',', index=False)

** Save test dataframe with engineered and cleaned features to pickle file for models to make predictions on. **

In [None]:
test_cleaned.to_pickle('../State_Farm/Data/test_cleaned.pickle')