## This notebook is created as part of the pre-assesment application process for State farm. 

### Applicant: Moutaz Elias

Data_loader.ipynb takes care of loading the test and training data, explaratory analysis, cleaning the data (one hot encoding, feature extraction, and fill Nan), and spliting the training data into test and train sets.

Modules.ipynb is set to build and train the modules, optimize hyper parameters. Logistic regression and SVM/randomforest (depending on available training power) are the models of choice. 

Deploy.ipynb depolys the modules one train and test data sets producing propabilities as outputs for the test data sets and various evaluation metrics for the train data set.

Data has been extracted from:https://drive.google.com/drive/folders/1J7N62rO2E-cC_H-ymPskgMINGtmPT_rg

# Modules.ipynb

In [1]:
#importing relevant libraries
import numpy as np
import scipy as sp
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFECV
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score
import pickle as pkl
import os as os

In [2]:
#Turning off warning and other jupyter specific options
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

In [3]:
#loading train and val sets
pickle_in = open("x_train.pickle","rb")
x_train = pkl.load(pickle_in)

pickle_in = open("y_train.pickle","rb")
y_train = pkl.load(pickle_in)

pickle_in = open("x_val.pickle","rb")
x_val = pkl.load(pickle_in)

pickle_in = open("y_val.pickle","rb")
y_val = pkl.load(pickle_in)

evaluationg_metrics='roc_auc'

## Logistic regression model

In [4]:
#importing log model
Log_model=LogisticRegression(verbose=True)

In [5]:
#Exercise asked for feature selection
#best option is recursive feature reduction that is done automatically by sklearn
Log_fr= RFECV(Log_model,scoring = evaluationg_metrics,n_jobs = -1,cv = 5,step = 2)

In [6]:
#hyperparameter tuning can be done by grid search
# Create regularization penalty space
penalty = ['l1', 'l2']

# Create regularization hyperparameter space
C = np.logspace(0, 4, 3)

# Create hyperparameter options
hyperparameters = dict(C=C, penalty=penalty)

CV_log_fr=GridSearchCV(Log_model,param_grid=hyperparameters,scoring = evaluationg_metrics,cv = 5)

In [7]:
%%time

#creating and training pipeline
pipeline1=Pipeline([('feature_sele',Log_fr),('clf_cv',CV_log_fr)])

pipeline1.fit(x_train, y_train)


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.7s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.7s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.7s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.7s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.7s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.7s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_j

CPU times: user 55.8 s, sys: 2.16 s, total: 57.9 s
Wall time: 2min 23s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.5s finished


Pipeline(steps=[('feature_sele',
                 RFECV(cv=5, estimator=LogisticRegression(verbose=True),
                       n_jobs=-1, scoring='roc_auc', step=2)),
                ('clf_cv',
                 GridSearchCV(cv=5, estimator=LogisticRegression(verbose=True),
                              param_grid={'C': array([1.e+00, 1.e+02, 1.e+04]),
                                          'penalty': ['l1', 'l2']},
                              scoring='roc_auc'))])

In [8]:
#predicting probabilites
log_reg_prob = pipeline1.predict_proba(x_val)
print('Accuracy: ', pipeline1.score(x_val, y_val))
print('AUC: ', roc_auc_score(y_val, log_reg_prob[:,1]))

Accuracy:  0.9062835792176034
AUC:  0.9062835792176034


In [9]:
#saving my model

pickle_out = open("logModel.pickle","wb")
pkl.dump(pipeline1, pickle_out)
pickle_out.close()

## SVM model

In [10]:
# Trying with SVC model if it doesnot finish by the morning will turn to randomforest.
SVM_Model=SVC(probability=True)

In [11]:
#hyperparameter tuning can be done by grid search
kernel = ['rbf']

# Create regularization hyperparameter space
C = np.logspace(0, 4, 3)

# Two gamma values
gamma=[1e-3,1e-4]

# Create hyperparameter options
hyperparameters = dict(kernel=kernel, C=C, gamma=gamma)

CV_svm_fr=GridSearchCV(SVM_Model,param_grid=hyperparameters,n_jobs = -1,scoring = evaluationg_metrics,cv = 5)

In [12]:
%%time

#Training model
CV_svm_fr.fit(x_train, y_train)

CPU times: user 12min 3s, sys: 5.12 s, total: 12min 8s
Wall time: 2h 51min 18s


GridSearchCV(cv=5, estimator=SVC(probability=True), n_jobs=-1,
             param_grid={'C': array([1.e+00, 1.e+02, 1.e+04]),
                         'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
             scoring='roc_auc')

In [13]:
%%time

#predicting probabilites
svm_reg_prob = CV_svm_fr.predict_proba(x_val)
print('Accuracy: ', CV_svm_fr.score(x_val, y_val))
print('AUC: ', roc_auc_score(y_val, svm_reg_prob[:,1]))

Accuracy:  0.9896557620220235
AUC:  0.989654203171773
CPU times: user 10.5 s, sys: 71.9 ms, total: 10.5 s
Wall time: 10.6 s


In [14]:
#saving my model

pickle_out = open("SVM.pickle","wb")
pkl.dump(CV_svm_fr, pickle_out)
pickle_out.close()

# End of file