## KSMC Project
Jonathan Armitage

The data set presented is from a Kaggle competition, [link](https://www.kaggle.com/c/bnp-paribas-cardif-claims-management). The company providing the data is BNP Paribas Cardif, an insureer looking to use quantitative methods to improve the claims process as it relates to servicing their customer. 

The systems in place that were used for this project:

* Python distrubution through Continuum Analytics (Anaconda)
* Apache Spark
    + PySpark with Jupyter Notebook

### Discussion:
This data set was in need of a great deal of preprocessing due the missing data, and the absolute number of categories that were found. However this was dealt with by simplyfing the process into two seperate data cleaning procedures, clean categorical features and then clean numeric features. Partitioning it this way allowed for more concise, simplified code since some numerical methods are apt of cleaning categorical data (ex. counting values) while others are better for numerical data.

Prediction of the test set was done by way of a logisitic regression utilizing an 'l1' penalty to envoke sparsity. Utilizing th 'l1' penalty serves as a feature selection mechanism given that it shrinks some (noninformative) coefficients to approximately zero. Per the Kaggle competition website particpants classification models were judged on the basis of log loss or cross entropy. Therefore, I scored my model using the performance as well.

### Ideas I Would Have Liked to Try:
1. Ability to create / find a better distrubution of the categorical features
2. Utilize the pipeline feature to tune hyperparameters of ensemble models, as well as the logisitc regreesion I used. 3. Better use of cross validation (for generalization error, and interpolation methods)
4. Include more visualizations of the data

Import necessary modules to enable analysis.

In [1]:
import os
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.cross_validation import cross_val_score, train_test_split
from sklearn.preprocessing import StandardScaler, RobustScaler, LabelEncoder, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.metrics import log_loss, auc, classification_report, confusion_matrix, roc_auc_score, f1_score, label_ranking_loss
from sklearn.decomposition import PCA, FactorAnalysis
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectKBest, SelectPercentile, f_classif, chi2
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.learning_curve import validation_curve, learning_curve
import matplotlib.pyplot as plt
from time import process_time
%matplotlib inline
%config InlineBackend.figure_formats = {'svg',}

In [3]:
# import raw data
trainBNP = pd.read_csv('train.csv')
testBNP = pd.read_csv('test.csv')

In [4]:
# lots of of missing
    # pctNan = percentage of all elements that have NaN

tElements_train = np.sum( trainBNP.count(numeric_only = False).values )
tElements_Numeric_train = np.sum( trainBNP.count(numeric_only = True).values )
tElements_NaN_train = tElements_train - tElements_Numeric_train
pctNaN = tElements_NaN_train / tElements_train

In [42]:
print('Elements Missing (percentage): %.3f' % (pctNaN))

Elements Missing (percentage): 0.202


Given that a large part of the data set is missing, approximately 20%, this function will be used to cerate a dataframe of only the categorical features. This will ease the data cleaning process.

In [6]:
def catVar_func(df_train):
    
    """df_train = trainBNP"""
    """create list of categorical features -- map them to dataframe"""

    catVar_train = []
    for k in df_train.columns:
        if df_train[k].dtype == 'O':
            catVar_train.append(k)

    train_catVar = pd.DataFrame(data = df_train, columns = catVar_train)  
    
    """imputing missing values to most frequent category -- this allows for a one-to-one transformation that 
    does not cause issues when predicting onto the test set"""
    
    for y in train_catVar.columns:
        train_catVar[y].fillna(value = train_catVar[y].value_counts().idxmax(), inplace = True)
    
    train_catVar_wTgt = pd.concat([df_train['target'], train_catVar], axis = 1)
    
    return train_catVar, train_catVar_wTgt, catVar_train

In [7]:
train_catVar, train_catVar_wTgt, catVar_train = catVar_func(df_train=trainBNP)

I used another function for categorical features in order to parse them relative a threshold stated in terms of the number of categories with in each categorical feature.

In [8]:
def catGroups(df_train_cat, catVar_train, thresH, minFac, maxFac):
    
    """df_cat = trainBNP_catvar_wTgt"""
    
    """this function is meant to parse the categorical features by first grouping 
    them together, calculating the number of categories within each feature and 
    benchmark each feature against the target: the absolute number of categories, 
    and the condtional expectation E[Tgt | catVar_i], benchmarked against the unconditional 
    expectation of the outcome, E[Tgt]"""
   
    meanTgt_train = []
    levl_train = []

    for t in catVar_train:
        meanTgt_train.append(np.mean(df_train_cat.groupby(t)['target'].aggregate(np.mean)))
        levl_train.append(len(df_train_cat[t].value_counts()))
    
    """new dataframe constructed to house the categorical features"""
    
    dfCatVar = pd.DataFrame(columns = ['catTgt_Mean', 'numbCat'], index = catVar_train)
    dfCatVar['catTgt_Mean'] = meanTgt_train
    dfCatVar['numbCat'] = levl_train
    dfCatVar['baseFactor'] = dfCatVar['catTgt_Mean'] / df_train_cat['target'].mean()
    
    """create binary factors based on researcher based thresholds (number of categories [catThr],
    and relative increase/decrease over the univariate target mean [minFac <= benchmark >= maxFac])"""

    catThr = thresH
    dfCatVar['inclCat_c1'] = np.where((dfCatVar['numbCat'] <= catThr), 1, 0)
    dfCatVar['inclCat_c2'] = np.where((dfCatVar['baseFactor'] <= minFac) | (dfCatVar['baseFactor'] >= maxFac), 1, 0)
    dfCatVar['incl_ovl'] = dfCatVar['inclCat_c1'] + dfCatVar['inclCat_c2']
    
    """grabs the features that met both the category threshold and creates a dataframe of dummy variables """
    
    catVar_nextStep = []
    for z in dfCatVar.index:
        if dfCatVar.loc[z, 'incl_ovl'] > 1:
            catVar_nextStep.append(z)
    
    dfTrain_catDum = pd.get_dummies(df_train_cat[catVar_nextStep])
    
    return dfTrain_catDum, catVar_nextStep

In [9]:
dfTrain_catDum, catVar_nextStep = catGroups(df_train_cat=train_catVar_wTgt, catVar_train=catVar_train,
                                            thresH=200, minFac=0.99, maxFac=1.02)

As stated above, categorical features were split off from numeric features into their own dataframe. The same thing is occuring within this function except there is an interpolation method ('intMethod') specified as a parameter of the function. Interpolation was necessary asbecause of the vast amounts of missing data.

In [10]:
def numVar_func(df_train, intMethod):

    """df = trainBNP"""
    """intMethod = 'linear', 'quadratic', 'nearest' """
    """generates columns that are not categorical and appends them to a list. Those columns are then placed into
    a dataframe where an interpolation mechanism is administered"""
    
    # create list of numerical features

    numVar = []
    for a in df_train.columns[2:len(df_train.columns)]:
        if df_train[a].dtype != 'O':
            numVar.append(a)
        
    trainBNP_numVar = pd.DataFrame(data = df_train, columns = numVar)

    dfTrain_numVar = trainBNP_numVar.interpolate(method = intMethod)
    
    return dfTrain_numVar, numVar, intMethod

In [11]:
dfTrain_numVar, numVar, intMethod = numVar_func(df_train = trainBNP, intMethod='nearest')

For prediction purposes onto the test set, the test set needs to mirror the training set in terms of the categorical variables selected.

In [12]:
def testSet(df_test, numVar, catVar_nextStep, intMethod = 'intMethod'):
    
    test_catVar = pd.DataFrame(data = df_test, columns = catVar_nextStep) 
    
    for n in test_catVar.columns:
        test_catVar[n].fillna(value = test_catVar[n].value_counts().idxmax(), inplace = True)
    
    #test_catVar.fillna(value = 'NA', inplace = True)
    dfTest_catDum = pd.get_dummies(test_catVar[catVar_nextStep])
    
    test_numVar = pd.DataFrame(data = df_test, columns = numVar)
    dfTest_numVar = test_numVar.interpolate(method = intMethod)
    
    X_test = np.concatenate((dfTest_catDum, dfTest_numVar), axis = 1)
    
    return X_test, dfTest_catDum, dfTest_numVar

In [13]:
X_test, dfTest_catDum, dfTest_numVar = testSet(df_test=testBNP, numVar=numVar,
                                               catVar_nextStep=catVar_nextStep, intMethod=intMethod)

Due to diffenences in the categorical variables in the train set and test set, i.e. a categorical feature 
in one set may have a disparate number of categories relative to the other set, the dimensionality of the sets need to be ammendded to equal each other. Thus, the dimensionality of the set with most categorical features after dummy varible transformation must be shrunk to equal the dimensionality of the set with the least categorical features.

In [14]:
def dimChange(df_train_cat, df_test_cat):
    
    """df_train_cat = dfTrain_catDum, df_test_cat = dfTest_catDum"""
    
    """due to diffenences in the categorical variables in the train set and test set, i.e. a categorical feature
    in one set may have a disparate number of categories relative to the other set. Thus, the dimensionality of
    the set with most categorical features after dummy varible transformation must be shrunk to equal the
    dimensionality of the set with the least categorical features."""
    
    if df_train_cat.shape[1] > df_test_cat.shape[1]:
        dfTrain_catDum2 = pd.DataFrame(df_train_cat, columns = df_test_cat.columns)
        dfTrain_catDum2.fillna(value = 'NA', inplace = True)
        dfTrain_catDum3 = pd.get_dummies(dfTrain_catDum2[df_test_cat.columns])                      
    
    return dfTrain_catDum3

In [15]:
dfTrain_catDum3 = dimChange(df_train_cat = dfTrain_catDum, df_test_cat = dfTest_catDum)

Creation of the training set occurs within this function.

In [16]:
def trainSet(df_train, df_train_cat, df_train_numVar, test_size = 0.20):
    
    """df = trainBNP, df_train_cat = dfTrain_catDum3, df_train_nnumVar = dfTrain_numVar, test_size defaults to 0.20 """
    """function to prepare data for models"""

    X = np.concatenate((df_train_cat, df_train_numVar), axis = 1)
    y = np.asarray(df_train['target'])
    
    X_train, X_test_fT, y_train, y_test_fT = train_test_split(X, y, test_size = test_size, random_state = 1)
    
    return X_train, y_train, X_test_fT, y_test_fT

In [17]:
X_train, y_train, X_test_fT, y_test_fT = trainSet(df_train=trainBNP, df_train_cat=dfTrain_catDum3, 
                                                  df_train_numVar=dfTrain_numVar)

Logistic Regression used to model the respone variable with a 'l1' regulariztion parameter .

In [36]:
clfLR = LogisticRegression(penalty='l1')
clfLR.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [None]:
Predictions made on sets derived from trainSet function.

In [37]:
probPred_train = clfLR.predict_proba(X_test_fT)
probPred_train2 = clfLR.predict_proba(X_train)
probPred_test = clfLR.predict_proba(X_test)

Log Loss calculated for sets

In [38]:
ll_test_fT = log_loss(y_true = y_test_fT, y_pred = probPred_train) 
ll_trainSet = log_loss(y_true = y_train, y_pred = probPred_train2)

In [39]:
print('Log Loss Train Set: %.3f' % (ll_trainSet))
print('Log Loss held Out Set: %.3f' % (ll_test_fT))


Log Loss Train Set: 0.485
Log Loss held Out Set: 0.481


Below is where I was going to begin using the pipeline module in sklearn to tune different models

In [None]:
pca = PCA()
clf = LogisticRegression(penalty='l1')
# scale the matrix X to have 0 mean and unit variance, then perform PCA, 
# and then fit a logistic regression model
n_comps = np.round(np.arange(1, 50, 10))
Cst = [0.001, 0.1, 1.0, 10.0]
#plts = ['l1', 'l2']

#params = dict(pca__n_components = n_comps,
#            clf__C = Cst,
#          clf__penalty = plts)

params = dict(clf__C = Cst)
             
             
pipe_lr = Pipeline([('scl', StandardScaler()),
                    ('clf', clf)])
cv_count = 10


In [None]:
scores = cross_val_score(estimator=pipe_lr, X=X_train, y=y_train, cv=3, n_jobs=1)

In [None]:
est_m1 = GridSearchCV(pipe_lr, param_grid = params, cv = cv_count)

In [None]:
est_m1.fit(X_train, y_train)

In [None]:
est_m1.best_params_

In [None]:
est_m1.score(X_train, y_train)

In [None]:
est_m1.score(X_test, y_test)

In [None]:
est_m1.grid_scores_