### CD Purchase - Customer Classification

This project provides initial analysis on the bank-client dataset provided. Out of given datasets, the "bank-additional.csv" dataset is used for the analysis purpose and to come up with the best predetive model

Ground Truth: Sales team should find a model that identifies the customer who buys the CD. Therefore, the concentration will be more on the label "yes"("1" here) to obtain maximum F-1 score on this label.

#### Importing Dependecies and Libraries

In [198]:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, ExtraTreesClassifier,AdaBoostClassifier, VotingClassifier
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import BernoulliNB
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import confusion_matrix
%matplotlib inline

#### Data Loading and Analysis

In [148]:
dataset = pd.read_csv("bank-additional/bank-additional-full.csv", delimiter = ";")
dataset.head() # View the first few records of the dataset - Most of the variables are categorial

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [149]:
# Null Rows
nan_rows = dataset[dataset.isnull().T.any().T]
nan_rows # there are no rows with null values

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y


In [150]:
# sometimes missing values are logged as "special characters"
dataset[dataset['education'] == "?"] # This was checked for every column for few common special characters

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y


#### Handling Duplicate Entries

In [151]:
df_unique = dataset.drop_duplicates() # dropping duplicate entries
print (len(df_unique))
print (len(dataset))

41176
41188


In [152]:
df_duplicates = dataset[dataset.duplicated()]
print (len(df_duplicates))


12


#### Feature Engineering

Job, Marital, Education, default, housing, loan, contact etc. - > converting catergorial to dummy numeric values
duration, campaign, pdays, previous- numeric variable

In [153]:
cleanup = {"job" :{"admin.":1, "blue-collar":2, "technician":3,"services" :4, "management":5, "retired" :6,
              "entrepreneur" :7, "self-employed": 8, "housemaid" :9, "unemployed":10, "student": 11,"unknown":0},
        "marital" :{"married" :1, "single" :2, "divorced" :3, "unknown" :0},
        "education": {"university.degree":1, "high.school":2, "basic.9y": 3, "professional.course": 4, "basic.4y":5,
                           "basic.6y": 6, "unknown":0, "illiterate": 7},
        "default": {"no": 1, "unknown": 0, "yes" :2},
               "housing": {"yes": 1, "no": 2, "unknown": 0},
               "loan": {"no": 2, "yes": 1, "unknown":0 },
               "contact": {"cellular": 1, "telephone": 1},
               "month": {"jan":1, "feb":2, "mar": 3, "apr":4, "may":5, "jun":6, "jul":7, "aug":8, "sep": 9, "oct": 10, "nov":11,
                        "dec":12},
               "day_of_week": {"mon":1, "tue":2, "wed": 3, "thu":4, "fri":5},
               "poutcome": {"nonexistent": 0, "failure": 1, "success": 2},
           "y":{"no": 0, "yes":1}
              }
df_unique.replace(cleanup, inplace = True)
df_unique.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  regex=regex)


Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,9,1,5,1,2,2,1,5,1,...,1,999,0,0,1.1,93.994,-36.4,4.857,5191.0,0
1,57,4,1,2,0,2,2,1,5,1,...,1,999,0,0,1.1,93.994,-36.4,4.857,5191.0,0
2,37,4,1,2,1,1,2,1,5,1,...,1,999,0,0,1.1,93.994,-36.4,4.857,5191.0,0
3,40,1,1,6,1,2,2,1,5,1,...,1,999,0,0,1.1,93.994,-36.4,4.857,5191.0,0
4,56,4,1,2,1,2,1,1,5,1,...,1,999,0,0,1.1,93.994,-36.4,4.857,5191.0,0


In [154]:
# Binary Labels ratio
df_unique.y.value_counts()

0    36537
1     4639
Name: y, dtype: int64

#### Data Preparation

In [155]:
# dividing into feature columns and label columns
df_X = df_unique.drop(df_unique.columns[20], axis = 1, inplace = False)
df_Y = df_unique.drop(df_unique.columns[:20], axis = 1, inplace = False).astype(int)
# algorithm expects the shape of labels as (rows,)
c,r = df_Y.shape
df_Y = df_Y.values.reshape(c, )
df_Y.shape

(41176,)

#### Feature Selection

Out of many feature selection technique, I have chosen the tree-based feature selection technique, where a tree-based classifier is used to select the top-important features based ona threshold. The default value of threshold is the mean value of the importances.

In [188]:
fs = ExtraTreesClassifier()
fs = fs.fit(df_X, df_Y)
fs_model = SelectFromModel(fs, prefit = True)
df_X_new = fs_model.transform(df_X) # new dataframe with the selected features


In [189]:
importances = fs.feature_importances_
importances

array([ 0.09372457,  0.060395  ,  0.03271347,  0.0563294 ,  0.00969296,
        0.02969112,  0.02010927,  0.        ,  0.01834122,  0.04995141,
        0.27101174,  0.05952224,  0.04966201,  0.01006202,  0.03797496,
        0.01623596,  0.0192206 ,  0.02923299,  0.1177001 ,  0.01842897])

In [190]:
indices = np.argsort(importances)[::-1]
indices

array([10, 18,  0,  1, 11,  3,  9, 12, 14,  2,  5, 17,  6, 16, 19,  8, 15,
       13,  4,  7], dtype=int64)

In [191]:
df_X_new.shape

(41176, 6)

Features Selected = ['duration', 'euribor3m','age','pdays','job', 'campaign', 'education' ]

#### Model Computation

State-of-the-art classification algorithms are used to handle this dataset. Once picking up the best base performance of an algorithm, gridsearch optimization technique is applied on the best algorithm to build the best model from the selected algorithm  
Note: With the given size of the dataset, the advanced neural network techniques are not required

In [194]:
models = [('lr', LogisticRegressionCV()),
          ('rf', RandomForestClassifier()),
          ('nb', BernoulliNB()), 
          ('adb', AdaBoostClassifier()),
          ('gb', GradientBoostingClassifier())]
models.append(('eclf', VotingClassifier(estimators = [models[i] for i in [0,1,2,3]], voting = 'soft')))

In [200]:
# Train test split - with the ration 67 : 33
X_train, X_test, y_train, y_test = train_test_split(df_X_new, df_Y, test_size = 0.33, random_state = 42)


In [201]:
# Normalizing the data
ss = StandardScaler()
X_train = ss.fit_transform(X_train)

X_test = ss.fit_transform(X_test)

In [202]:
for name,model in models:
    cv = cross_val_score(model, X_train, y_train, cv =10, scoring = 'accuracy')
    clf = model.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    
    print ('{0}\t{1:<1}\t{2:<4}\t{3:<4}'.format("MODEL","MEAN CV", "MIN CV", "MAX CV"))
    print ('{0}\t{1:<1}\t{2:<4}\t{3:<4}'.format(name, round(cv.mean(), 4), round(cv.min(), 4), round(cv.max(), 4)))
    conf = confusion_matrix(y_test, y_pred)
    print (conf)
    #Calculations performed manually for other metrics
    TP = conf[0,0]
    FN = conf [0,1]
    FP = conf[1,0]
    TN = conf[1,1]
    print ("Specificity: ", TN/(TN+FP))
    

MODEL	MEAN CV	MIN CV	MAX CV
lr	0.8974	0.8905	0.9004
[[11776   266]
 [ 1148   399]]
Specificity:  0.257918552036
MODEL	MEAN CV	MIN CV	MAX CV
rf	0.9034	0.8985	0.9123
[[11586   456]
 [  867   680]]
Specificity:  0.43956043956
MODEL	MEAN CV	MIN CV	MAX CV
nb	0.8842	0.879	0.892
[[11850   192]
 [ 1403   144]]
Specificity:  0.093083387201
MODEL	MEAN CV	MIN CV	MAX CV
adb	0.9027	0.897	0.9097
[[11672   370]
 [  976   571]]
Specificity:  0.369101486749
MODEL	MEAN CV	MIN CV	MAX CV
gb	0.913	0.9069	0.9221
[[11515   527]
 [  718   829]]
Specificity:  0.535875888817
MODEL	MEAN CV	MIN CV	MAX CV
eclf	0.9017	0.8974	0.9076
[[11785   257]
 [ 1081   466]]
Specificity:  0.301228183581


#### Model Selection and Metrics

As seen from the above results that the "GradientBoosting Classifier" is performing well with top test accuracy values as well as "highest true negatives - 829" which is what we needed. Though the accuracy isaround 92.2% , here for the research problem, what we require is to able to predict as many labels of "lablel - 1 " correctly. Therefore, out of all the existing metrics,  we need "Specificity"

Specificity = TN/ TN + FP

Model Selected = GradientBoostingClassifier

In [203]:
# let us define the parameter grid
param_grid = {    'n_estimators': [50, 100, 200],
                  'max_depth':range(5,16,2), 
                  'min_samples_split':range(200,1001,200),
                  'min_samples_split':range(1000,2100,200), 
                  'min_samples_leaf':range(30,71,10),
                   'learning_rate': [0.1, 0.2, 0.01]               
}

In [205]:
grid_clf = GridSearchCV(GradientBoostingClassifier(),param_grid, scoring='roc_auc')

In [None]:
grid_clf.fit(X_train, y_train)

In [None]:
print("Best estimator found by grid search:")

grid_clf.best_estimator_

In [None]:
y_pred = grid_clf.predict(X_test)
conf = confusion_matrix(y_test, y_pred)
print (conf)
#Calculations performed manually for other metrics
TP = conf[0,0]
FN = conf [0,1]
FP = conf[1,0]
TN = conf[1,1]
print ("Specificity: ", TN/(TN+FP))