# CHURN - Classification Analysis

## Overview

- [Description](#Description)  
- [Data Description](#Data-Description)
- [Data Preparation](#Data-Preparation)
- [Classification](#Classification)
    - [DECISION TREE](#DECISION-TREE)
    - [RANDOM FOREST](#RANDOM-FOREST)
    - [NAIVE BAYES](#NAIVE-BAYES)

## Description

Our objective is to make **churn prediction**.

## Data Description

Columns:
- **RowNumber** (int > 0). Is not necessary as a feature
- **CustomerId** (int > 0). Is not necessary as a feature
- **Surname** (string). Is not necessary as a feature
- **CreditScore** (int). Numerical feature
- **Geography** (string). Categorical feature
- **Gender** (string). Categorical feature
- **Age** (int > 0). Numerical feature
- **Tenure** (int > 0). Numerical feature
- **Balance** (float). Numerical feature
- **NumOfProduct** (int > 0). Numerical feature
- **HasCrCard** (0/1). Binary feature
- **IsActiveMember** (0/1). Binary feature
- **EstimatedSalary** (float). Numerical feature
- **Exited** Target
    - exited (1): the customer left the company
    - no exited (0): the user remained at the company

In [1]:
import numpy as np
import pandas as pd

df = pd.read_csv('data/Churn_Modelling.csv')

In [2]:
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [3]:
df.dtypes

RowNumber            int64
CustomerId           int64
Surname             object
CreditScore          int64
Geography           object
Gender              object
Age                  int64
Tenure               int64
Balance            float64
NumOfProducts        int64
HasCrCard            int64
IsActiveMember       int64
EstimatedSalary    float64
Exited               int64
dtype: object

## Data Preparation

- Checking Missing Data (see [Missing Data](../../00 Data Preparation/01_Missing Data.ipynb))
- Feature scaling (see [Feature Scaling](../../00 Data Preparation/03_Feature_Scaling.ipynb))  (necessary for some classification algorithms)
- One-hot-encoding for categorical data (see [Categorical Data](../../00 Data Preparation/02_Categorical Data.ipynb))

In [4]:
df.describe()

Unnamed: 0,RowNumber,CustomerId,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,5000.5,15690940.0,650.5288,38.9218,5.0128,76485.889288,1.5302,0.7055,0.5151,100090.239881,0.2037
std,2886.89568,71936.19,96.653299,10.487806,2.892174,62397.405202,0.581654,0.45584,0.499797,57510.492818,0.402769
min,1.0,15565700.0,350.0,18.0,0.0,0.0,1.0,0.0,0.0,11.58,0.0
25%,2500.75,15628530.0,584.0,32.0,3.0,0.0,1.0,0.0,0.0,51002.11,0.0
50%,5000.5,15690740.0,652.0,37.0,5.0,97198.54,1.0,1.0,1.0,100193.915,0.0
75%,7500.25,15753230.0,718.0,44.0,7.0,127644.24,2.0,1.0,1.0,149388.2475,0.0
max,10000.0,15815690.0,850.0,92.0,10.0,250898.09,4.0,1.0,1.0,199992.48,1.0


In [5]:
df.isnull().sum()

RowNumber          0
CustomerId         0
Surname            0
CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

In [6]:
# Values for categorical data
for column in df.columns:
    if (df[column].dtype) == 'object':
        print(column)
        print('----------------')
        print(df[column].value_counts())
        print('\n')

Surname
----------------
Smith          32
Martin         29
Scott          29
Walker         28
Brown          26
Shih           25
Yeh            25
Genovese       25
Wright         24
Maclean        24
White          23
Fanucci        23
Ma             23
Wilson         23
Wang           22
Lu             22
Moore          22
Johnson        22
Chu            22
Sun            21
Thompson       21
McGregor       21
Mai            21
Hughes         20
Miller         20
Mitchell       20
Watson         20
Lo             20
Kerr           20
Shen           20
               ..
Frater          1
Landry          1
Joshua          1
Williford       1
Hysell          1
MacDonnell      1
Cribb           1
Truscott        1
Weigel          1
Pepper          1
Newland         1
Howey           1
Wieck           1
Voss            1
Espinosa        1
Szabados        1
Corran          1
Kibble          1
Drake           1
Abrego          1
Bellew          1
Rozier          1
McEncroe        1
Wil

In [7]:
# isolating the target
y = df[['Exited']]
X = df.drop(labels=['Exited'], axis=1)

In [8]:
y.head()

Unnamed: 0,Exited
0,1
1,0
2,1
3,0
4,0


In [9]:
y['Exited'].value_counts()

0    7963
1    2037
Name: Exited, dtype: int64

In [10]:
X.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1


In [11]:
# dropping columns that no are necessary: RowNumber, CustomerId, Surname
columns_to_drop = ['RowNumber', 'CustomerId', 'Surname']
X.drop(labels=columns_to_drop, axis=1, inplace=True)
X.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
0,619,France,Female,42,2,0.0,1,1,1,101348.88
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58
2,502,France,Female,42,8,159660.8,3,1,0,113931.57
3,699,France,Female,39,1,0.0,2,0,0,93826.63
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.1


In [12]:
# processing categorical data
categorical_features = ['Geography', 'Gender']
X = pd.get_dummies(X, columns=categorical_features)
X.head()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Geography_France,Geography_Germany,Geography_Spain,Gender_Female,Gender_Male
0,619,42,2,0.0,1,1,1,101348.88,1,0,0,1,0
1,608,41,1,83807.86,1,0,1,112542.58,0,0,1,1,0
2,502,42,8,159660.8,3,1,0,113931.57,1,0,0,1,0
3,699,39,1,0.0,2,0,0,93826.63,1,0,0,1,0
4,850,43,2,125510.82,1,1,1,79084.1,0,0,1,1,0


In [13]:
# solving dummy variables
dummy_variables = ['Geography_Spain', 'Gender_Male']
X.drop(labels=dummy_variables, axis=1, inplace=True)
X.head()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Geography_France,Geography_Germany,Gender_Female
0,619,42,2,0.0,1,1,1,101348.88,1,0,1
1,608,41,1,83807.86,1,0,1,112542.58,0,0,1
2,502,42,8,159660.8,3,1,0,113931.57,1,0,1
3,699,39,1,0.0,2,0,0,93826.63,1,0,1
4,850,43,2,125510.82,1,1,1,79084.1,0,0,1


In [14]:
from sklearn.preprocessing import StandardScaler
sd = StandardScaler(with_mean=True, with_std=True)

Xscaled = sd.fit_transform(X)
pd.DataFrame(Xscaled, columns=X.columns).describe()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Geography_France,Geography_Germany,Gender_Female
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,-4.870326e-16,2.484679e-16,-1.400324e-16,-5.978551e-17,-8.652634e-16,-2.676082e-16,2.164047e-16,-1.580958e-17,7.723266e-16,5.961232e-16,2.414668e-15
std,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005
min,-3.109504,-1.994969,-1.733315,-1.225848,-0.9115835,-1.547768,-1.03067,-1.740268,-1.002804,-0.5787359,-0.9124191
25%,-0.6883586,-0.6600185,-0.6959818,-1.225848,-0.9115835,-1.547768,-1.03067,-0.8535935,-1.002804,-0.5787359,-0.9124191
50%,0.01522218,-0.1832505,-0.004425957,0.3319639,-0.9115835,0.6460917,0.9702426,0.001802807,0.9972039,-0.5787359,-0.9124191
75%,0.6981094,0.4842246,0.6871299,0.8199205,0.8077366,0.6460917,0.9702426,0.8572431,0.9972039,1.727904,1.095988
max,2.063884,5.061197,1.724464,2.795323,4.246377,0.6460917,0.9702426,1.7372,0.9972039,1.727904,1.095988


## Classification

In [15]:
# creating a dictionary for recording the results from the different models
results = {}
results['name'] = []
results['accuracy'] = []
results['model'] = []

In [16]:
# utility function to report best scores
def report(results, n_top=3):
    for i in range(1, n_top+1):
        candidates = np.flatnonzero(results['rank_test_score']==i)
        for candidate in candidates:
            print('Model with rank: {0}'.format(i))
            print('Mean validation score: {0:.3f} (std: {1:.3f})'.format(results['mean_test_score'][candidate], results['std_test_score'][candidate]))
            print('Parameters: {0}'.format(results['params'][candidate]))
            print('')

### DECISION TREE

See [Decision Tree](../../02 Classification/05 Decision Tree.ipynb) for a reference

In [17]:
# splittin training and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [18]:
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(criterion='gini')

# we are not prunning the tree (beware of overfitting!)
tree.fit(X_train, y_train.values.ravel())
print('Accuracy: {0}'.format(tree.score(X_test, y_test)))

Accuracy: 0.8005


In [19]:
from sklearn.tree import export_graphviz
export_graphviz(tree, out_file='tree.dot', feature_names=X.columns)

In [20]:
!dot -Tpng tree.dot -o tree.png

In [21]:
!cp tree.png ./images/

In [22]:
# pick a random integer with 1 in 2 billion chance of getting the same
# integer twice
import random
__counter__ = random.randint(0,2e9)

# now use IPython's rich display to display the html image with the
# new argument
from IPython.display import HTML, display
display(HTML('<img src="./images/tree.png?%d">'% __counter__))

As can be noticed by the previous image, the model is **too complex** (overfitting)  

Let's try to prune the tree using the parameters:
- max_depth
- min_samples_split
- min_samples_leaf

Let's see a **tree of maximum depth of 3**.

In [23]:
tree = DecisionTreeClassifier(criterion='gini', max_depth=3)
tree.fit(X_train, y_train.values.ravel())
print('Accuracy: {0}'.format(tree.score(X_test, y_test)))

Accuracy: 0.8425


In [24]:
export_graphviz(tree, out_file='tree2.dot', feature_names=X.columns)

In [25]:
!dot -Tpng tree2.dot -o tree2.png

In [26]:
!cp tree2.png ./images/

In [27]:
# pick a random integer with 1 in 2 billion chance of getting the same
# integer twice
import random
__counter__ = random.randint(0,2e9)

# now use IPython's rich display to display the html image with the
# new argument
from IPython.display import HTML, display
display(HTML('<img src="./images/tree2.png?%d">'% __counter__))

We are going to use **GridSearchCV** for tunnig the main parameters.

In [28]:
from time import time
from sklearn.model_selection import GridSearchCV

tree = DecisionTreeClassifier(criterion='gini')
parameters = {'max_depth':[2, 3, 4, 5],
              'min_samples_split': [50, 100, 200],
              'min_samples_leaf': [50, 100, 200],
              'max_features': [None, 'auto', 'log2']}
grid_search = GridSearchCV(estimator=tree, param_grid=parameters, cv=10, n_jobs=4, verbose=1)

In [29]:
start = time()
grid_search.fit(X_train, y_train.values.ravel())
end = time()

Fitting 10 folds for each of 108 candidates, totalling 1080 fits


[Parallel(n_jobs=4)]: Done 144 tasks      | elapsed:    2.0s
[Parallel(n_jobs=4)]: Done 744 tasks      | elapsed:   10.4s
[Parallel(n_jobs=4)]: Done 1080 out of 1080 | elapsed:   15.4s finished


In [30]:
print("GridSearchCV took: {0:.3f} for {1} candidate parameter settings".format(end-start, len(grid_search.cv_results_['params'])))

GridSearchCV took: 15.987 for 108 candidate parameter settings


In [31]:
report(grid_search.cv_results_)

Model with rank: 1
Mean validation score: 0.844 (std: 0.006)
Parameters: {'max_features': None, 'min_samples_split': 50, 'min_samples_leaf': 50, 'max_depth': 4}

Model with rank: 1
Mean validation score: 0.844 (std: 0.006)
Parameters: {'max_features': None, 'min_samples_split': 100, 'min_samples_leaf': 50, 'max_depth': 4}

Model with rank: 1
Mean validation score: 0.844 (std: 0.006)
Parameters: {'max_features': None, 'min_samples_split': 200, 'min_samples_leaf': 50, 'max_depth': 4}



In [32]:
grid_search.best_estimator_

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=4,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=50,
            min_samples_split=50, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [33]:
grid_search.best_params_           

{'max_depth': 4,
 'max_features': None,
 'min_samples_leaf': 50,
 'min_samples_split': 50}

In [34]:
grid_search.best_score_

0.84387500000000004

In [35]:
grid_search.cv_results_

{'mean_fit_time': array([ 0.03900943,  0.0470526 ,  0.0414434 ,  0.03883791,  0.05781031,
         0.05300636,  0.04122488,  0.03983669,  0.0395853 ,  0.03405576,
         0.02667046,  0.02473652,  0.03242893,  0.03449347,  0.03944132,
         0.02525618,  0.03683283,  0.04658325,  0.0304107 ,  0.02461839,
         0.0415103 ,  0.04344049,  0.0301477 ,  0.03096168,  0.03341646,
         0.02769814,  0.0253783 ,  0.03603423,  0.0476568 ,  0.04090619,
         0.04136724,  0.07308688,  0.11962352,  0.12357213,  0.09451692,
         0.0554909 ,  0.03857465,  0.04007862,  0.03740962,  0.02973528,
         0.02804227,  0.04313092,  0.04130831,  0.04184315,  0.05514965,
         0.05063498,  0.03727553,  0.04619949,  0.03690031,  0.03175452,
         0.02946742,  0.02316217,  0.02446034,  0.03139207,  0.04381821,
         0.04898028,  0.05727623,  0.05021262,  0.0390538 ,  0.0648705 ,
         0.04372902,  0.03803382,  0.03824878,  0.02457693,  0.03882861,
         0.02715993,  0.0219394 , 

In [36]:
grid_search.score(X_test, y_test)

0.85499999999999998

In [37]:
export_graphviz(grid_search.best_estimator_, out_file='tree3.dot', feature_names=X.columns)

In [38]:
!dot -Tpng tree3.dot -o tree3.png

In [39]:
!cp tree3.png ./images/

In [40]:
# pick a random integer with 1 in 2 billion chance of getting the same
# integer twice
import random
__counter__ = random.randint(0,2e9)

# now use IPython's rich display to display the html image with the
# new argument
from IPython.display import HTML, display
display(HTML('<img src="./images/tree3.png?%d">'% __counter__))

In [41]:
# picking up the best model
results['name'].append('Decision Tree')
results['accuracy'].append(grid_search.score(X_test, y_test))
results['model'].append(grid_search.best_estimator_)
print(results)

{'accuracy': [0.85499999999999998], 'model': [DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=4,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=50,
            min_samples_split=50, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')], 'name': ['Decision Tree']}


## RANDOM FOREST

See [Random Forest](../../02 Classification/06 Random Forest.ipynb) for a reference

In [42]:
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators=10, criterion='gini')

forest.fit(X_train, y_train.values.ravel())
print('Accuracy: {0}'.format(forest.score(X_test, y_test)))

Accuracy: 0.8605


In [43]:
from sklearn.model_selection import GridSearchCV

forest = RandomForestClassifier(criterion='gini')
parameters = {'n_estimators':[10, 50, 100, 500],
              'min_samples_split': [50, 100, 200],
              'min_samples_leaf': [50, 100, 200],
              'max_features': [None, 'auto', 'log2']}
grid_search = GridSearchCV(estimator=forest, param_grid=parameters, cv=5, n_jobs=4, verbose=1)

In [44]:
start = time()
grid_search.fit(X_train, y_train.values.ravel())
end = time()

Fitting 5 folds for each of 108 candidates, totalling 540 fits


[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:  1.1min
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:  3.9min
[Parallel(n_jobs=4)]: Done 442 tasks      | elapsed:  6.7min
[Parallel(n_jobs=4)]: Done 540 out of 540 | elapsed:  7.8min finished


In [45]:
print("GridSearchCV took: {0:.3f} for {1} candidate parameter settings".format(end-start, len(grid_search.cv_results_['params'])))

GridSearchCV took: 472.657 for 108 candidate parameter settings


In [46]:
report(grid_search.cv_results_)

Model with rank: 1
Mean validation score: 0.852 (std: 0.002)
Parameters: {'max_features': None, 'min_samples_split': 50, 'min_samples_leaf': 50, 'n_estimators': 100}

Model with rank: 2
Mean validation score: 0.852 (std: 0.004)
Parameters: {'max_features': None, 'min_samples_split': 50, 'min_samples_leaf': 50, 'n_estimators': 10}

Model with rank: 3
Mean validation score: 0.852 (std: 0.001)
Parameters: {'max_features': None, 'min_samples_split': 100, 'min_samples_leaf': 50, 'n_estimators': 100}



In [47]:
grid_search.best_estimator_

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=50,
            min_samples_split=50, min_weight_fraction_leaf=0.0,
            n_estimators=100, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [48]:
grid_search.best_params_           

{'max_features': None,
 'min_samples_leaf': 50,
 'min_samples_split': 50,
 'n_estimators': 100}

In [49]:
grid_search.best_score_

0.85237499999999999

In [50]:
grid_search.cv_results_

{'mean_fit_time': array([  0.37836123,   1.83253479,   3.83214307,  20.81370621,
          0.54170794,   2.20702586,   3.51421499,  19.07090082,
          0.32253494,   1.64624443,   2.81454763,  15.41304541,
          0.27338018,   1.36267052,   2.82097001,  12.75091324,
          0.34659863,   1.43396726,   2.54571857,  12.93085375,
          0.27605681,   1.23437219,   2.46966448,  13.26657424,
          0.22810984,   1.12388301,   2.44729252,  10.81843996,
          0.19813814,   1.0678916 ,   2.59069529,  10.97963719,
          0.212886  ,   1.13268795,   2.16718922,  10.7315464 ,
          0.17023396,   0.96640477,   1.26615558,   7.26301947,
          0.20392499,   0.65465508,   1.4419292 ,   7.31755581,
          0.1287272 ,   0.81032343,   1.23927183,   9.00525751,
          0.21945977,   0.64906282,   1.30137563,   6.60566564,
          0.14732947,   0.58001194,   1.41069036,   6.55499601,
          0.18860936,   0.76224699,   1.11277323,   6.64104342,
          0.15805154,  

In [51]:
grid_search.score(X_test, y_test)

0.85399999999999998

In [52]:
# picking up the best model
results['name'].append('Random Forest')
results['accuracy'].append(grid_search.score(X_test, y_test))
results['model'].append(grid_search.best_estimator_)
print(results)

{'accuracy': [0.85499999999999998, 0.85399999999999998], 'model': [DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=4,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=50,
            min_samples_split=50, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'), RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=50,
            min_samples_split=50, min_weight_fraction_leaf=0.0,
            n_estimators=100, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)], 'name': ['Decision Tree', 'Random Forest']}


## NAIVE BAYES

See [Naive Bayes](../../02 Classification/00 Naive Bayes.ipynb) for a reference

In [53]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()

gnb.fit(X_train, y_train.values.ravel())
print('Accuracy: {0}'.format(gnb.score(X_test, y_test)))

Accuracy: 0.7845


In [54]:
# picking up the model
results['name'].append('Naive Bayes')
results['accuracy'].append(gnb.score(X_test, y_test))
results['model'].append(gnb)
print(results)

{'accuracy': [0.85499999999999998, 0.85399999999999998, 0.78449999999999998], 'model': [DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=4,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=50,
            min_samples_split=50, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'), RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=50,
            min_samples_split=50, min_weight_fraction_leaf=0.0,
            n_estimators=100, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False), GaussianNB(priors=None)], 'name': ['Decision Tree', 'Random Forest', 'Naive Bayes']}


## K-NEAREST NEIGHBORS

See [K-Nearest Neighbors](../../02 Classification/01 K-Nearest Neighbor.ipynb) for a reference

In [55]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5, 
                           weights='uniform', 
                           algorithm='auto', 
                           p=2, 
                           metric='minkowski')


from sklearn.preprocessing import StandardScaler
sd = StandardScaler(with_mean=True, with_std=True)

X_train_scaled = sd.fit_transform(X_train)
X_test_scaled = sd.transform(X_test)

knn.fit(X_train_scaled, y_train.values.ravel())
print('Accuracy: {0}'.format(knn.score(X_test_scaled, y_test)))

Accuracy: 0.8275


In [56]:
from sklearn.model_selection import GridSearchCV

knn = KNeighborsClassifier()
parameters = {'n_neighbors':[5, 10, 15, 25, 50],
              'weights': ['uniform', 'distance'],
              'p': [1, 2]}
grid_search = GridSearchCV(estimator=knn, param_grid=parameters, cv=10, n_jobs=4, verbose=1)

In [57]:
start = time()
grid_search.fit(X_train_scaled, y_train.values.ravel())
end = time()

Fitting 10 folds for each of 20 candidates, totalling 200 fits


[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:   31.1s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:  3.4min
[Parallel(n_jobs=4)]: Done 200 out of 200 | elapsed:  3.5min finished


In [58]:
print("GridSearchCV took: {0:.3f} for {1} candidate parameter settings".format(end-start, len(grid_search.cv_results_['params'])))

GridSearchCV took: 212.289 for 20 candidate parameter settings


In [59]:
report(grid_search.cv_results_)

Model with rank: 1
Mean validation score: 0.835 (std: 0.008)
Parameters: {'p': 1, 'weights': 'distance', 'n_neighbors': 15}

Model with rank: 2
Mean validation score: 0.833 (std: 0.009)
Parameters: {'p': 1, 'weights': 'uniform', 'n_neighbors': 15}

Model with rank: 3
Mean validation score: 0.832 (std: 0.008)
Parameters: {'p': 1, 'weights': 'distance', 'n_neighbors': 25}



In [60]:
grid_search.best_estimator_

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=15, p=1,
           weights='distance')

In [61]:
grid_search.best_params_

{'n_neighbors': 15, 'p': 1, 'weights': 'distance'}

In [62]:
grid_search.best_score_

0.83487500000000003

In [63]:
grid_search.cv_results_

{'mean_fit_time': array([ 0.05086727,  0.0419997 ,  0.03285263,  0.04512339,  0.05818503,
         0.04527781,  0.04679232,  0.03984451,  0.03789189,  0.03792551,
         0.05351527,  0.03993073,  0.03633139,  0.0472043 ,  0.03821759,
         0.03892107,  0.03820686,  0.04492691,  0.04367583,  0.0425982 ]),
 'mean_score_time': array([ 0.35612414,  0.33260415,  0.19217491,  0.24356945,  0.5428345 ,
         0.47227101,  0.37734118,  0.2948252 ,  0.4721184 ,  0.45362642,
         0.36580591,  0.34375339,  0.46563206,  0.55936539,  0.41294732,
         0.49973485,  0.53094659,  0.6643291 ,  0.65524659,  0.53041511]),
 'mean_test_score': array([ 0.82975 ,  0.828625,  0.826125,  0.826   ,  0.830375,  0.830875,
         0.827625,  0.830625,  0.83275 ,  0.834875,  0.83075 ,  0.830875,
         0.831625,  0.832   ,  0.828625,  0.830875,  0.824125,  0.827   ,
         0.821875,  0.826625]),
 'mean_train_score': array([ 0.87331943,  1.        ,  0.87095839,  1.        ,  0.84783328,
         1

In [64]:
grid_search.score(X_test_scaled, y_test)

0.84299999999999997

In [65]:
# picking up the best model
results['name'].append('K-Nearest Neighbor')
results['accuracy'].append(grid_search.score(X_test_scaled, y_test))
results['model'].append(grid_search.best_estimator_)
print(results)

{'accuracy': [0.85499999999999998, 0.85399999999999998, 0.78449999999999998, 0.84299999999999997], 'model': [DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=4,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=50,
            min_samples_split=50, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'), RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=50,
            min_samples_split=50, min_weight_fraction_leaf=0.0,
            n_estimators=100, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False), GaussianNB(priors=None), KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=15, p=1,
           weights='distance')], '

## LOGISTIC REGRESSION

See [Logistic Regression](../../02 Classification/03 Logistic Regression.ipynb) for a reference

In [78]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(penalty='l2', 
                        C=1.0, 
                        max_iter=100)

lr.fit(X_train_scaled, y_train.values.ravel())
print('Accuracy: {0}'.format(lr.score(X_test_scaled, y_test)))

Accuracy: 0.811


In [79]:
from sklearn.model_selection import GridSearchCV

lr = LogisticRegression(n_jobs=-1)
parameters = {'penalty': ['l1', 'l2'],
              'C':[0.0001, 0.001, 0.01, 0.1, 1.0],
              'max_iter': [100, 500, 1000, 5000]}
grid_search = GridSearchCV(estimator=lr, param_grid=parameters, cv=5, n_jobs=4, verbose=1)

In [80]:
start = time()
grid_search.fit(X_train_scaled, y_train.values.ravel())
end = time()

Fitting 5 folds for each of 40 candidates, totalling 200 fits


[Parallel(n_jobs=4)]: Done 144 tasks      | elapsed:    1.9s
[Parallel(n_jobs=4)]: Done 200 out of 200 | elapsed:    2.4s finished


In [81]:
print("GridSearchCV took: {0:.3f} for {1} candidate parameter settings".format(end-start, len(grid_search.cv_results_['params'])))

GridSearchCV took: 2.634 for 40 candidate parameter settings


In [82]:
report(grid_search.cv_results_)

Model with rank: 1
Mean validation score: 0.811 (std: 0.003)
Parameters: {'C': 0.01, 'max_iter': 100, 'penalty': 'l2'}

Model with rank: 1
Mean validation score: 0.811 (std: 0.003)
Parameters: {'C': 0.01, 'max_iter': 500, 'penalty': 'l2'}

Model with rank: 1
Mean validation score: 0.811 (std: 0.003)
Parameters: {'C': 0.01, 'max_iter': 1000, 'penalty': 'l2'}

Model with rank: 1
Mean validation score: 0.811 (std: 0.003)
Parameters: {'C': 0.01, 'max_iter': 5000, 'penalty': 'l2'}



In [83]:
grid_search.best_estimator_

LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=-1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [84]:
grid_search.best_params_

{'C': 0.01, 'max_iter': 100, 'penalty': 'l2'}

In [85]:
grid_search.best_score_

0.81100000000000005

In [86]:
grid_search.cv_results_

{'mean_fit_time': array([ 0.02449355,  0.04040217,  0.01875882,  0.0360744 ,  0.03131461,
         0.02710223,  0.02113676,  0.03163681,  0.0308918 ,  0.05436735,
         0.01847949,  0.04643126,  0.02896719,  0.0294466 ,  0.02649341,
         0.04024696,  0.03673635,  0.03736582,  0.0247602 ,  0.02889805,
         0.02210979,  0.02661123,  0.02166781,  0.02459679,  0.027385  ,
         0.02946796,  0.02739797,  0.03072186,  0.03560762,  0.02398219,
         0.0275166 ,  0.0254015 ,  0.02902341,  0.02816858,  0.03278503,
         0.03124166,  0.03914623,  0.03393803,  0.03720565,  0.02217226]),
 'mean_score_time': array([ 0.01154914,  0.00528307,  0.0150176 ,  0.01120725,  0.01180558,
         0.00942297,  0.01009021,  0.01407657,  0.00884042,  0.01124239,
         0.0172523 ,  0.01333017,  0.00694346,  0.00900702,  0.00926204,
         0.00594296,  0.00330625,  0.00148363,  0.00469103,  0.00195837,
         0.00194621,  0.00105724,  0.00157056,  0.00159898,  0.00137963,
         0.00

In [87]:
grid_search.score(X_test_scaled, y_test)

0.8115

In [89]:
# picking up the best model
results['name'].append('Logisctic Regression')
results['accuracy'].append(grid_search.score(X_test_scaled, y_test))
results['model'].append(grid_search.best_estimator_)
print(results)

{'accuracy': [0.85499999999999998, 0.85399999999999998, 0.78449999999999998, 0.84299999999999997, 0.20300000000000001, 0.8115], 'model': [DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=4,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=50,
            min_samples_split=50, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'), RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=50,
            min_samples_split=50, min_weight_fraction_leaf=0.0,
            n_estimators=100, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False), GaussianNB(priors=None), KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=15, p=1,
     

## SUPPORT VECTOR MACHINE

See [Support Vector Machine](../../02 Classification/04 Support Vector Machine.ipynb) for a reference

In [93]:
from sklearn.svm import SVC
svc = SVC(C=1.0, 
          kernel='rbf',
          degree=3, 
          gamma='auto',
          coef0=0.0)

svc.fit(X_train_scaled, y_train.values.ravel())
print('Accuracy: {0}'.format(svc.score(X_test_scaled, y_test)))

Accuracy: 0.8635


In [95]:
from sklearn.model_selection import GridSearchCV

svc = SVC(cache_size=1000)
parameters = {'C':[0.001, 0.01, 0.1, 1.0],
              'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
              'degree': [3, 5],
              'gamma': ['auto', 0.1, 0.01],
              'coef0': [0.0, 0.1, 1.0],
              'class_weight':[None, 'balanced']}

grid_search = GridSearchCV(estimator=svc, param_grid=parameters, cv=5, n_jobs=4, verbose=1)

In [None]:
start = time()
grid_search.fit(X_train_scaled, y_train.values.ravel())
end = time()

Fitting 5 folds for each of 576 candidates, totalling 2880 fits


[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:   42.4s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:  2.7min
[Parallel(n_jobs=4)]: Done 442 tasks      | elapsed:  8.3min
[Parallel(n_jobs=4)]: Done 792 tasks      | elapsed: 22.6min
[Parallel(n_jobs=4)]: Done 1242 tasks      | elapsed: 33.1min
[Parallel(n_jobs=4)]: Done 1792 tasks      | elapsed: 42.1min
[Parallel(n_jobs=4)]: Done 2442 tasks      | elapsed: 56.9min
