In [30]:
%reload_kedro

2020-06-08 15:19:11,064 - root - INFO - ** Kedro project Immunization Drop-outs
2020-06-08 15:19:11,065 - root - INFO - Defined global variable `context` and `catalog`


In [31]:
dfm = catalog.load("model_table")

2020-06-08 15:19:11,069 - kedro.io.data_catalog - INFO - Loading data from `model_table` (CSVDataSet)...


## Baseline Classifier

To fix a baseline to see whether we are developing useful models or not, a baseline classifier always predicts the majority class. It's the simplest classifier we could build.

The majority class represents a 60.9% of the whole dataset (`high`). For that reason, that would be the accuracy of a majority class classifier.

In [32]:
import pickle
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from pprint import pprint
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import ShuffleSplit
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Random Forest

In [33]:
features_train = catalog.load("X_train")
labels_train = catalog.load("y_train")
features_test = catalog.load("X_test")
labels_test = catalog.load("y_test")

2020-06-08 15:19:11,102 - kedro.io.data_catalog - INFO - Loading data from `X_train` (PickleDataSet)...
2020-06-08 15:19:11,105 - kedro.io.data_catalog - INFO - Loading data from `y_train` (PickleDataSet)...
2020-06-08 15:19:11,106 - kedro.io.data_catalog - INFO - Loading data from `X_test` (PickleDataSet)...
2020-06-08 15:19:11,107 - kedro.io.data_catalog - INFO - Loading data from `y_test` (PickleDataSet)...


In [34]:
print(features_train.shape)
print(features_test.shape)

(36540, 7)
(9135, 7)


## Cross-Validation for Hyperparameter tuning

In [35]:
rf_0 = RandomForestClassifier(random_state = 8)
print('Parameters currently in use:\n')
pprint(rf_0.get_params())

Parameters currently in use:

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 8,
 'verbose': 0,
 'warm_start': False}


I'll tune the following ones:
* n_estimators = number of trees in the forest.
* max_features = max number of features considered for splitting a node
* max_depth = max number of levels in each decision tree
* min_samples_split = min number of data points placed in a node before the node is split
* min_samples_leaf = min number of data points allowed in a leaf node
* bootstrap = method for sampling data points (with or without replacement)

### Randomized Search Cross Validation

Define the grid:

In [36]:
# n_estimators
n_estimators = [int(x) for x in np.linspace(start = 20, stop = 100, num = 5)]

# max_features
max_features = ['auto', 'sqrt']

# max_depth
max_depth = [int(x) for x in np.linspace(10, 50, num = 5)]
max_depth.append(None)

# min_samples_split
min_samples_split = [2, 5, 10]

# min_samples_leaf
min_samples_leaf = [1, 2, 4]

# bootstrap
bootstrap = [True, False]

#n_jobs
n_jobs = [4]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap,
              'n_jobs': n_jobs}

In [37]:
pprint(random_grid)

{'bootstrap': [True, False],
 'max_depth': [10, 20, 30, 40, 50, None],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [20, 40, 60, 80, 100],
 'n_jobs': [4]}


Perform the Random Search:

In [38]:
# Create the base model to tune
rfc = RandomForestClassifier(random_state=8)

# Definition of the random search
random_search = RandomizedSearchCV(estimator=rfc,
                                   param_distributions=random_grid,
                                   n_iter=50,
                                   scoring='accuracy',
                                   cv=3, 
                                   verbose=1, 
                                   random_state=8,
                                  n_jobs=4)

# Fit the random search model
random_search.fit(features_train, labels_train)

Fitting 3 folds for each of 50 candidates, totalling 150 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:   10.5s
[Parallel(n_jobs=4)]: Done 150 out of 150 | elapsed:   43.6s finished


RandomizedSearchCV(cv=3, error_score=nan,
                   estimator=RandomForestClassifier(bootstrap=True,
                                                    ccp_alpha=0.0,
                                                    class_weight=None,
                                                    criterion='gini',
                                                    max_depth=None,
                                                    max_features='auto',
                                                    max_leaf_nodes=None,
                                                    max_samples=None,
                                                    min_impurity_decrease=0.0,
                                                    min_impurity_split=None,
                                                    min_samples_leaf=1,
                                                    min_samples_split=2,
                                                    min_weight_fraction_leaf=0.0,
               

In [39]:
print("The best hyperparameters from Random Search are:")
print(random_search.best_params_)
print("")
print("The mean accuracy of a model with these hyperparameters is:")
print(random_search.best_score_)

The best hyperparameters from Random Search are:
{'n_jobs': 4, 'n_estimators': 80, 'min_samples_split': 10, 'min_samples_leaf': 1, 'max_features': 'auto', 'max_depth': 10, 'bootstrap': False}

The mean accuracy of a model with these hyperparameters is:
0.8727422003284072


In [51]:
type(random_search)

sklearn.model_selection._search.RandomizedSearchCV

### Grid Search Cross Validation

In [42]:
# Create the parameter grid based on the results of random search 
bootstrap = [False]
max_depth = [30, 40, 50]
max_features = ['sqrt']
min_samples_leaf = [1, 2, 4]
min_samples_split = [5, 10, 15]
n_estimators = [800]
n_jobs = [4]

param_grid = {
    'bootstrap': bootstrap,
    'max_depth': max_depth,
    'max_features': max_features,
    'min_samples_leaf': min_samples_leaf,
    'min_samples_split': min_samples_split,
    'n_estimators': n_estimators,
 'n_jobs': [4]
}

# Create a base model
rfc = RandomForestClassifier(random_state=8)

# Manually create the splits in CV in order to be able to fix a random_state 
#(GridSearchCV doesn't have that argument) 
cv_sets = ShuffleSplit(n_splits = 3, test_size = .33, random_state = 8)

# Instantiate the grid search model
grid_search = GridSearchCV(estimator=rfc, 
                           param_grid=param_grid,
                           scoring='accuracy',
                           cv=cv_sets,
                           verbose=1,
                                  n_jobs=4)

# Fit the grid search to the data
grid_search.fit(features_train, labels_train)

Fitting 3 folds for each of 27 candidates, totalling 81 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:  3.2min
[Parallel(n_jobs=4)]: Done  81 out of  81 | elapsed:  6.0min finished


GridSearchCV(cv=ShuffleSplit(n_splits=3, random_state=8, test_size=0.33, train_size=None),
             error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_sampl...
                                              n_estimators=100, n_jobs=None,
                                              oob_score=False, random_state=8,
                                  

In [43]:
print("The best hyperparameters from Grid Search are:")
print(grid_search.best_params_)
print("")
print("The mean accuracy of a model with these hyperparameters is:")
print(grid_search.best_score_)

The best hyperparameters from Grid Search are:
{'bootstrap': False, 'max_depth': 40, 'max_features': 'sqrt', 'min_samples_leaf': 4, 'min_samples_split': 15, 'n_estimators': 800, 'n_jobs': 4}

The mean accuracy of a model with these hyperparameters is:
0.8643613345495756


In [44]:
# best model 
best_rfc = grid_search.best_estimator_

In [45]:
best_rfc

RandomForestClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=40, max_features='sqrt',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=4, min_samples_split=15,
                       min_weight_fraction_leaf=0.0, n_estimators=800, n_jobs=4,
                       oob_score=False, random_state=8, verbose=0,
                       warm_start=False)

Lets fit it and see how it performs:

In [None]:
def parameter_tuning_randomized_search():
    """
    
    """
    n_estimators = [int(x) for x in np.linspace(start = 20, stop = 100, num = 5)]
    max_features = ['auto', 'sqrt']
    max_depth = [int(x) for x in np.linspace(10, 50, num = 5)]
    max_depth.append(None)
    min_samples_split = [2, 5, 10]
    min_samples_leaf = [1, 2, 4]
    bootstrap = [True, False]
    n_jobs = [4]

    # Create the random grid
    random_grid = {'n_estimators': n_estimators,
                   'max_features': max_features,
                   'max_depth': max_depth,
                   'min_samples_split': min_samples_split,
                   'min_samples_leaf': min_samples_leaf,
                   'bootstrap': bootstrap,
                   'n_jobs': n_jobs}
    
    # Create the base model to tune
    rfc = RandomForestClassifier(random_state=8)
    # Definition of the random search
    random_search = RandomizedSearchCV(estimator=rfc,
                                       param_distributions=random_grid,
                                       n_iter=50,
                                       scoring='accuracy',
                                       cv=3, 
                                       verbose=1, 
                                       random_state=8,
                                       n_jobs=4)
    # Fit the random search model
    random_search.fit(features_train, labels_train)
    
    print("The best hyperparameters from Random Search are:")
    print(random_search.best_params_)
    print("")
    print("The mean accuracy of a model with these hyperparameters is:")
    print(random_search.best_score_)
    
    # best model 
    best_rfc = random_search.best_estimator_
    
    return best_rfc

In [None]:
def parameter_tuning_grid():
    
    bootstrap = [False]
    max_depth = [30, 40, 50]
    max_features = ['sqrt']
    min_samples_leaf = [1, 2, 4]
    min_samples_split = [5, 10, 15]
    n_estimators = [800]
    n_jobs = [4]

    param_grid = {
        'bootstrap': bootstrap,
        'max_depth': max_depth,
        'max_features': max_features,
        'min_samples_leaf': min_samples_leaf,
        'min_samples_split': min_samples_split,
        'n_estimators': n_estimators,
     'n_jobs': [4]
    }

    # Create a base model
    rfc = RandomForestClassifier(random_state=8)
    # Manually create the splits in CV 
    cv_sets = ShuffleSplit(n_splits = 3, test_size = .33, random_state = 8)
    # Instantiate the grid search model
    grid_search = GridSearchCV(estimator=rfc, 
                               param_grid=param_grid,
                               scoring='accuracy',
                               cv=cv_sets,
                               verbose=1,
                                      n_jobs=4)

    # Fit the grid search to the data
    grid_search.fit(features_train, labels_train)
    print("The best hyperparameters from Grid Search are:")
    print(grid_search.best_params_)
    print("")
    print("The mean accuracy of a model with these hyperparameters is:")
    print(grid_search.best_score_)
    
    # best model 
    best_rfc = grid_search.best_estimator_
    
    return best_rfc

## Model fit and performance

In [46]:
best_rfc.fit(features_train, labels_train)

RandomForestClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=40, max_features='sqrt',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=4, min_samples_split=15,
                       min_weight_fraction_leaf=0.0, n_estimators=800, n_jobs=4,
                       oob_score=False, random_state=8, verbose=0,
                       warm_start=False)

In [47]:
rfc_pred = best_rfc.predict(features_test)

In [53]:
rfc_pred

array([1, 1, 0, ..., 1, 0, 1])

In [54]:
features_test

array([[ 28.,   1.,   0., ...,   3.,   3.,   3.],
       [278.,   1.,   1., ...,   2.,   3.,   0.],
       [257.,   1.,   0., ...,   1.,   1.,   6.],
       ...,
       [251.,   1.,   0., ...,   3.,   4.,   0.],
       [307.,   1.,   1., ...,   1.,   2.,   0.],
       [288.,   1.,   0., ...,   1.,   3.,   0.]])

For performance analysis, I will use the confusion matrix, the classification report and the accuracy on both training and test data:

#### Training accuracy

In [48]:
print("The training accuracy is: ")
print(accuracy_score(labels_train, best_rfc.predict(features_train)))

The training accuracy is: 
0.8904761904761904


#### Test accuracy

In [49]:
print("The test accuracy is: ")
print(accuracy_score(labels_test, rfc_pred))

The test accuracy is: 
0.8561576354679803


#### Classification report

In [50]:
print("Classification report")
print(classification_report(labels_test,rfc_pred))

Classification report
              precision    recall  f1-score   support

           0       0.88      0.88      0.88      5576
           1       0.81      0.82      0.82      3559

    accuracy                           0.86      9135
   macro avg       0.85      0.85      0.85      9135
weighted avg       0.86      0.86      0.86      9135



In [52]:
def model_fit_and_performance(best_rfc, features_train, labels_train, features_test):
    """
    For performance analysisthe confusion matrix, the classification report 
    and the accuracy on both training and test data
    """
    best_rfc.fit(features_train, labels_train)
    rfc_pred = best_rfc.predict(features_test)
    
    print("The training accuracy is: ")
    print(accuracy_score(labels_train, best_rfc.predict(features_train)))
    print("The test accuracy is: ")
    print(accuracy_score(labels_test, rfc_pred))
    print("Classification report")
    print(classification_report(labels_test,rfc_pred))

#### Confusion matrix