# Capstone 2 Modeling

In this notebook we will perform modeling step of the data science method. The goal of this step is to develop a final model that effectively predicts our patients' class - 'positive'(1) or 'negative'(0). In the previous step we have already built two models - LogisticRegressionCassifier (which was out baseline model) and RandomForestClassifier. We assesed the performance of each of these models first without tuning and than with hyperparameter tuning using GridSearchCV for LogisticRegression and RandomizedSearchCV for RandomForest and obtained results using classification report. We noticed, that both models performed better with hyperparameter tuning, and RandomForest showed significantly better results in precision, recall and f1-score than LogisticRegression. We also saved our train_test_split as well as both our models as pickle files.

In this step we will built two other models - GradientBoostingClassifier and AdaBoostClassifer, use hyperparameter tuning in order to enhance their performance and compare their results with the results of two above mentioned models that we built durint preprocessing and trainin data development step. 

We will use standard metrics in order to asses our classification models performance - accuracy score, precision, recall and f1-score.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn import preprocessing
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, roc_curve, auc, roc_auc_score
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, RandomizedSearchCV, KFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
import os
import pickle

Now let's load our encoded dataset, train_test_split and two models - RandomForest and LogisticRegression - that we built on preprocessing and training data development step.

In [2]:
df = pd.read_csv('../EDA/Diabetes_EDA.csv')

In [3]:
df.head()

Unnamed: 0,Age,Gender,Polyuria,Polydipsia,sudden weight loss,weakness,Polyphagia,Genital thrush,visual blurring,Itching,Irritability,delayed healing,partial paresis,muscle stiffness,Alopecia,Obesity,class
0,40,0,0,1,0,1,0,0,0,1,0,1,0,1,1,1,1
1,58,0,0,0,0,1,0,0,1,0,0,0,1,0,1,0,1
2,41,0,1,0,0,1,1,0,0,1,0,1,0,1,1,0,1
3,45,0,0,0,1,1,1,1,0,1,0,1,0,0,0,0,1
4,60,0,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1


In [4]:
#loading train_test_split
with open('../Splits/Train_Test_Split.pkl', 'rb') as tts_pickle:
    tts = pickle.load(tts_pickle)
#loading RandomForestClassifier
with open('../Models/Diabetes_RandFor.pkl', 'rb') as rfc_pickle:
    rfc = pickle.load(rfc_pickle)
#loading LogisticRegression
with open('../Models/Diabetes_LogReg.pkl', 'rb') as lr_pickle:
    lr = pickle.load(lr_pickle)

In [5]:
X_train, X_test, y_train, y_test = tts

In [6]:
rfc

RandomForestClassifier(max_depth=340, max_features='sqrt', n_estimators=600,
                       oob_score=True)

In [7]:
lr

GridSearchCV(cv=5, estimator=LogisticRegression(), n_jobs=-1,
             param_grid={'C': [0.001, 0.01, 0.1, 1, 10, 100]})

In [8]:
rfc_ypred = rfc.predict(X_test)
lr_ypred = lr.predict(X_test)

Let's see classification reports for each of these models.

In [9]:
print('====== RANDOM FOREST CLASSIFICATION REPORT =====')
print(classification_report(y_test, rfc_ypred))

              precision    recall  f1-score   support

           0       0.98      1.00      0.99        40
           1       1.00      0.98      0.99        64

    accuracy                           0.99       104
   macro avg       0.99      0.99      0.99       104
weighted avg       0.99      0.99      0.99       104



In [10]:
print('====== LOGISTIC REGRESSION CLASSIFICATION REPORT =====')
print(classification_report(y_test, lr_ypred))

              precision    recall  f1-score   support

           0       0.89      0.97      0.93        40
           1       0.98      0.92      0.95        64

    accuracy                           0.94       104
   macro avg       0.93      0.95      0.94       104
weighted avg       0.95      0.94      0.94       104



As we know from the previous step, our RandomForestClassifier showed significantly better results than LogisticRegression model. Now we will build two more models. We will start with GradientBoostingClassifier.

Let's initialize our model, fit it to the training set and see how it performs. We will not use any hyperparameter tuning now.

In [11]:
gbc = GradientBoostingClassifier(random_state=42)
gbc.fit(X_train, y_train)
gbc_ypred_train = gbc.predict(X_train)

In [12]:
print('===== GBC TRAINING SET ACCURACY SCORE =====')
print(accuracy_score(y_train, gbc_ypred_train))
print('===== GBC TRAINING SET CLASSIFICATION REPORT =====')
print(classification_report(y_train, gbc_ypred_train))

===== GBC TRAINING SET ACCURACY SCORE =====
1.0
===== GBC TRAINING SET CLASSIFICATION REPORT =====
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       160
           1       1.00      1.00      1.00       256

    accuracy                           1.00       416
   macro avg       1.00      1.00      1.00       416
weighted avg       1.00      1.00      1.00       416



We have perfect results on our training set. Now let's see how the model perform on the test set.

In [13]:
gbc_ypred_test = gbc.predict(X_test)

In [14]:
print('===== GBC TEST SET ACCURACY SCORE =====')
print(accuracy_score(y_test, gbc_ypred_test))
print('===== GBC TEST SET CLASSIFICATION REPORT =====')
print(classification_report(y_test, gbc_ypred_test))

===== GBC TEST SET ACCURACY SCORE =====
0.9903846153846154
===== GBC TEST SET CLASSIFICATION REPORT =====
              precision    recall  f1-score   support

           0       0.98      1.00      0.99        40
           1       1.00      0.98      0.99        64

    accuracy                           0.99       104
   macro avg       0.99      0.99      0.99       104
weighted avg       0.99      0.99      0.99       104



The results are slightly worse, but there is no significant gaps between our train and test sets results. We can conclude that even generic model with no hyperparameters tuning is able to generalize on new data and show great results. However, we will still perform parameters tuning just in the sake of comparing models performance.

In [15]:
gbc.get_params().keys()

dict_keys(['ccp_alpha', 'criterion', 'init', 'learning_rate', 'loss', 'max_depth', 'max_features', 'max_leaf_nodes', 'min_impurity_decrease', 'min_impurity_split', 'min_samples_leaf', 'min_samples_split', 'min_weight_fraction_leaf', 'n_estimators', 'n_iter_no_change', 'presort', 'random_state', 'subsample', 'tol', 'validation_fraction', 'verbose', 'warm_start'])

As we can see, GradientBoostingClassifier's parameters are very similar to RandomForestClassifier parameters. For our goal we are mostly interested in n_estimators, learning_rate, max_features and max_depth.

In [16]:
#number of trees
n_estimators = [int(i) for i in np.linspace(200, 2000, 10)]

#learning rates
learning_rate = [0.05, 0.1, 0.25, 0.5, 0.75, 1]

#number of features for each split
max_features = ['auto', 'sqrt']

#maximal depth
max_depth = [int(i) for i in np.linspace(100, 500, 11)]

#parameters grid
param_grid = {'n_estimators':n_estimators, 'learning_rate':learning_rate, 'max_features':max_features, 'max_depth':max_depth}

In [17]:
gbc_rand = RandomizedSearchCV(estimator=gbc, param_distributions=param_grid, n_iter=100, cv=5, random_state=42, n_jobs=-1)
gbc_rand.fit(X_train, y_train)

RandomizedSearchCV(cv=5, estimator=GradientBoostingClassifier(random_state=42),
                   n_iter=100, n_jobs=-1,
                   param_distributions={'learning_rate': [0.05, 0.1, 0.25, 0.5,
                                                          0.75, 1],
                                        'max_depth': [100, 140, 180, 220, 260,
                                                      300, 340, 380, 420, 460,
                                                      500],
                                        'max_features': ['auto', 'sqrt'],
                                        'n_estimators': [200, 400, 600, 800,
                                                         1000, 1200, 1400, 1600,
                                                         1800, 2000]},
                   random_state=42)

In [18]:
print(gbc_rand.best_params_)

{'n_estimators': 200, 'max_features': 'sqrt', 'max_depth': 500, 'learning_rate': 0.05}


We obtained the best parameters for our GradientBoostingClassifier. Now let's plug those parameters into our model and asses its performance.

In [19]:
gbc_params = GradientBoostingClassifier(n_estimators=200, learning_rate = 0.05, max_features='sqrt', max_depth = 500, random_state = 42)
gbc_params.fit(X_train, y_train)
gbc_params_ypred = gbc_params.predict(X_test)

In [20]:
print('===== GBC_PARAMS TEST SET ACCURACY SCORE =====')
print(accuracy_score(y_test, gbc_params_ypred))
print('===== GBC_PARAMS TEST SET CLASSIFICATION REPORT =====')
print(classification_report(y_test, gbc_params_ypred))

===== GBC_PARAMS TEST SET ACCURACY SCORE =====
1.0
===== GBC_PARAMS TEST SET CLASSIFICATION REPORT =====
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        40
           1       1.00      1.00      1.00        64

    accuracy                           1.00       104
   macro avg       1.00      1.00      1.00       104
weighted avg       1.00      1.00      1.00       104



From the classification report we can see, that tuning the model's parameters improved its performance and we got perfect scores on the test set. This model showed the best results among all the models we used so far.

Next, we're going to build AdaBoost model and see how it's going to perform. Again, we will not tune hyperparameters first and will check how the model performs on train and test splits.

In [21]:
abc = AdaBoostClassifier(random_state=42)
abc.fit(X_train, y_train)
abc_ypred_train = abc.predict(X_train)
print('===== ABC TRAINING SET ACCURACY SCORE =====')
print(accuracy_score(y_train, abc_ypred_train))
print('===== ABC TRAINING SET CLASSIFICATION REPORT =====')
print(classification_report(y_train, abc_ypred_train))

===== ABC TRAINING SET ACCURACY SCORE =====
0.9471153846153846
===== ABC TRAINING SET CLASSIFICATION REPORT =====
              precision    recall  f1-score   support

           0       0.92      0.95      0.93       160
           1       0.97      0.95      0.96       256

    accuracy                           0.95       416
   macro avg       0.94      0.95      0.94       416
weighted avg       0.95      0.95      0.95       416



AdaBoostClassifier showed good results on the trainin set. Let's see how it's going to work on the test set.

In [22]:
abc_ypred_test = abc.predict(X_test)
print('===== ABC TEST SET ACCURACY SCORE =====')
print(accuracy_score(y_test, abc_ypred_test))
print('===== ABC TEST SET CLASSIFICATION REPORT =====')
print(classification_report(y_test, abc_ypred_test))

===== ABC TEST SET ACCURACY SCORE =====
0.9519230769230769
===== ABC TEST SET CLASSIFICATION REPORT =====
              precision    recall  f1-score   support

           0       0.89      1.00      0.94        40
           1       1.00      0.92      0.96        64

    accuracy                           0.95       104
   macro avg       0.94      0.96      0.95       104
weighted avg       0.96      0.95      0.95       104



Although accuracy score improved a little bit, we can see that some of the results got worse. Overall, there are no significant gaps in model performance on train and test splits.

Now we will tune hyperparameters of AdaBoostingClassifier and check if it's goint to improve model performance.

In [23]:
abc.get_params().keys()

dict_keys(['algorithm', 'base_estimator', 'learning_rate', 'n_estimators', 'random_state'])

In [24]:
param_grid = {"base_estimator__criterion" : ["gini", "entropy"],
              "base_estimator__splitter" :   ["best", "random"],
              "learning_rate": [0.05, 0.1, 0.25, 0.5, 0.75, 1],
              "n_estimators": [int(i) for i in np.linspace(200, 2000, 10)]
             }


weak_c = DecisionTreeClassifier(random_state = 42, max_features = "auto", class_weight = "balanced", max_depth = 3)

ABC = AdaBoostClassifier(base_estimator = weak_c)

In [28]:
abc_rand = RandomizedSearchCV(estimator=ABC, param_distributions=param_grid, n_iter=40, cv=5, scoring='roc_auc', random_state=42)
abc_rand.fit(X_train, y_train)
print(abc_rand.best_params_)

{'n_estimators': 600, 'learning_rate': 0.5, 'base_estimator__splitter': 'best', 'base_estimator__criterion': 'entropy'}


Let's plug those parameters into the model.

In [29]:
base_est = DecisionTreeClassifier(random_state=42, max_features='auto', class_weight='balanced', max_depth=3, splitter='best', criterion='entropy')
abc_params = AdaBoostClassifier(base_estimator=base_est, n_estimators=600, learning_rate=0.5, random_state=42)
abc_params.fit(X_train, y_train)
abc_params_ypred = abc_params.predict(X_test)

In [30]:
print('===== ABC_PARAMS TEST SET ACCURACY SCORE =====')
print(accuracy_score(y_test, abc_params_ypred))
print('===== ABC_PARAMS TEST SET CLASSIFICATION REPORT =====')
print(classification_report(y_test, abc_params_ypred))

===== ABC_PARAMS TEST SET ACCURACY SCORE =====
1.0
===== ABC_PARAMS TEST SET CLASSIFICATION REPORT =====
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        40
           1       1.00      1.00      1.00        64

    accuracy                           1.00       104
   macro avg       1.00      1.00      1.00       104
weighted avg       1.00      1.00      1.00       104



We can see that tuned AdaBoostClassifier shows better results than not tuned one. Again, just as with GradientBoostingClassifier, hyperparameters tuning allowed us to achieve perfect score.

Among all models that we built, GradientBoostingClassifier and AdaBoostClassifier showed the highest results - 100% on the train and test sets after hyperparameters tuning. Those two are the best models we've built and it would make sense to implement one of them for our problem.

FEATURE SELECTION 