# Capstone 2 Modeling

In this notebook we will perform modeling step of the data science method. The goal of this step is to develop a final model that effectively predicts our patients' class - 'positive'(1) or 'negative'(0). In the previous step we have already built two models - LogisticRegressionCassifier (which was out baseline model) and RandomForestClassifier. We assesed the performance of each of these models first without tuning and than with hyperparameter tuning using GridSearchCV for LogisticRegression and RandomizedSearchCV for RandomForest and obtained results using classification report. We noticed, that both models performed better with hyperparameter tuning, and RandomForest showed significantly better results in precision, recall and f1-score than LogisticRegression. We also saved our train_test_split as well as both our models as pickle files.

In this step we will built two other models - GradientBoostingClassifier and AdaBoostClassifer, use hyperparameter tuning in order to enhance their performance and compare their results with the results of two above mentioned models that we built durint preprocessing and trainin data development step. 

We will use standard metrics in order to asses our classification models performance - accuracy score, precision, recall and f1-score.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn import preprocessing
from sklearn import svm
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, roc_curve, auc, roc_auc_score
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, RandomizedSearchCV, KFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
import os
import pickle

Now let's load our encoded dataset, train_test_split and two models - RandomForest and LogisticRegression - that we built on preprocessing and training data development step.

In [2]:
df = pd.read_csv('../EDA/Diabetes_EDA.csv')

In [3]:
df.head()

Unnamed: 0,Age,Gender,Polyuria,Polydipsia,sudden weight loss,weakness,Polyphagia,Genital thrush,visual blurring,Itching,Irritability,delayed healing,partial paresis,muscle stiffness,Alopecia,Obesity,class
0,40,0,0,1,0,1,0,0,0,1,0,1,0,1,1,1,1
1,58,0,0,0,0,1,0,0,1,0,0,0,1,0,1,0,1
2,41,0,1,0,0,1,1,0,0,1,0,1,0,1,1,0,1
3,45,0,0,0,1,1,1,1,0,1,0,1,0,0,0,0,1
4,60,0,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1


In [4]:
#loading train_test_split
with open('../Splits/Train_Test_Split.pkl', 'rb') as tts_pickle:
    tts = pickle.load(tts_pickle)
#loading RandomForestClassifier
with open('../Models/Diabetes_RandFor.pkl', 'rb') as rfc_pickle:
    rfc = pickle.load(rfc_pickle)
#loading LogisticRegression
with open('../Models/Diabetes_LogReg.pkl', 'rb') as lr_pickle:
    lr = pickle.load(lr_pickle)

In [5]:
X_train, X_test, y_train, y_test = tts

In [6]:
#scaling
scaler = StandardScaler()
X_train_norm = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
X_test_norm = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns)

In [7]:
rfc

RandomForestClassifier(max_depth=340, max_features='sqrt', n_estimators=600,
                       oob_score=True)

In [8]:
lr

GridSearchCV(cv=5, estimator=LogisticRegression(), n_jobs=-1,
             param_grid={'C': [0.001, 0.01, 0.1, 1, 10, 100]})

In [9]:
rfc_ypred = rfc.predict(X_test_norm)
lr_ypred = lr.predict(X_test_norm)

Let's see classification reports for each of these models.

In [10]:
print('====== RANDOM FOREST CLASSIFICATION REPORT =====')
print(classification_report(y_test, rfc_ypred))

              precision    recall  f1-score   support

           0       0.97      0.97      0.97        40
           1       0.98      0.98      0.98        64

    accuracy                           0.98       104
   macro avg       0.98      0.98      0.98       104
weighted avg       0.98      0.98      0.98       104



In [11]:
print('====== LOGISTIC REGRESSION CLASSIFICATION REPORT =====')
print(classification_report(y_test, lr_ypred))

              precision    recall  f1-score   support

           0       0.71      1.00      0.83        40
           1       1.00      0.75      0.86        64

    accuracy                           0.85       104
   macro avg       0.86      0.88      0.85       104
weighted avg       0.89      0.85      0.85       104



As we know from the previous step, our RandomForestClassifier showed significantly better results than LogisticRegression model. Now we will build two more models. We will start with GradientBoostingClassifier.

Let's initialize our model, fit it to the training set and see how it performs. We will not use any hyperparameter tuning now.

In [12]:
gbc = GradientBoostingClassifier(random_state=42)
gbc.fit(X_train_norm, y_train)
gbc_ypred = gbc.predict(X_test_norm)

In [13]:
print('===== GBC ACCURACY SCORE =====')
print(accuracy_score(y_test, gbc_ypred))
print('===== GBC CLASSIFICATION REPORT =====')
print(classification_report(y_test, gbc_ypred))

===== GBC ACCURACY SCORE =====
0.9903846153846154
===== GBC CLASSIFICATION REPORT =====
              precision    recall  f1-score   support

           0       0.98      1.00      0.99        40
           1       1.00      0.98      0.99        64

    accuracy                           0.99       104
   macro avg       0.99      0.99      0.99       104
weighted avg       0.99      0.99      0.99       104



We can conclude that even generic model with no hyperparameters tuning is able to generalize on new data and show great results. However, we will still perform parameters tuning just for the sake of comparing models performance.

In [14]:
gbc.get_params().keys()

dict_keys(['ccp_alpha', 'criterion', 'init', 'learning_rate', 'loss', 'max_depth', 'max_features', 'max_leaf_nodes', 'min_impurity_decrease', 'min_impurity_split', 'min_samples_leaf', 'min_samples_split', 'min_weight_fraction_leaf', 'n_estimators', 'n_iter_no_change', 'presort', 'random_state', 'subsample', 'tol', 'validation_fraction', 'verbose', 'warm_start'])

As we can see, GradientBoostingClassifier's parameters are very similar to RandomForestClassifier parameters. For our goal we are mostly interested in n_estimators, learning_rate, max_features and max_depth.

In [15]:
#number of trees
n_estimators = [int(i) for i in np.linspace(200, 2000, 10)]

#learning rates
learning_rate = [0.05, 0.1, 0.25, 0.5, 0.75, 1]

#number of features for each split
max_features = ['auto', 'sqrt']

#maximal depth
max_depth = [int(i) for i in np.linspace(100, 500, 11)]

#parameters grid
param_grid = {'n_estimators':n_estimators, 'learning_rate':learning_rate, 'max_features':max_features, 'max_depth':max_depth}

In [16]:
# gbc_rand = RandomizedSearchCV(estimator=gbc, param_distributions=param_grid, n_iter=100, cv=5, random_state=42, n_jobs=-1)
# gbc_rand.fit(X_train_norm, y_train)

In [17]:
# print(gbc_rand.best_params_)

We obtained the best parameters for our GradientBoostingClassifier. Now let's plug those parameters into our model and asses its performance.

In [18]:
gbc_params = GradientBoostingClassifier(n_estimators=200, learning_rate = 0.05, max_features='sqrt', max_depth = 500, random_state = 42)
gbc_params.fit(X_train_norm, y_train)
gbc_params_ypred = gbc_params.predict(X_test_norm)

In [19]:
print('===== GBC_PARAMS TEST SET ACCURACY SCORE =====')
print(accuracy_score(y_test, gbc_params_ypred))
print('===== GBC_PARAMS TEST SET CLASSIFICATION REPORT =====')
print(classification_report(y_test, gbc_params_ypred))

===== GBC_PARAMS TEST SET ACCURACY SCORE =====
0.9903846153846154
===== GBC_PARAMS TEST SET CLASSIFICATION REPORT =====
              precision    recall  f1-score   support

           0       1.00      0.97      0.99        40
           1       0.98      1.00      0.99        64

    accuracy                           0.99       104
   macro avg       0.99      0.99      0.99       104
weighted avg       0.99      0.99      0.99       104



From the classification report we can see that tuning the model's parameters did not have any significant effect on the scores.

Next, we're going to build AdaBoost model and see how it's going to perform. Again, we will not tune hyperparameters first and will check how the model performs on train and test splits.

In [20]:
abc = AdaBoostClassifier(random_state=42)
abc.fit(X_train_norm, y_train)
abc_ypred = abc.predict(X_test_norm)
print('===== ABC ACCURACY SCORE =====')
print(accuracy_score(y_test, abc_ypred))
print('===== ABC CLASSIFICATION REPORT =====')
print(classification_report(y_test, abc_ypred))

===== ABC ACCURACY SCORE =====
0.9519230769230769
===== ABC CLASSIFICATION REPORT =====
              precision    recall  f1-score   support

           0       0.89      1.00      0.94        40
           1       1.00      0.92      0.96        64

    accuracy                           0.95       104
   macro avg       0.94      0.96      0.95       104
weighted avg       0.96      0.95      0.95       104



Now we will tune hyperparameters of AdaBoostingClassifier and check if it's goint to improve model performance.

In [21]:
abc.get_params().keys()

dict_keys(['algorithm', 'base_estimator', 'learning_rate', 'n_estimators', 'random_state'])

In [22]:
param_grid = {"base_estimator__criterion" : ["gini", "entropy"],
              "base_estimator__splitter" :   ["best", "random"],
              "learning_rate": [0.05, 0.1, 0.25, 0.5, 0.75, 1],
              "n_estimators": [int(i) for i in np.linspace(200, 2000, 10)]
             }


weak_c = DecisionTreeClassifier(random_state = 42, max_features = "auto", class_weight = "balanced", max_depth = 3)

ABC = AdaBoostClassifier(base_estimator = weak_c)

In [23]:
# abc_rand = RandomizedSearchCV(estimator=ABC, param_distributions=param_grid, n_iter=40, cv=5, scoring='roc_auc', random_state=42)
# abc_rand.fit(X_train_norm, y_train)
# print(abc_rand.best_params_)

Let's plug those parameters into the model.

In [24]:
base_est = DecisionTreeClassifier(random_state=42, max_features='auto', class_weight='balanced', max_depth=3, splitter='best', criterion='gini')
abc_params = AdaBoostClassifier(base_estimator=base_est, n_estimators=200, learning_rate=0.1, random_state=42)
abc_params.fit(X_train_norm, y_train)
abc_params_ypred = abc_params.predict(X_test_norm)

In [25]:
print('===== ABC_PARAMS ACCURACY SCORE =====')
print(accuracy_score(y_test, abc_params_ypred))
print('===== ABC_PARAMS CLASSIFICATION REPORT =====')
print(classification_report(y_test, abc_params_ypred))

===== ABC_PARAMS ACCURACY SCORE =====
1.0
===== ABC_PARAMS CLASSIFICATION REPORT =====
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        40
           1       1.00      1.00      1.00        64

    accuracy                           1.00       104
   macro avg       1.00      1.00      1.00       104
weighted avg       1.00      1.00      1.00       104



We can see that hyperparameters tuning on ADABoostClassifier allowed us to achieve perfect score.

Among all models that we built, AdaBoostClassifier showed the highest results - 100% on the test set after hyperparameters tuning. This is the best models we've built so far.

Now, let's build our third model - SupportVectorMachine. For this problem we will use Radial Basis Function kernel.

In [26]:
svmclf = svm.SVC(kernel='rbf')
svmclf.fit(X_train_norm, y_train)
svmclf_ypred = svmclf.predict(X_test_norm)

In [27]:
print('=====SVMCLF ACCURACY SCORE=====')
print(accuracy_score(y_test, svmclf_ypred))
print('=====SVMCLF CLASSIFICATION REPORT=====')
print(classification_report(y_test, svmclf_ypred))

=====SVMCLF ACCURACY SCORE=====
0.9807692307692307
=====SVMCLF CLASSIFICATION REPORT=====
              precision    recall  f1-score   support

           0       0.97      0.97      0.97        40
           1       0.98      0.98      0.98        64

    accuracy                           0.98       104
   macro avg       0.98      0.98      0.98       104
weighted avg       0.98      0.98      0.98       104



From the classification report we can tell that support vector machine with RBF kernel and no hyperparameters tuning shows results that are just slightly worse than those of GradientBoostingClassifier. However, it is still inferior to ADABoostClassifier. Now let's find out the best parameters for our SVM model.

In [28]:
svmclf.get_params().keys()

dict_keys(['C', 'break_ties', 'cache_size', 'class_weight', 'coef0', 'decision_function_shape', 'degree', 'gamma', 'kernel', 'max_iter', 'probability', 'random_state', 'shrinking', 'tol', 'verbose'])

For our SVM with RBF kernel we are mostly interested in two parameters (except 'kernel' itself, of course) - 'C' and 'gamma'.

In [29]:
# C_param = np.logspace(-2, 10, 13)
# gamma_param = np.logspace(-9, 3, 13)
# grid_param = {'C':C_param, 'gamma':gamma_param}
# svmclf_grid = GridSearchCV(svmclf, param_grid=grid_param, cv=5)
# svmclf_grid.fit(X_train_norm, y_train)

In [30]:
# print(svmclf_grid.best_params_)

Now let's plug those parameters into the model.

In [31]:
svmclf_params = svm.SVC(kernel='rbf', C=10.0, gamma=0.1)
svmclf_params.fit(X_train_norm, y_train)
svmclf_params_ypred = svmclf_params.predict(X_test_norm)
print('=====SVMCLF_PARAMS ACCURACY SCORE=====')
print(accuracy_score(y_test, svmclf_params_ypred))
print('=====SVMCLF_PARAMS CLASSIFICATION REPORT=====')
print(classification_report(y_test, svmclf_params_ypred))

=====SVMCLF_PARAMS ACCURACY SCORE=====
0.9807692307692307
=====SVMCLF_PARAMS CLASSIFICATION REPORT=====
              precision    recall  f1-score   support

           0       0.97      0.97      0.97        40
           1       0.98      0.98      0.98        64

    accuracy                           0.98       104
   macro avg       0.98      0.98      0.98       104
weighted avg       0.98      0.98      0.98       104



We can see that hyperparameters tuning did not change the model's performance. We can assume that those parameters we determined using GridSearchCV were used as default parameters in our first SVM model.

Now, let's put our models'results into one table in order to better understand how they performed with respect to each other. We will only use tuned models since they showed better performance than not tuned ones.

| MODEL | ACCURACY | PRECISION (1) | PRECISION (0) | RECALL (1) | RECALL (0) | F1-SCORE (1) | F1-SCORE (0) |
| :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: |
| LogisticRegression | 0.85 | 1.00 | 0.71 | 0.75 | 1.00 | 0.86 | 0.83 |
| RandomForest | 0.98 | 0.98 | 0.97 | 0.98 | 0.97 | 0.98 | 0.97 |
| GradientBoosting | 0.90 | 0.98 | 1.00 | 1.00 | 0.97 | 0.99 | 0.99 |
| ADABoost | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| Support Vector Machine | 0.98 | 0.98 | 0.97 | 0.98 | 0.97 | 0.98 | 0.97 |

#### As we can see from the table above, ADABoostClassifier showed the best - in fact, perfect - scores. In our future work we will use this model.
##### It is important to point out that most of our models showed very good results. Putting our model tuning aside, there is one thing that might have contributed to those results - the dataset might have been artificially composed the way that makes it eaiser to build models with high accuracy (in generic meaning of this word) scores.

Now let's refresh some moments from the EDA. We saved the correlation dataframes as pickle files. Now let's open them. Here are top5 feature-to-target and feature-to-feature correlations in our dataset.

In [32]:
with open('D:\Tutorials\SDST\My Projects\Capstone2\Correlations\Feature-to-Target.pkl', 'rb') as f_to_t_pickle:
    f_to_t = pickle.load(f_to_t_pickle)
with open('D:\Tutorials\SDST\My Projects\Capstone2\Correlations\Feature-to-Feature.pkl', 'rb') as f_to_f_pickle:
    f_to_f = pickle.load(f_to_f_pickle)

EOFError: Ran out of input