# Preprocessing and Training Data Development

In this notebook we will perform preprocessing and training data development for the project. The goal of this process is to standardize our model development dataset. It includes encoding categorical variables, scaling numeric features and splitting the dataset into train and test sets. Since we have already performed data encoding in the previous step (EDA), we don't have to do it again. There is also no need to perform scaling because our features are categorical (except 'age' column, which is numeric).

We saw in the previous steps that our dataset consists mostly of binary features. We also determined that the problem we're trying to solve is a classification problem - classifying patients as 'positive' (have diabetes) or 'negative' (does not have diabetes). In this step we will build and train a baseline model - in this case it will be Logistic Regression model, which is good for classification problems. We will also check another model - Random Forest - in order to compare it with Logistic Regression model and see which of them perform better.

In [1]:
#let's import all necessary libraries
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
from matplotlib.colors import ListedColormap
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")
import sklearn.model_selection
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, KFold, cross_val_score, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import os
import pickle

In [2]:
df = pd.read_csv('D:\Tutorials\SDST\My Projects\Capstone2\EDA\Diabetes_EDA.csv')

In [3]:
df.head()

Unnamed: 0,Age,Gender,Polyuria,Polydipsia,sudden weight loss,weakness,Polyphagia,Genital thrush,visual blurring,Itching,Irritability,delayed healing,partial paresis,muscle stiffness,Alopecia,Obesity,class
0,40,0,0,1,0,1,0,0,0,1,0,1,0,1,1,1,1
1,58,0,0,0,0,1,0,0,1,0,0,0,1,0,1,0,1
2,41,0,1,0,0,1,1,0,0,1,0,1,0,1,1,0,1
3,45,0,0,0,1,1,1,1,0,1,0,1,0,0,0,0,1
4,60,0,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1


From our EDA we know that there are 320 positive cases (61.5%) and 200 negative cases (38.5%) in the dataset.
We are going to perform train test split. We will have 20% of our data in the test set and 80% in the train set. Let's check the shape of our dataset and expected shapes of train and test sets.

In [4]:
df.shape

(520, 17)

In [5]:
len(df) * 0.8, len(df)*0.2

(416.0, 104.0)

We will have 416 observations in our train set and 104 observations in our test set.

Now let's create two variables - X and y. X will be a matrix of 520 rows and 16 columns (all columns that represent a patient's characteristics) and y is going to be our target variable - a vector with a shape of (520,) (representing 'positive' or 'negative' class).

In [6]:
X = df.drop('class', axis=1)
y = df['class']

In [7]:
print(X.shape)
print(y.shape)

(520, 16)
(520,)


Now we will perform the split using scikitlearn train_test_split function. We will use 'stratify' parameter in order to keep the same proportions of 'positive' and 'negative' cases as in our dataset before the split. Since we have one feature which is non-binary - 'age' - we will need to perform scaling in order to normalize our features whitin particular range. In this case we will use StandardScaler. We will also save our splits as pickle files.

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [9]:
splitpath = '../Splits'
if not os.path.exists(splitpath):
    os.mkdir(splitpath)
split_path = os.path.join(splitpath, 'Train_Test_Split.pkl')
if not os.path.exists(split_path):
    with open(split_path, 'wb') as sp:
        pickle.dump(train_test_split(X, y, test_size=0.2, random_state=42, stratify=y), sp)

Let's check if we got our train/test proportions right.

In [10]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(416, 16)
(416,)
(104, 16)
(104,)


Train set has 416 observations and test set has 104 observations, just as we expected. Let's see what are the proportions of positive vs negative cases in train and test sets.

In [11]:
print(y_train.value_counts())
print((256/416) * 100)
print((160/416) * 100)

1    256
0    160
Name: class, dtype: int64
61.53846153846154
38.46153846153847


In [12]:
print(y_test.value_counts())
print((64/104) * 100)
print((40/104) * 100)

1    64
0    40
Name: class, dtype: int64
61.53846153846154
38.46153846153847


In [13]:
#Scaling
scaler = StandardScaler()
X_train_norm = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
X_test_norm = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns)

We can see that in both train and test splits we have 61.5% of positive cases and 38.5% of negative cases, just as it was in our original dataset before we performed the split.

Now let's define our baseline model. We will use logistic regression for our classification problem. First, we will not tune its hyperparameters and see what the results will be. Then, if needed, we will perform regularization parameter tuning and compare results of those two models.

In [14]:
clf_noparams = LogisticRegression()
clf_noparams.fit(X_train_norm, y_train)
clf_noparams_train_ypred = clf_noparams.predict(X_train_norm)
clf_noparams_test_ypred = clf_noparams.predict(X_test_norm)
print('TRAIN SPLIT ACCURACY: ', accuracy_score(y_train, clf_noparams_train_ypred))
print('TEST SPLIT ACCURACY: ', accuracy_score(y_test, clf_noparams_test_ypred))

TRAIN SPLIT ACCURACY:  0.9423076923076923
TEST SPLIT ACCURACY:  0.9423076923076923


In [15]:
print("=== TRAIN SPLIT Classification Report ===")
print(classification_report(y_train, clf_noparams_train_ypred))
print("=== TEST SPLIT Classification Report ===")
print(classification_report(y_test, clf_noparams_test_ypred))

=== TRAIN SPLIT Classification Report ===
              precision    recall  f1-score   support

           0       0.93      0.93      0.93       160
           1       0.95      0.95      0.95       256

    accuracy                           0.94       416
   macro avg       0.94      0.94      0.94       416
weighted avg       0.94      0.94      0.94       416

=== TEST SPLIT Classification Report ===
              precision    recall  f1-score   support

           0       0.89      0.97      0.93        40
           1       0.98      0.92      0.95        64

    accuracy                           0.94       104
   macro avg       0.93      0.95      0.94       104
weighted avg       0.95      0.94      0.94       104



Here, from comparing our train and test splits, we notice that the model showed the same accuracy on both splits. We also do not see any significant gaps between train and test split precision, recall and f1-score. Therefore we can assume that our Logistic Regression model shows good results and do not over-/underfit, even without hyperparameters tuning. However, we will still explore how the model will perform if we tune its regularization parameter.

Let's create a list with C parameter values and use GridSearchCV in order to find the best C.

In [16]:
clf_params = LogisticRegression()
C_params = [0.001, 0.01, 0.1, 1, 10, 100]

Let's find what regularization parameter is the best for our model.

In [17]:
print(clf_params.get_params().keys())
c = [c for c in C_params]
grid_params = {'C':c}

dict_keys(['C', 'class_weight', 'dual', 'fit_intercept', 'intercept_scaling', 'l1_ratio', 'max_iter', 'multi_class', 'n_jobs', 'penalty', 'random_state', 'solver', 'tol', 'verbose', 'warm_start'])


In [18]:
clf_grid = GridSearchCV(clf_params, param_grid = grid_params, cv=5, n_jobs=-1)
clf_grid.fit(X_train_norm, y_train)
clf_grid.best_params_

{'C': 1}

After performing GridSearchCV we can conclude that the best regularization parameter ('C') for our classifier is C=1 (which is a default value for this parameter).

In [19]:
clf_params = LogisticRegression(C=1.0, random_state=42)
clf_params.fit(X_train_norm, y_train)
clfparams_ypred_train = clf_params.predict(X_train_norm)
print(accuracy_score(clfparams_ypred_train, y_train))

0.9423076923076923


The accuracy score for our model performance on the training set is 0.94.

Now let's see what are the probabilities for each target value.

In [20]:
print(clf_params.predict_proba(X_train_norm)[:5])

[[2.04590655e-02 9.79540935e-01]
 [9.73672053e-03 9.90263279e-01]
 [1.34293557e-03 9.98657064e-01]
 [4.94047431e-04 9.99505953e-01]
 [9.26693923e-01 7.33060769e-02]]


We will use classification report on our train and test datasets in order to determine if our model overfit/underfit.

In [21]:
print("=== Classification Report ===")
print(classification_report(y_train, clfparams_ypred_train))

=== Classification Report ===
              precision    recall  f1-score   support

           0       0.93      0.93      0.93       160
           1       0.95      0.95      0.95       256

    accuracy                           0.94       416
   macro avg       0.94      0.94      0.94       416
weighted avg       0.94      0.94      0.94       416



From this classification report we can infer that in our train data: 1) Among 256 people which were classified as 1 ('Positive') 95% were classified correctly. 2) Among 160 people which were classified as 0 ('Negative') 93% were classified correctly. 3) Among all people who actually were 1 ('Positive') 95% were classified correctly. 4) Among all people who actually were 0 ('Negative') 93% were classified correctly.

Overall, the performance of our classifier on training set seems to be good. Let us now check it on the test set.

In [22]:
clfparams_ypred_test = clf_params.predict(X_test_norm)
print(accuracy_score(clfparams_ypred_test, y_test))

0.9423076923076923


The accuracy score for our model performance on the test set is 0.43, which is the same as on the training set.

In [23]:
print(clf_params.predict_proba(X_test_norm)[:5])

[[0.71951361 0.28048639]
 [0.02096393 0.97903607]
 [0.00290846 0.99709154]
 [0.68692988 0.31307012]
 [0.95044974 0.04955026]]


In [24]:
print("=== Classification Report ===")
print(classification_report(y_test, clfparams_ypred_test))

=== Classification Report ===
              precision    recall  f1-score   support

           0       0.89      0.97      0.93        40
           1       0.98      0.92      0.95        64

    accuracy                           0.94       104
   macro avg       0.93      0.95      0.94       104
weighted avg       0.95      0.94      0.94       104



From this classification report we can infer that in our train data: 1) Among 64 people which were classified as 1 ('Positive') 98% were classified correctly (against 95% on the train data). 2) Among 40 people which were classified as 0 ('Negative') 89% were classified correctly (against 93% on the train data). 3) Among all people who actually were 1 ('Positive') 92% were classified correctly (against 95% on the train data). 4) Among all people who actually were 0 ('Negative') 97% were classified correctly (against 93% on the train data).  

Overall, there are no significant gaps between model performance on train and test sets. From this we can conclude that our Logistic Regression model does not overfit or underfit and is able to generalize on new data.

Now we will build another model - random forest - in order to compare its performance with logistic regression model we used before.

In [25]:
#initializing the model
rfc = RandomForestClassifier(oob_score=True)

First we will fit the model without tuning its parameters and see how it's going to perform.

In [26]:
rfc.fit(X_train_norm, y_train)
rfcy_pred_np = rfc.predict(X_test_norm)
print('ACCURACY SCORE: ', accuracy_score(rfcy_pred_np, y_test))
print('Out-of-bag SCORE: ', rfc.oob_score_)

ACCURACY SCORE:  0.9903846153846154
Out-of-bag SCORE:  0.9735576923076923


In [27]:
print("=== Classification Report ===")
print(classification_report(y_test, rfcy_pred_np))

=== Classification Report ===
              precision    recall  f1-score   support

           0       1.00      0.97      0.99        40
           1       0.98      1.00      0.99        64

    accuracy                           0.99       104
   macro avg       0.99      0.99      0.99       104
weighted avg       0.99      0.99      0.99       104



We have an accuracy score of 0.99 which is pretty good for a model with no parameters tuned. Now we will use RandomizedSearchCV in order to find what parameters are the best for this model. We are particularly interested in 'n_estimators', 'max_features' and 'max_depth'.

In [28]:
#list of parameters
rfc.get_params().keys()

dict_keys(['bootstrap', 'ccp_alpha', 'class_weight', 'criterion', 'max_depth', 'max_features', 'max_leaf_nodes', 'max_samples', 'min_impurity_decrease', 'min_impurity_split', 'min_samples_leaf', 'min_samples_split', 'min_weight_fraction_leaf', 'n_estimators', 'n_jobs', 'oob_score', 'random_state', 'verbose', 'warm_start'])

In [29]:
#number of trees
n_estimators = [int(i) for i in np.linspace(200, 2000, 10)]

#number of features for each split
max_features = ['auto', 'sqrt']

#maximal depth
max_depth = [int(i) for i in np.linspace(100, 500, 11)]

#random grid
random_grid = {'n_estimators':n_estimators, 'max_features':max_features, 'max_depth':max_depth}

In [30]:
#randomized search
rfc_random = RandomizedSearchCV(estimator=rfc, param_distributions=random_grid, n_iter=100, cv=5, random_state=42, n_jobs=-1)

#fitting the model
rfc_random.fit(X_train_norm, y_train)

print(rfc_random.best_params_)

{'n_estimators': 600, 'max_features': 'sqrt', 'max_depth': 340}


RandomizedSearchCV showed that the best parameters for our RandomForestClassifier are 'n_estimators':600, 'max_features':'sqrt' and 'max_depth':340. Now let's plug those parameters into the model and see if it improves its performance.

In [31]:
rfc_params = RandomForestClassifier(n_estimators=600, max_features='sqrt', max_depth=340, oob_score=True)
rfc_params.fit(X_train_norm, y_train)
rfcy_pred_params = rfc_params.predict(X_test_norm)
print('ACCURACY SCORE: ', accuracy_score(rfcy_pred_params, y_test))
print('Out-of-bag SCORE: ', rfc_params.oob_score_)

ACCURACY SCORE:  0.9807692307692307
Out-of-bag SCORE:  0.9735576923076923


The accuracy score of the model with tuned parameters is 0.98 (against 0.99 with no parameters tuning).The OOB-score is 0.97 (against 0.97 with no parameters tuning).

In [32]:
print("=== RANDOM FOREST TEST SET Classification Report ===")
print(classification_report(y_test, rfcy_pred_params))

=== RANDOM FOREST TEST SET Classification Report ===
              precision    recall  f1-score   support

           0       0.97      0.97      0.97        40
           1       0.98      0.98      0.98        64

    accuracy                           0.98       104
   macro avg       0.98      0.98      0.98       104
weighted avg       0.98      0.98      0.98       104



In [33]:
print("=== LOGISTIC REGRESSION TEST SET Classification Report ===")
print(classification_report(y_test, clfparams_ypred_test))

=== LOGISTIC REGRESSION TEST SET Classification Report ===
              precision    recall  f1-score   support

           0       0.89      0.97      0.93        40
           1       0.98      0.92      0.95        64

    accuracy                           0.94       104
   macro avg       0.93      0.95      0.94       104
weighted avg       0.95      0.94      0.94       104



We can conclude from the classification report that our RandomForestClassifier has better performance with its parameters tuned. Moreover, it shows better resutls than our baseline LogisticRegression model, outperforming it in every metric we used. It has higer precision, recall, f1-score and accuracy.

# Conclusion

We have conducted Preprocessing and Training data development. In this notebook we built two models - LogisticRegression and RandomForestClassifier - and compared their performance. We stratified our train and test splits so they have the same proportion of 'positive' and 'negative' cases. After splitting the data we built LogisticRegression model and evaluated it performance. The model showed no significant gaps on train and test sets, indicating there were no over-/underfitting. Then we build RandomForestClassifier model and evaluated its performance on test set. This model outperformed out baseline model and showed better precision, recall, f1-score and accuracy.

In [34]:
#saving the dataset
datapath = 'D://Tutorials/SDST/My Projects/Capstone2/Preprocessing and Training'
if not os.path.exists(datapath):
    os.mkdir(datapath)
datapath_Prep_Train = os.path.join(datapath, 'Diabetes_Prep_Train.csv')
if not os.path.exists(datapath_Prep_Train):
    df.to_csv(datapath_Prep_Train, index=False)

modelpath = '../Models'
if not os.path.exists(modelpath):
    os.mkdir(modelpath)
#saving random forest model
diabetes_rfc_path = os.path.join(modelpath, 'Diabetes_RandFor.pkl', )
if not os.path.exists(diabetes_rfc_path):
    with open(diabetes_rfc_path, 'wb') as f:
        pickle.dump(rfc_params, f)
#saving logistic regression model
diabetes_lr_path = os.path.join(modelpath, 'Diabetes_LogReg.pkl')
if not os.path.exists(diabetes_lr_path):
    with open(diabetes_lr_path, 'wb') as f:
        pickle.dump(clf_params, f)