<img src="../images/cads-logo.png" style="height: 100px;" align=left> 
<img src="../images/sklearn-logo.png" style="height: 100px;" align=right>

# Supervised Machine Learning

# Table of Contents

- [Thinking about Model Validation](#Thinking-about-Model-Validation)
- [Cross Validation](#Cross-Validation)
- [Model validation the wrong way](#Model-validation-the-wrong-way)
    - [Question: Can you guess the result of the following cell?](#Question:-Can-you-guess-the-result-of-the-following-cell?)
- [Model validation the right way: Holdout sets](#Model-validation-the-right-way:-Holdout-sets)
- [Model validation via cross-validation](#Model-validation-via-cross-validation)
- [Grid Search](#Grid-Search)

# Thinking about Model Validation

In principle, model validation is very simple: after choosing a model and its hyperparameters, we can estimate how effective it is by applying it to some of the training data and comparing the prediction to the known value.

The following sections first show a naive approach to model validation and why it
fails, before exploring the use of holdout sets and cross-validation for more robust
model evaluation.

# Cross Validation

## Model validation the wrong way

Let's demonstrate the naive approach to validation using the Iris data, which we saw in the previous section. We will start by loading the data:

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns; sns.set()

In [2]:
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target

print('Shape of X:', X.shape)
print('Shape of y:', y.shape)

Shape of X: (150, 4)
Shape of y: (150,)


Next we choose a model and hyperparameters. Here we'll use a *k*-neighbors classifier with ``n_neighbors=1``.
This is a very simple and intuitive model that says "the label of an unknown point is the same as the label of its closest training point:"

<img src='../images/KNN.png'>

In [4]:
from sklearn.neighbors import KNeighborsClassifier
model= KNeighborsClassifier(n_neighbors=1)

Then we train the model, and use it to predict labels for data we already know:

In [5]:
model.fit(X, y)
y_model = model.predict(X)

Finally, we compute the fraction of correctly labeled points:

### Question: Can you guess the result of the following cell?

In [6]:
from sklearn.metrics import accuracy_score
accuracy_score(y, y_model)

1.0

We see an accuracy score of 1.0, which indicates that 100% of points were correctly labeled by our model!
But is this truly measuring the expected accuracy? Have we really come upon a model that we expect to be correct 100% of the time?

As you may have gathered, the answer is no.
In fact, this approach contains a fundamental flaw: *it trains and evaluates the model on the same data*.
Furthermore, the nearest neighbor model is an *instance-based* estimator that simply stores the training data, and predicts labels by comparing new data to these stored points: except in contrived cases, it will get 100% accuracy *every time!*

## Model validation the right way: Holdout sets

So what can be done?
A better sense of a model's performance can be found using what's known as a *holdout set*: that is, we hold back some subset of the data from the training of the model, and then use this holdout set to check the model performance.
This splitting can be done using the ``train_test_split`` utility in Scikit-Learn:

In [None]:
print(X.shape)
print(y.shape)

In [7]:
from sklearn.model_selection import train_test_split

# split the data with 50% in each set
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=0,train_size=0.5)

#fit the model
model.fit(X_train, y_train)

# fit and evaluate the model on the second set of data
y_model = model.predict(X_test)

accuracy_score(y_test, y_model)

0.9066666666666666

In [None]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

We see here a more reasonable result: the nearest-neighbor classifier is about 90% accurate on this hold-out set.
The hold-out set is similar to unknown data, because the model has not "seen" it before.

## Model validation via cross-validation

One disadvantage of using a holdout set for model validation is that we have lost a portion of our data to the model training.
In the above case, half the dataset does not contribute to the training of the model!
This is not optimal, and can cause problems – especially if the initial set of training data is small.

One way to address this is to use *cross-validation*; that is, to do a sequence of fits where each subset of the data is used both as a training set and as a validation set. Visually, it might look something like this:
<img src= "../images/2-fold-CV.png" style="height: 500px;">


Here we do two validation trials, alternately using each half of the data as a holdout set.
Using the split data from before, we could implement it like this:

**Question:**
Write the code that implements the accuracy described on the previous image

In [8]:
# solution
y1_model = model.fit(X_train, y_train).predict(X_test)
y2_model = model.fit(X_test, y_test).predict(X_train)
accuracy_score(y_test, y1_model), accuracy_score(y_train, y2_model)

(0.9066666666666666, 0.96)

What comes out are two accuracy scores, which we could combine (by, say, taking the mean) to get a better measure of the global model performance.
This particular form of cross-validation is a *two-fold cross-validation*—that is, one in which we have split the data into two sets and used each in turn as a validation set.

We could expand on this idea to use even more trials, and more folds in the data—for example, here is a visual depiction of five-fold cross-validation:

<img src='../images/CV.png'/>

Here we split the data into five groups, and use each of them in turn to evaluate the model fit on the other 4/5 of the data.
This would be rather tedious to do by hand, and so we can use Scikit-Learn's ``cross_val_score`` convenience routine to do it succinctly:

In [13]:
from sklearn.model_selection import cross_val_score

model= KNeighborsClassifier(n_neighbors=3)
scores = cross_val_score(model, X, y, cv=2)  
scores

array([0.96, 0.92])

In [14]:
print("Accuracy: {}".format(scores.mean()))

Accuracy: 0.94


By default, the score computed at each cv iteration is the `score` method of the estimator. It is possible to change this by using the scoring parameter. Take a look at all possible values for [scoring parameter](https://scikit-learn.org/stable/modules/model_evaluation.html).


In [19]:
scores_f1 = cross_val_score(model, X, y, cv=2, scoring='f1_macro')   
scores_f1, np.mean(scores_f1)

(array([0.95998399, 0.91987179]), 0.9399278942346169)

Repeating the validation across different subsets of the data gives us an even better idea of the performance of the algorithm.

Scikit-Learn implements a number of useful cross-validation schemes that are useful in particular situations; these are implemented via iterators in the ``cross_validation`` module.
For example, we might wish to go to the extreme case in which our number of folds is equal to the number of data points: that is, we train on all points but one in each trial.
This type of cross-validation is known as *leave-one-out* cross validation, and can be used as follows:

In [16]:
from sklearn.model_selection import LeaveOneOut

model= KNeighborsClassifier(n_neighbors=3)
scores = cross_val_score(model, X, y, cv=LeaveOneOut())
scores

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 0., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

Because we have 150 samples, the leave one out cross-validation yields scores for 150 trials, and the score indicates either successful (1.0) or unsuccessful (0.0) prediction.
Taking the mean of these gives an estimate of the error rate:

In [17]:
scores.mean()

0.96

Other cross-validation schemes can be used similarly.
For a description of what is available in Scikit-Learn, use IPython to explore the ``sklearn.cross_validation`` submodule, or take a look at Scikit-Learn's online [cross-validation documentation](http://scikit-learn.org/stable/modules/cross_validation.html).

**Exercise:**
Try to classify Iris data using KNN for n_neighbors=4. Use 5-fold cross validation and use Accuracy, Precision, Recall, F1-score as evaluation metrics.

In [18]:
# MC

knn_5= KNeighborsClassifier(n_neighbors=4)

A_scores=cross_val_score(knn_5, X, y, cv=5)
A_mean=np.mean(A_scores)
print('Mean of Accuracy: {}'.format(A_mean))

P_scores=cross_val_score(knn_5, X, y, cv=5, scoring='precision_macro')
P_mean=np.mean(P_scores)
print('Mean of Precision: {}'.format(P_mean))

R_scores=cross_val_score(knn_5, X, y, cv=5, scoring='recall_macro')
R_mean=np.mean(R_scores)
print('Mean of Recall: {}'.format(R_mean))

F1_scores=cross_val_score(knn_5, X, y, cv=5, scoring='f1_macro')
F1_mean=np.mean(F1_scores)
print('Mean of F1_scores: {}'.format(F1_mean))

Mean of Accuracy: 0.9733333333333334
Mean of Precision: 0.9757575757575758
Mean of Recall: 0.9733333333333334
Mean of F1_scores: 0.9732664995822891


### Exercise
Try to classify 'indian_liver_patient.csv' data using KNN for n_neighbors=5 and Logistic Regression. Use 3-fold cross validation.

This data set contains 416 liver patient records and 167 non liver patient records collected from North East of Andhra Pradesh, India. The "Dataset" column is a class label used to divide groups into liver patient (liver disease) or not (no disease). 

In [22]:
liver = pd.read_csv("../Data/indian_liver_patient.csv")
liver.head(2)

Unnamed: 0,Age,Gender,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio,Dataset
0,65,Female,0.7,0.1,187,16,18,6.8,3.3,0.9,0
1,62,Male,10.9,5.5,699,64,100,7.5,3.2,0.74,0


In [23]:
liver.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 583 entries, 0 to 582
Data columns (total 11 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Age                         583 non-null    int64  
 1   Gender                      583 non-null    object 
 2   Total_Bilirubin             583 non-null    float64
 3   Direct_Bilirubin            583 non-null    float64
 4   Alkaline_Phosphotase        583 non-null    int64  
 5   Alamine_Aminotransferase    583 non-null    int64  
 6   Aspartate_Aminotransferase  583 non-null    int64  
 7   Total_Protiens              583 non-null    float64
 8   Albumin                     583 non-null    float64
 9   Albumin_and_Globulin_Ratio  583 non-null    float64
 10  Dataset                     583 non-null    int64  
dtypes: float64(5), int64(5), object(1)
memory usage: 50.2+ KB


In [21]:
#Assuming 0=no disease and 1=disease
liver.Dataset.value_counts()

0    416
1    167
Name: Dataset, dtype: int64

In [24]:
X = liver.drop('Dataset', axis=1)
y = liver.Dataset

In [26]:
X_dum=pd.get_dummies(X, drop_first=True)
X_dum

Unnamed: 0,Age,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio,Gender_Male
0,65,0.7,0.1,187,16,18,6.8,3.3,0.90,0
1,62,10.9,5.5,699,64,100,7.5,3.2,0.74,1
2,62,7.3,4.1,490,60,68,7.0,3.3,0.89,1
3,58,1.0,0.4,182,14,20,6.8,3.4,1.00,1
4,72,3.9,2.0,195,27,59,7.3,2.4,0.40,1
...,...,...,...,...,...,...,...,...,...,...
578,60,0.5,0.1,500,20,34,5.9,1.6,0.37,1
579,40,0.6,0.1,98,35,31,6.0,3.2,1.10,1
580,52,0.8,0.2,245,48,49,6.4,3.2,1.00,1
581,31,1.3,0.5,184,29,32,6.8,3.4,1.00,1


In [28]:
### KNN

from sklearn.preprocessing import MinMaxScaler
x_scaled = MinMaxScaler().fit_transform(X_dum)


knn_5= KNeighborsClassifier(n_neighbors=5)

A_scores=cross_val_score(knn_5, x_scaled, y, cv=3, scoring='precision_macro')
A_mean=np.mean(A_scores)
print('Mean of precision: {}'.format(A_mean))

Mean of precision: 0.5797898665311592


In [29]:
### Logistic Regression
from sklearn.linear_model import LogisticRegression

lr_5= LogisticRegression()

A_scores=cross_val_score(lr_5, x_scaled, y, cv=3, scoring='precision_macro')
A_mean=np.mean(A_scores)
print('Mean of precision: {}'.format(A_mean))

Mean of precision: 0.7758208345692875


# Grid Search
Grid search is the process of performing hyper parameter tuning in order to determine the optimal values for a given model. Scikit-Learn provides automated tools to do this in the grid search module.

Here is an example of using grid search to find the optimal KNN model. This can be set up using Scikit-Learn's ``GridSearchCV`` meta-estimator:

Now let's load breast_cancer data that is available from sklearn datasets.

In [31]:
from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()

X = cancer.data
y= cancer.target
X.shape

(569, 30)

In [34]:
np.sum(y)

357

In [37]:
X_train, X_test, y_train, y_test=train_test_split(X,y,random_state=42, train_size=.80,stratify=y)

In [39]:
from sklearn.model_selection import GridSearchCV

param_grid = {'n_neighbors': np.arange(1, 20),
              'p': [1,2],
              'weights': ['uniform','distance']}

grid = GridSearchCV(KNeighborsClassifier(), 
                    param_grid, scoring='precision_macro',
                    cv=3)

In [None]:
# What is the total number of models that grid search builds


Notice that like a normal estimator, this has not yet been applied to any data.
Calling the ``fit()`` method will fit the model at each grid point, keeping track of the scores along the way:

In [40]:
grid.fit(X_train, y_train)

Now that this is fit, we can ask for the best parameters as follows:

In [41]:
grid.best_params_

{'n_neighbors': 7, 'p': 1, 'weights': 'uniform'}

In [42]:
grid

GridSearchCV(cv=3, estimator=KNeighborsClassifier(),
             param_grid={'n_neighbors': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19]),
                         'p': [1, 2], 'weights': ['uniform', 'distance']},
             scoring='precision_macro')

Finally, if we wish, we can use the best model and show the fit to our data using code from before:

In [51]:
model = grid.best_estimator_
model.fit(X_train,y_train)
y_pred=model.predict(X_test)

from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix, classification_report
print('classification_report:\n',classification_report(y_test, y_pred))
confusion_matrix(y_test, y_pred)

classification_report:
               precision    recall  f1-score   support

           0       0.95      0.90      0.93        42
           1       0.95      0.97      0.96        72

    accuracy                           0.95       114
   macro avg       0.95      0.94      0.94       114
weighted avg       0.95      0.95      0.95       114



array([[38,  4],
       [ 2, 70]])

The grid search provides many more options, including the ability to specify a custom scoring function, to parallelize the computations, to do randomized searches, and more.

## Exercise:

Load the cancer dataset and choose the best classification algorithm with the best hyperparameters.

- Define X and y

- To simplify, remove missing values

- Split data to train and test

- Use 5 fold cross validation and grid search on train data

- Choose appropriate validation metric

- Set grid parameters for each classification algorithm

- Build the best models for each classification algorithm according to the best estimator (best hyperparameters) given by the grid search

- Compare the performance of the algorithms with the best hyperparametrs on the test data according to confusion matrix, recall, precision, F1, and auc metrics

In [53]:
df=pd.read_csv('../data/breast_cancer_wisconsin.csv')
df

Unnamed: 0,Id,Cl.thickness,Cell.size,Cell.shape,Marg.adhesion,Epith.c.size,Bare.nuclei,Bl.cromatin,Normal.nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1.0,3,1,1,0
1,1002945,5,4,4,5,7,10.0,3,2,1,0
2,1015425,3,1,1,1,2,2.0,3,1,1,0
3,1016277,6,8,8,1,3,4.0,3,7,1,0
4,1017023,4,1,1,3,2,1.0,3,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...
694,776715,3,1,1,1,3,2.0,1,1,1,0
695,841769,2,1,1,1,2,1.0,1,1,1,0
696,888820,5,10,10,3,7,3.0,8,10,2,1
697,897471,4,8,6,4,3,4.0,10,6,1,1


In [77]:
from sklearn.neighbors import KNeighborsClassifier

from sklearn.tree import DecisionTreeClassifier

from sklearn.linear_model import LogisticRegression

from sklearn.svm import SVC

In [78]:
# MC
df.head()

Unnamed: 0,Id,Cl.thickness,Cell.size,Cell.shape,Marg.adhesion,Epith.c.size,Bare.nuclei,Bl.cromatin,Normal.nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1.0,3,1,1,0
1,1002945,5,4,4,5,7,10.0,3,2,1,0
2,1015425,3,1,1,1,2,2.0,3,1,1,0
3,1016277,6,8,8,1,3,4.0,3,7,1,0
4,1017023,4,1,1,3,2,1.0,3,1,1,0


In [79]:
# MC
print(df.isna().sum())
df = df.dropna()

Id                 0
Cl.thickness       0
Cell.size          0
Cell.shape         0
Marg.adhesion      0
Epith.c.size       0
Bare.nuclei        0
Bl.cromatin        0
Normal.nucleoli    0
Mitoses            0
Class              0
dtype: int64


In [80]:
# MC
X = df.drop(['Id','Class'], axis=1)
y = df.Class
X.head()

Unnamed: 0,Cl.thickness,Cell.size,Cell.shape,Marg.adhesion,Epith.c.size,Bare.nuclei,Bl.cromatin,Normal.nucleoli,Mitoses
0,5,1,1,1,2,1.0,3,1,1
1,5,4,4,5,7,10.0,3,2,1
2,3,1,1,1,2,2.0,3,1,1
3,6,8,8,1,3,4.0,3,7,1
4,4,1,1,3,2,1.0,3,1,1


In [81]:
#MC
df.Class.value_counts()

0    444
1    239
Name: Class, dtype: int64

In [82]:
# MC
y.head()

0    0
1    0
2    0
3    0
4    0
Name: Class, dtype: int64

In [83]:
# MC
X.describe()

Unnamed: 0,Cl.thickness,Cell.size,Cell.shape,Marg.adhesion,Epith.c.size,Bare.nuclei,Bl.cromatin,Normal.nucleoli,Mitoses
count,683.0,683.0,683.0,683.0,683.0,683.0,683.0,683.0,683.0
mean,4.442167,3.150805,3.215227,2.830161,3.234261,3.544656,3.445095,2.869693,1.603221
std,2.820761,3.065145,2.988581,2.864562,2.223085,3.643857,2.449697,3.052666,1.732674
min,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,2.0,1.0,1.0,1.0,2.0,1.0,2.0,1.0,1.0
50%,4.0,1.0,1.0,1.0,2.0,1.0,3.0,1.0,1.0
75%,6.0,5.0,5.0,4.0,4.0,6.0,5.0,4.0,1.0
max,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0


In [84]:
# MC
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, train_size=.8,
                                                    random_state=7)

**Knn**

In [85]:
# MC
knn_param_grid = {'n_neighbors': np.arange(1, 30),
                  'p': [1,2],
                 'weights':['uniform','distance']}
                   

knn_grid = GridSearchCV(KNeighborsClassifier(), 
                        knn_param_grid, cv=5, scoring = 'roc_auc', return_train_score=True)
knn_grid.fit(X_train, y_train)
print('Knn Best parameters', knn_grid.best_params_)
knn_model = knn_grid.best_estimator_
knn_model.fit(X_train, y_train)
print('Knn best score = ',knn_grid.best_score_ )

knn_pred = knn_model.predict(X_test)
print('Knn best model confusion matrix on test data  \n',confusion_matrix(y_test, knn_pred)  )
print('*********************************************')
print('Knn best model Precision  on test data = {:.2f}'.format(precision_score(y_test, knn_pred)))
print('Knn best model Recall  on test data = {:.2f}'.format(recall_score(y_test, knn_pred)))
print('Knn best model F1 on test data = {:.2f}'.format(f1_score(y_test, knn_pred)))
print('Knn best model Accuracy  on test data = {:.2f}'.format(accuracy_score(y_test, knn_pred)))
print('*********************************************')

Knn Best parameters {'n_neighbors': 7, 'p': 2, 'weights': 'distance'}
Knn best score =  0.9940848871908916
Knn best model confusion matrix on test data  
 [[87  2]
 [ 0 48]]
*********************************************
Knn best model Precision  on test data = 0.96
Knn best model Recall  on test data = 1.00
Knn best model F1 on test data = 0.98
Knn best model Accuracy  on test data = 0.99
*********************************************


**Decision Tree**

In [87]:
# MC
dt_param_grid = {'max_depth': np.arange(1, 10)}

dt_grid = GridSearchCV(DecisionTreeClassifier(random_state=1), 
                        dt_param_grid, cv=5, scoring = 'roc_auc', return_train_score=True)
dt_grid.fit(X_train, y_train)
print('DecosionTree Best parameters', dt_grid.best_params_)
dt_model = dt_grid.best_estimator_
print('DecosionTree best score = ',dt_grid.best_score_ )

DecosionTree Best parameters {'max_depth': 3}
DecosionTree best score =  0.957172454429682


**Logistic Regression**

In [88]:
# MC
logr_param_grid = {'C': [0.001, 0.004, 0.01, 0.1, 0.4, 1, 10, 50, 100],
                   'penalty' : ['l1', 'l2'],
                   'fit_intercept': [True, False]}

logr_grid = GridSearchCV(LogisticRegression(solver='liblinear'), 
                        logr_param_grid, cv=5, scoring = 'roc_auc', return_train_score=True)
logr_grid.fit(X_train, y_train)
print('LogisticRegression Best parameters', logr_grid.best_params_)
logr_model = logr_grid.best_estimator_
print('LogisticRegression best score = ',logr_grid.best_score_ )

LogisticRegression Best parameters {'C': 100, 'fit_intercept': True, 'penalty': 'l2'}
LogisticRegression best score =  0.9956530003231265


**Support Vector Machine (SVM)**

In [89]:
# MC
svc_param_grid = {'C': [0.001, 0.004, 0.01, 0.1, 0.4, 1, 10, 50, 100],
                  'kernel' : ['linear', 'rbf'],
                  'gamma': [0.001, 0.004, 0.01, 0.1, 0.4, 1, 10, 50, 100]}

svc_grid = GridSearchCV(SVC(), 
                        svc_param_grid, cv=5, scoring = 'roc_auc', return_train_score=True)
svc_grid.fit(X_train, y_train)
print('LogisticRegression Best parameters', svc_grid.best_params_)
svc_model = svc_grid.best_estimator_
print('LogisticRegression best score = ',svc_grid.best_score_ )

LogisticRegression Best parameters {'C': 0.001, 'degree': 1, 'gamma': 0.001, 'kernel': 'linear'}
LogisticRegression best score =  0.996082568284199


**PipeLine: Polynomial Logistic Regression**

In [90]:
#MC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures

poly_pipeline = make_pipeline(
    PolynomialFeatures(degree=2, include_bias=False), 
    MinMaxScaler(), 
    LogisticRegression())

poly_pipeline

Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(include_bias=False)),
                ('minmaxscaler', MinMaxScaler()),
                ('logisticregression', LogisticRegression())])

In [91]:
#MC
pip_param_grid = {'polynomialfeatures__degree':[1,2,3],
                  'logisticregression__C': [0.001, 0.004, 0.01, 0.1, 0.4, 1, 10, 50, 100],
                   'logisticregression__penalty' : ['l1', 'l2'],
                   'logisticregression__solver': ['liblinear'],
                   'logisticregression__fit_intercept': [True, False]}
pip_grid = GridSearchCV(poly_pipeline, 
                        pip_param_grid, cv=5, scoring = 'roc_auc', return_train_score=True)
pip_grid.fit(X_train, y_train)
print('LogisticRegression Polynomial Best parameters', pip_grid.best_params_)
pip_model = pip_grid.best_estimator_
print('LogisticRegression Polynomial best score = ',pip_grid.best_score_ )

LogisticRegression Polynomial Best parameters {'logisticregression__C': 1, 'logisticregression__fit_intercept': True, 'logisticregression__penalty': 'l1', 'logisticregression__solver': 'liblinear', 'polynomialfeatures__degree': 3}
LogisticRegression Polynomial best score =  0.9963087567238791


**Compare the best models on the test data**

In [92]:
# MC
from sklearn.metrics import confusion_matrix

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [93]:
# MC
knn_pred = knn_model.predict(X_test)
print('Knn best model confusion matrix on test data  \n',confusion_matrix(y_test, knn_pred)  )
print('*********************************************')
print('Knn best model Precision  on test data = {:.2f}'.format(precision_score(y_test, knn_pred)))
print('Knn best model Recall  on test data = {:.2f}'.format(recall_score(y_test, knn_pred)))
print('Knn best model F1 on test data = {:.2f}'.format(f1_score(y_test, knn_pred)))
print('Knn best model Accuracy  on test data = {:.2f}'.format(accuracy_score(y_test, knn_pred)))
print('*********************************************')

Knn best model confusion matrix on test data  
 [[87  2]
 [ 0 48]]
*********************************************
Knn best model Precision  on test data = 0.96
Knn best model Recall  on test data = 1.00
Knn best model F1 on test data = 0.98
Knn best model Accuracy  on test data = 0.99
*********************************************


In [94]:
# MC
dt_pred = dt_model.predict(X_test)
print('Decision Tree best model confusion matrix on test data  \n',confusion_matrix(y_test, dt_pred)  )
print('*********************************************')
print('Decision Tree best model Precision  on test data = {:.2f}'.format(precision_score(y_test, dt_pred)))
print('Decision Tree best model Recall on test data = {:.2f}'.format(recall_score(y_test, dt_pred)))
print('Decision Tree best model F1 on test data  = {:.2f}'.format(f1_score(y_test, dt_pred)))
print('Decision Tree best model Accuracy  on test data = {:.2f}'.format(accuracy_score(y_test, dt_pred)))
print('*********************************************')

Decision Tree best model confusion matrix on test data  
 [[85  4]
 [ 2 46]]
*********************************************
Decision Tree best model Precision  on test data = 0.92
Decision Tree best model Recall on test data = 0.96
Decision Tree best model F1 on test data  = 0.94
Decision Tree best model Accuracy  on test data = 0.96
*********************************************


In [96]:
# MC
logr_pred = logr_model.predict(X_test)
print('Logistic Regression best model confusion matrix on test data  \n',confusion_matrix(y_test, logr_pred)  )
print('*********************************************')
print('Logistic Regression best model Precision  on test data = {:.2f}'.format(precision_score(y_test, logr_pred)))
print('Logistic Regression best model Recall on test data = {:.2f}'.format(recall_score(y_test, logr_pred)))
print('Logistic Regression best model F1 on test data  = {:.2f}'.format(f1_score(y_test, logr_pred)))
print('Logistic Regression best model Accuracy  on test data = {:.2f}'.format(accuracy_score(y_test, logr_pred)))
print('*********************************************')

Logistic Regression best model confusion matrix on test data  
 [[87  2]
 [ 0 48]]
*********************************************
Logistic Regression best model Precision  on test data = 0.96
Logistic Regression best model Recall on test data = 1.00
Logistic Regression best model F1 on test data  = 0.98
Logistic Regression best model Accuracy  on test data = 0.99
*********************************************


In [97]:
# MC
svc_pred = svc_model.predict(X_test)
print('SVC best model confusion matrix on test data  \n',confusion_matrix(y_test, svc_pred)  )
print('*********************************************')
print('SVC best model Precision  on test data = {:.2f}'.format(precision_score(y_test, svc_pred)))
print('SVC best model Recall on test data = {:.2f}'.format(recall_score(y_test, svc_pred)))
print('SVC best model F1 on test data  = {:.2f}'.format(f1_score(y_test, svc_pred)))
print('SVC best model Accuracy  on test data = {:.2f}'.format(accuracy_score(y_test, svc_pred)))
print('*********************************************')

SVC best model confusion matrix on test data  
 [[87  2]
 [ 1 47]]
*********************************************
SVC best model Precision  on test data = 0.96
SVC best model Recall on test data = 0.98
SVC best model F1 on test data  = 0.97
SVC best model Accuracy  on test data = 0.98
*********************************************


In [98]:
# MC
pip_pred = pip_model.predict(X_test)
print('pip best model confusion matrix on test data  \n',confusion_matrix(y_test, pip_pred)  )
print('*********************************************')
print('pip best model Precision  on test data = {:.2f}'.format(precision_score(y_test, pip_pred)))
print('pip best model Recall on test data = {:.2f}'.format(recall_score(y_test, pip_pred)))
print('pip best model F1 on test data  = {:.2f}'.format(f1_score(y_test, pip_pred)))
print('pip best model Accuracy  on test data = {:.2f}'.format(accuracy_score(y_test, pip_pred)))
print('*********************************************')

pip best model confusion matrix on test data  
 [[87  2]
 [ 0 48]]
*********************************************
pip best model Precision  on test data = 0.96
pip best model Recall on test data = 1.00
pip best model F1 on test data  = 0.98
pip best model Accuracy  on test data = 0.99
*********************************************
