# Exercise 4

For this exercise, use the admission dataset: https://stats.idre.ucla.edu/stat/data/binary.csv. The dataset contains three predictor variables: gre, gpa and rank and one binary response variable called admit.

a) List all tunable hyperparameters.

b) Select the best model by searching over a range of hyperparameters based on cross validation score using an Exhaustive Search.

# Solution

## GridSearch

The conventional way of performing **Hyperparameter Optimisation** has been a grid search (aka parameter sweep). It is an exhaustive search through a manually specified subset of the hyperparameter space of a learning algorithm. A grid search algorithm must be guided by some performance metric, typically measured by cross-validation on the training set or evaluation on a validation set.

GridSearch performs exhaustive search over specified parameter values for an estimator. It implements a “fit” and a “score” method among other methods. The parameters of the estimator used to apply these methods are optimised by cross-validated grid-search over a parameter grid.

In [8]:
# Import Library 
import numpy as np
import pandas as pd
import io 
import requests

import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LogisticRegression 
from sklearn.model_selection import GridSearchCV, cross_val_score

In [2]:
df = pd.read_csv('https://stats.idre.ucla.edu/stat/data/binary.csv')
df

Unnamed: 0,admit,gre,gpa,rank
0,0,380,3.61,3
1,1,660,3.67,3
2,1,800,4.00,1
3,1,640,3.19,4
4,0,520,2.93,4
...,...,...,...,...
395,0,620,4.00,2
396,0,560,3.04,3
397,0,460,2.63,2
398,0,700,3.65,2


In [3]:
features = df.drop('admit', axis=1) 
target = df['admit']

features

Unnamed: 0,gre,gpa,rank
0,380,3.61,3
1,660,3.67,3
2,800,4.00,1
3,640,3.19,4
4,520,2.93,4
...,...,...,...
395,620,4.00,2
396,560,3.04,3
397,460,2.63,2
398,700,3.65,2


In [4]:
# convert to np.array
X = features.values 
y = target.values

In [9]:
# Scale and fit the model 
pipe = Pipeline([("scaler", StandardScaler()), 
                 ("logistic", LogisticRegression(solver='liblinear'))]) 

pipe.fit(X, y)

## a) List of all tunable hyper-parameters

In [10]:
# get model params 
pipe.get_params()

{'memory': None,
 'steps': [('scaler', StandardScaler()),
  ('logistic', LogisticRegression(solver='liblinear'))],
 'verbose': False,
 'scaler': StandardScaler(),
 'logistic': LogisticRegression(solver='liblinear'),
 'scaler__copy': True,
 'scaler__with_mean': True,
 'scaler__with_std': True,
 'logistic__C': 1.0,
 'logistic__class_weight': None,
 'logistic__dual': False,
 'logistic__fit_intercept': True,
 'logistic__intercept_scaling': 1,
 'logistic__l1_ratio': None,
 'logistic__max_iter': 100,
 'logistic__multi_class': 'auto',
 'logistic__n_jobs': None,
 'logistic__penalty': 'l2',
 'logistic__random_state': None,
 'logistic__solver': 'liblinear',
 'logistic__tol': 0.0001,
 'logistic__verbose': 0,
 'logistic__warm_start': False}

## b) Select the best model by searching over a range of hyperparameters based on cross validation score using an Exhaustive Search.

In [11]:
# penalty hyperparamter values 
penalty = ['l1', 'l2'] 

# regularization hyperparamter 
C = np.linspace(0.01,10,10) 
C 

# subsume into one dict 
param_grid = dict(logistic__C=C, logistic__penalty=penalty)

In [12]:
%%time
# create a grid search with cv=5 
gridsearch = GridSearchCV(pipe, param_grid, n_jobs=-1, cv=5, verbose=1) 

# fit grid search 
best_model = gridsearch.fit(X, y)

Fitting 5 folds for each of 20 candidates, totalling 100 fits
CPU times: total: 93.8 ms
Wall time: 2.42 s


In [13]:
# best model parameters 
best_model.best_params_

{'logistic__C': 2.23, 'logistic__penalty': 'l2'}

In [14]:
# best score 
best_model.best_score_

0.7075000000000001

In [15]:
pipe['logistic'].coef_

array([[ 0.26139396,  0.29067213, -0.51864053]])

In [16]:
best_model.best_estimator_.named_steps

{'scaler': StandardScaler(),
 'logistic': LogisticRegression(C=2.23, solver='liblinear')}

In [17]:
best_model.best_estimator_.named_steps['logistic'].coef_

array([[ 0.26317473,  0.29320696, -0.52387746]])

In [18]:
best_model.best_estimator_.named_steps['logistic'].intercept_

array([-0.85285979])

In [19]:
# best model params after hypertuning 
best_model.get_params()

{'cv': 5,
 'error_score': nan,
 'estimator__memory': None,
 'estimator__steps': [('scaler', StandardScaler()),
  ('logistic', LogisticRegression(solver='liblinear'))],
 'estimator__verbose': False,
 'estimator__scaler': StandardScaler(),
 'estimator__logistic': LogisticRegression(solver='liblinear'),
 'estimator__scaler__copy': True,
 'estimator__scaler__with_mean': True,
 'estimator__scaler__with_std': True,
 'estimator__logistic__C': 1.0,
 'estimator__logistic__class_weight': None,
 'estimator__logistic__dual': False,
 'estimator__logistic__fit_intercept': True,
 'estimator__logistic__intercept_scaling': 1,
 'estimator__logistic__l1_ratio': None,
 'estimator__logistic__max_iter': 100,
 'estimator__logistic__multi_class': 'auto',
 'estimator__logistic__n_jobs': None,
 'estimator__logistic__penalty': 'l2',
 'estimator__logistic__random_state': None,
 'estimator__logistic__solver': 'liblinear',
 'estimator__logistic__tol': 0.0001,
 'estimator__logistic__verbose': 0,
 'estimator__logisti

In [20]:
# cross validation results 
df1 = pd.DataFrame(best_model.cv_results_) 

df1

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_logistic__C,param_logistic__penalty,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.004,0.001096,0.002,0.000633,0.01,l1,"{'logistic__C': 0.01, 'logistic__penalty': 'l1'}",0.6875,0.6875,0.6875,0.675,0.675,0.6825,0.006124,20
1,0.003604,0.00102,0.001201,0.000399,0.01,l2,"{'logistic__C': 0.01, 'logistic__penalty': 'l2'}",0.7125,0.75,0.7,0.6875,0.6875,0.7075,0.023184,14
2,0.003003,0.000634,0.001199,0.0004,1.12,l1,"{'logistic__C': 1.12, 'logistic__penalty': 'l1'}",0.7125,0.7375,0.7,0.6875,0.6875,0.705,0.018708,16
3,0.0034,0.000491,0.0012,0.000399,1.12,l2,"{'logistic__C': 1.12, 'logistic__penalty': 'l2'}",0.7125,0.7375,0.7,0.6875,0.6875,0.705,0.018708,16
4,0.002401,0.000492,0.001001,3e-06,2.23,l1,"{'logistic__C': 2.23, 'logistic__penalty': 'l1'}",0.7125,0.75,0.7,0.6875,0.6875,0.7075,0.023184,14
5,0.002601,0.000489,0.0006,0.00049,2.23,l2,"{'logistic__C': 2.23, 'logistic__penalty': 'l2'}",0.7125,0.7375,0.7,0.6875,0.7,0.7075,0.016956,1
6,0.0022,0.000399,0.000799,0.0004,3.34,l1,"{'logistic__C': 3.34, 'logistic__penalty': 'l1'}",0.7125,0.7375,0.7,0.6875,0.6875,0.705,0.018708,16
7,0.002398,0.000801,0.0012,0.000401,3.34,l2,"{'logistic__C': 3.34, 'logistic__penalty': 'l2'}",0.7125,0.7375,0.7,0.6875,0.7,0.7075,0.016956,1
8,0.002201,0.0004,0.001,3e-06,4.45,l1,"{'logistic__C': 4.45, 'logistic__penalty': 'l1'}",0.7125,0.7375,0.7,0.6875,0.6875,0.705,0.018708,16
9,0.002598,0.000489,0.0008,0.0004,4.45,l2,"{'logistic__C': 4.45, 'logistic__penalty': 'l2'}",0.7125,0.7375,0.7,0.6875,0.7,0.7075,0.016956,1


For a combination of C and penality values, we have created 10 x 2 x 5 = 100 model candidates from which the best model was selected. On the basis of above cross validation results, we then choose the model that ranked number one.

In [21]:
# Model Params 
print(f"Best Penalty: {best_model.best_params_['logistic__penalty']}") 
print(f"Best C: {best_model.best_params_['logistic__C']}") 
print(f"Best Score: {best_model.best_score_:.04}")

Best Penalty: l2
Best C: 2.23
Best Score: 0.7075


# References
* Scikit-learn GridSearchCV
* Scikit-learn KNN
* Python resources