# Hyper Parameter Tuning

After Evaluating model performance, we can optimize the models.

Recall that we had to choose a value for alpha in ridge and lasso regression before fitting it. Likewise, before fitting and predicting KNN, we choose n_neighbors. 

Parameters that we specify before fitting a model, like alpha and n_neighbors, are called **hyperparameters**. So, a fundamental step for building a successful model: is choosing the correct hyperparameters. We can try lots of different values, fit all of them separately, see how well they perform, and choose the best values! This is called hyperparameter tuning. 

When fitting different hyperparameter values, we use cross-validation to avoid overfitting the hyperparameters to the test set. We can still split the data, but perform cross-validation on the training set. We withhold the test set and use it for evaluating the tuned model.

## GridSearch

One approach for hyperparameter tuning is called grid search, where we choose a grid of possible hyperparameter values to try. Then we perform k-fold cross-validation for each combination of hyperparameters. We then choose hyperparameters that performed best.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, KFold, RandomizedSearchCV
from sklearn.linear_model import Lasso, LogisticRegression

In [2]:
diabetes_df = pd.read_csv('Data/diabetes_clean.csv')
print(diabetes_df.shape)
diabetes_df.head()

(768, 9)


Unnamed: 0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age,diabetes
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [3]:
X = diabetes_df.drop('glucose', axis=1).values
y = diabetes_df['glucose'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

kf = KFold(n_splits=5, shuffle=True, random_state=42)
lasso = Lasso()

param_grid = {'alpha': np.linspace(0.00001, 1, 20)}
lasso_cv = GridSearchCV(lasso, param_grid, cv=kf)

lasso_cv.fit(X_train, y_train)
print("Tuned lasso paramaters: {}".format(lasso_cv.best_params_))
print("Tuned lasso score: {}".format(lasso_cv.best_score_))

Tuned lasso paramaters: {'alpha': 1e-05}
Tuned lasso score: 0.33506359955850107


Unfortunately, the best model only has an R-squared score of 0.33, highlighting that using the optimal hyperparameters does not guarantee a high performing model!

## RandomizedSearchCV

GridSearchCV can be computationally expensive, especially if you are searching over a large hyperparameter space. In this case, you can use RandomizedSearchCV, which tests a fixed number of hyperparameter settings from specified probability distributions.

In [4]:
import warnings
warnings.filterwarnings("ignore")

In [5]:
X = diabetes_df.drop('diabetes', axis=1).values
y = diabetes_df['diabetes'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

kf = KFold(n_splits=5, shuffle=True, random_state=42)
logreg = LogisticRegression()

params = {"penalty": ["l1", "l2"], # The regularization penalty options are "l1" (Lasso) and "l2" (Ridge)
         "tol": np.linspace(0.0001, 1.0, 50), # tolerance for the change in the optimization objective
         "C": np.linspace(0.1, 1.0, 50), # The "C" parameter is the inverse of regularization strength. It controls the trade-off between fitting the training data well and preventing overfitting. 
         "class_weight": ["balanced", {0:0.8, 1:0.2}]} # The "class_weight" parameter is used to handle class imbalance. It can take two forms: balanced == automatic, manual

logreg_cv = RandomizedSearchCV(logreg, params, cv=kf)
logreg_cv.fit(X_train, y_train)

print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_))
print("Tuned Logistic Regression Best Accuracy Score: {}".format(logreg_cv.best_score_))


Tuned Logistic Regression Parameters: {'tol': 0.16334897959183672, 'penalty': 'l2', 'class_weight': 'balanced', 'C': 0.3020408163265306}
Tuned Logistic Regression Best Accuracy Score: 0.7633783316026306


Even without exhaustively trying every combination of hyperparameters, the model has an accuracy of over 70% on the test set!