# Lab 6: Introducing Hyperparameter Tuning

Objectives:
- To gain hands-on experience tuning parameters
- To implement concepts related to Hyperparameter Tuning

## Lab

### Load Necessary Libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

### Load and Prepare Data

We're going to use iris dataset.

In [2]:
# Load dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


Let's use decision tree again! (Don't forget it yet)

In [3]:
# Initialize a Decision Tree model
dt = DecisionTreeClassifier(random_state=42)

### Parameter Tuning Using GridSearchCV

In [4]:
# Define parameter grid for tuning
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Apply GridSearchCV
grid_search = GridSearchCV(dt, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Get best parameters
print("Best Parameters from GridSearchCV:", grid_search.best_params_)


Best Parameters from GridSearchCV: {'criterion': 'entropy', 'max_depth': 5, 'min_samples_leaf': 4, 'min_samples_split': 2}


### Parameter Tuning Using RandomizedSearchCV

In [5]:
from scipy.stats import randint

# Define parameter distributions for tuning
param_dist = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [3, 5, 10, None],
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 10)
}

# Apply RandomizedSearchCV
random_search = RandomizedSearchCV(dt, param_dist, n_iter=10, cv=5, scoring='accuracy', n_jobs=-1, random_state=42)
random_search.fit(X_train, y_train)

# Get best parameters
print("Best Parameters from RandomizedSearchCV:", random_search.best_params_)

Best Parameters from RandomizedSearchCV: {'criterion': 'entropy', 'max_depth': None, 'min_samples_leaf': 3, 'min_samples_split': 3}


### Evaluate the Best Model

In [6]:
# Train a Decision Tree using the best parameters from GridSearchCV
best_dt = grid_search.best_estimator_
y_pred = best_dt.predict(X_test)

# Compute accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy with Best GridSearchCV Model: {accuracy:.4f}")


Test Accuracy with Best GridSearchCV Model: 1.0000


___

## Tasks:

Use wine dataset :)

In [11]:
from sklearn.datasets import load_wine

# Load dataset
wine = load_wine()
X = wine.data
y = wine.target

In [17]:
print(X.shape)
print(y.shape)

(178, 13)
(178,)


**Task 1: Train-Test Split**

- Split the Wine dataset into training and testing sets using an 80-20 split.
- What is the role of the train-test split in evaluating a machine learning model?

In [38]:
# Write your code here
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=42)

we use `X_train`, `y_train`, to evaluate the learnable params and `X_test`, `y_test` to test the model for evaluation

**Task 2: Cross-Validation**

- Implement cross-validation using GridSearchCV and RandomizedSearchCV.
- How does cross-validation help in providing a more reliable estimate of the model's performance?
- Discuss how cross-validation improves the results and prevents overfitting compared to a simple train-test split.

In [49]:
# Write your code here
dt = DecisionTreeClassifier(random_state=42)
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}
# Apply GridSearchCV
grid_search = GridSearchCV(dt, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train,y_train)
print(grid_search.best_params_)
param_dist = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [3, 5, 10, None],
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 10)
}

# Apply RandomizedSearchCV
random_search = RandomizedSearchCV(dt, param_dist, n_iter=10, cv=5, scoring='accuracy', n_jobs=-1, random_state=42)
random_search.fit(X_train,y_train)
dt.fit(X_train,y_train)
print(dt.get_params())

best_grid = grid_search.best_estimator_
grid_pred = best_grid.predict(X_test)
best_random = random_search.best_estimator_
random_pred = best_random.predict(X_test)

pred = dt.predict(X_test)
print(f"acc of best_grid : {accuracy_score(y_test,grid_pred)}")
print(f"acc of best_random : {accuracy_score(y_test,random_pred)}")
print(f"acc of og : {accuracy_score(y_test,pred)}")

{'criterion': 'gini', 'max_depth': 3, 'min_samples_leaf': 1, 'min_samples_split': 2}
{'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'monotonic_cst': None, 'random_state': 42, 'splitter': 'best'}
acc of best_grid : 0.9555555555555556
acc of best_random : 0.9333333333333333
acc of og : 0.9555555555555556


- Cross validation continuously training model over multiple test set, so we are sure that the result that we get from the CV is like average result from multiple training set.
- When doing traditional `train_test_split`, model could prone to overfit because the training set could have some certain pattern that just occur in that dataset. But for CV, it would train over multiple training set. So model would get generalize over multiple dataset.

> the answers above are just theoritical answers, but don't seem to show here. Anyway, it could be caused by other factor such as size of dataset and variance of data

**Task 3: Hyperparameter Tuning**

- Use GridSearchCV to find the best hyperparameters by exhaustively searching through the parameter grid.
- Use RandomizedSearchCV to sample a fixed number of combinations of hyperparameters.
- Compare the results from both methods in terms of accuracy and computation time.
- Discuss the trade-offs between GridSearchCV and RandomizedSearchCV.

In [66]:
# Write your code here
import time

dt = DecisionTreeClassifier(random_state=42)
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}
# Apply GridSearchCV
grid_search = GridSearchCV(dt, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
%timeit grid_search.fit(X_train,y_train)


param_dist = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [3, 5, 10, None],
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 10)
}
random_search = RandomizedSearchCV(dt, param_dist, n_iter=10, cv=5, scoring='accuracy', n_jobs=-1, random_state=42)
%timeit random_search.fit(X_train,y_train)

208 ms ± 15.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
39 ms ± 1.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


`GridSearchCV` would take more time, since the search space it large and it would go through every point in the search space
and as `%timeit` function show, `RandomSearchCV` take much less time, because it only just search only certain point specified by `n_iters`

the trade of would be that `RandomizedSearchCV` would use less time, but would find only minimum from those sampled point, could be global maximum, but not certain. Where `GridGridSearchCV` will surely find the global maximum from all combination of hyperparameters

**Task 4: Model Evaluation**

- Using the best model from GridSearchCV or RandomizedSearchCV, make predictions on the test set and calculate the accuracy.
- Compare the performance of the tuned model with a baseline Decision Tree model (without hyperparameter tuning).
- How does hyperparameter tuning affect the accuracy of the Decision Tree model?

In [67]:
dt.fit(X_train,y_train)
best_grid = grid_search.best_estimator_
grid_pred = best_grid.predict(X_test)
best_random = random_search.best_estimator_
random_pred = best_random.predict(X_test)
pred = dt.predict(X_test)
print(f"acc of best_grid : {accuracy_score(y_test,grid_pred)}")
print(f"acc of best_random : {accuracy_score(y_test,random_pred)}")
print(f"acc of og : {accuracy_score(y_test,pred)}")

print(grid_search.best_params_)
print(random_search.best_params_)
print(dt.get_params())

print("randomly change hyperparam")
dt.set_params(min_samples_leaf=2)
dt.fit(X_train,y_train)
print(f"randomly change param : {accuracy_score(y_test,dt.predict(X_test))}")

acc of best_grid : 0.9555555555555556
acc of best_random : 0.9333333333333333
acc of og : 0.9555555555555556
{'criterion': 'gini', 'max_depth': 3, 'min_samples_leaf': 1, 'min_samples_split': 2}
{'criterion': 'gini', 'max_depth': None, 'min_samples_leaf': 5, 'min_samples_split': 5}
{'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'monotonic_cst': None, 'random_state': 42, 'splitter': 'best'}
randomly change hyperparam
randomly change param : 0.9777777777777777


as you can see `GridSearchCV` and `Baseline` are the best, while `Baseline` could prone to overfit in the unseen data since there are no restriction on the decision tree.

but as you can see, `RandomizedSearchCV` might not get the best result, since it just randomly search, it might or might not find the global minimum. And this case it doesn't

and tuning hyperparameter definitely directly affect the result, I just change the `min_samples_split` and the accuracy changed. This is why we need to tune the hyperparameter and need to get the one that fit the use case.