# Hyperparameter Tuning and Cross-Validation #

The purpose of this notebook is to introduce various means of hyperparameter tuning and cross-validation.

### Hyperparameter Tuning ###

Hyperparameter tuning is the process of finding the optimal configuration for a machine learning model. It involves testing different values of hyperparameters for a given ML algorithm and selecting the combination that maximizes performance.

There are different techniques for hyperparameter tuning, many of which are built into machine learning modules like SKLearn. Some common techniques are covered in this notebook:
* Grid Search - Exhaustively evaluates all possible hyperparameter combinations.
* Randomized Search - A faster version of Grid Search that samples random combinations of hyperparameters.
* Bayesian Optimization - Uses probabilistic models to find optimal hyperparameters more efficiently.

### Cross-Validation ###

Cross-validation is a technique to minimize overfitting and it is especially important with regard to hyperparameter tuning. The basic idea is to create many different sets of training data and to evaluate the model's cumulative performance.

There are different techniques for cross-validation, many of which are built into machine learning modules like SKLearn. Some common techniques are covered in this notebook:
* Leave-P-Out (see also Leave-One-Out) - Removes `p` samples for validation in each iteration.
* Stratified K-Fold - Ensures that each fold maintains a balance when there are common vs rare classification labels.
* Shuffle-Split - Randomly partitions data into multiple train-test splits.

By using cross-validation, we ensure that our chosen hyperparameters generalize well to unseen data, improving the model's robustness.

In [None]:
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import (
    train_test_split, GridSearchCV, RandomizedSearchCV, 
    LeavePOut, StratifiedKFold, ShuffleSplit)
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

# pip install scikit-optimize
from skopt import BayesSearchCV

## Wine Dataset ##

We are going to reuse the wine dataset. It's not so much that I particularly enjoy this dataset, but it's a nice small size for this lesson and all of the values are numeric. Notice that I skip over any sort of normalization or standardization. It might be possible to improve scores by taking these sorts of things into account.

In [None]:
from sklearn.datasets import load_wine
import pandas as pd

# Load the wine dataset
wine = load_wine()
df = pd.DataFrame(wine.data, columns=wine.feature_names)
df["target"] = wine.target

print(df.shape)
df.head()

### Validation Sets ###

We need to set aside a portion of the data to use as our ***validation set***. Even though we are going to be using cross-validation to cycle through the data, it is important to still set aside a portion of the data for final testing. In this way, we have actually have three sets of data. The first two come from the 'training' data and the last one comes from the 'test' data.
```
X_train, X_test, y_train, y_test = train_test_split(X, y, ...)
```

#### X_train, y_train ####
This data is used for the cross-validation. It will be split over-and-over again to create different sets of training and test data.
* Training set: the true training data, a different subset of X_train and y_train for each cross-validation set
* Validation set: intermediate "test" data, a different subset of X_train and y_train for each cross-validation set

#### X_test, y_test ####
* Test set: this data is the true test data; it is held aside from the very beginning for final score. We will not use this test data for any of the cross-validation models because that could lead to data leakage.

In [None]:
# Train-test split
X = df.drop(columns=["target"])
y = df["target"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

## Two New ML Algorithms ##

We are going to try two new machine learning classification algorithms, Support Vector Machines (SVM) and Gradient Boost. Let's check in with ChatGPT for a description of these algorithms:

### Support Vector Machines ###

"Support Vector Machines (SVM) are supervised learning algorithms used for classification and regression tasks. SVM works by finding the optimal hyperplane that best separates different classes in a dataset. It maximizes the margin (distance between the closest data points, called support vectors) to ensure better generalization. When the data is not linearly separable, SVM uses the kernel trick (e.g., polynomial or RBF kernel) to transform data into a higher-dimensional space where a linear boundary can be applied. SVMs are particularly effective in high-dimensional spaces and small datasets, but they can be computationally expensive for large datasets."

### Gradient Boost ###

"Gradient Boosting is a machine learning method that builds a strong model by combining many weak models (usually small decision trees). It works step by step, where each new tree tries to fix the mistakes made by the previous trees. Instead of treating all mistakes equally, Gradient Boosting focuses more on errors that were hardest to correct. By doing this repeatedly, the model improves over time. This method is very powerful for complex, non-linear problems, but it needs careful tuning to avoid overfitting (memorizing the training data too much). It is commonly used in applications like fraud detection, ranking systems, and predicting customer behavior."

In [None]:
models = {
    "SVM": SVC(random_state=432),
    "Gradient Boost": GradientBoostingClassifier(random_state=432)
}

### Hyperparameter Tuning ###

Hyperparameter tuning is the process of finding the optimal configuration for a machine learning model. It involves testing different values of hyperparameters for a given ML algorithm and selecting the combination that maximizes performance.

For example, in clustering, the K-means algorithm requires the user to choose the number of clusters. We used WSSE elbow plots to find the optimum value for `k`. Similarly, the DBSCAN algorithm is depends on `min_samples` (minimum number of neighbors) and `eps` (neighborhood size). We combined the Silhouette Score, Calinski-Harabasz Index, and Davies-Bouldin Index metrics to optimize these parameters.

In regression learning, we might choose a polynomial model that requires us to specify options like the polynomial degree and learning rate.

And finally, with classification algorithms like K-Nearest Neighbors, the user is able to choose between various `metric` values (e.g., `"euclidean"` or `"manhattan"`) and a voting strategy using the `weights` parameter.

In each of these cases, selecting the optimal hyperparameter values can significantly impact model performance.

In [None]:
# We will compare Gradient Boost and Support Vector Machine, two new ML algorithms
param_grids = {
    'Gradient Boost': {
        'n_estimators': [50, 100, 200],
        'learning_rate': [0.01, 0.1, 0.2],
        'max_depth': [3, 5, 7]
    },
    'SVM': {
        'C': [0.1, 1, 10],
        'kernel': ['linear', 'rbf']
    }
}

### Cross-Validation ###

Cross-validation is a technique to minimize overfitting and it is especially important with regard to hyperparameter tuning. The basic idea is to create many different sets of training data and to evaluate the model's cumulative performance.

Think back to how we performed hyperparameter tuning. We basically tried a bunch of different parameter values and then found which combination gave the highest accuracy score. One of the potential problems with this approach is that it is prone to overfitting. The best parameters are only "best" for the chosen train-test split. A different set of training data might have led us to choose different hyperparameters. This happens because the model's performance may vary depending on the data split, leading to inconsistent hyperparameter selection.

Cross-validation mitigates this by repeatedly training and evaluating the model on different train-test splits, producing a more reliable estimate of model performance. The final evaluation metric is averaged over multiple train-test splits, providing a more reliable estimate of the model's true performance.


In [None]:
# Three cross-validation techniques
cv_methods = {
    "Shuffle-Split": ShuffleSplit(n_splits=10, test_size=0.2, random_state=123),
    "Stratified K-Fold": StratifiedKFold(n_splits=5, shuffle=True, random_state=123),
    #"Leave-P-Out": LeavePOut(p=2)
}

## Evaluating the Models ##

The final step is to create and validate (test) all of the models, keeping track of the best performers.

We're going to find that the wine dataset is relatively small and each class is fairly easy to predict, so the accuracy of each cross-validation method will be roughly the same, though there may be some small difference in the values. We had to keep the dataset simple in order for the cross-validation techniques to finish in a reasonable time on my little laptop. Had we more processing power, we could have analyzed a larger and more complex dataset, where the differences would be more pronounced.

The key to notice here is the differnce in execution time!

In [None]:

# Perform hyperparameter tuning with different cross-validation methods
for model_name, model in models.items():
    print(f"Training {model_name}...")
    
    for cv_name, cv_method in cv_methods.items():
        print(f"  Using {cv_name} cross-validation")
        
        start_time = time.time()
        grid_search = GridSearchCV(
            model, param_grids[model_name], 
            cv=cv_method, scoring='accuracy', n_jobs=-1)
        grid_search.fit(X_train, y_train)
        y_pred = grid_search.best_estimator_.predict(X_test)
        end_time = time.time()
        print(f"    Grid Search  ({end_time - start_time:5.2f}s): {accuracy_score(y_test, y_pred):.2f} accuracy with {grid_search.best_params_}")
        
        start_time = time.time()
        random_search = RandomizedSearchCV(
            model, param_grids[model_name], 
            cv=cv_method, scoring='accuracy', 
            n_iter=5, random_state=17, n_jobs=-1)
        random_search.fit(X_train, y_train)
        y_pred = random_search.best_estimator_.predict(X_test)
        end_time = time.time()
        print(f"    Rnd Search   ({end_time - start_time:5.2f}s): {accuracy_score(y_test, y_pred):.2f} accuracy with {random_search.best_params_}")

        start_time = time.time()
        bayes_search = BayesSearchCV(
            model, param_grids[model_name], 
            cv=cv_method, scoring='accuracy', 
            n_iter=10, random_state=17, n_jobs=-1)
        bayes_search.fit(X_train, y_train)
        y_pred = bayes_search.best_estimator_.predict(X_test)
        end_time = time.time()
        print(f"    Bayes Search ({end_time - start_time:5.2f}s): {accuracy_score(y_test, y_pred):.2f} accuracy with {bayes_search.best_params_}")

### Analyzing Results ###

We can dig deeper into the results by comparing the accuracy of each test using the `cv_results_` parameter. There are several different metrics available, mostly related to accuracy score and execution time. We will look into the accuracy scores.

In [None]:
print(random_search.cv_results_['params'])
print(random_search.cv_results_['mean_test_score'])
print(random_search.cv_results_['rank_test_score'])

In [None]:
combined = zip(random_search.cv_results_['params'], random_search.cv_results_['mean_test_score'])
combined = sorted(list(combined), key=lambda x: x[1], reverse=True)
for param, score in combined:
    print(f"Accuracy {100*score:.3f}%: {param}")

In [None]:
combined = zip(grid_search.cv_results_['params'], grid_search.cv_results_['mean_test_score'])
combined = sorted(list(combined), key=lambda x: x[1], reverse=True)
for param, score in combined:
    print(f"Accuracy {100*score:.3f}%: {param}")

## Final Testing ##

Now that we have selected the best model, it's time to make predictions on our real test data.

In [None]:
model = SVC(random_state=432, kernel='linear', C=1)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Classification Report for Support Vector Machine on Wine Dataset")
print("Predicting originating vineyard based on chemical composition of wine\n")
print(classification_report(y_test, y_pred))