# Hyperparameter Tuning and Cross-Validation #

The purpose of this notebook is to introduce various means of hyperparameter tuning and cross-validation.

### Hyperparameter Tuning ###

Hyperparameter tuning is the process of finding the optimal configuration for a machine learning model. It involves testing different values of hyperparameters for a given ML algorithm and selecting the combination that maximizes performance.

There are different techniques for hyperparameter tuning, many of which are built into machine learning modules like SKLearn. Some common techniques are covered in this notebook:
* Grid Search - Exhaustively evaluates all possible hyperparameter combinations.
* Randomized Search - A faster version of Grid Search that samples random combinations of hyperparameters.
* Bayesian Optimization - Uses probabilistic models to find optimal hyperparameters more efficiently.

### Cross-Validation ###

Cross-validation is a technique to minimize overfitting and it is especially important with regard to hyperparameter tuning. The basic idea is to create many different sets of training data and to evaluate the model's cumulative performance.

There are different techniques for cross-validation, many of which are built into machine learning modules like SKLearn. Some common techniques are covered in this notebook:
* Leave-P-Out (see also Leave-One-Out) - Removes `p` samples for validation in each iteration.
* Stratified K-Fold - Ensures that each fold maintains a balance when there are common vs rare classification labels.
* Shuffle-Split - Randomly partitions data into multiple train-test splits.

By using cross-validation, we ensure that our chosen hyperparameters generalize well to unseen data, improving the model's robustness.

In [1]:
import time
import numpy as np                                                 # type: ignore
import pandas as pd                                                # type: ignore
import matplotlib.pyplot as plt                                    # type: ignore
from sklearn.model_selection import (                              # type: ignore
    train_test_split, GridSearchCV, RandomizedSearchCV, 
    LeavePOut, StratifiedKFold, ShuffleSplit)
from sklearn.ensemble import GradientBoostingClassifier            # type: ignore
from sklearn.svm import SVC                                        # type: ignore
from sklearn.metrics import accuracy_score, classification_report  # type: ignore

# pip install scikit-optimize
from skopt import BayesSearchCV                                    # type: ignore

## Forest Cover Type Dataset ##

The Forest Cover Type Dataset is another common dataset for multi-class classification. The goal is to predict the type of forest cover based on environment factors such as elevation, soil type, and other climate-related features.

It seems that I favor environmental datasets. We had earthquake and river flow for CSC 314. Now we are considering geysers, mushrooms, and forest cover in CSC 432. I guess this comes from my love for the outdoors and the appreciation of God's creative beauty. My sabbatical is coming up and my preliminary idea is to create a large skiing dataset that can be used for clustering, regression, classification, and (of course) athletic analysis.

Our data originally comes from U.S. Forest Service and U.S. Geological Survey (USGS). This is the third or fourth dataset that we have used from USGS. The data is officially hosted on the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/dataset/31/covertype) although it can also be found on Kaggle and other similar websites.

### Classes ###

It is a large dataset with more than 500,000 samples with 54 features that are categorized into 7 different classes:
* Spruce/Fir (label #1, and one of my personal favorites),
* Lodgepole Pine (#2),
* Ponderosa Pine (#3),
* Cottonwood/Willow (#4),
* Aspen (label #5, another favorite),
* Douglas-Fir (#6), and
* Krummholz (#7).

### Features ###

Here is a description of the 54 features. Notice that many of them are dummy variables for the soil type.

|Feature Name|Description|
|------------|-----------|
|Elevation|Altitude in meters|
|Aspect|Compass direction the slope faces (0-360 degrees)|
|Slope|Steepness in degrees|
|Horizontal Distance to Hydrology|Distance to nearest surface water (m)|
|Vertical Distance to Hydrology|Elevation difference from nearest surface water (m)|
|Horizontal Distance to Roadways|Distance to nearest road (m)|
|Hillshade at 9am, Noon, 3pm|Sunlight levels at different times of the day|
|Horizontal Distance to Fire Points|Distance to nearest wildfire ignition point (m)|
|Wilderness Area (**4 columns**)|One-hot encoding of 4 protected areas|
|Soil Type (**40 columns**)|One-hot encoding of 40 soil categories|

The dataset is highly imbalanced with the majority of samples being Lodgepole Pine and Spruce/Fir. On a somewhat related note, we have an official Champion Lodgepole Pine in Big Bear. It has the best combination of height, circumfrence, and canopy spread in all of California and is 2nd in the US only to an Oregon tree.

In [None]:
# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.data.gz"
column_names = (
    [ "Elevation", "Aspect", "Slope",
      "Horizontal_Distance_To_Hydrology",
      "Vertical_Distance_To_Hydrology", 
      "Horizontal_Distance_To_Roadways", 
      "Hillshade_9am", "Hillshade_Noon", "Hillshade_3pm", 
      "Horizontal_Distance_To_Fire_Points"]
      + [f"Wilderness_Area_{i}" for i in range(4)]
      + [f"Soil_Type_{i}" for i in range(40)] 
      + ["Cover_Type"])

df = pd.read_csv(url, header=None, names=column_names)
print(df.shape)
df.head()

In [None]:
# Evaluate for class imbalance
df['Cover_Type'].value_counts()

In [None]:
# Check for missing values
df.isna().sum().sort_values(ascending=False).head()

In [2]:
from sklearn.datasets import load_wine
import pandas as pd

# Load the wine dataset
wine = load_wine()
df = pd.DataFrame(wine.data, columns=wine.feature_names)
df["target"] = wine.target

print(df.shape)
df.head()

(178, 14)


Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0,0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0


### Validation Sets ###

We need to set aside a portion of the data to use as our ***validation set***. Even though we are going to be using cross-validation to cycle through the data, it is important to still set aside a portion of the data for final testing. In this way, we have actually have three sets of data. The first two come from the 'training' data and the last one comes from the 'test' data.
```
X_train, X_test, y_train, y_test = train_test_split(X, y, ...)
```

#### X_train, y_train ####
This data is used for the cross-validation. It will be split over-and-over again to create different sets of training and test data.
* Training set: the true training data, a different subset of X_train and y_train for each cross-validation set
* Validation set: intermediate "test" data, a different subset of X_train and y_train for each cross-validation set

#### X_test, y_test ####
* Test set: this data is the true test data; it is held aside from the very beginning for final score. We will not use this test data for any of the cross-validation models because that could lead to data leakage.

In [None]:
# Split features and labels
X = df.drop(columns=["Cover_Type"])
y = df["Cover_Type"]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

In [3]:
X = df.drop(columns=["target"])
y = df["target"]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

## Two New ML Algorithms ##

We are going to try two new machine learning classification algorithms, Support Vector Machines (SVM) and Gradient Boost. Let's check in with ChatGPT for a description of these algorithms:

### Support Vector Machines ###

"Support Vector Machines (SVM) are supervised learning algorithms used for classification and regression tasks. SVM works by finding the optimal hyperplane that best separates different classes in a dataset. It maximizes the margin (distance between the closest data points, called support vectors) to ensure better generalization. When the data is not linearly separable, SVM uses the kernel trick (e.g., polynomial or RBF kernel) to transform data into a higher-dimensional space where a linear boundary can be applied. SVMs are particularly effective in high-dimensional spaces and small datasets, but they can be computationally expensive for large datasets."

### Gradient Boost ###

"Gradient Boosting is a machine learning method that builds a strong model by combining many weak models (usually small decision trees). It works step by step, where each new tree tries to fix the mistakes made by the previous trees. Instead of treating all mistakes equally, Gradient Boosting focuses more on errors that were hardest to correct. By doing this repeatedly, the model improves over time. This method is very powerful for complex, non-linear problems, but it needs careful tuning to avoid overfitting (memorizing the training data too much). It is commonly used in applications like fraud detection, ranking systems, and predicting customer behavior."

In [9]:
models = {
    "SVM": SVC(random_state=432),
    "Gradient Boost": GradientBoostingClassifier(random_state=432)
}

### Hyperparameter Tuning ###

Hyperparameter tuning is the process of finding the optimal configuration for a machine learning model. It involves testing different values of hyperparameters for a given ML algorithm and selecting the combination that maximizes performance.

For example, in clustering, the K-means algorithm requires the user to choose the number of clusters. We used WSSE elbow plots to find the optimum value for `k`. Similarly, the DBSCAN algorithm is depends on `min_samples` (minimum number of neighbors) and `eps` (neighborhood size). We combined the Silhouette Score, Calinski-Harabasz Index, and Davies-Bouldin Index metrics to optimize these parameters.

In regression learning, we might choose a polynomial model that requires us to specify options like the polynomial degree and learning rate.

And finally, with classification algorithms like K-Nearest Neighbors, the user is able to choose between various `metric` values (e.g., `"euclidean"` or `"manhattan"`) and a voting strategy using the `weights` parameter.

In each of these cases, selecting the optimal hyperparameter values can significantly impact model performance.

In [6]:
# We will compare Gradient Boost and Support Vector Machine, two new ML algorithms
param_grids = {
    'Gradient Boost': {
        'n_estimators': [50, 100, 200],
        'learning_rate': [0.01, 0.1, 0.2],
        'max_depth': [3, 5, 7]
    },
    'SVM': {
        'C': [0.1, 1, 10],
        'kernel': ['linear', 'rbf']
    }
}

### Cross-Validation ###

Cross-validation is a technique to minimize overfitting and it is especially important with regard to hyperparameter tuning. The basic idea is to create many different sets of training data and to evaluate the model's cumulative performance.

Think back to how we performed hyperparameter tuning. We basically tried a bunch of different parameter values and then found which combination gave the highest accuracy score. One of the potential problems with this approach is that it is prone to overfitting. The best parameters are only "best" for the chosen train-test split. A different set of training data might have led us to choose different hyperparameters. This happens because the model's performance may vary depending on the data split, leading to inconsistent hyperparameter selection.

Cross-validation mitigates this by repeatedly training and evaluating the model on different train-test splits, producing a more reliable estimate of model performance. The final evaluation metric is averaged over multiple train-test splits, providing a more reliable estimate of the model's true performance.


In [7]:
# Three cross-validation techniques
cv_methods = {
    "Shuffle-Split": ShuffleSplit(n_splits=10, test_size=0.2, random_state=123),
    "Stratified K-Fold": StratifiedKFold(n_splits=5, shuffle=True, random_state=123),
    #"Leave-P-Out": LeavePOut(p=2)
}

## Evaluating the Models ##

The final step is to create and validate (test) all of the models, keeping track of the best performers.

We're going to find that the wine dataset is relatively small and each class is fairly easy to predict, so the accuracy of each cross-validation method will be roughly the same, though there may be some small difference in the values. We had to keep the dataset simple in order for the cross-validation techniques to finish in a reasonable time on my little laptop. Had we more processing power, we could have analyzed a larger and more complex dataset, where the differences would be more pronounced.

The key to notice here is the differnce in execution time!

In [8]:

# Perform hyperparameter tuning with different cross-validation methods
for model_name, model in models.items():
    print(f"Training {model_name}...")
    
    for cv_name, cv_method in cv_methods.items():
        print(f"  Using {cv_name} cross-validation")
        
        start_time = time.time()
        grid_search = GridSearchCV(
            model, param_grids[model_name], 
            cv=cv_method, scoring='accuracy', n_jobs=-1)
        grid_search.fit(X_train, y_train)
        y_pred = grid_search.best_estimator_.predict(X_test)
        end_time = time.time()
        print(f"    Grid Search  ({end_time - start_time:5.2f}s): {accuracy_score(y_test, y_pred):.2f} accuracy with {grid_search.best_params_}")
        
        start_time = time.time()
        random_search = RandomizedSearchCV(
            model, param_grids[model_name], 
            cv=cv_method, scoring='accuracy', 
            n_iter=5, random_state=17, n_jobs=-1)
        random_search.fit(X_train, y_train)
        y_pred = random_search.best_estimator_.predict(X_test)
        end_time = time.time()
        print(f"    Rnd Search   ({end_time - start_time:5.2f}s): {accuracy_score(y_test, y_pred):.2f} accuracy with {random_search.best_params_}")

        start_time = time.time()
        bayes_search = BayesSearchCV(
            model, param_grids[model_name], 
            cv=cv_method, scoring='accuracy', 
            n_iter=10, random_state=17, n_jobs=-1)
        bayes_search.fit(X_train, y_train)
        y_pred = bayes_search.best_estimator_.predict(X_test)
        end_time = time.time()
        print(f"    Bayes Search ({end_time - start_time:5.2f}s): {accuracy_score(y_test, y_pred):.2f} accuracy with {bayes_search.best_params_}")

Training Gradient Boost...
  Using Shuffle-Split cross-validation
    Grid Search  (41.47s): 0.94 accuracy with {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 200}
    Rnd Search   ( 3.45s): 0.97 accuracy with {'n_estimators': 100, 'max_depth': 7, 'learning_rate': 0.2}
    Bayes Search ( 9.45s): 0.94 accuracy with OrderedDict({'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 200})
  Using Stratified K-Fold cross-validation
    Grid Search  ( 7.54s): 0.94 accuracy with {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 50}
    Rnd Search   ( 2.04s): 0.94 accuracy with {'n_estimators': 50, 'max_depth': 3, 'learning_rate': 0.01}
    Bayes Search ( 6.48s): 0.94 accuracy with OrderedDict({'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 100})
Training SVM...
  Using Shuffle-Split cross-validation
    Grid Search  ( 0.77s): 0.97 accuracy with {'C': 0.1, 'kernel': 'linear'}
    Rnd Search   ( 0.65s): 0.97 accuracy with {'kernel': 'linear', 'C': 0.1}
    Bayes Search ( 

### Analyzing Results ###

We can dig deeper into the results by comparing the accuracy of each test using the `cv_results_` parameter. There are several different metrics available, mostly related to accuracy score and execution time. We will look into the accuracy scores.

In [None]:
print(random_search.cv_results_['params'])
print(random_search.cv_results_['mean_test_score'])
print(random_search.cv_results_['rank_test_score'])

[{'kernel': 'linear', 'C': 1}, {'kernel': 'rbf', 'C': 1}, {'kernel': 'linear', 'C': 10}, {'kernel': 'linear', 'C': 0.1}, {'kernel': 'rbf', 'C': 10}]
[0.95763547 0.67586207 0.9364532  0.95763547 0.72512315]
[1 5 3 1 4]


In [20]:
combined = zip(random_search.cv_results_['params'], random_search.cv_results_['mean_test_score'])
combined = sorted(list(combined), key=lambda x: x[1], reverse=True)
for param, score in combined:
    print(f"Accuracy {100*score:.3f}%: {param}")

Accuracy 95.764%: {'kernel': 'linear', 'C': 1}
Accuracy 95.764%: {'kernel': 'linear', 'C': 0.1}
Accuracy 93.645%: {'kernel': 'linear', 'C': 10}
Accuracy 72.512%: {'kernel': 'rbf', 'C': 10}
Accuracy 67.586%: {'kernel': 'rbf', 'C': 1}


In [21]:
combined = zip(grid_search.cv_results_['params'], grid_search.cv_results_['mean_test_score'])
combined = sorted(list(combined), key=lambda x: x[1], reverse=True)
for param, score in combined:
    print(f"Accuracy {100*score:.3f}%: {param}")

Accuracy 95.764%: {'C': 0.1, 'kernel': 'linear'}
Accuracy 95.764%: {'C': 1, 'kernel': 'linear'}
Accuracy 93.645%: {'C': 10, 'kernel': 'linear'}
Accuracy 72.512%: {'C': 10, 'kernel': 'rbf'}
Accuracy 67.586%: {'C': 1, 'kernel': 'rbf'}
Accuracy 66.232%: {'C': 0.1, 'kernel': 'rbf'}
