## Hyperparameter Optimization

### Cross-validation

Finding the optimal tuning parameters for a machine learning problem can often be very difficult. We may encounter **overfitting**, which means our machine learning model trains too specifically on our training dataset and causes higher levels of error when applied to our test/holdout datasets. Or, we may run into **underfitting**, which means our model doesn’t train specifically enough to our training dataset. This also leads to higher levels of error when applied to test/holdout datasets.

When conducting a normal train/test split for model training and testing, the model trains on a specific randomly selected portion of the data, validates on a separate set of data, then finally tests on a holdout dataset. In practice this could lead to some **issues**, especially when the size of the dataset is relatively **small**, because you could be eliminating a portion of observations that would be key to training an optimal model. Keeping a percentage of data out of the training phase, even if its **15–25%** still holds plenty of information that would otherwise help our model train more **effectively**.

In comes a solution to our problem — **Cross Validation**. Cross validation works by splitting our dataset into random groups, holding one group out as the test, and training the model on the remaining groups. This process is repeated for each group being held as the test group, then the average of the models is used for the resulting model.

One of the most common types of cross validation is **K-Fold Cross Validation**, where $k$ is the number of folds within the dataset. Using $k=5$ is a common first step and easy for demonstrations of this principle below:

<img src="https://miro.medium.com/v2/resize:fit:828/format:webp/1*zSlot50Mu-NDODADz3pz4g.png" width=500>

Here we see five iterations of the model, each of which treats a **different fold** as the **test set** and **trains on the other four folds**. Once all five iterations are complete the resulting iterations are **averaged** (mean) together creating the **final cross validation model**.

While cross validation can greatly benefit model development, there is also an important **drawback** that should be considered when conducting cross validation. Because each iteration of the model, up to $k$ times, requires you to run the full model, it can get **computationally expensive** as your dataset gets larger and as the value of $k$ increases. 

For example, running a cross validation model of $k = 10$ on a dataset with **1 million** observations requires you to run **10 separate models**, each of which uses all **1 million** observations. This won’t really be an issue with **small datasets** as the compute time would be in the scale of minute but when working with **larger datasets** with sizes in scales of many $GB$ or $TB$, the time required will **significantly increase**.e.below: below.ly below.



### Hyperparameter Tuning

However, when creating a machine learning model, you'll be presented with design choices as to how to define your model architecture. Often times, we don't immediately know what the optimal model architecture should be for a given model, and thus we'd like to be able to explore a range of possibilities. In true machine learning fashion, we'll ideally ask the machine to perform this exploration and select the optimal model architecture automatically. Parameters which define the model architecture are referred to as hyperparameters and thus this process of searching for the ideal model architecture is referred to as hyperparameter tuning.

These hyperparameters might address model design questions such as:

- What **degree** of polynomial features should I use for my linear model?
- What should be the **maximum depth** allowed for my decision tree?
- What should be the **minimum number of samples** required at a leaf node in my decision tree?
- How many **trees** should I include in my random forest?
- How many **neurons** should I have in my neural network layer?
- How many **layers** should I have in my neural network?
- What should I set my **learning rate** to for gradient descent?

I want to be absolutely clear, hyperparameters are not model parameters and they cannot be directly trained from the data. Model parameters are learned during training when we optimize a loss function using something like **gradient descent**. The process for learning parameter values is shown below

<img src="https://www.jeremyjordan.me/content/images/2017/11/Screen-Shot-2017-11-02-at-1.28.26-PM.png" width="600">

Whereas the model parameters specify how to transform the input data into the desired output, the hyperparameters define how our model is actually structured. Unfortunately, there's no way to calculate: **Which way should I update my hyperparameter to reduce the loss?** (ie. gradients) in order to find the optimal model architecture; thus, we generally resort to experimentation to figure out what works best.

In general, this process includes:

1. Define a model
2. Define the range of possible values for all hyperparameters
3. Define a method for sampling hyperparameter values
4. Define a cross-validation method

### Hyperparameter Tuning Methods

Recall that I previously mentioned that the hyperparameter tuning methods relate to how we sample possible model architecture candidates from the space of possible hyperparameter values. This is often referred to as "searching" the hyperparameter space for the optimum values. 

In the following visualization, the **x** and **y** dimensions represent two hyperparameters, and the **z** dimension represents the model's score (defined by some evaluation metric) for the architecture defined by the **x** and **y**

<img src="https://www.jeremyjordan.me/content/images/2017/11/hyperparameter_space.png" width=500>.

If we had access to such a plot, choosing the ideal hyperparameter combination would be trivial. However, calculating such a plot at the granularity visualized above would be prohibitively expensive. Thus, we are left to blindly explore the hyperparameter space in hopes of locating the hyperparameter values which lead to the maximum score.

For each method, I'll discuss how to search for the optimal structure of a Random Forest classifer:
- How many **estimators** (ie. decision trees) should I use?
- What should be the maximum allowable **depth** for each decision tree?

#### Grid Search
Grid search is arguably the most basic hyperparameter tuning method. With this technique, we simply build a model for each possible combination of all of the hyperparameter values provided, evaluating each model, and selecting the architecture which produces the best results.

For example, we would define a list of values to try for both ``n_estimators`` and ``max_depth``, and a grid search would build a model for each possible combination.

Performing grid search over the defined hyperparameter space:

In [None]:
n_estimators = [10, 50, 100, 200]
max_depth = [3, 10, 20, 40]

Would yeld the following models:

In [None]:
from sklearn.ensemble import RandomForestClassifier
RandomForestClassifier(n_estimators=10, max_depth=3)
RandomForestClassifier(n_estimators=10, max_depth=10)
RandomForestClassifier(n_estimators=10, max_depth=20)
RandomForestClassifier(n_estimators=10, max_depth=40)

RandomForestClassifier(n_estimators=50, max_depth=3)
RandomForestClassifier(n_estimators=50, max_depth=10)
RandomForestClassifier(n_estimators=50, max_depth=20)
RandomForestClassifier(n_estimators=50, max_depth=40)

RandomForestClassifier(n_estimators=100, max_depth=3)
RandomForestClassifier(n_estimators=100, max_depth=10)
RandomForestClassifier(n_estimators=100, max_depth=20)
RandomForestClassifier(n_estimators=100, max_depth=40)

RandomForestClassifier(n_estimators=200, max_depth=3)
RandomForestClassifier(n_estimators=200, max_depth=10)
RandomForestClassifier(n_estimators=200, max_depth=20)
RandomForestClassifier(n_estimators=200, max_depth=40)

Each model would be fit to the training data and evaluated on the validation data. As you can see, this is an **exhaustive** sampling of the hyperparameter space and can be quite inefficient.

<img src="https://www.jeremyjordan.me/content/images/2017/11/grid_search.gif" width=500>

To implement the Grid-Search, we have a Scikit-Learn library called **GridSearchCV**.

The computational time would be long, but it would reduce the manual efforts by avoiding the ‘n’ number of lines of code. Library itself perform the search operations and returns the performing model and its score. In which each model are built for each permutation of a given hyperparameter, internally it would be evaluated and ranked across the given cross-validation folds.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

# Step 1: Create a synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20,
                           n_informative=2, n_redundant=10,
                           random_state=42, n_classes=2)

# Step 2: Define the parameter grid
param_grid = {
    'n_estimators': [10, 50, 100, 200],
    'max_depth': [3, 10, 20, 40],
    'bootstrap': [True, False],
    'criterion': ["gini", "entropy"]
}

# Step 3: Initialize the RandomForestClassifier
rf = RandomForestClassifier(random_state=42)

# Step 4: Use GridSearchCV
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1)

# Step 5: Fit the model
grid_search.fit(X, y)

# Print the best parameters
print("Best parameters found: ", grid_search.best_params_)
print("Best mean Cross-validation Accuracy score: ", round(grid_search.best_score_, 3))

In [None]:
# Plot the results
def plot_grid_search(cv_results, grid_param_1, grid_param_2, name_param_1, name_param_2):
    # Initialize an empty array to store mean scores
    scores_mean = np.zeros((len(grid_param_2), len(grid_param_1)))

    # Iterate over all combinations and compute mean scores
    for i, val1 in enumerate(grid_param_1):
        for j, val2 in enumerate(grid_param_2):
            # Filter results for specific n_estimators and max_depth
            mask = (cv_results['param_' + name_param_1] == val1) & (cv_results['param_' + name_param_2] == val2)
            # Average over all bootstrap and criterion combinations
            scores_mean[j, i] = np.mean(cv_results['mean_test_score'][mask])

    # Plotting
    for idx, val in enumerate(grid_param_2):
        plt.plot(grid_param_1, scores_mean[idx, :], '-o', label=name_param_2 + ': ' + str(val))

    plt.title("Grid Search Scores (Averaged over Bootstrap and Criterion)")
    plt.xlabel(name_param_1)
    plt.ylabel('CV Average Score')
    plt.legend(loc="best")
    plt.grid('on')

plot_grid_search(grid_search.cv_results_, param_grid['n_estimators'], param_grid['max_depth'], 'n_estimators', 'max_depth')
plt.figure(figsize=(8, 6))
plt.show()

#### Random search
Random search differs from grid search in that we longer provide a discrete set of values to explore for each hyperparameter; rather, we provide a statistical distribution for each hyperparameter from which values may be randomly sampled.

We'll define a sampling distribution for e hyperparameter:er.

In [None]:
import numpy as np

min_estimators, max_estimators = 2, 400
n_estimators = np.random.randint(min_estimators, max_estimators)
low, high = 1, 40
max_depth = np.random.randint(low, high)

The Grid Search one that we have discussed above usually increases the complexity in terms of the computation flow, So sometimes GS is considered inefficient since it attempts all the combinations of given hyperparameters.  But the **Randomized Search** is used to train the models based on random hyperparameters and combinations. obviously, the number of training models are **smaller** than grid search.

In simple terms, In Random Search, in a given grid, the list of hyperparameters are trained and test our model on a random combination of given hyperparameters.

<img src="https://www.jeremyjordan.me/content/images/2017/11/random_search.gif" width=500>

To implement the Random Search, we have a Scikit-Learn library called **RandomSearchCV**.

In [None]:
import numpy as np
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

# Create a synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20,
                           n_informative=2, n_redundant=10,
                           random_state=42, n_classes=2)

# Define the parameter distributions
param_distributions = {
    'n_estimators': np.random.randint(2, 400, 100),  # 100 random values between 2 and 400
    'max_depth': np.random.randint(1, 40, 100),       # 100 random values between 1 and 40
    'bootstrap': [True, False],
    'criterion': ["gini", "entropy"]
}

# Initialize the RandomForestClassifier
rf = RandomForestClassifier(random_state=42)

# Use RandomizedSearchCV
random_search = RandomizedSearchCV(estimator=rf, param_distributions=param_distributions,
                                   n_iter=100, cv=5, n_jobs=-1, verbose=2, random_state=42)

# Fit the model
random_search.fit(X, y)

# Print the best parameters
print("Best parameters found: ", random_search.best_params_)
print("Best mean Cross-validation Accuracy score: ", round(grid_search.best_score_, 3))

In [None]:
# Plot the results
# Plot the results for RandomizedSearchCV
def plot_random_search(cv_results, name_param_1, name_param_2):
    scores = cv_results['mean_test_score']
    param_1_values = cv_results['param_' + name_param_1]
    param_2_values = cv_results['param_' + name_param_2]

    plt.figure(figsize=(8, 6))
    scatter = plt.scatter(param_1_values, param_2_values, c=scores, cmap='viridis')
    plt.colorbar(scatter, label='CV Average Score')
    plt.xlabel(name_param_1)
    plt.ylabel(name_param_2)
    plt.title(f"Random Search CV Scores ({name_param_1} vs {name_param_2})")
    plt.grid(True)
    plt.show()

# Call the plotting function
plot_random_search(random_search.cv_results_, 'n_estimators', 'max_depth')

## Grid Search vs. Random Search



|                  GridSearchCV                  |                       RandomSearshCV                      |
|:----------------------------------------------:|:---------------------------------------------------------:|
|              Grid is well-defined              |                  Grid is not well defined                 |
|          Discrete values for HP-params         |       Continuos values and Statistical distribution       |
|      Defined size for Hyperparameter space     |                   No such a restriction                   |
|   Picks of the best combination from HP-Space  |             Picks up the samples from HP-Space            |
|             Samples are not created            | Samples are created and specified by the range and n_iter |
|            Low performance than RSCV           |               Better performance and result               |
| Guided flow to search for the best combination |          The name itself says that, no guidance.          |

<div align="center">
    The blow pictorial representation would give you the best understanding of GridSearchCV and RandomSearshCV.
    <br><br>
    <img src="https://editor.analyticsvidhya.com/uploads/73200GSRS-CV.png" width=500>
</div>

### Conclusion
So far we have discussed in a detailed study of Hyperparameter visions with respect to the Machine Learning point of view, please remember a few things:

- Each model has a set of hyperparameters, so we have carefully chosen them and tweaked them during hyperparameter tuning.
- All hyperparameters are **NOT** equally important and no defined rules for this. Try to use continuous values instead of discrete values.
- Make sure to use **K-Fold Cross Validation** while using Hyperparameter tuning to improvise your hyperparameter tuning and coverage of hyperparameter space.
- Go with a better combination for hyperparameters and build strong results.ults.

### Hyperoptimization Libraries

- Ray.tune: Hyperparameter Optimization Framework - https://www.ray.io/
- Optuna - https://optuna.org/
- Hyperopt - https://github.com/hyperopt/hyperopt-sklearn
- Polyaxon
- Talos
- Spearmint
- GPyOpt