# Hyper-parameter tunning

Hello again, do you know what cross-validation is? Well, we're going to use it a bit for the topic of this chapter.

You know that to train a machine learning model, it's necessary to establish some values and configurations to modify the training behavior; these are known as hyperparameters. To give an example of these hyperparameters, take the `RandomForestRegressor` class (we'll look at the machine learning algorithms that sklearn offers us later, don't worry about it for now):

In [None]:
from sklearn.ensemble import RandomForestRegressor

models = RandomForestRegressor(
    n_estimators = 10,
    criterion = "gini",
    max_depth = 10,
    max_leaf_nodes = 100
)


Where the hyperparameters are: the number of trees, the splitting criterion, the maximum depth, and the minimum number of samples per leaf.

These values have a significant impact on the model's performance and can be the difference between a poor model and one that works perfectly.

Although the default hyperparameters in scikit-learn classes are reasonable values, they are not necessarily optimal for all datasets or all machine learning problems. Therefore, it is important to perform a hyperparameter search to find the optimal values that maximize the model's performance across all our datasets.

Conducting this search takes time and effort, but it is an investment worth making for the improvement these parameters represent in our model.

Scikit-learn offers us several options when it comes to searching for these hyperparameters systematically rather than manually.

The techniques are: *grid search* and *random search*. Each has its advantages and disadvantages; in this book, I will discuss random search:

Just a small note, in scikit-learn, hyperparameter searches are always connected with cross-validation to ensure that the chosen values are a correct choice for the dataset.

## Random search

Now, let's see an example in a regression problem.

First, let's load the dataset and split it into training and test sets:

In [None]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

housing_dataset = fetch_california_housing()

X_train, X_test, y_train, y_test = train_test_split(
	housing_dataset.data,
	housing_dataset.target,
	random_state=42
)


Then we will create a regression model:

In [None]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()


We must define the parameter space in which we are going to search - this search space will be used by Random Search to randomly generate combinations of hyperparameters. These combinations will be used to create new instances of our RandomForestRegressor and run cross-validation on them, thus evaluating how good they are to find the best combination.

In [None]:
# Some of the parameters are
# commented out because they will
# take too long to execute, but you
# can try them out!

param_distributions = {
    # 'n_estimators': [100, 1000, 2000],
    # 'criterion': ["squared_error", "absolute_error", "friedman_mse"],
    # 'max_depth': [None, 10, 100],
    'max_features': ["sqrt", "log2"],
    'max_leaf_nodes': [
        None, 10, 
        # 100, 1000
    ]
}


And finally, we import the `RandomizedSearchCV` class:

In [None]:
from sklearn.model_selection import RandomizedSearchCV

We create an instance, passing it the model and the set of parameters. Then we specify the number of iterations; remember that the search is random, the number of iterations specifies how many attempts we will make to find the best hyperparameters. With `cv` we specify the number of subsets for cross-validation, and finally, we set the random state to 42 to make the result reproducible.

In [None]:
search = RandomizedSearchCV(model, param_distributions, n_iter=10, cv=5, random_state=42)

Finally, we call `fit` to begin the search; this receives the training data:

In [None]:
search.fit(X_train, y_train)

This will take a little while, but when finished we'll be able to access the best parameters using the `best_params_` attribute and we can evaluate the best model obtained through the `score` method:

In [None]:
print("Mejores hiperparámetros: ", search.best_params_)
print("Puntuación de prueba: ", search.score(X_test, y_test))


## Training a model with the best parameters

To train the final model, we can take the best hyperparameters and pass them to the constructor. This creates a fresh model with the ideal configuration we just obtained and trains it with all of our training data:

In [None]:
best_model = RandomForestRegressor(**search.best_params_)

best_model.fit(X_train, y_train)


```{hint}
As an exercise, practice using a grid search, utilizing \`GridSearchCV\`. Be careful when using too many parameters because grid search takes time to execute.
```

## Does not guarantee the best solution

It's important to note that hyperparameter search does not guarantee finding the optimal set of hyperparameters for a given model. The optimal combination of hyperparameters may not be within the manually specified search space. Therefore, it's important to consider hyperparameter search as an iterative process that may require several iterations to reach an optimal set of hyperparameters for a given model.

## In conclusion

Hyperparameter search is a crucial step when you want to get the most out of your data. In scikit-learn, this search is strongly linked to cross-validation, although in theory, they are two independent concepts.

scikit-learn offers two methods for hyperparameter search: GridSearchCV and RandomizedSearchCV. The first performs an exhaustive search over all possible combinations of specified hyperparameter values, while the second performs a random search of a subset of combinations. In general, RandomizedSearchCV can be more efficient than GridSearchCV when the hyperparameter search space is large.

Also remember that it's not a magical solution, and sometimes you need to iterate on choosing the best search space.