# Bayesian Hyper-parameter Search

We continue the prediction exercise we considered in Lab 6: predicting the median house price in CA districts. We'll use the pipeline as setup in that lab, with a Random Forest regressor (which performed best).

The two hyper parameters we tuned last time were
* `n_clusters`: Number of geographic clusters to create and measure distance from.
* `max_features`: A hyper-parameter of random forest regressor that controls the number of features considered at each stage of tree construction.

Recall that the randomized search had an edge over the grid search. However, the randomized search picked all hyper-parameter values at random. In this lab, we'll try to pick latter hyper-parameter values in a region that is likely to yield an improvement. As the focus of the lab will be on randomized search, we'll skip the grid search strategy in this.

These are the steps:
1. From the last lab
  1. Download the data and split into training-test
  1. Setup the pipeline
  1. Evaluate nine randomly picked hyper parameter values and identify best for comparison
1. Carry out bayesian hyper-paramter search using `scikit-optimize` package
  1. Compare the CV error with that found from `RandomizedSearchCV`
1. Evaluate the best found model on test data

First we need to install `scikit-optimize`.

In [None]:
!pip install scikit-optimize  # well established, easy to use, hyper-parameter tuning package

Collecting scikit-optimize
  Downloading scikit_optimize-0.10.2-py2.py3-none-any.whl.metadata (9.7 kB)
Collecting pyaml>=16.9 (from scikit-optimize)
  Downloading pyaml-25.7.0-py3-none-any.whl.metadata (12 kB)
Downloading scikit_optimize-0.10.2-py2.py3-none-any.whl (107 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m107.8/107.8 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyaml-25.7.0-py3-none-any.whl (26 kB)
Installing collected packages: pyaml, scikit-optimize
Successfully installed pyaml-25.7.0 scikit-optimize-0.10.2


The following three blocks are from last lab.

First we download the data.

In [None]:
import numpy as np
import pandas as pd

file_url = 'https://drive.google.com/uc?export=download&id=1NL9jMOXOxmG88N_Ih5FeDPZOVmlFXcu6'
housing = pd.read_csv(file_url)


Skipping exploratory data analysis, done last time, we split data into training and test.

In [None]:
from sklearn.model_selection import train_test_split

X = housing.drop("median_house_value", axis=1)
y = housing["median_house_value"].copy()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)


Second, setup the preprocessing pipeline -- taken from last lab.

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import rbf_kernel
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import FunctionTransformer, StandardScaler, OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer

class ClusterSimilarity(BaseEstimator, TransformerMixin):
    # Inherits BaseEstimator to use as a custom scikit-learn estimator, gaining methods like set_params and get_params.
    # Inherits TransformerMixin for compatibility with scikit-learn pipelines and the fit_transform method.
    def __init__(self, n_clusters=10, gamma=1.0, random_state=None):
        # Initializes the custom transformer with default parameters for clustering and RBF kernel.
        self.n_clusters = n_clusters  # Number of clusters to form.
        self.gamma = gamma  # Coefficient for RBF.
        self.random_state = random_state  # Random state for reproducibility.

    def fit(self, X, y=None, sample_weight=None):
        # Fit method to perform k-means clustering on X.
        # Creates a KMeans instance with the specified number of clusters and random state.
        self.kmeans_ = KMeans(self.n_clusters, n_init='auto', random_state=self.random_state)
        self.kmeans_.fit(X, sample_weight=sample_weight)  # Fits the KMeans model to the data.
        return self  # Returns self to allow method chaining.

    def transform(self, X):
        # Transform method to calculate the RBF kernel between X and the k-means cluster centers.
        # This method effectively computes the similarity of each sample in X to each cluster center.
        return rbf_kernel(X, self.kmeans_.cluster_centers_, gamma=self.gamma)

    def get_feature_names_out(self, names=None):
        # Provides output feature names for the transformation, aiding in interpretability.
        # Useful when the transformed data is passed to other estimators or inspected.
        return [f"Cluster {i} similarity" for i in range(self.n_clusters)]

cat_pipeline = make_pipeline(
    SimpleImputer(strategy="most_frequent"),
    OneHotEncoder(handle_unknown="ignore") # unknown values in test → 0 for all 0/1 cols.
  )

def column_ratio(X):
    return X[:, [0]] / X[:, [1]]

def ratio_name(function_transformer, feature_names_in):
    return ["ratio"]  # a fixed string to be appended to the name of the ratio pipeline step

ratio_pipeline = make_pipeline(
        SimpleImputer(strategy="median"),
        FunctionTransformer(column_ratio, feature_names_out=ratio_name),
        StandardScaler())

log_pipeline = make_pipeline(
    SimpleImputer(strategy="median"),
    FunctionTransformer(np.log, feature_names_out="one-to-one"),
    StandardScaler())

cluster_simil = ClusterSimilarity(n_clusters=10, gamma=1., random_state=42)

default_num_pipeline = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())

preprocessing = ColumnTransformer([
        ("bedrooms", ratio_pipeline, ["total_bedrooms", "total_rooms"]),
        ("rooms_per_house", ratio_pipeline, ["total_rooms", "households"]),
        ("people_per_house", ratio_pipeline, ["population", "households"]),
        ("log", log_pipeline, ["total_bedrooms", "total_rooms", "population",
                               "households", "median_income"]),
        ("geo", cluster_simil, ["latitude", "longitude"]),
        ("cat", cat_pipeline, ["ocean_proximity"]),
    ],
    remainder=default_num_pipeline, force_int_remainder_cols=False # one column remaining: housing_median_age
)

preprocessing


Next we append the `RandomForestRegressor` to complete the pipeline.

In [None]:
from sklearn.ensemble import RandomForestRegressor

forest_reg = make_pipeline(preprocessing, RandomForestRegressor(n_estimators=10, random_state=42))
n_trials = 20

First, carry out the randomized search, note the best found hyperparameter values, and the Cross Validation (CV) error at those values.

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {'columntransformer__geo__n_clusters': randint(low=3, high=100),
                  'randomforestregressor__max_features': randint(low=2, high=20),
                  }

rnd_search = RandomizedSearchCV(
    forest_reg, param_distribs, n_iter=n_trials, cv=3,
    scoring='neg_root_mean_squared_error', random_state=42)

rnd_search.fit(X_train, y_train)

# Let's check the random search results
rnd_res = pd.DataFrame(rnd_search.cv_results_)
rnd_res.sort_values(by="mean_test_score", ascending=False, inplace=True)
rnd_res.filter(regex = '(^param_|mean_test_score)', axis=1)


Unnamed: 0,param_columntransformer__geo__n_clusters,param_randomforestregressor__max_features,mean_test_score
12,94,17,-44581.677891
9,78,13,-44660.41107
2,85,12,-44774.027999
1,74,8,-44953.529312
16,53,8,-45021.707991
14,64,13,-45034.691755
11,93,11,-45229.558357
6,90,13,-45282.653753
10,91,18,-45430.511489
0,54,16,-45577.234942


Note the score is negative of RMSE, hence a negative value.

Can we do better using a Bayesian search? `BayesSearchCV` from `scikit-optimize` is almost a drop-in replacement for `RandomizedSearchCV`.

In [None]:
from skopt import BayesSearchCV
from sklearn.ensemble import RandomForestRegressor
from skopt.space import Integer

param_distribs = {
    'columntransformer__geo__n_clusters': Integer(3, 100),  # slightly different syntax
    'randomforestregressor__max_features': Integer(2, 20),  # sampled uniformly by default
}

bayes_search = BayesSearchCV(
    forest_reg, param_distribs, n_iter=n_trials, cv=3,
    optimizer_kwargs={'n_initial_points':10}, # just making it explicit
      # by default it selects 10 initial points at random and the rest per bayesian optimization
    scoring='neg_root_mean_squared_error', random_state=42)

bayes_search.fit(X_train, y_train)

# Let's check the bayesian search results
bayes_res = pd.DataFrame(bayes_search.cv_results_)
bayes_res.sort_values(by="mean_test_score", ascending=False, inplace=True)
bayes_res.filter(regex = '(^param_|mean_test_score)', axis=1)



Unnamed: 0,param_columntransformer__geo__n_clusters,param_randomforestregressor__max_features,mean_test_score
11,100,17,-44598.780703
8,96,15,-44731.334019
10,100,20,-44766.235454
12,100,16,-44878.915038
18,95,20,-44927.867077
1,84,18,-45019.17571
14,100,11,-45020.855959
5,74,19,-45081.000079
7,56,19,-45182.046503
6,63,16,-45258.154026


The CV errors are similar. Let's measure the out-of-sample prediction performance on the test data set aside in the beginning.

In [None]:
from sklearn.metrics import root_mean_squared_error

final_predictions = bayes_search.best_estimator_.predict(X_test)
final_rmse = root_mean_squared_error(y_test, final_predictions)
print(f'The RMSE of the selected model {final_rmse:.0f}.')

The RMSE of the selected model 44006.


The results can some times be a little better than cross-validation error because the final model was trained on the entire training data, whereas the for each cross validation fold, the model was trained on only 2/3 of the training data.