

The goal is to find the best set of hyperparameters which maximize the
generalization performance on a training set.

In [64]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

data, target = fetch_california_housing(return_X_y=True, as_frame=True)
target *= 100  # rescale the target in k$

data_train, data_test, target_train, target_test = train_test_split(
    data, target, random_state=42)

In [65]:
data.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


### Defining a regression  pipeline :
* with a `StandardScaler` to normalize the numerical data;
* with `sklearn.neighbors.KNeighborsRegressor` as a predictive model.

In [66]:
from sklearn import set_config
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline

set_config()

scaler = StandardScaler()
model = make_pipeline(scaler,KNeighborsRegressor())
model

In [67]:
model.get_params()

{'memory': None,
 'steps': [('standardscaler', StandardScaler()),
  ('kneighborsregressor', KNeighborsRegressor())],
 'verbose': False,
 'standardscaler': StandardScaler(),
 'kneighborsregressor': KNeighborsRegressor(),
 'standardscaler__copy': True,
 'standardscaler__with_mean': True,
 'standardscaler__with_std': True,
 'kneighborsregressor__algorithm': 'auto',
 'kneighborsregressor__leaf_size': 30,
 'kneighborsregressor__metric': 'minkowski',
 'kneighborsregressor__metric_params': None,
 'kneighborsregressor__n_jobs': None,
 'kneighborsregressor__n_neighbors': 5,
 'kneighborsregressor__p': 2,
 'kneighborsregressor__weights': 'uniform'}

### Hyperparameter tuning 

We use `RandomizedSearchCV` with `n_iter=20` to find the best set of
hyperparameters by tuning the following parameters of the `model`:

- the parameter `n_neighbors` of the `KNeighborsRegressor` with values
  `np.logspace(0, 3, num=10).astype(np.int32)`;
- the parameter `with_mean` of the `StandardScaler` with possible values
  `True` or `False`;
- the parameter `with_std` of the `StandardScaler` with possible values
  `True` or `False`.


In [68]:
import numpy as np
from sklearn.model_selection import RandomizedSearchCV

param_distributions = {
    "kneighborsregressor__n_neighbors": np.logspace(0, 3, num=10).astype(np.int32),
    "standardscaler__with_mean": [True, False],
    "standardscaler__with_std": [True, False],
}

model_random_search = RandomizedSearchCV(
    model, param_distributions=param_distributions, n_iter=20,
     n_jobs=2, verbose=1, random_state=1)

model_random_search.fit(data_train, target_train)

Fitting 5 folds for each of 20 candidates, totalling 100 fits


In [69]:
model_random_search.best_params_

{'standardscaler__with_std': True,
 'standardscaler__with_mean': False,
 'kneighborsregressor__n_neighbors': 10}

### Parallel coordinate plot

To simplify the axis of the plot, we will rename the column of the dataframe and only select the mean test score and the value of the hyperparameters.

In [70]:
import pandas as pd

cv_results = pd.DataFrame(model_random_search.cv_results_)

column_name_mapping = {
    "param_kneighborsregressor__n_neighbors": "n_neighbors",
    "param_standardscaler__with_mean": "centering",
    "param_standardscaler__with_std": "scaling",
    "mean_test_score": "mean test score",
}

cv_results = cv_results.rename(columns=column_name_mapping)
cv_results = cv_results[column_name_mapping.values()].sort_values(
    "mean test score", ascending=False)

In addition, the parallel coordinate plot from plotly expects all data to be numeric. Thus, we convert the boolean indicator informing whether or not the data were centered or scaled into an integer, where True is mapped to 1 and False is mapped to 0. As n_neighbors has dtype=object, we also convert it explicitly to an integer.

In [71]:
column_scaler = ["centering", "scaling"]
cv_results[column_scaler] = cv_results[column_scaler].astype(np.int64)
cv_results["n_neighbors"] = cv_results["n_neighbors"].astype(np.int64)
cv_results

Unnamed: 0,n_neighbors,centering,scaling,mean test score
17,10,0,1,0.687926
18,4,0,1,0.674812
6,46,0,1,0.668778
9,100,0,1,0.648317
16,2,1,1,0.629772
15,215,1,1,0.617295
12,215,0,1,0.617295
10,464,1,1,0.567164
0,1,0,1,0.508809
13,1000,1,1,0.486503


In [74]:
import plotly.express as px

fig = px.parallel_coordinates(
    cv_results,
    color="mean test score",
    dimensions=["n_neighbors", "centering", "scaling", "mean test score"],
    color_continuous_scale=px.colors.diverging.Tealrose,
)
fig.show()

### Analysis

Selecting the best performing models (i.e. above an accuracy of ~0.68), we observe that in this case:

- scaling the data is important. All the best performing models use scaled features;

- centering the data does not have a strong impact. Both approaches, centering and not centering, can lead to good models;

- using some neighbors is fine but using too many is a problem. In particular no pipeline with n_neighbors=1 can be found among the best models. However, scaling features has an even stronger impact than the choice of n_neighbors in this problem.

In this case, the models with scaled features perform better than the models with non-scaled features because all the variables are expected to be predictive and we rather avoid some of them being comparatively ignored.

If the variables in lower scales were not predictive one may experience a decrease of the performance after scaling the features: noisy features would contribute more to the prediction after scaling and therefore scaling would increase overfitting.