```
Copyright (c) Gradient Institute. All rights reserved.
Licensed under the Apache 2.0 License.
```


# Model selection demonstration

This notebook uses a regular regression dataset to demonstrate how to perform model selection on both stages of the TwoStageRidge model.

We also demonstrate how to make the model nonlinear in the control covariates.

In [9]:
import numpy as np

from sklearn.datasets import load_boston
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge, BayesianRidge
from sklearn.kernel_approximation import Nystroem
from sklearn.compose import ColumnTransformer
from sklearn.metrics import r2_score

from twostageridge import TwoStageRidge, make_first_stage_scorer, make_combined_stage_scorer

## Load the data - Boston housing

In [3]:
data = load_boston()
X, y = data.data, data.target
N, D = X.shape

# Standardise the targets
y -= y.mean()
y /= y.std()

# Shuffle the data
pint = np.random.permutation(N)
y = y[pint]
X = X[pint, :]

# "treatment" index, we need to give this to the two stage model.
treatment_ind = 0

## Non-linear models

This dataset is known to have some non-linear relationships.

### TwoStageRidge

In [4]:
# TwoStageRidge + Nystroem
control_ind = np.delete(np.arange(D), treatment_ind)

# NOTE: This keeps a linear treatment relationship
model = make_pipeline(
    StandardScaler(),
    ColumnTransformer([
        ("treatment", 'passthrough', [treatment_ind]),
        ("controls", Nystroem(n_components=300), control_ind)
    ]),
    TwoStageRidge(treatment_index=treatment_ind)
)

#### Separate stage model selection

First let's try doing model selection *separately* on each of the
stages in the two stage model.

We perform two grid-searches. The first grid search is for the first
stage of the two stage ridge regression model, and uses the 
`make_first_stage_scorer` function for creating a scorer that
evaluated the score of the model in predicting the treatment variable.

In [5]:
# Model selection for stage 1
gs = GridSearchCV(
    model,
    param_grid={
        "twostageridge__regulariser1": [1e-3, 1e-2, 0.1, 1, 10],
        "columntransformer__controls__gamma": [1e-3, 1e-2, 0.1, 1.0]
    },
    cv=10,
    scoring=make_first_stage_scorer(r2_score)  # Note this special scorer function! 
)
gs.fit(X, y)
print(f"best stage-1 score R^2: {gs.best_score_:.4f}")
print(f"best stage-1 parameters: \n\t{gs.best_params_}\n")


# Model selection for stage 2
gs = GridSearchCV(
    gs.best_estimator_,
    param_grid={
        "twostageridge__regulariser2": [1e-3, 1e-2, 0.1, 1, 10]
    },
    cv=10
)
gs.fit(X, y)
print(f"best stage-2 score R^2: {gs.best_score_:.4f}")
print(f"best stage-2 parameters: \n\t{gs.best_params_}")

best stage-1 score R^2: 0.5539
best stage-1 parameters: 
	{'columntransformer__controls__gamma': 0.01, 'twostageridge__regulariser1': 0.01}

best stage-2 score R^2: 0.8679
best stage-2 parameters: 
	{'twostageridge__regulariser2': 0.001}


#### Combined model selection 

Now let's see what happens if we do model selection on them simultaneously.

No we can use one grid search. The `make_combined_stage_scorer` function combines
the scores of the first and second stage models in predicting the treatments and
outcomes respectively. In this case the scores are just added.

In [7]:
# Model selection for stage 1
gs = GridSearchCV(
    model,
    param_grid={
        "twostageridge__regulariser1": [1e-3, 1e-2, 0.1, 1, 10],
        "twostageridge__regulariser2": [1e-3, 1e-2, 0.1, 1, 10],
        "columntransformer__controls__gamma": [1e-3, 1e-2, 0.1, 1.0],
    },
    cv=10,
    scoring=make_combined_stage_scorer(r2_score)  # Note this special scorer function! 
)
gs.fit(X, y)

print(f"best combined score R^2: {gs.best_score_:.4f}")
print(f"best parameters: \n\t{gs.best_params_}")

best combined score R^2: 1.4258
best parameters: 
	{'columntransformer__controls__gamma': 0.1, 'twostageridge__regulariser1': 1, 'twostageridge__regulariser2': 0.01}


Let's get an approximate idea of how each stage performs in this model.

In [14]:
scores = cross_validate(gs.best_estimator_, X, y, cv=10, scoring=make_first_stage_scorer(r2_score))
print(f"First stage R^2: {scores['test_score'].mean():.4f}")

scores = cross_validate(gs.best_estimator_, X, y, cv=10, scoring="r2")
print(f"Second stage R^2: {scores['test_score'].mean():.4f}")

First stage R^2: 0.5523
Second stage R^2: 0.8612


### BayesianRidge

Comparison to a BayesianRidge regressor that uses maximum likelihood type II to learn its regularisation prior. This model is completely non-linear, not partially non-linear like the two stage models previously.

In [4]:
# BayesianRidge + Nystroem
model = make_pipeline(
    StandardScaler(),
    Nystroem(n_components=300),
    BayesianRidge()
)

gs = GridSearchCV(model, param_grid={"nystroem__gamma": [1e-3, 1e-2, 0.1, 1.0]}, cv=10)
gs.fit(X, y)
print(f"best score R^2: {gs.best_score_:.4f}")
print(f"best parameters: \n\t{gs.best_params_}")

best score R^2: 0.8834
best parameters: 
	{'nystroem__gamma': 0.1}
