<h1>CS4618: Artificial Intelligence I</h1>
<h1>Model Selection</h1>
<h2>
    Derek Bridge<br>
    School of Computer Science and Information Technology<br>
    University College Cork
</h2>

<h1>Initialization</h1>
$\newcommand{\Set}[1]{\{#1\}}$ 
$\newcommand{\Tuple}[1]{\langle#1\rangle}$ 
$\newcommand{\v}[1]{\pmb{#1}}$ 
$\newcommand{\cv}[1]{\begin{bmatrix}#1\end{bmatrix}}$ 
$\newcommand{\rv}[1]{[#1]}$ 
$\DeclareMathOperator{\argmax}{arg\,max}$ 
$\DeclareMathOperator{\argmin}{arg\,min}$ 
$\DeclareMathOperator{\dist}{dist}$
$\DeclareMathOperator{\abs}{abs}$

In [2]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [3]:
import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor

from sklearn.metrics import mean_absolute_error

from sklearn.model_selection import train_test_split
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

from joblib import dump

In [4]:
# Use pandas to read the CSV file into a DataFrame
df = pd.read_csv("../datasets/dataset_corkA.csv")

# Shuffle the dataset
df = df.sample(frac=1, random_state=2)
df.reset_index(drop=True, inplace=True)

# The features we want to select
features = ["flarea", "bdrms", "bthrms"]

# Extract the features but leave as a DataFrame
X = df[features]

# Target values, converted to a 1D numpy array
y = df["price"].values

<h1>Parameters and Hyperparameters</h1>
<ul>
    <li>In machine learning, we distinguish between <b>parameters of the model</b> and <b>hyperparameters of
        the learning algorithm</b>.
    </li>
    <li><b>Parameters</b> are the variables whose values the learning algorithm is trying to find:
        <ul>
            <li>They define the model that the learning algorithm outputs.</li>
            <li>E.g. in the case of linear regression, $\v{\beta}_0, \v{\beta}_1,\ldots,\v{\beta}_n$ are 
                the parameters.
            </li>
        </ul>
    </li>
    <li><b>Hyperparameters</b> are the variables whose values are set by us:
        <ul>
            <li>These values are inputs to the learning algorithm.</li>
            <li>E.g. the number of neighbours, $k$, for the kNN algorithm. (In scikit-learn's
                <code>KNeighborRegression</code> class, this variable is called <code>n_neighbors</code>.)
            </li>
            <li>E.g. if you are using Gradient Descent for linear regression, then the learning rate
                ($\alpha$) and the number of iterations are hyperparameters. (sickit-learn's
                <code>SGDRegressor</code> class calls these <code>learning_rate</code> and
                <code>max_iter</code>.) If you are also using simulated annealing, then there may be
                further hyperparameters.
            </li>
        </ul>
    </li>
    <li>Since the learning algorithm does not set them, 
        <ul>
            <li>how will we choose the values for hyperparameters that we use 
                for error estimation?
            </li>
            <li>and what values will we use in our final model?</li>
        </ul>
    </li>
</ul>

<h1>Error Estimation and Model Selection</h1>
<ul>
    <li>The process by which we find good values for the hyperparameters is called <b>model selection</b>.</li>
    <li>The way we evaluate a particular model (to estimate how it will perform on unseen examples) is
        called <b>error estimation</b>.
    </li>
    <li>So what we want to do is use model selection to find good hyperparameter values and then use
        error estimation to estimate future performance.
    </li>
</ul>

<h1>Model Selection and Error Estimation Done Wrong!</h1>
<ul>
    <li>It is tempting, <em>but wrong</em>, to do the following:
        <ul>
            <li>Randomly partition the dataset into training set and test set (e.g. 80%-20%).</li>
            <li>Train the predictor on the training set using one set of hyperparameter values; 
                test it on the test set. 
            </li>
            <li>If not happy with the MAE, train the predictor on the training set using a different 
                set of hyperparameter values; test it on the test set.
            </li>
            <li>Keep doing this until you are satisfied.</li>
        </ul>
    </li>
</ul>

In [5]:
# Create the object that shuffles and splits the data
ss = ShuffleSplit(n_splits=1, train_size=0.8, random_state=2)

In [6]:
# Create a preprocessor
preprocessor = ColumnTransformer([
        ("scaler", StandardScaler(), features)], 
        remainder="passthrough")

In [8]:
# Create a pipeline that combines the preprocessor with 1NN
knn_model = Pipeline([
    ("preprocessor", preprocessor),
    ("predictor", KNeighborsRegressor(n_neighbors=1))])

# Error estimation for k=1
cross_val_score(knn_model, X, y, scoring="neg_mean_absolute_error", cv=ss)

array([-91.6344086])

In [9]:
# Create a pipeline that combines the preprocessor with 2NN
knn_model = Pipeline([
    ("preprocessor", preprocessor),
    ("predictor", KNeighborsRegressor(n_neighbors=2))])

# Error estimation for k=2
cross_val_score(knn_model, X, y, scoring="neg_mean_absolute_error", cv=ss)

array([-74.81182796])

In [10]:
# Create a pipeline that combines the preprocessor with 3NN
knn_model = Pipeline([
    ("preprocessor", preprocessor),
    ("predictor", KNeighborsRegressor(n_neighbors=3))])

# Error estimation for k=3
cross_val_score(knn_model, X, y, scoring="neg_mean_absolute_error", cv=ss)

array([-69.84587814])

<ul>
    <li>And so on until we're confident we've found a good value for $k$.</li>
</ul>

<ul>
    <li>It is just as wrong to use $k$-fold cross-validation in the same way.</li>
</ul>

In [11]:
# Create a pipeline that combines the preprocessor with 1NN
knn_model = Pipeline([
    ("preprocessor", preprocessor),
    ("predictor", KNeighborsRegressor(n_neighbors=1))])

# Error estimation for k=1
np.mean(cross_val_score(knn_model, X, y, scoring="neg_mean_absolute_error", cv=10))

-83.64074930619797

In [12]:
# Create a pipeline that combines the preprocessor with 2NN
knn_model = Pipeline([
    ("preprocessor", preprocessor),
    ("predictor", KNeighborsRegressor(n_neighbors=2))])

# Error estimation for k=2
np.mean(cross_val_score(knn_model, X, y, scoring="neg_mean_absolute_error", cv=10))

-73.79824236817763

In [13]:
# Create a pipeline that combines the preprocessor with 3NN
knn_model = Pipeline([
    ("preprocessor", preprocessor),
    ("predictor", KNeighborsRegressor(n_neighbors=3))])

# Error estimation for k=3
np.mean(cross_val_score(knn_model, X, y, scoring="neg_mean_absolute_error", cv=10))

-72.60975948196116

<h2>Why is this wrong?</h2>
<ul>
    <li>It's an example of <b>leakage</b>: 
        <ul>
            <li>Information about the test set is being used to develop the model.</li>
        </ul>
    </li>
    <li>This means that error estimation will often be over optimistic.</li>
</ul>

<h1>Model Selection and Error Estimation Done Right!</h1>
<ul>
    <li>The simplest approach is to randomly partition the dataset into three, e.g. 60%-20%-20%:
        <ul>
            <li>training set;</li>
            <li><b>validation set</b>;</li>
            <li>test set.</li>
        </ul>
    </li>
    <li>Compute the <b>validation errors</b>:
        <ul>
            <li>Train the predictor on the training set using one set of hyperparameter values; 
                test it on the <em>validation set</em>.
            </li>
            <li>If not happy with the MAE, train the predictor on the training set using a different 
                set of hyperparameter values; 
                test it on the <em>validation set</em>.
            </li>
            <li>Keep doing this until you are satisfied.</li>
        </ul>
    </li>
    <li>Model selection: choose the hyperparameter values that gave you the lowest validation error.
    </li>
    <li>Error estimation: using these hyperparameter values, train the predictor on the union of the training set 
        and validation set; test this model 
        on the <em>test set</em>.
    </li>
    <li>You can tweak and tune your model as much as you like based on validation error. But, during this
        process, you must never use the test set. Only when you've done tweaking and tuning should you 
        test your chosen model on the test set.
    </li>
    <li>Of course, this method requires an even bigger dataset: one we can split into three large enough 
        partitions!
    </li>
</ul>

In [14]:
# Split off the test set: 20% of the dataset.
dev_df, test_df = train_test_split(df, train_size=0.8, random_state=2)

In [15]:
# Extract the features but leave as a DataFrame
dev_X = dev_df[features]
test_X = test_df[features]

# Target values, converted to a 1D numpy array
dev_y = dev_df["price"].values
test_y = test_df["price"].values

In [16]:
# Create the object that shuffles and splits the dev data
# Why 0.75? Because 0.75 of 80% of the data is 60% of the original dataset.
ss = ShuffleSplit(n_splits=1, train_size=0.75, random_state=2)

In [17]:
# Create a pipeline that combines the preprocessor with 1NN
knn_model = Pipeline([
    ("preprocessor", preprocessor),
    ("predictor", KNeighborsRegressor(n_neighbors=1))])

# Error estimation for k=1
cross_val_score(knn_model, dev_X, dev_y, scoring="neg_mean_absolute_error", cv=ss)

array([-88.97849462])

In [None]:
# Repeat the previous cell but with different values for k
# Let's suppose k=3 is the one with the lowest MAE

In [18]:
# So now with k=3, re-train on the train+validation sets and test on the test set

knn_model = Pipeline([
    ("preprocessor", preprocessor),
    ("predictor", KNeighborsRegressor(n_neighbors=3))])

knn_model.fit(dev_X, dev_y)

# Error estimation on the test set.
mean_absolute_error(test_y, knn_model.predict(test_X))

69.84587813620071

<ul>
    <li>Suppose your dataset is not large enough to split into three in this way.</li>
    <li>In this case, it is common to use $k$-fold cross-validation for computing the validation errors.</li>
    <li>If so, replace <code>cv=ss</code> with <code>cv=10</code>.</li>
</ul>

In [22]:
# Create a pipeline that combines the preprocessor with 1NN
knn_model = Pipeline([
    ("preprocessor", preprocessor),
    ("predictor", KNeighborsRegressor(n_neighbors=1))])

# Error estimation for k=1
np.mean(cross_val_score(knn_model, dev_X, dev_y, scoring="neg_mean_absolute_error", cv=10))

-78.99495021337125

In [None]:
# Repeat the previous cell but with different values for k
# Let's suppose k=3 is the one with the lowest MAE

In [23]:
# So now with k=3, re-train on the train+validation sets and test on the test set

knn_model = Pipeline([
    ("preprocessor", preprocessor),
    ("predictor", KNeighborsRegressor(n_neighbors=3))])

knn_model.fit(dev_X, dev_y)

# Error estimation on the test set.
mean_absolute_error(test_y, knn_model.predict(test_X))

69.84587813620071

<h1>Grid Search</h1>
<ul>
    <li>When we want to try lots of hyperparameter values, the code becomes quite repetitive.</li>
    <li>Instead, we can specify the values we wish to try for each hyperparameter, and <b>grid search</b>
        will try all <em>combinations</em> of these values.
    </li>
</ul>

In [27]:
# Grid Search (using holdout for the validation errors)

# Create a pipeline that combines the preprocessor with kNN
knn_model = Pipeline([
    ("preprocessor", preprocessor),
    ("predictor", KNeighborsRegressor())])

# Create a dictionary of hyperparameters and values to try
param_grid = {"predictor__n_neighbors" : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}

# Create the grid search object which will find the best hyperparameter values based on validation error
gs = GridSearchCV(knn_model, param_grid, scoring="neg_mean_absolute_error", cv=ss)

# Run grid search by calling fit
gs.fit(dev_X, dev_y)

In [28]:
# We can then find out all results (gs.cv_results_) or best result (gs.best_score_) or 
# best hyperparameter values (gs.best_params_)

gs.best_params_, gs.best_score_

({'predictor__n_neighbors': 10}, -69.64193548387097)

<ul>
    <li>So now we have done model selection, we can do error estimation.</li>
</ul>

In [29]:
# Now we re-train on train+validation and test on the test set
knn_model.set_params(**gs.best_params_) 
knn_model.fit(dev_X, dev_y)
mean_absolute_error(test_y, knn_model.predict(test_X))

63.77526881720429

<ul>
    <li>In fact, we can avoid two lines of code, by including <code>refit=True</code> as an argument to the <code>GridSearchCV</code>. This means it retrains using the best parameters for us.</li>
    <li>The simplified code is:</li>
</ul>

In [None]:
# Create the grid search object which will find the best hyperparameter values based on validation error
gs = GridSearchCV(knn_model, param_grid, scoring="neg_mean_absolute_error", cv=ss, refit=True)

# Run grid search by calling fit. It will also re-train on train+validation using the best parameters.
gs.fit(dev_X, dev_y)

In [30]:
# Now test on the test set
mean_absolute_error(test_y, gs.predict(test_X))

63.77526881720429

<ul>
    <li>Again, if our dataset is too small to split into three, we might prefer to use $k$-fold cross-validation 
        for computing the validation errors in model selection.
    </li>
</ul>

In [34]:
# Grid Search (using k-fold cross-validation for the validation errors)

# Create the grid search object which will find the best hyperparameter values based on validation error
gs = GridSearchCV(knn_model, param_grid, scoring="neg_mean_absolute_error", cv=10, refit=True)

# Run grid search by calling fit. It will also re-train on train+validation using the best parameters
gs.fit(dev_X, dev_y)

In [35]:
# Now test on the test set
mean_absolute_error(test_y, gs.predict(test_X))

67.76344086021506

<h1>Randomized Search</h1>
<ul>
    <li>When the number of combinations of hyperparameter values is high, Grid Search's exhaustive
        approach may take too long.
    </li>
    <li>We can instead use Randomized Search:
        <ul>
            <li>Replace <code>GridSearchCV</code> by <code>RandomizedSearchCV</code>.</li>
            <li>Supply an extra argument to <code>RandomizedSearchCV</code>: <code>n_iter</code>
                is how many combinations to try.
            </li>
        </ul>
    </li>
</ul>

In [31]:
# Randomized Search (using k-fold cross-validation for the validation errors)

# Create the randomized search object which will find the best hyperparameter values based on validation error. 
rs = RandomizedSearchCV(knn_model, param_grid, scoring="neg_mean_absolute_error", cv=10, n_iter=5, random_state=2, refit=True)

# Run grid search by calling fit. It will also re-train on train+validation using the best parameters.
rs.fit(dev_X, dev_y)

In [32]:
# Now test on the test set
mean_absolute_error(test_y, gs.predict(test_X))

63.77526881720429

<h1>Final Steps</h1>
<ul>
    <li>So now we could do error estimation for linear regression. (There's no model selection
        because there are no hyperparameters.) This will enable us to see which predictor is better:
        kNN or linear regression.
    </li>
    <li>Make sure you use the same splits for this, otherwise the two will not be comparable.
</ul>

In [33]:
# Error estimation for linear regression

linear_model = Pipeline([
    ("preprocessor", preprocessor),
    ("predictor", LinearRegression())])
linear_model.fit(dev_X, dev_y)
mean_absolute_error(test_y, linear_model.predict(test_X))

66.01046377907335

<ul>
    <li>The winner may change, depending on the randomness.</li>
    <li>Let's say that linear regression is slightly better. So this is the model we might deploy.</li>
    <li>If we want to deploy it, train it on the entire dataset and save the model.</li>
</ul>

In [None]:
linear_model.fit(X, y)
dump(linear_model, 'models/my_model.pkl') # For this to work, create a folder called models!

<h1>Nested $k$-Fold Cross Validation (Advanced!)</h1>
<ul>
    <li><strong>You can ignore the rest of this notebook</strong>.</li>
    <li>Above we took two approaches depending on the size of our dataset:
        <ul>
            <li>Big dataset: split off a test set (using <code>train_test_split</code>) and
                split the rest into a training and validation set (using
                <code>ShuffleSplit</code>).
            </li>
            <li>Dataset not quite so big: split off a test set (using <code>train_test_split</code>) and
                then obtain multiple training and validation sets using k-fold cross-validation.
            </li>
        </ul>
    </li>
    <li>It is natural to wonder whether, for even smaller datasets, we could use k-fold-cross-validation
        to obtain our test sets as well as our validation sets.
    </li>
    <li>The answer is: kind-of.</li>
    <li>This is called nested $k$-fold cross-validation: it's like a nested for-loop.
</ul>

In [None]:
# Create the grid search object which will find the best hyperparameter values based on validation error
gs = GridSearchCV(knn_model, param_grid, scoring="neg_mean_absolute_error", cv=10)

# Run grid search repeatedly by using cross-val_score
np.mean(cross_val_score(gs, X, y, scoring="neg_mean_absolute_error", cv=10))

<ul>
    <li>There's a very subtle catch: this can only be used for error estimation and not for
        model selection.
        <ul>
            <li>The reason is: a possibly different model (set of hyperparamter values) is selected
                for each of the outer folds.
            </li>
            <li>You do not end up with a single winner, so you cannot use this to tell you
                what hyperparameter values to use going forward.
            </li>
        </ul>
    </li>
    <li>So what use is it?
        <ul>
            <li>Suppose you are an academic, comparing your new whizzo learning algorithm (you hope) 
                with existing algorithms.
            </li>
            <li>You are interested in error estimation, but you don't need to know the winning hyperparameter
                values, because you have no intention of 'going live' (deploying a winning model).
            </li>
            <li>It's important you use the best hyperparameter values for your competitors, otherwise you 
                can be accused of dishonestly giving your whizzo algorithm an unfair advantage.
            </li>
            <li>As an academic, you may also have the problem of a small dataset.</li>
        </ul>
        In this case, nested $k$-fold cross-validation may be exactly what you need for making your comparisons.
    </li>
</ul>