In [2]:
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt
%matplotlib inline

import sys
sys.path.append('../')
from lib.processing_functions import convert_to_pandas

# Model selection

## Outline

Goal: Understand modeling workflow and explain steps for model selection.

Topics:

- **Modeling workflow**: overview of the workflow
- **Model validation**: methodology for checking how well a model fits a given dataset
- **Hyperparameter optimization**: choosing a set of model parameters that optimize a model's performance (grid search)

##  Modeling workflow

In summary, an example workflow for training a predictive model consist of the following steps:

- **data splitting**: split the dataset into a train/validation set and a test set
- **model selection**: fit the most optimal model to the train data using grid search (*includes validation*)
- **model evaluation**: evaluate the performance of the final model

The output of this workflow is a model that can be used for prediction and/or inference.

In this notebook we focus on the **model selection** part. Since in the previous notebook we have already studied model fitting, we continue with model validation before we explain grid search. The next notebook will cover model evaluation.

<img src="../images/example_workflow.png" style="display: block;margin-left: auto;margin-right: auto;width: 600px"/>

## Model validation

We should not test model performance on data used for training: 

- complex models tend to **overfit** the training data;
- overfitted models **don't generalize** well to yet-unseen data;
- so we should validate on unseen data!

### Validation

The solution is to split the data into different data sets:

- **train set**: used for training the model
- **validation set**: used for hyperparameter optimization (if necessary)
- **test set**: used for evaluating the final model 

<img src="../images/data_splitting.png" style="display: block;margin-left: auto;margin-right: auto;"/>

###  Holdout validation

Simple dataset splitting is often problematic:
* we want to learn from as much data as possible
* we also want to test on a large data set
* taking a single partitioning might not be very robust

### K-fold cross-validation

Split the data into $k$ independent folds: 

- fit the model on $k−1$  train folds
- estimate performance on the remaining fold
- repeat this procedure  $k$  times

![cv](../images/cross_validation_E.png)

## Hyperparameter optimization

Optimization of parameters that are not directly learned within an estimator:
* Model hyperparameters specify *how* the model learns.

Hyperparameters are optimized by changing them and observing the (cross-) validation score.

Examples of hyperparameters:

- Regularization strength for LASSO and Ridge regression
- Number and depth of trees in random forest algorithm

###  Pathology of a hyperparameter search

The search for optimal hyperparameters consists of the following components:

- a hyperparameter space
- a method for searching or sampling candidates
- a cross-validation scheme
- a score or evaluation function

### Grid search

Brute-force search over a set of possible hyperparameter candidates.

- Specify a list of values for different hyperparameters.
- Evaluate the model performance for every possible parameter combination using cross-validation.
- Optimal hyperparameter set: highest performance score.
- Evaluate your best performing model on the test data.
- Build your final production model on the whole dataset.

### Hyperparameter search &  cross-validation

![hs](../images/hyperparameter-search.png)

## Tools for model selection

Scikit-learn has several tools for efficiently doing dataset splitting:

- **`train_test_split`**: simple splitting of a dataset into a train and test set
- **`cross_val_score`**: evaluate score by cross-validation 
- **`permutation_test_score`**: evaluate the significance of a cross-validated score
- **`cross_val_predict`**: generates estimates by cross-validation
- **`cross-validation iterators`**: large set of iterators which generate dataset splits according to different strategies

Next we will discuss some different tools in more detail.

## Simple dataset splitting
 
Function for splitting the dataset into a train and a test set:

```model_selection.train_test_split(*arrays, test_size=None, train_size=None, random_state=None)```

Usage details:
- an arbitrary number of arrays can be split
- `test_size`/`train_size` can be specified as
    - `None` : automatically complement the other split (default is 25% test, 75% train)
    - `float` : proportion of the dataset to be included
    - `int` : absolute number of samples to be included
- set the `random_state` to make splitting deterministic

### Dataset splitting example
Sample a training set from the Iris dataset while withholding 40% of the data for testing:

In [3]:
from sklearn.model_selection import train_test_split

X, y = convert_to_pandas(datasets.load_iris())

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, 
                                                    random_state=0)

print("X shape: {}, y shape: {}".format(
    X.shape, y.shape))
print("X_train shape: {}, y_train shape: {}".format(
    X_train.shape, y_train.shape))
print("X_test shape: {}, y_test shape: {}".format(
    X_test.shape, y_test.shape))

X shape: (150, 4), y shape: (150,)
X_train shape: (90, 4), y_train shape: (90,)
X_test shape: (60, 4), y_test shape: (60,)


## Cross-validated metrics

Function for evaluating a metric score by cross-validation:

   `model_selection.cross_val_score(estimator, X, y=None, scoring=None, cv=None, n_jobs=1, fit_params=None, ...)`

### Cross-validated accuracy

Compute cross-validated accuracy of a logistic regression model on the Iris dataset by 5-Fold cross-validation:

In [4]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

clf = LogisticRegression(C=0.1)

scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')

print("Scores per fold: {}".format(scores))
print("Mean accuracy: {0:.2f}".format(scores.mean()))

Scores per fold: [ 0.76666667  0.86666667  0.83333333  0.83333333  0.8       ]
Mean accuracy: 0.82


## Cross-validation iterators

By specifying an integer $k$ for the `cv` parameter we get the default $k$-fold cross-validation. We can also be more specific on the type of cross-validation that we want to perform by creating a cross-validation iterator.

Many different cross-validation iterators are supported:
    - `StratifiedKFold`
    - `ShuffleSplit`
    - `LeavePOut`
    - ...

Cross-validation iterators generate sets of train and test indices that can be used for splitting the data. 

**StratifiedKFold**: for each fold/split the original class ratios are preserved.

In [5]:
from sklearn.model_selection import StratifiedKFold, KFold

X_data = np.array([0,0,0,0,0,0,0,0,0,0,0,0])
y_data = np.array([0,0,0,0,0,0,0,0,0,1,1,1])

stratifier = StratifiedKFold(n_splits=3, shuffle=True, random_state=1)
kfolder = KFold(n_splits=3, shuffle=True, random_state=1)

stratified_splits = stratifier.split(X_data, y_data)
regular_splits = kfolder.split(X_data, y_data)

def label_ratios(splits):
    "determine class label ratios for all splits"
    return [
        {label: sum(y_data[dset])/len(y_data[dset]) 
         for label, dset in zip(('train', 'test'), split)}
        for split in splits
    ]

print("class ratio dataset:\n{}".format(sum(y_data)/len(y_data)))
print("\nstratified splits:\n{}".format(label_ratios(stratified_splits)))
print("\n regular splits:\n{}".format(label_ratios(regular_splits)))

class ratio dataset:
0.25

stratified splits:
[{'train': 0.25, 'test': 0.25}, {'train': 0.25, 'test': 0.25}, {'train': 0.25, 'test': 0.25}]

 regular splits:
[{'train': 0.25, 'test': 0.25}, {'train': 0.375, 'test': 0.0}, {'train': 0.125, 'test': 0.5}]


**LeaveOneGroupOut**: hold out the samples corresponding to $p$ labels

In [6]:
from sklearn.model_selection import LeaveOneGroupOut

x_data = range(7)
labels = [2010, 2010, 
          2011, 2011, 2011, 
          2012, 2012]

logo = LeaveOneGroupOut()
logo_splits = logo.split(x_data, groups=labels)

for i, split in enumerate(logo_splits):
    print(f"{i+1} train: {split[0]}\ttest: {split[1]}\ttest labels: {[labels[x] for x in split[1]]}")

1 train: [2 3 4 5 6]	test: [0 1]	test labels: [2010, 2010]
2 train: [0 1 5 6]	test: [2 3 4]	test labels: [2011, 2011, 2011]
3 train: [0 1 2 3 4]	test: [5 6]	test labels: [2012, 2012]


### Data shuffling

Data is often ordered and shuffling may be essential to get a meaningful cross-validation result.

Note that the opposite may also be true! This is usually the case when working with time series data.

Some iterators, like `KFold`, allow shuffling of the data before splitting. However, note that:

- by default **no shuffling** occurs
- `random_state` needs to be set to make it deterministic.

Other iterators inherently shuffle the data, for example `ShuffleSplit`.

### Grid search example

Optimize both the type and the amount of regularization for the logistic regression model:

In [7]:
from sklearn.model_selection import GridSearchCV

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=0)
param_grid = {
    'C': np.logspace(-4.0, 4.0, num=10), 
    'penalty': ['l1', 'l2'] 
}

clf = GridSearchCV(LogisticRegression(), param_grid=param_grid, cv=5)
clf = clf.fit(X_train, y_train)

grid_size = np.product([len(value) for value in param_grid.values()])
print("grid size: {}".format(grid_size))
print("best score: {}".format(clf.best_score_))
print("best estimator: {}".format(clf.best_estimator_)) 

best_model = clf.best_estimator_.fit(X_train, y_train)
evaluation_score = best_model.score(X_test, y_test)
print("Evaluation score on test set: {}".format(evaluation_score))

grid size: 20
best score: 0.9809523809523809
best estimator: LogisticRegression(C=166.81005372000558, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)
Evaluation score on test set: 0.9777777777777777


## Estimator specific cross-validation

Some of the estimators that have model selection via cross-validation built-in:

- `ElasticNetCV`: Elastic Net model with iterative fitting along a regularization path
- `LassoCV`: Lasso linear model with iterative fitting along a regularization path
- `LogisticRegressionCV`: Logistic Regression CV (aka logit, MaxEnt) classifier
- `RidgeCV`: Ridge regression with built-in cross-validation

Most focus on optimizing the **strength of the regularizer**.

### Built-in cross-validation example

Run the logistic regression with built-in cross-validation on the whole Iris data and compute the mean score: 

In [8]:
from sklearn.linear_model import LogisticRegressionCV

scores = cross_val_score(LogisticRegressionCV(), X, y)

print("LogisticRegressionCV mean score: {}".format(scores.mean()))

LogisticRegressionCV mean score: 0.9268790849673202


## Advanced hyperparameter optimization: randomized search

A probabalistic search over discrete or continuous ranges of hyperparameters. For each entry in a `param_grid` dict,

+ if a discrete list of values is given, this list is sampled without replacement
+ if a continuous distribution object is given (imported, say, from `scipy.stats`), values will be sampled with replacement following the particular form of the distribution; for instance: `param_grid={'alpha': scipy.stats.norm()}`

Why prefer the randomized approach to uniformly searching each hyperparameter subspace? Empirically, it has been shown that often a variation in one or more hyperparameters has little effect on the generalizability of the resulting model. We see this in the following image:

![](../images/randomized_search.png) Source: <a href="http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf">J. Bergstra and Y. Bengio</a>

The scikit-learn API call is:

```python
from sklearn.grid_search import RandomizedSearchCV
```

In addition to the parameters `GridSearchCV` accepts, `RandomizedSearchCV` also accepts `n_iter`, the number of hyperparameter value choices which will be assessed. Using this parameter, we can choose a resource budget which is independent of the number of hyperparameters and the number of possible values they can take, essentially decoupling our runtime from our search space size. In this way, adding hyperparameters to `param_grid` doesn't *per se* influence the performance of the search or decrease its efficiency.

# Review Questions 

1. What might go wrong if you only split test/train once? 
2. When might a gridsearch be a better/worse idea than a randomised search? 
3. Is it important to add a seed to our crossvalidation calls? 

# Exercises: [lab 4 - Model selection](../labs/lab_04_model_selection.ipynb)