# Lab 04 - Model Selection

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import datasets
%matplotlib inline
sns.set_style("darkgrid")

import sys
sys.path.append('../')
from lib.processing_functions import convert_to_pandas

## Exercise goals:

- Get a feeling how cross-validation iterators split the data.
- Learn how to compute cross-validated predictions and scores.
- Practice hyperparameter optimization using grid search.

---
## Exercise 1: Cross-validation iterators

We will visualize the splits generated by different cross-validation iterators.
These iterators are explained in more detail in [3.1.2. Cross validation iterators](http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators).

The iterators will be demonstrated on the Iris dataset:

In [None]:
# import the Iris dataset
X, y = convert_to_pandas(datasets.load_iris())

The functions below can be used for visualizing train and test sets generated by  cross-validation iterator:

In [None]:
from sklearn.preprocessing import LabelEncoder


def plot_labels(y, n_split, title):
    """Visualize the order of the original labels.
    
    Parameters
    ----------
    y : 1d array-like
        Ground truth (correct) labels.
    n_split : int
        Number of equal splits.
    title : str
        Title of the plot.
    """
    f, ax = plt.subplots(figsize=(15,1))
    labels = np.tile(LabelEncoder().fit_transform(y), [n_split, 1])
    ax.imshow(labels, interpolation='nearest', aspect='auto')
    ax.grid(None)
    ax.set_yticks([0])
    ax.set_title(title, y=1.5, size=14)

    
def plot_cv(cv, X, y, title):
    """Visualize the order for a cross-validation scheme.

    Parameters
    ----------
    cv : CrossValidator
        Cross-validation iterator.
    y : 1d array-like
        Ground truth (correct) labels.
    n_split : int
        Number of equal splits.
    title : str
        Title of the plot.    
    """
    f, ax = plt.subplots(figsize=(15,1), sharex=True)
    masks = []
    n_samples = len(y)
    for train, test in cv.split(X, y):
        mask = np.zeros(n_samples, dtype=bool)
        mask[test] = 1
        masks.append(mask)
    ax.imshow(masks, interpolation='nearest', aspect='auto')
    ax.set_xlabel('sample')
    ax.set_yticks(range(0, cv.n_splits))
    ax.set_yticklabels(range(1, cv.n_splits+1))
    ax.grid(None)
    ax.set_title(title, y=1.5, size=14)

Generate the following three cross-validation iterators: 

- 5-fold cross-validation
- stratified 5-fold cross-validation 
- shuffle split with a test set size 20% of the dataset and 5 iterations

```python
# TODO: Replace <FILL IN> with appropriate code
# import the cv iterators
<FILL IN>

# number of samples
n_samples = <FILL IN>

#create 5-Fold
cv_5fold = <FILL IN>

#create statified 5-Fold
cv_strat_5fold = <FILL IN>

#create shuffle split .2 test set size and 5 iterations
cv_shuffle = <FILL IN>
```

In [None]:
%load ../answers/04_01_cv.py

Now visualize the splits using the `plot_cv` function:

In [None]:
plot_labels(y, 5, 'Labels (colours)')
plot_cv(cv_5fold, X, y, '5-fold')
plot_cv(cv_strat_5fold, X, y, 'Stratified 5-fold')
plot_cv(cv_shuffle, X, y, 'Shuffle split')

**Question**: Are the black samples in the visualization the test or the train samples? Which method uses the same test samples in multiple splits?

Also intriguing is the difference between regular 5-fold and stratified 5-fold splits. To explain this, let's compare the test set label counts for both methods:  

In [None]:
def test_label_counts(X, y, cv, name):
    """Count the number labels for each fold.

    Parameters
    ----------
    X : 2d array-like
        Features.
    y : 1d array-like
        Ground truth (correct) labels.
    cv : CrossValidator
        Cross-validation iterator.
    name : str
        Name of the cross-validation iterator.
    """
    label_counts = pd.DataFrame(columns=y.unique())
    for i, split in enumerate(cv.split(X, y)):
        train, test = split
        split_name = '{} test_set_{}'.format(name, i)
        label_counts.loc[split_name,:] = y[test].value_counts()
    return label_counts.fillna(0)

print(test_label_counts(X, y, cv_5fold, '5-fold'))
print(test_label_counts(X, y, cv_strat_5fold, 'stratified 5-fold'))

**Question**: What is happening here? Would this problem become less severe is we would run regular K-Fold with `shuffle=True`?

---
## Exercise 2: Cross-validated scoring and prediction

Let's compare the cross-validation scores and predictions of two different regression methods on the Boston dataset. We'll compare `LinearRegression` and `RandomForestRegressor`.

More information on cross-validation scores and predictions can be found [3.1.1. Computing cross-validated metrics](http://scikit-learn.org/stable/modules/cross_validation.html#computing-cross-validated-metrics).

In [None]:
# load the Boston dataset
X, y = convert_to_pandas(datasets.load_boston())

### 2.1 Cross-validated scores

The cross-validation scorers can be used to quickly get a cross-validated score of a model and see if it is able to learn something.
We'll use it to assess the performance of `LinearRegression` and `RandomForestRegressor`.

Create a 5-fold cross-validation iterator with shuffling: 

```python
# TODO: Replace <FILL IN> with appropriate code
from sklearn.model_selection import KFold

# create 5-Fold iterator with shuffle
cv = <FILL IN>
```

In [None]:
%load ../answers/04_02_5fold.py

Compute the cross-validated score for the `LinearRegression` estimator using the previously-created iterator `cv`: 

```python
# TODO: Replace <FILL IN> with appropriate code
from sklearn.model_selection import <FILL IN>
from sklearn.linear_model import LinearRegression

# compute the cross-validated score for the LinearRegression 
lr_scores = <FILL IN>

print("cv scores: {}".format(lr_scores))
```

In [None]:
%load ../answers/04_03_lr_scores.py

Do the same for the `RandomForestRegressor` estimator:

```python
# TODO: Replace <FILL IN> with appropriate code
from sklearn.ensemble import RandomForestRegressor

# compute the cross-validated score for the RandomForestRegressor 
rf_scores = <FILL IN>

print("cv scores: {}".format(rf_scores))
```

In [None]:
%load ../answers/04_04_rf_scores.py

Show boxplots of both results:

In [None]:
results = pd.DataFrame([lr_scores, rf_scores], 
                       index=['linear regression', 'random forest regressor'],
                       columns = ["split_{}".format(k) for k in range(cv.n_splits)]).T
ax = results.boxplot()
ax.set_ylabel('$R^2$');

**Question**: Which model performs best? Is this the best that both models can do, or do you see opportunites to improve one or both?

### 2.2 Cross-validated predictions

The cross-validation predictor `cross_val_predict` runs cross-validation and returns the predictions for the data points when they are in the test fold.
This is a quick way to get predictions from our model and see if it has any clear biases.

Compute the cross validated predictions for the `LinearRegression` and `RandomForestRegressor` estimators using the `cv` iterator created in 2.1:

```python
# TODO: Replace <FILL IN> with appropriate code
from sklearn.model_selection import <FILL IN>

# compute the cross-validated predictions for the LinearRegression 
lr_y_pred = <FILL IN>
# compute the cross-validated predictions for the RandomForestRegressor 
rf_y_pred = <FILL IN>
```

In [None]:
%load ../answers/04_05_xval_predictions.py

Use the `truth_vs_predictions` function below to plot `y` versus `y_pred`:

In [None]:
def truth_vs_prediction(y, y_pred, ax=None, title=None):
    """Plot truth versus predictions.

    Parameters
    ----------
    y_true : 1d array-like
        Ground truth (correct) labels.

    y_pred : 1d array-like
        Predicted labels, as returned by a classifier.
        
    ax : matplotlib.axes
        Axes to plot on.
        
    title : str
        Title for plot.
    """
    if ax is None:
        ax = plt.subplot()
    y_max = y.max()
    ax.plot(y, y_pred, 'o', markersize=2)
    ax.plot([0,y_max], [0,y_max],':r')
    ax.set_xlabel('Truth')
    ax.set_ylabel('Prediction')
    ax.set_xlim(0,y_max)
    ax.set_ylim(0,y_max)
    ax.set_aspect('equal')
    ax.set_title("truth versus prediction" if title is None else title)

# plots the truth versus cv predictions for both regressors
fig ,axes = plt.subplots(1, 2, figsize=(8,4), sharey=True)
fig.set_size_inches(11.0, 8.5)
truth_vs_prediction(y, lr_y_pred, axes[0], 'linear regression')
truth_vs_prediction(y, rf_y_pred, axes[1], 'random forest regression')

**Question**: Does the random forest regressor suffer from the same systematic error as the linear regression?

---
## Exercise 3: Hyperparameter optimization

We shouldn't just use the default hyperparameters of our models: the defaults might not be good choices for our problem!

Let's return to the digits dataset and optimize the Linear SVM model fitted in Lab 2:

In [None]:
# load the digits dataset
X, y = convert_to_pandas(datasets.load_digits())

### 3.1 Cross-validated grid search

Create a stratified 5-fold cross-validation iterator with shuffling: 

```python
# TODO: Replace <FILL IN> with appropriate code
from sklearn.model_selection import StratifiedKFold

# create 5-Fold iterator with shuffle
cv = <FILL IN>
```

In [None]:
%load ../answers/04_06_stratified.py

In Lab 2 we ran `LinearSVC` with an 'l2' penalty and `C=1.0`. The penalty refers to [regularization](http://scikit-learn.org/stable/auto_examples/svm/plot_svm_scale_c.html) of the model. 

Perform a cross-validated grid search over the `C` parameter that regulates the amount of regularization. Use the following grid values [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1e0, 1e1] and the previously created cross-validation iterator `cv`:

```python
# TODO: Replace <FILL IN> with appropriate code
from sklearn.svm import LinearSVC
from sklearn.model_selection import <FILL IN>

# specify the parameter grid
param_grid = {'C': <FILL IN>}

# perform the gridsearch using the cv-iterator and param_grid
grid_clf = <FILL IN>

# fit the data
grid_clf.<FILL IN>
```

In [None]:
%load ../answers/04_07_grid_search.py

Display the cross validation results for the different values of `C`:

```python
# display the grid scores
grid_clf.<FILL IN>
```

In [None]:
%load ../answers/04_08_grid_scores.py

**Question**: Did we pick a large enough parameter range?

### 3.2  Validation curve

To get a bit more insight in how cross-validated grid search works, we can plot a validation curve. This plot shows both the training and cross-validation scores for the different hyperparameter candidates.

In [None]:
from sklearn.model_selection import validation_curve

train_scores, valid_scores = validation_curve(LinearSVC(), X, y, 'C', param_grid['C'])
best_C = grid_clf.best_params_['C']

train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
valid_scores_mean = np.mean(valid_scores, axis=1)
valid_scores_std = np.std(valid_scores, axis=1)

fig, ax = plt.subplots()
fig.set_size_inches(11.0, 8.5)
ax.set_title("Validation Curve with Linear SVM")
ax.set_xlabel("C")
ax.set_ylabel("Score")
ax.set_ylim(0.8, 1.1)
ax.semilogx(param_grid['C'], train_scores_mean, label="Training score", color="r")
ax.fill_between(param_grid['C'], train_scores_mean - train_scores_std,
                train_scores_mean + train_scores_std, alpha=0.2, color="r")
ax.semilogx(param_grid['C'], valid_scores_mean, label="Cross-validation score", 
            color="g")
ax.fill_between(param_grid['C'], valid_scores_mean - valid_scores_std,
                 valid_scores_mean + valid_scores_std, alpha=0.2, color="g")
ax.plot([best_C,best_C], [0.8,1.1], 'k:', label='best C')
ax.legend(loc="best")

**Question**:  Which curve relates to the (very good) score we obtained in Lab 2? Consider the scikit-learn documentation on [learning curves](http://scikit-learn.org/stable/modules/learning_curve.html) to get more information!

In [None]:
%load ../answers/04_questions.py