# Lab 05 - Model Evaluation

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import datasets
%matplotlib inline
sns.set_style("darkgrid")

import sys
sys.path.append('../')
from lib.processing_functions import convert_to_pandas

## Exercise goals:

- see the full flow going from model selection, evaluation to persistence
- get some practice with classification metrics


---
## Exercise 1: Persist the best model

We will optimize and evaluate the performance of multiple regression models on the Boston dataset and store the best model on disk. The model that will considered are `LinearRegression`, `Lasso` and `Ridge`.

In [None]:
# load the Boston dataset
X, y = convert_to_pandas(datasets.load_boston())

### 1.1 Prepare models

Instead of using grid search for optimizing the Lasso and Ridge models, we will use the model versions with built-in cross-validated parameter optimization.

In Lab 1 we saw that the features vary a lot in their spread. However, Lasso and Ridge regression usually perform best when all features are scaled to the same range. The Ridge and Lasso models can automatically do this by setting `normalize=True`.

All features will be penalized equally when doing regularization. 
Consider what would happen for two features A and B that have equal predictive power but where B has really small values.
Without regularization the model will have a large coefficient for B to also equally contribute to the output.
However, regularization will punish B much more because its coefficient are so big and not because it has less predicitive power.
Using an equal scale for the features levels the playing field.

In addition, some optimization algorithms converge to better solutions if the input features are on the same range.
Not that scaling doesn't always make sense: if you're units have the same scale (e.g. euro's) it might not make sense to scale them.


#### Bonus question

The documentation mentions that setting `normalize=True` is not the same as standarization.

> What's the difference between normalization and standarization in this case? What different kinds of normalization can you find?

Standardization is scaling a feature to have zero mean and unit variance.
Normalization is scaling a vector to have unit norm.
If the vector has no average, standardization and normalization only differ by a factor (`len(X)`).

In different area's normalization can be used as a term for:

- scaling the samples (for e.g. clustering);
- min-max scaling (scale a vector in the range of [0, 1]);
- ...?

Complete the cell below to initialize and store the three regression models:

```python
# TODO: Replace <FILL IN> with appropriate code
from sklearn.linear_model import LassoCV, RidgeCV, LinearRegression

# Initialize the three models
lasso_reg = <FILL IN>
ridge_reg = <FILL IN>
linear_reg = <FILL IN>

# Store the models in a dict
models = dict(lasso=lasso_reg, ridge=ridge_reg, linear=linear_reg)
```

In [None]:
%load ../answers/05_01_initialize.py

### 1.2 Define scorers

We will not use the models' default scoring metrics to measure the model performance. Instead, we will use two common metrics: negative mean absolute error (MAE) and negative mean squared error (MSE). 

Print the names of the all common scorers below to figure out how to specify the two scorers:

In [None]:
from sklearn.metrics import SCORERS

# print the common scorer names
print("SCORERS: {}".format(list(SCORERS.keys())))

```python
# TODO: Replace <FILL IN> with appropriate code
# Store the scorers' names as strings in a dict
scorers = dict(mae=<FILL IN>, mse=<FILL IN>)
```

In [None]:
%load ../answers/05_02_scorers.py

### 1.3 Create cross-validation iterator

Cross-validation strategy for scoring the three models will be 5-fold cross-validation (with shuffling); create the iterator:

```python
# TODO: Replace <FILL IN> with appropriate code
from sklearn.model_selection import KFold

# create 5-Fold iterator
cv = <FILL IN>
```

In [None]:
%load ../answers/05_03_5fold.py

### 1.4 Perform model selection
Now let's compute the cross-validated scores using the previously-defined models, scorers and cross-validation iterators and store the results:

In [None]:
from sklearn.model_selection import cross_val_score

# create a DataFrame for storing the results
index = pd.Index([(scorer, model) for scorer in scorers.keys() for model in models.keys()],
                 name=['scorer','model'])
columns = ["split_{}".format(split) for split in range(cv.n_splits)]
results = pd.DataFrame(columns=columns, index=index)

# evaluate the models with the scorers
for scorer in scorers.keys():
    for model in models.keys():
        scoring = scorers[scorer]
        current = models[model]
        # compute cross-validated score
        results.loc[(scorer,model),:] = cross_val_score(
            current, X, y, scoring=scoring, cv=cv
        )

results

### 1.5  Evaluate model performance

Visualize the results to select the best model:

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 5.5))
for idx, scorer in enumerate(scorers.keys()):
    results.T[scorer].plot(kind='box', title=scorer, ax=axes[idx])

**Question**: Which model performs best for which metric?

### 1.6 Persist optimal model

Select the model that performs best.

Before we can store this model, we have to train it first on the full dataset: 

```python
# TODO: Replace <FILL IN> with appropriate code
reg = <FILL IN>
```

In [None]:
%load ../answers/05_04_full.py

Now, store the regression model on disk: 

```python
# TODO: Replace <FILL IN> with appropriate code
from sklearn.externals import joblib
file_name = 'boston_best_model.pkl'
joblib.<FILL IN>
```

In [None]:
%load ../answers/05_05_dump.py

**Question**: Rerun the cells in section 1.4 and 1.5 a few times, what is happening?

---
## Exercise 2: Classification metrics

We will play around several classification metrics. These metrics will be applied to the classifiction results for a noisier version of the digits dataset:

In [None]:
# load the digits dataset
X, y = convert_to_pandas(datasets.load_digits())

# add some white noise
np.random.seed(10)
X = np.abs(X + np.random.randn(*X.shape) * 7)

from lib.processing_functions import display_digits
fig = display_digits(X, y)  

### 2.1 Split the data

Split the data into a single train and test set, use `random_state=7`:

```python
# TODO: Replace <FILL IN> with appropriate code
from sklearn.model_selection import train_test_split

# split the data set into a single train and test set
X_train, X_test, y_train, y_test = <FILL IN>
print(X_train.shape, X_test.shape)
```

In [None]:
%load ../answers/05_06_split.py

### 2.2 Fit random forest classifier 

Fit the `RandomForestClassifier` model to the data, perform grid search to optimize the `n_estimators` parameter (note, keep `n_estimators<=200`; moreover, to speed-up the fitting process you can try to modify the `n_jobs` parameter):

```python
# TODO: Replace <FILL IN> with appropriate code
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# specify the parameter grid
param_grid = {'n_estimators': <FILL IN>}

# perform the gridsearch using the param_grid
grid_clf = <FILL IN>

# fit the model on train data
grid_clf.<FILL IN>

# make predictions on test data
y_pred = grid_clf.<FILL IN>

# get prediction probabilities
y_score = grid_clf.<FILL IN>
```

In [None]:
%load ../answers/05_07_fit_rf.py

### 2.3 Default model score

Start simple by printing the default model score for the test set:

```python
# TODO: Replace <FILL IN> with appropriate code
# compute default score on the test set
accuracy = grid_clf.<FILL IN>

print("accuracy: {}".format(accuracy))
```

In [None]:
%load ../answers/05_08_default_score.py

### 2.4 Confusion matrix

Analyze the performance of the classifier by plotting the normalized confusion matrix, showing the predicted label percentages for each true label:

```python
# TODO: Replace <FILL IN> with appropriate code
from sklearn.metrics import confusion_matrix

# compute confusion matrix
cm = <FILL IN>
```

In [None]:
%load ../answers/05_09_confusion.py

In [None]:
# normalize such that rows sum to 1
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

plt.imshow(cm_normalized, cmap='Blues', interpolation='nearest')
plt.grid(False)
plt.ylabel('true')
plt.xlabel('predicted')
plt.xticks(range(10))
plt.yticks(range(10))
plt.title("Normalized confusion matrix")
plt.colorbar();

**Question**: Which noisy digits seems hardest to predict and which one easiest?

### 2.5 Classification report:
To get a quick overview of the classification performance for each class print the classification report:

In [None]:
from sklearn.metrics import classification_report

# print the classification report
print(classification_report(y_test, y_pred, labels=range(10)))

**Question**: What does it mean that the precision for digit 8 is still ok but the recall is very low?

### 2.6 One-versus-rest classification
Let's have a look at two one-versus-rest transformations of the problem, turning the problem into two binary classification tasks of

- is this sample an 8 or is it not an 8
- is this sample a 5 or is it not a 5

For both these problems we will plot the receiver operator characteristic (ROC) and compute the area under the curve (AUC).

First create a vector which has a 1 if the `y_test==8` and 0 if `y_test!=8`, then do the same for digit 5:

```python
# TODO: Replace <FILL IN> with appropriate code
# convert labels into binary classification problem
y_test_8 = (y_test == 8).astype(float)
y_test_5 = <FILL IN>
```

In [None]:
%load ../answers/05_10_convert.py

Second, we need the probability scores for each sample; these scores provide the probabilities of the sample being one of the 10 possible labels. The earlier computed  `y_score` contains these prediction probability scores:

In [None]:
# select sample
idx = 29

# get sample probability scores
sample_prob = y_score[idx, :]

# plot label prediction probabilities
prob_df = pd.DataFrame({'prob': sample_prob}).rename_axis('digit')
title = 'sample #{} probability scores - true label={}'
prob_df.plot(kind='bar', title=title.format(idx, y_test.iloc[idx]), rot=0);

**Question**: What digit is sample 29 (`idx=29`), and what probability did our model give to it being that digit?

Extract the pobability scores of predicting the digit 8 and do the same for digit 5:

```python
# TODO: Replace <FILL IN> with appropriate code
# assign probability score of predicting digit
y_score_8 = y_score[:, 8] 
y_score_5 = <FILL IN>
```

In [None]:
%load ../answers/05_11_score.py

Use the binary labels and digit probability scores to create ROC curves for both digits and compute the area under the curve:

```python
# TODO: Replace <FILL IN> with appropriate code
from sklearn.metrics import roc_curve, auc
fpr, tpr, roc_auc = dict(), dict(), dict()

fpr[8], tpr[8], _ = roc_curve(y_test_8, y_score_8)
roc_auc[8] = auc(fpr[8], tpr[8])

fpr[5], tpr[5], _ = <FILL IN>
roc_auc[5] = <FILL IN>
```

In [None]:
%load ../answers/05_12_roc.py

In [None]:
plt.plot(fpr[8], tpr[8], label='digit 8: ROC curve (AUC = %0.2f)' % roc_auc[8])
plt.plot(fpr[5], tpr[5], label='digit 5: ROC curve (AUC = %0.2f)' % roc_auc[5])
plt.plot([0, 1], [0, 1], 'k--', label='luck')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc=4, fontsize=16)

**Question**: Interpret the results, does it match the conclusions we drew from in the confusion matrix? What AUC would we get when the classifier just makes random predictions?

### 2.7 Prediction results
Run the cell below to visualize 40 of the samples in the test set. If the label is red than the classfier make a classification error:


In [None]:
fig = display_digits(X_test, y_test, y_pred, n_max=40) 

Classifying these digits is not easy, is it?!

In [None]:
%load ../answers/05_questions.py