# Measuring prediction performance

Here we will discuss how to use **validation sets** to get a better measure of
performance for a classifier.

## Using the K-neighbors classifier

Here we'll continue to look at the digits data, but we'll switch to the
K-Neighbors classifier.  The K-neighbors classifier is an instance-based
classifier.  The K-neighbors classifier predicts the label of
an unknown point based on the labels of the *K* nearest points in the
parameter space.

In [None]:
# Get the data
from sklearn.datasets import load_digits
digits = load_digits()
X = digits.data
y = digits.target

In [None]:
# Instantiate and train the classifier
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=1)
clf.fit(X, y)

In [None]:
# Check the results using metrics
from sklearn import metrics
y_pred = clf.predict(X)

In [None]:
print(metrics.confusion_matrix(y_pred, y))

Apparently, we've found a perfect classifier!  But this is misleading
for the reasons we saw before: the classifier essentially "memorizes"
all the samples it has already seen.  To really test how well this
algorithm does, we need to try some samples it *hasn't* yet seen.

This problem can also occur with regression models. In the following we fit an other instance-based model named "decision tree" to the Boston Housing price dataset:

In [None]:
%matplotlib inline
from matplotlib import pyplot as plt
import numpy as np

In [None]:
from sklearn.datasets import load_boston
from sklearn.tree import DecisionTreeRegressor

data = load_boston()
clf = DecisionTreeRegressor().fit(data.data, data.target)
predicted = clf.predict(data.data)
expected = data.target

plt.scatter(expected, predicted)
plt.plot([0, 50], [0, 50], '--k')
plt.axis('tight')
plt.xlabel('True price ($1000s)')
plt.ylabel('Predicted price ($1000s)')

Here again the predictions are seemingly perfect as the model was able to perfectly memorize the training set.

## A Better Approach: Using a validation set

Learning the parameters of a prediction function and testing it on the
same data is a methodological mistake: a model that would just repeat
the labels of the samples that it has just seen would have a perfect
score but would fail to predict anything useful on yet-unseen data.

To avoid over-fitting, we have to define two different sets:

- a training set X_train, y_train which is used for learning the parameters of a predictive model
- a testing set X_test, y_test which is used for evaluating the fitted predictive model

In scikit-learn such a random split can be quickly computed with the
`train_test_split` helper function.  It can be used this way:

In [None]:
from sklearn.model_selection import train_test_split
X = digits.data
y = digits.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

print("%r, %r, %r" % (X.shape, X_train.shape, X_test.shape))

Now we train on the training data, and test on the testing data:

In [None]:
clf = KNeighborsClassifier(n_neighbors=1).fit(X_train, y_train)
y_pred = clf.predict(X_test)

In [None]:
print(metrics.confusion_matrix(y_test, y_pred))

In [None]:
print(metrics.classification_report(y_test, y_pred))

The averaged f1-score is often used as a convenient measure of the
overall performance of an algorithm.  It appears in the bottom row
of the classification report; it can also be accessed directly:

In [None]:
metrics.f1_score(y_test, y_pred, average='weighted')

The over-fitting we saw previously can be quantified by computing the
f1-score on the training data itself:

In [None]:
metrics.f1_score(y_train, clf.predict(X_train), average='weighted')

In [None]:
# whatever score one uses, in general, the higher the score, the better the performance
# the metrics one has to use depends on what is the problem

### Validation with a Regression Model

These validation metrics also work in the case of regression models.  Here we'll use
a Gradient-boosted regression tree, which is a meta-estimator which makes use of the
``DecisionTreeRegressor`` we showed above.  We'll start by doing the train-test split
as we did with the classification case:

In [None]:
data = load_boston()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=0)

print("%r, %r, %r" % (X.shape, X_train.shape, X_test.shape))

Next we'll compute the training and testing error using the Decision Tree that
we saw before:

In [None]:
est = DecisionTreeRegressor().fit(X_train, y_train)

validation_score = metrics.explained_variance_score(
    y_test, est.predict(X_test))

print("validation: %r" % validation_score)

training_score = metrics.explained_variance_score(
    y_train, est.predict(X_train))

print("training: %r" % training_score)

This large spread between validation and training error is characteristic
of a **high variance** model.  Decision trees are not entirely useless,
however: by combining many individual decision trees within ensemble
estimators such as Gradient Boosted Trees or Random Forests, we can get
much better performance:

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
est = GradientBoostingRegressor().fit(X_train, y_train)

validation_score = metrics.explained_variance_score(
    y_test, est.predict(X_test))

print("validation: %r" % validation_score)

training_score = metrics.explained_variance_score(
    y_train, est.predict(X_train))

print("training: %r" % training_score)

This model is still over-fitting the data, but not by as much as the single tree.

## Exercise: Model Selection via Validation

Here we saw K-neighbors classification of the digits. We've also seen support vector
machine classification of digits. Now that we have these
validation tools in place, we can ask quantitatively which of the three estimators
works best for the digits dataset.

Take a moment and determine the answers to these questions for the digits dataset:

- With the default hyper-parameters for each estimator, which gives the best f1 score
  on the **validation set**?  Recall that hyperparameters are the parameters set when
  you instantiate the classifier: for example, the ``n_neighbors`` in

          clf = KNeighborsClassifier(n_neighbors=1)

  To use the default value, simply leave them unspecified.
- For each classifier, which value for the hyperparameters gives the best results for
  the digits data?  For ``LinearSVC``, use ``loss='l2'`` and ``loss='l1'``.  For
  ``KNeighborsClassifier`` use ``n_neighbors`` between 1 and 10. Try also
  ``GaussianNB``. Note that it does not have any adjustable hyperparameters.
- Bonus: do the same exercise on the Iris data rather than the Digits data.  Does the
  same classifier/hyperparameter combination win out in this case?

In [None]:
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier

### Solution

# General Strategy

<img src="img/predictive_modeling_data_flow.png">

**Even better**: when we have to choose hyper-parameters in the model, we want to be sure not overfittig on a particular subset of validation data.


<img src="img/Cross-val.png">

# Let's start from tabular data from the Titanic kaggle challenge

Let us have a look at the Titanic dataset from the Kaggle Getting Started challenge at:

https://www.kaggle.com/c/titanic-gettingStarted

We can load the CSV file as a pandas data frame in one line.

First we can have a look:

In [None]:
!head -5 titanic_train.csv

Now, let's use pandas

In [None]:
import pandas as pd

In [None]:
data = pd.read_csv('./data/titanic_train.csv')

pandas data frames have a HTML table representation in the IPython notebook. Let's have a look at the first 5 rows:

In [None]:
data.head(5)

In [None]:
data.columns

In [None]:
data.count()

The data frame has 891 rows. Some passengers have missing information though: in particular Age and Cabin info can be missing. The meaning of the columns is explained on the challenge website:

https://www.kaggle.com/c/titanic-gettingStarted/data

and copied here:

```
VARIABLE DESCRIPTIONS:
survival        Survival
                (0 = No; 1 = Yes)
pclass          Passenger Class
                (1 = 1st; 2 = 2nd; 3 = 3rd)
name            Name
sex             Sex
age             Age
sibsp           Number of Siblings/Spouses Aboard
parch           Number of Parents/Children Aboard
ticket          Ticket Number
fare            Passenger Fare
cabin           Cabin
embarked        Port of Embarkation
                (C = Cherbourg; Q = Queenstown; S = Southampton)

SPECIAL NOTES:
Pclass is a proxy for socio-economic status (SES)
 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

Age is in Years; Fractional if Age less than One (1)
 If the Age is Estimated, it is in the form xx.5

With respect to the family relation variables (i.e. sibsp and parch)
some relations were ignored.  The following are the definitions used
for sibsp and parch.

Sibling:  Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse:   Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent:   Mother or Father of Passenger Aboard Titanic
Child:    Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

Other family relatives excluded from this study include cousins,
nephews/nieces, aunts/uncles, and in-laws.  Some children travelled
only with a nanny, therefore parch=0 for them.  As well, some
travelled with very close friends or neighbors in a village, however,
the definitions do not support such relations.
```



In [None]:
list(data.columns)

In [None]:
data.shape

A data frame can be converted into a numpy array by calling the `values` attribute:

In [None]:
data.values

However this cannot be directly fed to a scikit-learn model:


- the target variable (survival) is mixed with the input data

- some attribute such as unique ids have no predictive values for the task

- the values are heterogeneous (string labels for categories, integers and floating point numbers)

- some attribute values are missing (nan: "not a number")

In [None]:
data.dtypes

## Predicting survival

The goal of the challenge is to predict whether a passenger has survived from others known attribute. Let us have a look at the `Survived` columns:

In [None]:
survived_column = data['Survived']
survived_column.dtype

`data.Survived` is an instance of the pandas `Series` class with an integer dtype:

In [None]:
type(survived_column)

`Series` can be seen as homegeneous, 1D columns. `DataFrame` instances are heterogenous collections of columns with the same length.

The original data frame can be aggregated by counting rows for each possible value of the `Survived` column:

In [None]:
data.groupby('Survived').count()

for example, from here I can infer that if for a passenger I don't know the cabin, it's a good indicator that 
the passenger did not survive

In [None]:
data.groupby('Survived').mean()

In [None]:
np.mean(survived_column == 0)

From this the subset of the full passengers list, about 2/3 perished in the event. So if we are to build a predictive model from this data, a baseline model to compare the performance to would be to always predict death. Such a constant model would reach around 62% predictive accuracy (which is higher than predicting at random):

pandas `Series` instances can be converted to regular 1D numpy arrays by using the `values` attribute:

In [None]:
target = survived_column.values

In [None]:
data.plot(kind='scatter', x='Age', y='Fare', c='Survived', s=50);

In [None]:
from pandas.plotting import scatter_matrix
scatter_matrix(data.get(['Fare', 'Pclass','Age']), alpha=0.2, figsize=(8,8),diagonal='kde');

In [None]:
data.get(['Fare','Age','Survived']).groupby('Survived').mean().plot(kind='bar');

## Training a predictive model on numerical features

`sklearn` estimators all work with homegeneous numerical feature descriptors passed as a numpy array. Therefore passing the raw data frame will not work out of the box.

**Let us start simple** and build a first model that only uses readily available numerical features as input, namely `data.Fare`, `data.Pclass` and `data.Age`.

In [None]:
numerical_features = data.get(['Fare', 'Pclass', 'Age'])
numerical_features.head(5)

Unfortunately some passengers do not have age information:

In [None]:
numerical_features.count()

Let's use pandas `fillna` method to input the median age for those passengers. First of all, if we remove na values

In [None]:
median_features = numerical_features.dropna().median()
median_features

a possible strategy for fill missing values:

In [None]:

imputed_features = numerical_features.fillna(median_features)
imputed_features.count()

now:

In [None]:
imputed_features.median()

--> **Filling the missing values (imputing) is a critical step for any data analysis**

Now that the data frame is clean, we can convert it into an homogeneous numpy array of floating point values:

In [None]:
features_array = imputed_features.values
features_array

In [None]:
features_array.dtype

Let's take the 80% of the data for training a first model and keep 20% for computing is generalization score:

In [None]:
#from sklearn.model_selection import train_test_split
from sklearn.model_selection import train_test_split

features_train, features_test, target_train, target_test = train_test_split(
    features_array, target, test_size=0.20, random_state=0)

In [None]:
features_train.shape

In [None]:
features_test.shape

In [None]:
target_train.shape

In [None]:
target_test.shape

### Let's start with a simple model from sklearn, namely `LogisticRegression`:

In [None]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(C=1.)
logreg.fit(features_train, target_train)

In [None]:
target_predicted = logreg.predict(features_test)

In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(target_test, target_predicted)

This first model has around 73% accuracy: this is better than our baseline that always predicts death.

We can use a different score:

In [None]:

from sklearn.metrics import precision_score

precision_score(target_test, target_predicted)

**Quiz**: why two different score values? What is the difference?

Solution

<img src="img/Conf_Matrix.png">

Also notice thatalready in the train set, 62% of the passengers did not survive...so I am comparing accuracy of 73% with 62%,
not with 50%...not soooooo good.

## Model evaluation and interpretation

### Interpreting linear model weights

The `coef_` attribute of a fitted linear model such as `LogisticRegression` holds the weights of each features:

In [None]:
feature_names = numerical_features.columns
feature_names

In [None]:
logreg.coef_

In [None]:
x = np.arange(len(feature_names))
plt.bar(x, logreg.coef_.ravel())
_ = plt.xticks(x + 0.5, feature_names, rotation=30)

In this case, survival is slightly positively linked with Fare (the higher the fare, the higher the likelyhood the model will predict survival) while passenger from first class and lower ages are predicted to survive more often than older people from the 3rd class.

First-class cabins were closer to the lifeboats and children and women reportedly had the priority. Our model seems to capture that historical data. We will see later if the sex of the passenger can be used as an informative predictor to increase the predictive accuracy of the model.

### Alternative evaluation metrics

It is possible to see the details of the false positive and false negative errors by computing the confusion matrix:

In [None]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(target_test, target_predicted)
print(cm)

The true labeling are seen as the rows and the predicted labels are the columns:

In [None]:
def plot_confusion(cm):
    plt.imshow(cm, interpolation='nearest', cmap=plt.cm.binary)
    plt.title('Confusion matrix')
    plt.set_cmap('Blues')
    plt.colorbar()

    target_names = ['not survived', 'survived']

    tick_marks = np.arange(len(target_names))
    plt.xticks(tick_marks, target_names, rotation=60)
    plt.yticks(tick_marks, target_names)
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    # Convenience function to adjust plot parameters for a clear layout.
    plt.tight_layout()
    
plot_confusion(cm)

We can normalize the number of prediction by dividing by the total number of true "survived" and "not survived" to compute false and true positive rates for survival (in the second column of the confusion matrix).

In [None]:
print(cm.astype(np.float64) / cm.sum(axis=1))

We can therefore observe that the fact that the target classes are not balanced in the dataset makes the accuracy score not very informative.

scikit-learn provides alternative classification metrics to evaluate models performance on imbalanced data such as precision, recall and f1 score:

In [None]:
from sklearn.metrics import classification_report

print(classification_report(target_test, target_predicted,
                            target_names=['not survived', 'survived']))

Another way to quantify the quality of a binary classifier on imbalanced data is to compute the precision, recall and f1-score of a model (at the default fixed decision threshold of 0.5).

Logistic Regression is a probabilistic models: instead of just predicting a binary outcome (survived or not) given the input features it can also estimates the posterior probability of the outcome given the input features using the `predict_proba` method:

In [None]:
target_predicted_proba = logreg.predict_proba(features_test)
target_predicted_proba[:5]

By default the decision threshold is 0.5: if we vary the decision threshold from 0 to 1 we could generate a family of binary classifier models that address all the possible trade offs between false positive and false negative prediction errors.

We can summarize the performance of a binary classifier for all the possible thresholds by plotting the ROC curve and quantifying the Area under the ROC curve:

In [None]:
from sklearn.metrics import roc_curve
from sklearn.metrics import auc

def plot_roc_curve(target_test, target_predicted_proba):
    fpr, tpr, thresholds = roc_curve(target_test, target_predicted_proba[:, 1])
    
    roc_auc = auc(fpr, tpr)
    # Plot ROC curve
    plt.plot(fpr, tpr, label='ROC curve (area = %0.3f)' % roc_auc)
    plt.plot([0, 1], [0, 1], 'k--')  # random predictions curve
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.0])
    plt.xlabel('False Positive Rate or (1 - Specifity)')
    plt.ylabel('True Positive Rate or (Sensitivity)')
    plt.title('Receiver Operating Characteristic')
    plt.legend(loc="lower right")

In [None]:
plot_roc_curve(target_test, target_predicted_proba)

Here the area under ROC curve is 0.756 which is very similar to the accuracy (0.732). However the ROC-AUC score of a random model is expected to 0.5 on average while the accuracy score of a random model depends on the class imbalance of the data. ROC-AUC can be seen as a way to callibrate the predictive accuracy of a model against class imbalance.

### Cross-validation

We previously decided to randomly split the data to evaluate the model on 20% of held-out data. However the location randomness of the split might have a significant impact in the estimated accuracy:

In [None]:
features_train, features_test, target_train, target_test = train_test_split(
    features_array, target, test_size=0.20, random_state=0)

logreg.fit(features_train, target_train).score(features_test, target_test)

What happens if I change the random seed?

In [None]:
features_train, features_test, target_train, target_test = train_test_split(
    features_array, target, test_size=0.20, random_state=1)

logreg.fit(features_train, target_train).score(features_test, target_test)

In [None]:
features_train, features_test, target_train, target_test = train_test_split(
    features_array, target, test_size=0.20, random_state=2)

logreg.fit(features_train, target_train).score(features_test, target_test)

So instead of using a single train / test split, we can use a group of them and compute the min, max and mean scores as an estimation of the real test score while not underestimating the variability:

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(logreg, features_array, target, cv=5)
scores

In [None]:
scores.min(), scores.mean(), scores.max()

In [None]:
scores.mean(), scores.std()

`cross_val_score` reports accuracy by default be it can also be used to report other performance metrics such as ROC-AUC or f1-score:

In [None]:
scores = cross_val_score(logreg, features_array, target, cv=5,
                         scoring='roc_auc')
scores.min(), scores.mean(), scores.max()

**Exercise**:

- Compute cross-validated scores for other classification metrics ('precision', 'recall', 'f1', 'accuracy'...).

- Change the number of cross-validation folds between 3 and 10: what is the impact on the mean score? on the processing time?

- Try change some internal parameter of the logistic regression

Hints:

The list of classification metrics is available in the online documentation:

  http://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values
  
You can use the `%%time` cell magic on the first line of an IPython cell to measure the time of the execution of the cell. 