# Model Evaluation

Prof. Dr. Georgios K. Ouzounis<br/>
[georgios.ouzounis@go.kauko.lt](georgios.ouzounis@go.kauko.lt)

<img src="https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/02/17085331/scikit-learn-logo.png" alt="sci-kit learn" width="300" style="float: left; margin-right: 10px;" />

The contents of this session are taken directly from the source site
http://scikit-learn.org/stable/index.html 

<style TYPE="text/css">
code.has-jax {font: inherit; font-size: 100%; background: inherit; border: inherit;}
</style>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
    tex2jax: {
        inlineMath: [['$','$'], ['\\(','\\)']],
        skipTags: ['script', 'noscript', 'style', 'textarea', 'pre'] // removed 'code' entry
    }
});
MathJax.Hub.Queue(function() {
    var all = MathJax.Hub.getAllJax(), i;
    for(i = 0; i < all.length; i += 1) {
        all[i].SourceElement().parentNode.className += ' has-jax';
    }
});
</script>
<script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.4/MathJax.js?config=TeX-AMS_HTML-full"></script>

## Contents

- the scoring parameter
- classification metrics
- regression metrics
- validation curves

## The scoring parameter 

Evaluates the quality of a model’s predictions using internal scoring strategies!

Model selection and evaluation using tools, such as:

- [model_selection.GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) and 
- [model_selection.cross_val_score](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score), 

take a scoring parameter that controls what metric they apply to the estimators evaluated.

Designate a scorer object with the scoring parameter [(click here to see all possible values)](http://scikit-learn.org/stable/modules/model_evaluation.html). 

All scorer objects follow the convention that higher return values are better than lower return values. 

Examples:

In [None]:
from sklearn import svm, datasets
from sklearn.model_selection import cross_val_score

In [None]:
iris = datasets.load_iris()

In [None]:
X, y = iris.data, iris.target

In [None]:
 clf = svm.SVC(probability=True, random_state=0)

In [None]:
cross_val_score(clf, X, y, scoring='neg_log_loss')

In [None]:
model = svm.SVC()

In [None]:
cross_val_score(model, X, y, scoring='accuracy')

|Scoring|Function|Comments|
|:------|:-------|:-------|
| ‘accuracy’ | [metrics.accuracy_score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score) | |
| ‘average_precision’ | [metrics.average_precision_score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html#sklearn.metrics.average_precision_score) | |
| ‘f1’ | [metrics.f1_score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score) | for binary targets |
| ‘f1_micro’ | [metrics.f1_score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score) | micro-averaged |
| ‘f1_macro’ | [metrics.f1_score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score) | macro-averaged |
| ‘f1_weighted’ | [metrics.f1_score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score) | weighted average |
| ‘f1_samples’ | [metrics.f1_score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score) | by multilabel sample |
| ‘neg_log_loss’ | [metrics.log_loss](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html#sklearn.metrics.log_loss) | requires predict_proba support |
| ‘precision’ etc. | [metrics.precision_score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.precision_score) | suffixes apply as with ‘f1’ |
| ‘recall’ etc. | [metrics.recall_score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html#sklearn.metrics.recall_score) | suffixes apply as with ‘f1’ |
| ‘roc_auc’ | [metrics.roc_auc_score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html#sklearn.metrics.roc_auc_score) | |
 


Some scoring parameters for regression:

|Scoring | Function | Comment |
|:------:|:--------:|:-------:|
|explained_variance’ | [metrics.explained_variance_score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.explained_variance_score.html#sklearn.metrics.explained_variance_score)||
| ‘neg_mean_absolute_error’ | [metrics.mean_absolute_error](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html#sklearn.metrics.mean_absolute_error) | |
| ‘neg_mean_squared_error’ | [metrics.mean_squared_error](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html#sklearn.metrics.mean_squared_error) | |
| ‘neg_mean_squared_log_error’ | [metrics.mean_squared_log_error](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_log_error.html#sklearn.metrics.mean_squared_log_error) | |
| ‘neg_median_absolute_error’ | [metrics.median_absolute_error](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.median_absolute_error.html#sklearn.metrics.median_absolute_error) | |
| ‘r2’ | [metrics.r2_score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score) | |


## Classification Metrics

The sklearn.metrics module implements several loss, score, and utility functions to measure classification performance.

Some of these are restricted to the binary classification case:

| Metric | Description |
|--------|-------------|
| [precision_recall_curve(y_true, probas_pred)](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_curve.html#sklearn.metrics.precision_recall_curve) | Compute precision-recall pairs for different probability thresholds |
| [roc_curve(y_true, y_score\[, pos_label, …\])](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve) | Compute Receiver operating characteristic (ROC) |


Others also work in the multiclass case:

| Metric | Description |
|--------|-------------|
| [cohen_kappa_score(y1, y2\[, labels, weights, …\])](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.cohen_kappa_score.html#sklearn.metrics.cohen_kappa_score) | Cohen’s kappa: a statistic that measures inter-annotator agreement. |
| [confusion_matrix(y_true, y_pred\[, labels, …\])](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html#sklearn.metrics.confusion_matrix) | Compute confusion matrix to evaluate the accuracy of a classification |
| [hinge_loss(y_true, pred_decision\[, labels, …\])](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.hinge_loss.html#sklearn.metrics.hinge_loss) | Average hinge loss (non-regularized) |
| [matthews_corrcoef(y_true, y_pred\[, …\])](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.matthews_corrcoef.html#sklearn.metrics.matthews_corrcoef) | Compute the Matthews correlation coefficient (MCC) |


Some also work in the multilabel case:


| Metric | Description |
|--------|-------------|
| [accuracy_score(y_true, y_pred\[, normalize, …\])](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score) | Accuracy classification score. |
| [classification_report(y_true, y_pred\[, …\])](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html#sklearn.metrics.classification_report) | Build a text report showing the main classification metrics | 
| [f1_score(y_true, y_pred\[, labels, …\])](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score) | Compute the F1 score, also known as balanced F-score or F-measure |
| [fbeta_score(y_true, y_pred, beta\[, labels, …\])](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.fbeta_score.html#sklearn.metrics.fbeta_score) | Compute the F-beta score |
| [hamming_loss(y_true, y_pred\[, labels, …\])](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.hamming_loss.html#sklearn.metrics.hamming_loss) | Compute the average Hamming loss. |
| [jaccard_similarity_score(y_true, y_pred\[, …\])](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_similarity_score.html#sklearn.metrics.jaccard_similarity_score) | Jaccard similarity coefficient score |
| [log_loss(y_true, y_pred\[, eps, normalize, …\])](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html#sklearn.metrics.log_loss) | Log loss, aka logistic loss or cross-entropy loss. |
| [precision_recall_fscore_support(y_true, y_pred)](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html#sklearn.metrics.precision_recall_fscore_support) | Compute precision, recall, F-measure and support for each class |
| [precision_score(y_true, y_pred\[, labels, …\])](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.precision_score) | Compute the precision |
| [recall_score(y_true, y_pred\[, labels, …\])](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html#sklearn.metrics.recall_score) | Compute the recall |
| [zero_one_loss(y_true, y_pred\[, normalize, …\])](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.zero_one_loss.html#sklearn.metrics.zero_one_loss) | Zero-one classification loss. |


And some work with binary and multi-label (but not multiclass) problems:

| Metric | Description |
|--------|-------------|
| [average_precision_score(y_true, y_score\[, …\])](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html#sklearn.metrics.average_precision_score) | Compute average precision (AP) from prediction scores |
| [roc_auc_score(y_true, y_score\[, average, …\])](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html#sklearn.metrics.roc_auc_score) | Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores. |


In the following some more frequently used metrics will be presented in more detail.


### The accuracy score

The [accuracy_score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score) function computes the [accuracy](https://en.wikipedia.org/wiki/Accuracy_and_precision), either the fraction (default) or the count (normalize=False) of correct predictions.

In multilabel classification, the function returns the subset accuracy. If the entire set of predicted labels for a sample strictly match with the true set of labels, then the subset accuracy is 1.0; otherwise it is 0.0.

If $\hat{y}_i$ is the predicted value of the $i_{th}$ sample and $y_i$ is the corresponding true value, then the fraction of correct predictions over $n_{samples}$ is defined as:

$accuracy(y,\hat{y}) = \frac{1}{n_{samples}}\sum_{i=0}^{n_{samples}-1} 1(\hat{y}_i = y_i)$

where $1(x)$ is the [indicator function](https://en.wikipedia.org/wiki/Indicator_function).

In [None]:
import numpy as np
from sklearn.metrics import accuracy_score

In [None]:
y_pred = [0, 2, 1, 3]

In [None]:
y_true = [0, 1, 2, 3]

In [None]:
accuracy_score(y_true, y_pred)

In [None]:
accuracy_score(y_true, y_pred, normalize=False)

In the multilabel case with binary label indicators:

In [None]:
accuracy_score(np.array([[0, 1], [1, 1]]), np.ones((2, 2)))


### Precision, Recall and F-Measures

Intuitively, [precision](https://en.wikipedia.org/wiki/Precision_and_recall#Precision) is the ability of the classifier not to label as positive a sample that is negative, and [recall](https://en.wikipedia.org/wiki/Precision_and_recall#Recall) is the ability of the classifier to find all the positive samples.

The [F-measure](https://en.wikipedia.org/wiki/F1_score) ($F_{\beta}$ and $F_1$ measures) can be interpreted as a weighted harmonic mean of the precision and recall. A $F_{\beta}$ measure reaches its best value at 1 and its worst score at 0. With $\beta=1$, $F_{\beta}$ and $F_1$ are equivalent, and the recall and the precision are equally important.

<img src="https://www.kdnuggets.com/images/precision-recall-relevant-selected.jpg" />

The traditional F-measure or balanced F-score (**$F_1$ score**) is the [harmonic mean](https://en.wikipedia.org/wiki/Harmonic_mean#Harmonic_mean_of_two_numbers) of precision and recall:

$F_1 = \frac{2}{\frac{1}{recall} + \frac{1}{precision}} = 2\times\frac{precision\;\times\;recall}{precision\;+\;recall}$

The general formula for positive real $\beta$ is:

$F_{\beta} = (1 + \beta^2)\times\frac{precision\;\times\;recall}{(\beta^2 \times precision)\;+\;recall}$

The formula in terms of [Type I and type II errors](https://en.wikipedia.org/wiki/Type_I_and_type_II_errors):

$F_{\beta} = \frac{(1+\beta^2)\times true\;positives}{(1+\beta^2)\times (true\;positives) + \beta^2\times (false\;negatives) + (false\;positives)}$


- The [precision_recall_curve](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_curve.html#sklearn.metrics.precision_recall_curve) computes a precision-recall curve from the ground truth label and a score given by the classifier by varying a decision threshold.
- The [average_precision_score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html#sklearn.metrics.average_precision_score) function computes the [average precision (AP)](http://en.wikipedia.org/w/index.php?title=Information_retrieval&oldid=793358396#Average_precision) from prediction scores. The value is between 0 and 1 and higher is better. AP is defined as $AP = \sum_n(R_n - R_{n-1})P_n$ where $P_n$ and $R_n$ are the precision and recall at the nth threshold. With random predictions, the $AP$ is the fraction of positive samples.

Note that the [precision_recall_curve](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_curve.html#sklearn.metrics.precision_recall_curve) function is restricted to the binary case. The [average_precision_score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html#sklearn.metrics.average_precision_score) function works only in binary classification and multilabel indicator format.

Several functions allow you to analyze the precision, recall and F-measures score:


| Metric | Description |
|--------|-------------|
| [average_precision_score(y_true, y_score\[, …\])](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html#sklearn.metrics.average_precision_score) | Compute average precision (AP) from prediction scores |
| [f1_score(y_true, y_pred\[, labels, …\])](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score) | Compute the F1 score, also known as balanced F-score or F-measure | 
| [fbeta_score(y_true, y_pred, beta\[, labels, …\])](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.fbeta_score.html#sklearn.metrics.fbeta_score) | Compute the F-beta score |
| [precision_recall_curve(y_true, probas_pred)](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_curve.html#sklearn.metrics.precision_recall_curve) | Compute precision-recall pairs for different probability thresholds |
| [precision_recall_fscore_support(y_true, y_pred)](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html#sklearn.metrics.precision_recall_fscore_support) | Compute precision, recall, F-measure and support for each class |
| [precision_score(y_true, y_pred\[, labels, …\])](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.precision_score) | Compute the precision |
| [recall_score(y_true, y_pred\[, labels, …\])](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html#sklearn.metrics.recall_score) | Compute the recall |


### Confusion Matrix

The [confusion_matrix function](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html#sklearn.metrics.confusion_matrix) evaluates classification accuracy by computing the [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix).

By definition, entry $i,j$ in a confusion matrix is the number of observations actually in group $i$, but predicted to be in group Thank $j$. Here is an example:

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
y_true = [2, 0, 2, 2, 0, 1]

In [None]:
y_pred = [0, 0, 2, 2, 0, 2]

In [None]:
confusion_matrix(y_true, y_pred)

Here is a visual representation of such a confusion matrix (this figure comes from the [confusion matrix](http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html#sphx-glr-auto-examples-model-selection-plot-confusion-matrix-py) example):

<img src="https://scikit-learn.org/stable/_images/sphx_glr_plot_confusion_matrix_001.png"/>

For binary problems, we can get counts of true negatives, false positives, false negatives and true positives as follows:


In [None]:
y_true = [0, 0, 0, 1, 1, 1, 1, 1]

In [None]:
y_pred = [0, 1, 0, 1, 0, 1, 0, 1]


In [None]:
tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()

In [None]:
tn, fp, fn, tp

### Classification Report

The [classification_report function](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html#sklearn.metrics.classification_report) builds a text report showing the main classification metrics. Here is a small example with custom *target_names* and inferred labels:


In [None]:
from sklearn.metrics import classification_report

In [None]:
y_true = [0, 1, 2, 2, 0]

In [None]:
y_pred = [0, 0, 2, 1, 0]

In [None]:
target_names = ['class 0', 'class 1', 'class 2']

In [None]:
print(classification_report(y_true, y_pred, target_names=target_names))

### Receiver Operating Characteristic (ROC)

A [receiver operating characteristic (ROC)](https://en.wikipedia.org/wiki/Receiver_operating_characteristic), or simply ROC curve, is a graphical plot which illustrates the performance of a binary classifier system as its discrimination threshold is varied. It is created by plotting the fraction of true positives out of the positives (TPR = true positive rate) vs. the fraction of false positives out of the negatives (FPR = false positive rate), at various threshold settings. TPR is also known as sensitivity, and FPR is one minus the specificity or true negative rate. [Wikipedia](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)

<img src="https://scikit-learn.org/stable/_images/sphx_glr_plot_roc_001.png"/>

The function [roc_curve](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve) computes the [ROC curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic). It requires the true binary value and the target scores, which can either be probability estimates of the positive class, confidence values, or binary decisions.  

In [None]:
import numpy as np
from sklearn.metrics import roc_curve

In [None]:
y = np.array([1, 1, 2, 2])

In [None]:
scores = np.array([0.1, 0.4, 0.35, 0.8])

In [None]:
fpr, tpr, thresholds = roc_curve(y, scores, pos_label=2)
fpr

In [None]:
tpr

In [None]:
thresholds

The [roc_auc_score function]() computes the **area under the receiver operating characteristic (ROC) curve**, which is also denoted by **AUC** or **AUROC**. By computing the area under the roc curve, the curve information is summarized in one number. For more information see the [Wikipedia article on AUC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve).

In [None]:
import numpy as np
from sklearn.metrics import roc_auc_score

In [None]:
y_true = np.array([0, 0, 1, 1])

In [None]:
y_scores = np.array([0.1, 0.4, 0.35, 0.8])

In [None]:
roc_auc_score(y_true, y_scores)

## Regression Metrics

The [sklearn.metrics](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics) module implements several loss, score, and utility functions to measure regression performance. Some of those have been enhanced to handle the multi-output case: [mean_squared_error](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html#sklearn.metrics.mean_squared_error), [mean_absolute_error](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html#sklearn.metrics.mean_absolute_error), [explained_variance_score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.explained_variance_score.html#sklearn.metrics.explained_variance_score) and [r2_score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score). 

### Explained Variance Score

The [explained_variance_score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.explained_variance_score.html#sklearn.metrics.explained_variance_score) computes the [explained variance regression score](https://en.wikipedia.org/wiki/Explained_variation).

If $\hat{y}$ is the estimated target output, $y$ the corresponding (correct) target output, and $Var$ is [variance](https://en.wikipedia.org/wiki/Variance)(the square of the standard deviation), then the explained variance is estimated as follow:

$explained\_variance(y,\hat{y}) = 1 - \frac{Var\{y-\hat{y}\}}{Var\{y\}}$

The best possible score is 1.0, lower values are worse.


Here is a small example of usage of the explained_variance_score function:

In [None]:
from sklearn.metrics import explained_variance_score

In [None]:
y_true = [3, -0.5, 2, 7]

In [None]:
y_pred = [2.5, 0.0, 2, 8]

In [None]:
explained_variance_score(y_true, y_pred)

In [None]:
y_true = [[0.5, 1], [-1, 1], [7, -6]]

In [None]:
y_pred = [[0, 2], [-1, 2], [8, -5]]

In [None]:
explained_variance_score(y_true, y_pred, multioutput='raw_values')

In [None]:
explained_variance_score(y_true, y_pred, multioutput=[0.3, 0.7])

### Mean Absolute Error

The [mean_absolute_error function](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html#sklearn.metrics.mean_absolute_error) computes the [mean absolute error](https://en.wikipedia.org/wiki/Mean_absolute_error), a risk metric corresponding to the expected value of the absolute error loss or $l1$-norm loss.

If $\hat{y}_i$ is the predicted value of the $i$-th sample, and $y_i$ is the corresponding true value, then the mean absolute error $(MAE)$ estimated over $n_{samples}$ is defined as:

$MAE(y,\hat{y}) = \frac{1}{n_{samples}}\sum_{i=0}^{n_{samples}-1}|y_i - \hat{y}_i|$

Here is a small example of usage of the [mean_absolute_error function](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html#sklearn.metrics.mean_absolute_error):



In [None]:
from sklearn.metrics import mean_absolute_error

In [None]:
y_true = [3, -0.5, 2, 7]

In [None]:
y_pred = [2.5, 0.0, 2, 8]

In [None]:
mean_absolute_error(y_true, y_pred)

In [None]:
y_true = [[0.5, 1], [-1, 1], [7, -6]]

In [None]:
y_pred = [[0, 2], [-1, 2], [8, -5]]


In [None]:
mean_absolute_error(y_true, y_pred)

In [None]:
mean_absolute_error(y_true, y_pred, multioutput='raw_values')

In [None]:
mean_absolute_error(y_true, y_pred, multioutput=[0.3, 0.7])

### Mean Squared Error

The [mean_squared_error function](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html#sklearn.metrics.mean_squared_error) computes [mean square error](https://en.wikipedia.org/wiki/Mean_squared_error), a risk metric corresponding to the expected value of the squared (quadratic) error or loss.

If $\hat{y}_i$ is the predicted value of the $i$-th sample, and $y_i$ is the corresponding true value, then the mean squared error $(MSE)$ estimated over $n_{samples}$ is defined as:


$MSE(y,\hat{y}) = \frac{1}{n_{samples}}\sum_{i=0}^{n_{samples}-1}(y_i - \hat{y}_i)^2$

Here is a small example of usage of the [mean_squared_error](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html#sklearn.metrics.mean_squared_error)  function:

In [None]:
from sklearn.metrics import mean_squared_error

In [None]:
y_true = [3, -0.5, 2, 7]

In [None]:
y_pred = [2.5, 0.0, 2, 8]

In [None]:
mean_squared_error(y_true, y_pred)

In [None]:
y_true = [[0.5, 1], [-1, 1], [7, -6]]

In [None]:
y_pred = [[0, 2], [-1, 2], [8, -5]]

In [None]:
mean_squared_error(y_true, y_pred)

### Median Absolute Error

The [median_absolute_error](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.median_absolute_error.html#sklearn.metrics.median_absolute_error) is particularly interesting because it is robust to outliers. The loss is calculated by taking the median of all absolute differences between the target and the prediction.

If $\hat{y}_i$ is the predicted value of the $i$-th sample, and $y_i$ is the corresponding true value, then the median squared error $(MedAE)$ estimated over $n_{samples}$ is defined as:

$MedAE(y,\hat{y}) = median(|y_1-\hat{y}_1|, ... , |y_n-\hat{y}_n|)$

The median_absolute_error does not support multi-output.

Here is a small example of usage of the [median_absolute_error](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.median_absolute_error.html#sklearn.metrics.median_absolute_error) function:


In [None]:
from sklearn.metrics import median_absolute_error

In [None]:
y_true = [3, -0.5, 2, 7]

In [None]:
y_pred = [2.5, 0.0, 2, 8]

In [None]:
median_absolute_error(y_true, y_pred)

### R2 Score, the Coefficient of Determination

The [r2_score function](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score) computes $R^2$, the coefficient of determination. It provides a measure of how well future samples are likely to be predicted by the model. Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a $R^2$ score of 0.0.

If $\hat{y}_i$ is the predicted value of the $i$-th sample, and $y_i$ is the corresponding true value, then the score $R^2$ estimated over $n_{samples}$ is defined as:

$R^2(y,\hat{y}) = 1 - \frac{\sum_{i=0}^{n_{samples}-1}(y_i - \hat{y}_i)^2}{\sum_{i=0}^{n_{samples}-1}(y_i - \bar{y}_i)^2}$

where 

$\bar{y} = \frac{1}{n_{samples}}\sum_{i=0}^{n_{samples}-1}y_i$

Here is a small example of usage of the coefficient of determination function:

In [None]:
from sklearn.metrics import r2_score

In [None]:
y_true = [3, -0.5, 2, 7]

In [None]:
y_pred = [2.5, 0.0, 2, 8]

In [None]:
r2_score(y_true, y_pred)

In [None]:
y_true = [[0.5, 1], [-1, 1], [7, -6]]

In [None]:
y_pred = [[0, 2], [-1, 2], [8, -5]]

In [None]:
r2_score(y_true, y_pred, multioutput='variance_weighted')

In [None]:
y_true = [[0.5, 1], [-1, 1], [7, -6]]

In [None]:
y_pred = [[0, 2], [-1, 2], [8, -5]]

In [None]:
r2_score(y_true, y_pred, multioutput='uniform_average')

In [None]:
r2_score(y_true, y_pred, multioutput='raw_values')

In [None]:
r2_score(y_true, y_pred, multioutput=[0.3, 0.7])

## Validation Curves

Plotting Scores to Evaluate Models 

The generalization error of each estimator can be decomposed in terms of bias, variance and noise:

- the **bias** of an estimator is its average error for different training sets;
- the **variance** of an estimator indicates how sensitive it is to varying training sets;
- **noise** is a property of the data.


Consider the function $f(x) = cos(\frac{3}{2}\pi x)$ and some noisy samples from that function.  Lets use three different estimators to fit the function: linear regression with polynomial features of degree 1, 4 and 15. 

<a href="https://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html"><img src="https://scikit-learn.org/stable/_images/sphx_glr_plot_underfitting_overfitting_001.png"/></a>

Observing the results we see that:

- the first estimator can at best provide only a poor fit to the samples and the true function because it is too simple (high bias). This is known as **underfitting**!
- the second estimator approximates it almost perfectly and
- the last estimator approximates the training data perfectly but does not fit the true function very well, i.e. it is very sensitive to varying training data (high variance). This is known as **overfitting**!



Bias and variance are inherent properties of estimators and we want to keep them as low as possible!

- select learning algorithms and hyperparameters that minimize them;
- to reduce the variance of a model use more training data but only if the true function is too complex to be approximated by an estimator with a lower variance.

The condition is easy to see in 1D problems like the earlier one. In high-dimensional spaces models can become very difficult to visualize. For these cases we use the tools that follow.


To validate a model we need a **scoring function**, for example accuracy for classifiers. 

Choose multiple hyperparameters of an estimator using grid search to get the maximum score on one or more  validation sets. 

Hyperparameter optimization based on a validation score introduces bias on the latter and is not a good estimate of the generalization. To get a proper estimate of the generalization we have to compute the score on another test set.


Plot the influence of a single hyperparameter on the training score and the validation score to find out whether the estimator is overfitting or underfitting for some hyperparameter values.

In the example on the right we vary the parameter $\gamma$ of an SVM on the digits dataset


<a href="https://scikit-learn.org/stable/auto_examples/model_selection/plot_validation_curve.html"><img src="https://scikit-learn.org/stable/_images/sphx_glr_plot_validation_curve_001.png"/></a>

1. If the training score and the validation score are both low, the estimator will be underfitting. 
2. If the training score is high and the validation score is low, the estimator is overfitting and otherwise it is working very well.
3. A low training score and a high validation score is usually not possible.

[Get the code here](http://scikit-learn.org/stable/auto_examples/model_selection/plot_validation_curve.html#sphx-glr-auto-examples-model-selection-plot-validation-curve-py)


### Learning Curve

**A learning curve shows the validation and training score of an estimator for varying numbers of training samples.** 

It is a tool to find out how much we benefit from adding more training data and whether the estimator suffers more from a variance error or a bias error. 

If both the validation score and the training score converge to a value that is too low with increasing size of the training set, we will not benefit much from more training data. 

Example: naive Bayes roughly converges to a low score:



<a href="https://scikit-learn.org/stable/modules/learning_curve.html#learning-curve"><img src="https://scikit-learn.org/stable/_images/sphx_glr_plot_learning_curve_0011.png"/></a>

We will probably have to use an estimator or a parametrization of the current estimator that can learn more complex concepts (i.e. has a lower bias).

If the training score is much greater than the validation score for the maximum number of training samples, adding more training samples will most likely increase generalization.

Example: SVM benefiting from more training examples:

<a href="https://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html"><img src="https://scikit-learn.org/stable/_images/sphx_glr_plot_learning_curve_002.png"/></a>

[Get code here](http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html#sphx-glr-auto-examples-model-selection-plot-learning-curve-py)