# Model evaluation

There are 3 different approaches to evaluate the quality of predictions of a model:
* Estimator score method: Estimators have a score method providing a default evaluation criterion for the problem they are designed to solve. This is not discussed on this page, but in each estimator’s documentation.
* Scoring parameter: Model-evaluation tools using cross-validation (such as model_selection.cross_val_score and model_selection.GridSearchCV) rely on an internal scoring strategy. This is discussed in the section The scoring parameter: defining model evaluation rules.
* Metric functions: The metrics module implements functions assessing prediction error for specific purposes. These metrics are detailed in sections on Classification metrics, Multilabel ranking metrics, Regression metrics and Clustering metrics.

Finally, Dummy estimators are useful to get a baseline value of those metrics for random predictions.

## The scoring parameter

Model selection and evaluation using tools, such as model_selection.GridSearchCV and model_selection.cross_val_score, take a scoring parameter that controls what metric they apply to the estimators evaluated.

For the most common use cases, you can designate a scorer object with the scoring parameter; the table below shows all possible values. All scorer objects follow the convention that higher return values are better than lower return values. Thus metrics which measure the distance between the model and the data, like metrics.mean_squared_error, are available as neg_mean_squared_error which return the negated value of the metric.

<table border="1" class="docutils">
<colgroup>
<col width="26%">
<col width="40%">
<col width="33%">
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Scoring</th>
<th class="head">Function</th>
<th class="head">Comment</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td><strong>Classification</strong></td>
<td>&nbsp;</td>
<td>&nbsp;</td>
</tr>
<tr class="row-odd"><td>‘accuracy’</td>
<td><a class="reference internal" href="generated/sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score" title="sklearn.metrics.accuracy_score"><code class="xref py py-func docutils literal"><span class="pre">metrics.accuracy_score</span></code></a></td>
<td>&nbsp;</td>
</tr>
<tr class="row-even"><td>‘average_precision’</td>
<td><a class="reference internal" href="generated/sklearn.metrics.average_precision_score.html#sklearn.metrics.average_precision_score" title="sklearn.metrics.average_precision_score"><code class="xref py py-func docutils literal"><span class="pre">metrics.average_precision_score</span></code></a></td>
<td>&nbsp;</td>
</tr>
<tr class="row-odd"><td>‘f1’</td>
<td><a class="reference internal" href="generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score" title="sklearn.metrics.f1_score"><code class="xref py py-func docutils literal"><span class="pre">metrics.f1_score</span></code></a></td>
<td>for binary targets</td>
</tr>
<tr class="row-even"><td>‘f1_micro’</td>
<td><a class="reference internal" href="generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score" title="sklearn.metrics.f1_score"><code class="xref py py-func docutils literal"><span class="pre">metrics.f1_score</span></code></a></td>
<td>micro-averaged</td>
</tr>
<tr class="row-odd"><td>‘f1_macro’</td>
<td><a class="reference internal" href="generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score" title="sklearn.metrics.f1_score"><code class="xref py py-func docutils literal"><span class="pre">metrics.f1_score</span></code></a></td>
<td>macro-averaged</td>
</tr>
<tr class="row-even"><td>‘f1_weighted’</td>
<td><a class="reference internal" href="generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score" title="sklearn.metrics.f1_score"><code class="xref py py-func docutils literal"><span class="pre">metrics.f1_score</span></code></a></td>
<td>weighted average</td>
</tr>
<tr class="row-odd"><td>‘f1_samples’</td>
<td><a class="reference internal" href="generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score" title="sklearn.metrics.f1_score"><code class="xref py py-func docutils literal"><span class="pre">metrics.f1_score</span></code></a></td>
<td>by multilabel sample</td>
</tr>
<tr class="row-even"><td>‘neg_log_loss’</td>
<td><a class="reference internal" href="generated/sklearn.metrics.log_loss.html#sklearn.metrics.log_loss" title="sklearn.metrics.log_loss"><code class="xref py py-func docutils literal"><span class="pre">metrics.log_loss</span></code></a></td>
<td>requires <code class="docutils literal"><span class="pre">predict_proba</span></code> support</td>
</tr>
<tr class="row-odd"><td>‘precision’ etc.</td>
<td><a class="reference internal" href="generated/sklearn.metrics.precision_score.html#sklearn.metrics.precision_score" title="sklearn.metrics.precision_score"><code class="xref py py-func docutils literal"><span class="pre">metrics.precision_score</span></code></a></td>
<td>suffixes apply as with ‘f1’</td>
</tr>
<tr class="row-even"><td>‘recall’ etc.</td>
<td><a class="reference internal" href="generated/sklearn.metrics.recall_score.html#sklearn.metrics.recall_score" title="sklearn.metrics.recall_score"><code class="xref py py-func docutils literal"><span class="pre">metrics.recall_score</span></code></a></td>
<td>suffixes apply as with ‘f1’</td>
</tr>
<tr class="row-odd"><td>‘roc_auc’</td>
<td><a class="reference internal" href="generated/sklearn.metrics.roc_auc_score.html#sklearn.metrics.roc_auc_score" title="sklearn.metrics.roc_auc_score"><code class="xref py py-func docutils literal"><span class="pre">metrics.roc_auc_score</span></code></a></td>
<td>&nbsp;</td>
</tr>
<tr class="row-even"><td><strong>Clustering</strong></td>
<td>&nbsp;</td>
<td>&nbsp;</td>
</tr>
<tr class="row-odd"><td>‘adjusted_rand_score’</td>
<td><a class="reference internal" href="generated/sklearn.metrics.adjusted_rand_score.html#sklearn.metrics.adjusted_rand_score" title="sklearn.metrics.adjusted_rand_score"><code class="xref py py-func docutils literal"><span class="pre">metrics.adjusted_rand_score</span></code></a></td>
<td>&nbsp;</td>
</tr>
<tr class="row-even"><td><strong>Regression</strong></td>
<td>&nbsp;</td>
<td>&nbsp;</td>
</tr>
<tr class="row-odd"><td>‘neg_mean_absolute_error’</td>
<td><a class="reference internal" href="generated/sklearn.metrics.mean_absolute_error.html#sklearn.metrics.mean_absolute_error" title="sklearn.metrics.mean_absolute_error"><code class="xref py py-func docutils literal"><span class="pre">metrics.mean_absolute_error</span></code></a></td>
<td>&nbsp;</td>
</tr>
<tr class="row-even"><td>‘neg_mean_squared_error’</td>
<td><a class="reference internal" href="generated/sklearn.metrics.mean_squared_error.html#sklearn.metrics.mean_squared_error" title="sklearn.metrics.mean_squared_error"><code class="xref py py-func docutils literal"><span class="pre">metrics.mean_squared_error</span></code></a></td>
<td>&nbsp;</td>
</tr>
<tr class="row-odd"><td>‘neg_median_absolute_error’</td>
<td><a class="reference internal" href="generated/sklearn.metrics.median_absolute_error.html#sklearn.metrics.median_absolute_error" title="sklearn.metrics.median_absolute_error"><code class="xref py py-func docutils literal"><span class="pre">metrics.median_absolute_error</span></code></a></td>
<td>&nbsp;</td>
</tr>
<tr class="row-even"><td>‘r2’</td>
<td><a class="reference internal" href="generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score" title="sklearn.metrics.r2_score"><code class="xref py py-func docutils literal"><span class="pre">metrics.r2_score</span></code></a></td>
<td>&nbsp;</td>
</tr>
</tbody>
</table>

In [1]:
from sklearn import svm, datasets
from sklearn.model_selection import cross_val_score
iris = datasets.load_iris()
X, y = iris.data, iris.target
clf = svm.SVC(probability=True, random_state=0)
print cross_val_score(clf, X, y, scoring='neg_log_loss') 

model = svm.SVC()
cross_val_score(model, X, y, scoring='wrong_choice')

[-0.07490352 -0.16449405 -0.06685511]


ValueError: 'wrong_choice' is not a valid scoring value. Valid options are ['accuracy', 'adjusted_rand_score', 'average_precision', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'neg_log_loss', 'neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'r2', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'roc_auc']

####  Defining your scoring strategy from metric functions

The module sklearn.metric also exposes a set of simple functions measuring a prediction error given ground truth and prediction:

* functions ending with _score return a value to maximize, the higher the better.
* functions ending with _error or _loss return a value to minimize, the lower the better. When converting into a scorer object using make_scorer, set the greater_is_better parameter to False (True by default; see the parameter description below).

Metrics available for various machine learning tasks are detailed in sections below.

Many metrics are not given names to be used as scoring values, sometimes because they require additional parameters, such as fbeta_score. In such cases, you need to generate an appropriate scoring object. The simplest way to generate a callable object for scoring is by using make_scorer. That function converts metrics into callables that can be used for model evaluation.

One typical use case is to wrap an existing metric function from the library with non-default values for its parameters, such as the beta parameter for the fbeta_score function:

In [2]:
from sklearn.metrics import fbeta_score, make_scorer
ftwo_scorer = make_scorer(fbeta_score, beta=2)
from sklearn.model_selection import GridSearchCV
from sklearn.svm import LinearSVC
grid = GridSearchCV(LinearSVC(), param_grid={'C': [1, 10]}, scoring=ftwo_scorer)

The second use case is to build a completely custom scorer object from a simple python function using make_scorer, which can take several parameters:

* the python function you want to use (my_custom_loss_func in the example below)
* whether the python function returns a score (greater_is_better=True, the default) or a loss (greater_is_better=False). If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.
* for classification metrics only: whether the python function you provided requires continuous decision certainties (needs_threshold=True). The default value is False.
* any additional parameters, such as beta or labels in f1_score.

Here is an example of building custom scorers, and of using the greater_is_better parameter:

In [3]:
import numpy as np
def my_custom_loss_func(ground_truth, predictions):
    diff = np.abs(ground_truth - predictions).max()
    return np.log(1 + diff)

# loss_func will negate the return value of my_custom_loss_func,
#  which will be np.log(2), 0.693, given the values for ground_truth
#  and predictions defined below.
loss  = make_scorer(my_custom_loss_func, greater_is_better=False)
score = make_scorer(my_custom_loss_func, greater_is_better=True)
ground_truth = [[1, 1]]
predictions  = [0, 1]
from sklearn.dummy import DummyClassifier
# What is the dummy classifier!?! Wait one second
clf = DummyClassifier(strategy='most_frequent', random_state=0)
clf = clf.fit(ground_truth, predictions)
loss(clf,ground_truth, predictions) 

score(clf,ground_truth, predictions) 

0.69314718055994529

## Other scoring functions

There are so many different scoring functions that there is no way that we are going to go over all of them. But we will go over some extremely useful ones

#### Confusion Matrix

The confusion_matrix function evaluates classification accuracy by computing the confusion matrix.

By definition, entry i, j in a confusion matrix is the number of observations actually in group i, but predicted to be in group j. Here is an example:

In [4]:
from sklearn.metrics import confusion_matrix
y_true = [2, 0, 2, 2, 0, 1]
y_pred = [0, 0, 2, 2, 0, 2]
confusion_matrix(y_true, y_pred)

array([[2, 0, 0],
       [0, 0, 1],
       [1, 0, 2]])

In [5]:
y_true = [0, 0, 0, 1, 1, 1, 1, 1]
y_pred = [0, 1, 0, 1, 0, 1, 0, 1]
tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
tn, fp, fn, tp


(2, 1, 2, 3)

#### Classification Report

The classification_report function builds a text report showing the main classification metrics. Here is a small example with custom target_names and inferred labels:

In [6]:
from sklearn.metrics import classification_report
y_true = [0, 1, 2, 2, 0]
y_pred = [0, 0, 2, 1, 0]
target_names = ['class 0', 'class 1', 'class 2']
print(classification_report(y_true, y_pred, target_names=target_names))



             precision    recall  f1-score   support

    class 0       0.67      1.00      0.80         2
    class 1       0.00      0.00      0.00         1
    class 2       1.00      0.50      0.67         2

avg / total       0.67      0.60      0.59         5



#### Dummy Estimators

This was an interesting choice to put this here, but it did not feel right in any of the other sections.

When doing supervised learning, a simple sanity check consists of comparing one’s estimator against simple rules of thumb. DummyClassifier implements several such simple strategies for classification:
* stratified generates random predictions by respecting the training set class distribution.
* most_frequent always predicts the most frequent label in the training set.
* prior always predicts the class that maximizes the class prior (like most_frequent) and predict_proba returns the class prior.
* uniform generates predictions uniformly at random.
* constant always predicts a constant label that is provided by the user. A major motivation of this method is F1-scoring, when the positive class is in the minority.

Note that with all these strategies, the predict method completely ignores the input data!
To illustrate DummyClassifier, first let’s create an imbalanced dataset:

In [7]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()
X, y = iris.data, iris.target
y[y != 1] = -1
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

Now let's compare the SVC to the most_frequent

In [8]:
from sklearn.dummy import DummyClassifier
from sklearn.svm import SVC
clf = SVC(kernel='linear', C=1).fit(X_train, y_train)
clf.score(X_test, y_test) 



0.63157894736842102

In [9]:
clf = DummyClassifier(strategy='most_frequent',random_state=0)
clf.fit(X_train, y_train)

clf.score(X_test, y_test)  

0.57894736842105265

Right, we don't do much better! But with a simple kernel change:

In [10]:
clf = SVC(kernel='rbf', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)  


0.97368421052631582

DummyRegressor also implements four simple rules of thumb for regression:
* mean always predicts the mean of the training targets.
* median always predicts the median of the training targets.
* quantile always predicts a user provided quantile of the training targets.
* constant always predicts a constant value that is provided by the user.