### Intro

* 3 approaches to eval prediction quality:
   * Estimator score ("score_" method, provided by estimator tool)
   * Scoring parameter (provided by cross-validation tools)
   * Metric function (metrics module)

Also: [dummy estimators](http://scikit-learn.org/stable/modules/model_evaluation.html#dummy-estimators)

### Scoring parameters

* convention: higher scores > lower scores

In [43]:
#example
from sklearn import svm, datasets
from sklearn.model_selection import cross_val_score

iris = datasets.load_iris()
X, y = iris.data, iris.target

clf = svm.SVC(
    probability=True, 
    random_state=0)

print(cross_val_score(
    clf, X, y, 
    scoring='neg_log_loss'))

#model = svm.SVC()
#cross_val_score(
#    model, X, y, 
#    scoring='wrong_choice')

[-0.07475338 -0.16911634 -0.0698804 ]


### Example - defining score strategy from metric functions

* Use case #1: wrap existing metric from library with non-default parameter values, ex: `beta`.

In [44]:
#example - defining score strategy from metric functions
from sklearn.metrics import fbeta_score, make_scorer
ftwo_scorer = make_scorer(
    fbeta_score, 
    beta=2)

from sklearn.model_selection import GridSearchCV
from sklearn.svm import LinearSVC
grid = GridSearchCV(
    LinearSVC(), 
    param_grid={'C': [1, 10]}, 
    scoring=ftwo_scorer)

* Use case #2: build custom scorer from python function
* using [make_scorer](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html#sklearn.metrics.make_scorer)


In [45]:
import numpy as np
def my_custom_loss_func(ground_truth, predictions):
    diff = np.abs(ground_truth - predictions).max()
    return np.log(1 + diff)

# loss_func will negate the return value of my_custom_loss_func,
#  which will be np.log(2), 0.693, given the values for ground_truth
#  and predictions defined below.

loss  = make_scorer(my_custom_loss_func, greater_is_better=False)
score = make_scorer(my_custom_loss_func, greater_is_better=True)
ground_truth = [[1, 1]]
predictions  = [0, 1]

from sklearn.dummy import DummyClassifier

clf = DummyClassifier(strategy='most_frequent', random_state=0)
clf = clf.fit(ground_truth, predictions)

print(loss(clf,ground_truth, predictions))

print(score(clf,ground_truth, predictions))


-0.69314718056
0.69314718056


### Classification metrics

### accuracy score

* either the fraction or count of correct predictions

[demo](plot_permutation_test_for_classification.ipynb)

In [46]:
# accuracy score
import numpy as np
from sklearn.metrics import accuracy_score

y_pred = [0, 2, 1, 3]
y_true = [0, 1, 2, 3]

print(accuracy_score(
        y_true, y_pred))
print(accuracy_score(
        y_true, y_pred, normalize=False))

0.5
2


### [Cohen's Kappa](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.cohen_kappa_score.html#sklearn.metrics.cohen_kappa_score) | [Wikipedia](https://en.wikipedia.org/wiki/Cohen%27s_kappa)

* Intended to compare labelings by human annotators, not a classifier vs ground truth
* range [-1,+1]; >0.8 generally considered good; 0 = basically random

In [47]:
from sklearn.metrics import cohen_kappa_score
y_true = [2, 0, 2, 2, 0, 1]
y_pred = [0, 0, 2, 2, 0, 2]
cohen_kappa_score(y_true, y_pred)

0.4285714285714286

### [Confusion Matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html#sklearn.metrics.confusion_matrix) | [Wikipedia](https://en.wikipedia.org/wiki/Confusion_matrix) | [Demo](plot_confusion_matrix.ipynb)

* definition: data(i,j) in confusion matrix = #observations actually in group i, but predicted to be in group j.

In [48]:
# Confusion matrix

from sklearn.metrics import confusion_matrix
y_true = [2, 0, 2, 2, 0, 1]
y_pred = [0, 0, 2, 2, 0, 2]
print("matrix:\n",confusion_matrix(
        y_true, y_pred))

# To get counts of (true/false)(positives/negatives)
y_true = [0, 0, 0, 1, 1, 1, 1, 1]
y_pred = [0, 1, 0, 1, 0, 1, 0, 1]
tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
print("counts:\n",tn, fp, fn, tp)

matrix:
 [[2 0 0]
 [0 0 1]
 [1 0 2]]
counts:
 2 1 2 3


### [Classification Report](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html#sklearn.metrics.classification_report)

* returns text report with main metrics

In [49]:
# classification report

from sklearn.metrics import classification_report
y_true = [0, 1, 2, 2, 0]
y_pred = [0, 0, 2, 1, 0]
target_names = ['class 0', 'class 1', 'class 2']

print(classification_report(
        y_true, y_pred, target_names=target_names))

             precision    recall  f1-score   support

    class 0       0.67      1.00      0.80         2
    class 1       0.00      0.00      0.00         1
    class 2       1.00      0.50      0.67         2

avg / total       0.67      0.60      0.59         5



### [Hamming loss](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.hamming_loss.html#sklearn.metrics.hamming_loss) | [Wikipedia](https://en.wikipedia.org/wiki/Hamming_distance)

* Finds hamming distance between two sets of samples

In [50]:
# hamming loss
from sklearn.metrics import hamming_loss
y_pred = [1, 2, 3, 4]
y_true = [2, 2, 3, 4]

print(hamming_loss(
        y_true, y_pred))

# multilabel use case, binary label indicators

print(hamming_loss(
        np.array([[0, 1], [1, 1]]), 
        np.zeros((2, 2))))


0.25
0.75


### [Jaccard score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_similarity_score.html#sklearn.metrics.jaccard_similarity_score) | [Wikipedia](https://en.wikipedia.org/wiki/Jaccard_index)

In [51]:
# jaccard score

import numpy as np
from sklearn.metrics import jaccard_similarity_score

y_pred = [0, 2, 1, 3]
y_true = [0, 1, 2, 3]
print(jaccard_similarity_score(
        y_true, y_pred))

print(jaccard_similarity_score(
        y_true, y_pred, normalize=False))

0.5
2


### [Precision](https://en.wikipedia.org/wiki/Precision_and_recall#Precision), [Recall](https://en.wikipedia.org/wiki/Precision_and_recall#Recall), [F1-score](https://en.wikipedia.org/wiki/F1_score)

* precision: ability to avoid labeling a negative sample as positive
* recall: ability to find all positive samples
* F-measure: weighted harmonic mean of precision & recall (best=1, worst=0)

[precision/recall curve](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_curve.html#sklearn.metrics.precision_recall_curve) |
[avg precision score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html#sklearn.metrics.average_precision_score) |
[demo](plot_precision_recall.ipynb)

In [52]:
# binary classification examples

from sklearn import metrics
y_pred = [0, 1, 0, 0]
y_true = [0, 1, 0, 1]

print("precision score:\n",metrics.precision_score(
        y_true, y_pred))
print("recall score:\n",metrics.recall_score(
        y_true, y_pred))
print("f1 score:\n",metrics.f1_score(
        y_true, y_pred))
print("fbeta score, beta=0.5\n",metrics.fbeta_score(
        y_true, y_pred, beta=0.5))
print("fbeta score, beta=1.0\n",metrics.fbeta_score(
        y_true, y_pred, beta=1))
print("fbeta score, beta=2.0\n",metrics.fbeta_score(
        y_true, y_pred, beta=2))
print("precision recall fscore support\n",metrics.precision_recall_fscore_support(
        y_true, y_pred, beta=0.5))

import numpy as np
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import average_precision_score

y_true = np.array([0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.35, 0.8])
precision, recall, threshold = precision_recall_curve(
    y_true, y_scores)
print("\n")
print("precision:\n",precision)
print("recall:\n",recall)
print("threshold:\n",threshold)
print("avg precision score:\n",average_precision_score(y_true, y_scores))

precision score:
 1.0
recall score:
 0.5
f1 score:
 0.666666666667
fbeta score, beta=0.5
 0.833333333333
fbeta score, beta=1.0
 0.666666666667
fbeta score, beta=2.0
 0.555555555556
precision recall fscore support
 (array([ 0.66666667,  1.        ]), array([ 1. ,  0.5]), array([ 0.71428571,  0.83333333]), array([2, 2]))


precision:
 [ 0.66666667  0.5         1.          1.        ]
recall:
 [ 1.   0.5  0.5  0. ]
threshold:
 [ 0.35  0.4   0.8 ]
avg precision score:
 0.791666666667


### Multiclass / multilabel classification metrics

* precision, recall, f-measures can be applied to each label independently

In [53]:
# example
from sklearn import metrics
y_true = [0, 1, 2, 0, 1, 2]
y_pred = [0, 2, 1, 0, 0, 1]

print("precision:\n",metrics.precision_score(
        y_true, y_pred, average='macro'))
print("recall:\n",metrics.recall_score(
        y_true, y_pred, average='micro'))
print("f1 score:\n",metrics.f1_score(
        y_true, y_pred, average='weighted')) 
print("fbeta score, avg=macro:\n",metrics.fbeta_score(
        y_true, y_pred, average='macro', beta=0.5))
print("fbeta score, avg=none:\n",metrics.precision_recall_fscore_support(
        y_true, y_pred, beta=0.5, average=None))

precision:
 0.222222222222
recall:
 0.333333333333
f1 score:
 0.266666666667
fbeta score, avg=macro:
 0.238095238095
fbeta score, avg=none:
 (array([ 0.66666667,  0.        ,  0.        ]), array([ 1.,  0.,  0.]), array([ 0.71428571,  0.        ,  0.        ]), array([2, 2, 2]))


### [Hinge loss](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.hinge_loss.html#sklearn.metrics.hinge_loss) | [Wikipedia](https://en.wikipedia.org/wiki/Hinge_loss)

* returns avg distance between model and actual data

In [54]:
# hinge loss, SVM classifier, binary class problem

from sklearn import svm
from sklearn.metrics import hinge_loss

X = [[0], [1]]
y = [-1, 1]

est = svm.LinearSVC(random_state=0)
est.fit(X, y)

pred_decision = est.decision_function([[-2], [3], [0.5]])

print("decision:\n",pred_decision)
print("loss:\n",hinge_loss(
        [-1, 1, 1], 
        pred_decision))

decision:
 [-2.18177944  2.36355888  0.09088972]
loss:
 0.303036760385


In [55]:
# hinge loss, SVM classifier, multiclass problem

X = np.array([[0], [1], [2], [3]])
Y = np.array([0, 1, 2, 3])

labels = np.array([0, 1, 2, 3])
est = svm.LinearSVC()
est.fit(X, Y)

pred_decision = est.decision_function([[-1], [2], [3]])
print("decision:\n",pred_decision)

y_true = [0, 2, 3]
print("loss:\n",hinge_loss(
        y_true, pred_decision, labels))

decision:
 [[ 1.27271735  0.03419428 -0.68377145 -1.4016886 ]
 [-1.45454347 -0.58117471 -0.37609483 -0.17096867]
 [-2.36363041 -0.78629771 -0.27353595  0.23927131]]
loss:
 0.564106298623


### [Log loss (logistic regression loss, cross-entropy loss)](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html#sklearn.metrics.log_loss) | [demo](plot_calibration_multiclass.ipynb)


In [56]:
#example
from sklearn.metrics import log_loss
y_true = [0, 0, 1, 1]
y_pred = [[.9, .1], [.8, .2], [.3, .7], [.01, .99]]

print("loss:\n",log_loss(
        y_true, y_pred)) 

loss:
 0.173807336691


### [Matthews correlation coefficient (MCC)](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.matthews_corrcoef.html#sklearn.metrics.matthews_corrcoef) | [wiki](https://en.wikipedia.org/wiki/Matthews_correlation_coefficient)

* a measure of binary (2-class) classification model quality.

In [57]:
#example
from sklearn.metrics import matthews_corrcoef
y_true = [+1, +1, +1, -1]
y_pred = [+1, -1, +1, +1]
print("MCC:\n",matthews_corrcoef(
        y_true, y_pred)) 

MCC:
 -0.333333333333


### [Receiver Operating Characteristic (ROC)](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve) | [Wikipedia](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)

* illustrates binary classifier quality as its discrimination threshold is varied. 
* fraction of true positives/positives (TPR, also called sensitivity) vs false positives/negatives (FPR)

[demo](plot_roc.ipynb) | 
[demo w/ CV](plot_roc_crossval.ipynb) |
[demo - species modeling](plot_species_distribution_modeling.ipynb)

In [58]:
#example

import numpy as np
from sklearn.metrics import roc_curve

y = np.array([1, 1, 2, 2])
scores = np.array([0.1, 0.4, 0.35, 0.8])
fpr, tpr, thresholds = roc_curve(y, scores, pos_label=2)

print("FPR:\n",fpr)
print("TPR:\n",tpr)
print("thresholds:\n",thresholds)

FPR:
 [ 0.   0.5  0.5  1. ]
TPR:
 [ 0.5  0.5  1.   1. ]
thresholds:
 [ 0.8   0.4   0.35  0.1 ]


### [Zero One Loss](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.zero_one_loss.html#sklearn.metrics.zero_one_loss) | [demo](plot_adaboost_hastie_10_2.ipynb)

In [59]:
# zero one loss
from sklearn.metrics import zero_one_loss
y_pred = [1, 2, 3, 4]
y_true = [2, 2, 3, 4]

print("loss:\n",zero_one_loss(
        y_true, y_pred))
print("loss, not normalized:\n",zero_one_loss(
        y_true, y_pred, normalize=False))

loss:
 0.25
loss, not normalized:
 1


### [Brier score loss](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.brier_score_loss.html#sklearn.metrics.brier_score_loss) | [Wikipedia](https://en.wikipedia.org/wiki/Brier_score) | [demo](plot_calibration_curve.ipynb)

* returns score mean square difference between actual outcome (0,1) & predicated probability of possible outcome (0-1)
* lower score = more accurate prediction

In [60]:
# brier score loss
import numpy as np
from sklearn.metrics import brier_score_loss

y_true = np.array([0, 1, 1, 0])
y_true_categorical = np.array(["spam", "ham", "ham", "spam"])
y_prob = np.array([0.1, 0.9, 0.8, 0.4])
y_pred = np.array([0, 1, 1, 0])

print(brier_score_loss(
        y_true, y_prob))

print(brier_score_loss(
        y_true, 1-y_prob, pos_label=0))

print(brier_score_loss(
        y_true_categorical, y_prob, pos_label="ham"))

print(brier_score_loss(
        y_true, y_prob > 0.5))


0.055
0.055
0.055
0.0


### Multi-Label Ranking metrics

### [Coverage Error](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.coverage_error.html#sklearn.metrics.coverage_error)

* finds avg #labels required in final prediction so all true labels are predicted.

In [61]:
#example
import numpy as np
from sklearn.metrics import coverage_error

y_true = np.array([[1, 0, 0], [0, 0, 1]])
y_score = np.array([[0.75, 0.5, 1], [1, 0.2, 0.1]])
coverage_error(y_true, y_score)

2.5

### [Label Ranking Avg Precision (LRAP)](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.label_ranking_average_precision_score.html#sklearn.metrics.label_ranking_average_precision_score)

* returns average over each ground truth label assigned to each sample

In [62]:
# example LRAP
import numpy as np
from sklearn.metrics import label_ranking_average_precision_score

y_true = np.array([[1, 0, 0], [0, 0, 1]])
y_score = np.array([[0.75, 0.5, 1], [1, 0.2, 0.1]])
label_ranking_average_precision_score(y_true, y_score) 

0.41666666666666663

### [Ranking Loss](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.label_ranking_loss.html#sklearn.metrics.label_ranking_loss)

In [63]:
#example
import numpy as np
from sklearn.metrics import label_ranking_loss

y_true = np.array([[1, 0, 0], [0, 0, 1]])
y_score = np.array([[0.75, 0.5, 1], [1, 0.2, 0.1]])
print(label_ranking_loss(
        y_true, y_score))

# With the following prediction, we have perfect and minimal loss
y_score = np.array([[1.0, 0.1, 0.2], [0.1, 0.2, 0.9]])
print(label_ranking_loss(
        y_true, y_score))

0.75
0.0


### Regression metrics

### [Explained Variance Score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.explained_variance_score.html#sklearn.metrics.explained_variance_score)

* Optimum = 1.0, worst case = 0.0

In [64]:
#example

from sklearn.metrics import explained_variance_score

y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]

print("score #1:\n",explained_variance_score(
        y_true, y_pred)) 

y_true = [[0.5, 1], [-1, 1], [7, -6]]
y_pred = [[0,   2], [-1, 2], [8, -5]]

print("score #2 (multiout=raw):\n",explained_variance_score(
        y_true, y_pred, 
        multioutput='raw_values'))

print("score #2 (multiout=given):\n",explained_variance_score(
    y_true, y_pred, multioutput=[0.3, 0.7]))

score #1:
 0.957173447537
score #2 (multiout=raw):
 [ 0.96774194  1.        ]
score #2 (multiout=given):
 0.990322580645


### [Mean Absolute Error](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html#sklearn.metrics.mean_absolute_error) | [wiki](https://en.wikipedia.org/wiki/Mean_absolute_error)

In [65]:
#example

from sklearn.metrics import mean_absolute_error
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
print(mean_absolute_error(
        y_true, y_pred))

y_true = [[0.5, 1], [-1, 1], [7, -6]]
y_pred = [[0, 2], [-1, 2], [8, -5]]
print(mean_absolute_error(
        y_true, y_pred))

print(mean_absolute_error(
        y_true, y_pred, multioutput='raw_values'))

print(mean_absolute_error(
        y_true, y_pred, multioutput=[0.3, 0.7]))


0.5
0.75
[ 0.5  1. ]
0.85


### [Mean Squared Error](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html#sklearn.metrics.mean_squared_error) | [wiki](https://en.wikipedia.org/wiki/Mean_squared_error) | [demo: GBR](plot_gradient_boosting_regression.ipynb)

In [66]:
# example
from sklearn.metrics import mean_squared_error
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
print(mean_squared_error(
        y_true, y_pred))

y_true = [[0.5, 1], [-1, 1], [7, -6]]
y_pred = [[0, 2], [-1, 2], [8, -5]]
print(mean_squared_error(
        y_true, y_pred))


0.375
0.708333333333


### [Mean Absolute Error](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.median_absolute_error.html#sklearn.metrics.median_absolute_error)

* Robust to outliers
* No multioutput support

In [67]:
# example

from sklearn.metrics import median_absolute_error
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
median_absolute_error(y_true, y_pred)

0.5

### [R^2 score (coefficient of determination)](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score) | [wiki](https://en.wikipedia.org/wiki/Coefficient_of_determination)

* Returns measure of how well future samples will be predicted by current model.
* Best case = 1.0, can be <0.0
* 0.0 = always predicts expected value (disregarding input features)

In [68]:
# example
from sklearn.metrics import r2_score
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
print(r2_score(
        y_true, y_pred))

y_true = [[0.5, 1], [-1, 1], [7, -6]]
y_pred = [[0, 2], [-1, 2], [8, -5]]
print(r2_score(
        y_true, y_pred, multioutput='variance_weighted'))

y_true = [[0.5, 1], [-1, 1], [7, -6]]
y_pred = [[0, 2], [-1, 2], [8, -5]]
print(r2_score(
        y_true, y_pred, multioutput='uniform_average'))

print(r2_score(
        y_true, y_pred, multioutput='raw_values'))

print(r2_score(
        y_true, y_pred, multioutput=[0.3, 0.7]))

0.948608137045
0.938256658596
0.936800526662
[ 0.96543779  0.90816327]
0.92534562212


### Clustering metrics

[instance clustering](clustering.ipynb) |
[biclustering](biclustering.ipynb)

### Dummy estimators

[Dummy classifiers](http://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html#sklearn.dummy.DummyClassifier) | [Dummy regressors](http://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyRegressor.html#sklearn.dummy.DummyRegressor)

* use case: supervised learning - comparing an estimator against simple rules of thumb

In [69]:
# create unbalanced dataset
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X, y = iris.data, iris.target
y[y != 1] = -1
X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=0)

In [70]:
# compare accuracy of SVC & most_frequent
from sklearn.dummy import DummyClassifier
from sklearn.svm import SVC

clf1 = SVC(
    kernel='linear', 
    C=1).fit(X_train, y_train)

clf2 = DummyClassifier(
    strategy='most_frequent',
    random_state=0).fit(X_train, y_train)

clf1.score(X_test, y_test) ,clf2.score(X_test, y_test)  


(0.63157894736842102, 0.57894736842105265)

In [71]:
# SVC doesn't do much better than dummy classifier. Now chg kernel.
clf = SVC(kernel='rbf', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)  

0.97368421052631582

In [72]:
# accuracy boosted to near 100%