# Evaluation Metrics and Scoring

We will use the widely popular sklearn library to demonstrate how evaluation metrics work.  It should be noted that there are multiple ways to perform evaluation in sklearn.  First, each ML algorithm implemented in sklearn comes with its own evaluation method "out of the box".  This will be demonstrated down the road in the course, when we use these algoriths.
The second, which is featured in this notebook, is by using sklearn.metrics.

For a complete reference, refer to:
https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Regression Evaluation Metrics

In [None]:
# Let's create a small toy dataset containing labels (y) and predictions
# (y_hat) only.  We will not be needing features here.
# The data here will be stored in numpy arrays, although generally
# you can apply the sklearn evaluation metrics on pandas Series (columns of
# dataframes) or even python lists.
y = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
y_hat = np.array([1.0, 2.5, 2.7, 5.0, 5.5])


In [None]:
from sklearn.metrics import mean_squared_error
mean_squared_error(y, y_hat)

0.31799999999999995

In [None]:
# Verify that you get the same as below:
MSE = np.mean(np.square(y-y_hat))
MSE

0.31799999999999995

In [None]:
from sklearn.metrics import mean_absolute_error

MAE = mean_absolute_error(y, y_hat)
MAE

0.45999999999999996

In [None]:
# Verify it's the same
np.mean(np.abs(y-y_hat))

0.45999999999999996

In [None]:
from sklearn.metrics import mean_absolute_percentage_error

MAPE = mean_absolute_percentage_error(y, y_hat)
MAPE

0.13999999999999996

In [None]:
# Verify it's the same
np.mean(np.abs(y - y_hat)/np.abs(y))

0.13999999999999996

In [None]:
from sklearn.metrics import r2_score
r2_score(y, y_hat)

In [None]:
# verify it's the same
1 - MSE / np.var(y)

## Classification - Hard Prediction

In [None]:
y = np.array([1,1,1,1,1,1,1,1,-1,-1,-1,-1,-1,-1,-1,-1])
y_hat = np.array([1,1,1,-1,1,1,-1,-1,1,-1,-1,-1,-1,-1,-1,-1])


In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y, y_hat)

In [None]:
# Verify that it is the same as below.
# "y==y_hat" is an array of True/False, with "True" at indices where the two
# arrays are equal, and  False otherwise.  The "np.sum" of that is the number
# of "True"s, beacuse when you sum up booleans, the Trues are interpreted as 1's
# and the Falses are 0's.
np.sum(y == y_hat)/len(y)

In [None]:
# Error rate
from sklearn.metrics import zero_one_loss
zero_one_loss(y, y_hat)

In [None]:
# Verify it's the same
np.sum(y != y_hat)/len(y)

In [None]:
from sklearn.metrics import confusion_matrix
conf = confusion_matrix(y, y_hat)
conf

In [None]:
# Check that it is the same
TN, FP, FN, TP = \
    np.sum((y==-1)*(y_hat==-1)), np.sum((y==-1)*(y_hat==1)), \
    np.sum((y==1)*(y_hat==-1)), np.sum((y==1)*(y_hat==1))
np.array([[TN, FP],
          [FN, TP]])

In [None]:
from sklearn.metrics import precision_score
precision = precision_score(y, y_hat)
precision

In [None]:
# Verify it's the same
TP/(TP+FP)

In [None]:
from sklearn.metrics import recall_score
recall = recall_score(y, y_hat)
recall

In [None]:
# Verify it's the same
TP/(TP+FN)

In [None]:
from sklearn.metrics import f1_score
f1_score(y, y_hat)

In [None]:
# Verify it's the same
2 * precision * recall / (precision + recall)

# Classification - Soft Prediction

In [None]:
# label: -1 or +1
# prediction: probability of +1
y = np.array([1,1,1,1,1,1,1,1,-1,-1,-1,-1,-1,-1,-1,-1])
y_hat = np.array([0.9,0.95,0.7,0.2,0.1,0.051,0.06,0.8,0.89,0.49,0.4,0.45,0.61,0.3,0.35,0.36])

In [None]:
# Convert soft prediction to hard prediction using threshold
thresh = 0.6
y_hat_hard = np.where(y_hat > thresh, 1, -1)
y_hat_hard

In [None]:
# Cross-entropy loss
from sklearn.metrics import log_loss

In [None]:
# sklearn's log_loss expects 0/1 labels (instead of -1, +1).  (y+1)/2 converts.
log_loss(((y+1)/2), y_hat)

In [None]:
# Check that it's the same
np.mean(-(1+y)/2 * np.log(y_hat) - (1-y)/2 * np.log(1-y_hat))

In [None]:
from sklearn.metrics import roc_curve
# ROC curve = FPR vs TPR
FPR, TPR, _ = roc_curve(y, y_hat)

In [None]:
plt.plot(FPR, TPR)

In [None]:
from sklearn.metrics import roc_auc_score
roc_auc_score(y, y_hat)

In [None]:
# Check that it equals equivalent definition of auc

negative_indices = np.where(y==-1)[0]
positive_indices = np.where(y==1)[0]

# number of (negative,positive) pairs for which y_hat is ordered correctly
num_pairs_ordered_correctly = np.sum(
    [[y_hat[j] > y_hat[k]
      for k in negative_indices]
        for j in positive_indices])

# number of (negative, positve) pairs
num_pairs = len(negative_indices) * len(positive_indices)

# AUC = ratio of last two expressions
num_pairs_ordered_correctly / num_pairs