In [5]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import scipy.optimize as opt
import scipy.linalg as linalg
import sklearn.linear_model
import sklearn.model_selection
import pandas as pd
from sklearn.datasets.samples_generator import make_blobs, make_moons
from sklearn.datasets import load_iris
from IPython.display import Markdown as md
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
import warnings
warnings.filterwarnings('ignore')

# Applied Machine Learning

## Metrics: regression and classification

### Metrics and loss

- Metric is how do we measure how good our model is
- Loss function is what model optimizes for
- They are not always the same (or even always not the same)

### Example

- In classification we might be interested in accuracy
- We can't optimize for accuracy (it's an NP problem)
- That's why we used surrogate losses

### Baseline

- No matter what metric we use, we're always interested in the best (simplest) solution 
- We can relate our solution to the baseline to understand thing better

### Metric design

- We will consider a few metrics next
- You can always design your own metric

# Regression metrics

### Mean Squared Error (MSE)

$\textsf{MSE}(y, \hat y) = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat y_i)^2$

- Usually optimized directly, loss = metric
- Punishes large outliers
- What is the baseline (best constant)?

### Root MSE (RMSE)

$\textsf{RMSE}(y, \hat y) = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (y_i - \hat y_i)^2}$

- Metric, not a loss optimized directly
- Trivial that MSE minimum is also RMSE minimum
- Absolute values make more sense than MSEs: prediction is 3, RMSE = 0.5

### Mean Absolute Error (MAE)

$\textsf{MAE}(y, \hat y) = \frac{1}{N} \sum_{i=1}^{N} \left| y_i - \hat y_i \right| $

- Non differentiable but can be optimized directly
- Absolute values make some sense
- Tolerate outliers, thus more robust
- In MSE average is a minimizer, here it is median
- [1, 4, 8, 10], min |x-1| + |x-4| + |x-8| + |x-10|

### $R^2$ aka coefficient of determination

- It is a good idea to compare our performance to baseline
- Given $\overline y$ as the mean $y$, $R^2$ is defined as
$$
R^2 = 1 - \frac{\textit{residual sum} = MSE(y, \hat y)}{\textit{explained sum} = MSE(\overline y, \hat y)}
$$
- Basically tells the correlation between ground truth and predictions

### Some errors are different

- Is the \\$100 vs \\$101 the same as \\$1000 vs \\$1001 in predicting prices?
- We need to respect this property in the metric
- That's the case when the target column has long tails (e.g. exponential)

### Mean Absolute Percentage Error (MAPE)

$\textsf{MAPE}(y, \hat y) = \frac{1}{N} \sum_{i=1}^{N} \left|\frac{ y_i - \hat y_i}{y_i}\right| $

- The values are interpretable and easy to report to someone else
- MAE minimizer is also MAPE minimizer

### RMSE in log space

- It is a common practice to logarithm skewed features
- We can do the same thing to the target column
- Measuring RMSE in this log-space is a good metric that solves the problem

# Classification metrics

### Accuracy

- The metric you can measure but not optimize
- Even smallest imbalance makes it too optimistic
- What happens if we predict if someone is sick with Ebola?
- The accuracy for "always healthy" algorithm: 99.999999999%

### Log loss

Given $M$ classes and $N$ examples:
$$\textsf{log loss} = - \frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^{M} y_{ij} \log \hat y_{ij},$$
$y_{ij}$ is an indicator that $i$th examples is of $j$th class, $\hat y_{ij}$ is the predicted probability of $i$th example being of $j$th class

### Log loss

- The loss function that is directly optimized
- The baseline is the probability distribution over classes in target
- In case of multiclass also known as cross-entropy

### Confusion matrix

$C, C_{ij} = \textsf{the number of i-th class predicted as j-th class}$

- Ideally, the confusion matrix is diagonal
- Plotting the matrix as an image makes some sense sometimes

### Confusion matrix for binary

- We should declare one class as Positive, and one class as Negative
- $C = \left[
 \begin{matrix}
  TP & FP \\
  FN & TN
 \end{matrix}
\right]$

In [4]:
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, stratify=data.target, random_state=0)
lr = LogisticRegression().fit(X_train, y_train)
y_pred = lr.predict(X_test)
confusion_matrix(y_test, y_pred)

array([[48,  5],
       [ 4, 86]])

### Precision

$\textsf{precision} = \frac{TP}{TP + FP}$

- The ratio of Positives that are True Positives
- Quite easy to get perfect precision

### Recall

$\textsf{recall} = \frac{TP}{TP+FN}$

- The ratio of True Positives among Positives 
- Quite easy to get perfect recall, but precision gets bad

### Combining ratios: $F_1$

- If you want to combine two values average is the usual thing
- Actually, to combine two ratios it is wise to use harmonic average called $F_1$
$F_1 = \frac{2pr}{p+r}$
- Precision=0.1, Recall=0.9, what are the averages: arithmetic and harmonic?
- Arithmetic: (0.1 + 0.9)/2 = 0.5
- Harmonic: (2 * 0.1 * 0.9) / (0.1 + 0.9) = 0.18

### Thresholds

- Most of the classifiers can actually predict some soft score
- In binary classification, we choose some threshold $t$, $\hat y > t$ leads to positive, otherwise negative

### Receiver Operator Characteristics (ROC) curve

- Plot for all thresholds points (FPR, TPR)
- FPR is the rate of false positives over all positives (sensitivity)
- TPR is the rate of true positives over all positives (specificity)

### Area Under Curve

- It would be convenient to compare different ROC curves
- The probability of ordering being correct

### Going multiclass

- We can measure the same things per class
- One can measure mean Precision, mean Recall, ...
- Other ways include things like Cohen's Kappa