# Evaluation Metrics 

## Classification metrics

### Accuracy
The proportion of correctly predicted instances out of the total instances.

 $\text{Accuracy} =\frac{\text{Number of correct predictions}}{\text{Total number of predictions}}$


In [None]:
def accuracy(labels, predicted):
    return sum([l == p for l, p in zip(labels, predicted)]) / len(labels)

labels =    [0, 0, 0, 0, 1, 1, 1, 1]
predicted = [0, 0, 0, 1, 1, 1, 1, 1]

accuracy(labels, predicted)

Accuracy is not a good metrix if classes are heavily umbalanced

In [None]:
labels =    [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
predicted = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

accuracy(labels, predicted)

The following metrics are defined for two class problems, but can be extended to multiple class problems. We will add the class to the variable name
- You should use them if you want a custom analysis for each of the classes

### Precision
The proportion of true positive instances out of the total instances predicted as positive.

$\text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}$

In [None]:
def precission(labels, predicted, klass):
    c = sum([p == klass for p in predicted])
    if c > 0:   
        return sum([l == p and l == klass for l, p in zip(labels, predicted)]) / c
    else:
        return 1
    
labels =    [0, 0, 0, 0, 1, 1, 1, 1]
predicted = [0, 0, 0, 1, 1, 1, 1, 1]

precission(labels, predicted, 1), precission(labels, predicted, 0)

Precision for a single class should be analyzed with care:

In [None]:
labels =    [0, 1, 1, 1, 1, 1, 1, 1]
predicted = [0, 0, 0, 0, 0, 0, 1, 1]

precission(labels, predicted, 1)

In [None]:
precission(labels, predicted, 0)

### Recall (Sensitivity or True Positive Rate)
The proportion of true positive instances out of the total actual positive instances.

$\text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}$

In [None]:
def recall(labels, predicted, klass):
    return sum([l == p and l == klass for l, p in zip(labels, predicted)]) / sum([p==klass for p in labels])

labels =    [0, 0, 0, 0, 1, 1, 1, 1]
predicted = [0, 0, 0, 1, 1, 1, 1, 1]

recall(labels, predicted, 1), recall(labels, predicted, 0)

In [None]:
labels =    [0, 1, 1, 1, 1, 1, 1, 1]
predicted = [0, 0, 0, 0, 0, 0, 1, 1]

recall(labels, predicted, 1), recall(labels, predicted, 0)

### F1 Score
The harmonic mean of precision and recall, providing a balance between the two.

$\text{F1 Score} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$
     

In [None]:
def f1_score(labels, predicted, klass):
    p = precission(labels, predicted, klass)
    r = recall(labels, predicted, klass)
    return 2 * p * r / (p + r)

labels =    [0, 0, 0, 0, 1, 1, 1, 1]
predicted = [0, 0, 0, 1, 1, 1, 1, 1]

f1_score(labels, predicted, 1), f1_score(labels, predicted, 0)

In [None]:
labels =    [0, 1, 1, 1, 1, 1, 1, 1]
predicted = [0, 0, 0, 0, 0, 0, 1, 1]

f1_score(labels, predicted, 1), f1_score(labels, predicted, 0)

In [None]:
# imbalance case
labels =    [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
predicted = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
f1_score(labels, predicted, 1)

### ROC-AUC (Receiver Operating Characteristic - Area Under Curve)
The AUC represents the degree or measure of separability, indicating how well the model distinguishes between classes.

In different classification models, one can set a threshold that is used to determine if an object belongs to a class or the other. For example, consider that the result of a classification algorithm returns some sort of "probability" of one particular class. Once we set a threshold value, we turned the probability into classes, and the metrics can be calculated.

In order to use the most common ROC curve, we need to introduce another metric, the False Positive Rate (FPR), which is better if closer to zero.

$\text{FPR} = \frac{\text{False Positives}}{\text{False Positives} + \text{True Negatives}}$

This is used together with the True Positive Rate, or recall.

In [None]:
import numpy as np

def fpr(labels, predicted, klass):
    return sum([l != klass and p == klass for l, p in zip(labels, predicted)]) / sum([p!=klass for p in labels])


labels = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
predicted_probs = [0.3, 0.4, 0.1, 0.6, 0.0, 0.95, 0.4, 0.6, 0.3, 0.9]

Once we set a given threshold, for example 0.5, then we can turn predicted_probs intro predicted class.

In [None]:
threshold = 0.5
predicted = [0 if p < threshold else 1 for p in predicted_probs]
print(labels)
print(predicted)
recall(labels, predicted, 1), fpr(labels, predicted, 1)

Lets change explore what happens for all different threshold values

In [None]:
print(labels)
print(predicted_probs)
all_r = []
all_f = []
print()
print("thr, TPR, FPR")
for threshold in np.arange(0, 1.1, 0.1):
    predicted = [0 if p < threshold else 1 for p in predicted_probs]
    r, f = recall(labels, predicted, 1), fpr(labels, predicted, 1)
    all_r.append(r)
    all_f.append(f)
    print(f"{threshold:.1f}, {r:.2f}, {f:.2f}, {predicted}")

In [None]:
import matplotlib.pyplot as plt

plt.plot(all_f, all_r, marker='o')
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.show()

A ROC curve shows the tradeoff of using a threshold in a particular method. 
- Good methods allows to increase the TPR without sacrifying too much the FPR
- Bad methods can only increase the TPR by degrading proportionally the FPR

Lets put all the code in a function in order to test all this in practice.

In [None]:
def create_ROC_report(labels, predicted_probs):
    all_r = []
    all_f = []
    print()
    print("thr, TPR, FPR")
    for threshold in np.arange(0, 1.1, 0.1):
        predicted = [0 if p < threshold else 1 for p in predicted_probs]
        r, f = recall(labels, predicted, 1), fpr(labels, predicted, 1)
        all_r.append(r)
        all_f.append(f)
        print(f"{threshold:.1f}, {r:.2f}, {f:.2f}, {predicted}")

    plt.plot(all_f, all_r, marker='o')
    plt.xlabel('FPR')
    plt.ylabel('TPR')

In a random classifier, the predicted probability is not related to real class

In [None]:
n_points = 100

labels = [0 for _ in range(n_points//2 )] + [1 for _ in range(n_points//2 )]
predicted_probs = np.random.uniform(0, 1, (n_points,))
create_ROC_report(labels, predicted_probs)

In [None]:
n_points = 100

labels = [0 for _ in range(n_points//2 )] + [1 for _ in range(n_points//2 )]
predicted_probs = np.hstack([
    np.random.uniform(0, 0.49, (n_points//2,)),
    np.random.uniform(0.51, 1, (n_points//2,))
])

create_ROC_report(labels, predicted_probs)

The area below the ROC-curve (AUC-ROC) is minimal in the random case and maximal (equal to one) for the perfect case, so it can be used as a metric.

### Confusion Matrix
A matrix showing the counts of true positive, true negative, false positive, and false negative predictions, useful for a more detailed performance analysis.

In many classification problems, specially those with more than two classes, the number of errors might be less important than the distribution of errors. This is totally missed by global metrics like the accuracy.

Lets look this threee examples.

In [None]:
labels =    [0, 0, 0, 0, 1, 1, 1, 1]
predicted = [0, 0, 1, 1, 1, 1, 0, 0]
accuracy(labels, predicted)

In [None]:
labels =    [0, 0, 0, 0, 1, 1, 1, 1]
predicted = [0, 0, 0, 0, 0, 0, 0, 0]
accuracy(labels, predicted)

In [None]:
labels =    [0, 0, 0, 0, 1, 1, 1, 1]
predicted = [1, 1, 1, 1, 1, 1, 1, 1]
accuracy(labels, predicted)

In all the examples the classifier missclassify half of the objects, but if you want to improve the result you need to know where are the errors located. This is shown in the confussion matrix.

In [None]:
from sklearn.metrics import confusion_matrix

labels =    [0, 0, 0, 0, 1, 1, 1, 1]
predicted = [0, 0, 1, 1, 1, 1, 0, 0]
confusion_matrix(labels, predicted)

In a confusion matrix, each object in counted in the row of its real class, and the column of its assigned class.

In [None]:
labels =    [0, 0, 0, 0, 1, 1, 1, 1]
predicted = [0, 0, 0, 0, 0, 0, 0, 0]
confusion_matrix(labels, predicted)

In [None]:
labels =    [0, 0, 0, 0, 1, 1, 1, 1]
predicted = [1, 1, 1, 1, 1, 1, 1, 1]
confusion_matrix(labels, predicted)

With more than two classes, the interpretation can be even more useful.

In [None]:
labels =    [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2]
predicted = [0, 0, 0, 0, 1, 1, 1, 2, 2, 1, 1, 1]
confusion_matrix(labels, predicted)

## Regression Metrics

Mean Absolute Error (MAE): The average of the absolute differences between predicted and actual values.

$\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} \left| y_i - \hat{y}_i \right|$

Mean Squared Error (MSE): The average of the squared differences between predicted and actual values.
$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} \left( y_i - \hat{y}_i \right)^2
     $

Root Mean Squared Error (RMSE): The square root of the mean squared error.

$\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} \left( y_i - \hat{y}_i \right)^2}$
     

### R-squared (Coefficient of Determination)
The proportion of the variance in the dependent variable that is predictable from the independent variables.

$R^2 = 1 - \frac{\sum_{i=1}^{n} \left( y_i - \hat{y}_i \right)^2}{\sum_{i=1}^{n} \left( y_i - \bar{y} \right)^2}$

where $\bar{y}$ is the mean of the actual values.

The coefficient of determination, denoted as R2, is a statistical measure that explains the proportion of the variance in the dependent variable that is predictable from the independent variable(s). It provides an indication of the goodness of fit of a model. An R2 value of 1 indicates that the regression predictions perfectly fit the data.
- It ranges from 0 to 1.
- R2 of 0 means that the dependent variable cannot be predicted from the independent variable(s).
- R2 of 1 means that the dependent variable can be predicted without error from the independent variable(s).

In [None]:
np.random.seed(31416)  
values = np.random.randint(1, 100, 10)
predicted = values + np.random.normal(0, 10, 10)  
values, predicted

In [None]:
plt.scatter(values, predicted)
plt.show()

In [None]:
variance_of_data = np.sum((values - np.mean(values))**2)
variance_of_data

In [None]:
variance_of_model = np.sum((values - predicted)**2)
variance_of_model

In [None]:
R2 = 1 - (variance_of_model / variance_of_data)
R2

Adding different levels of noise makes the regression worse, and it is reflected in R2

In [None]:
for noise in range(0, 40, 3):
    predicted = values + np.random.normal(0, noise, 10)  
    variance_of_data = np.sum((values - np.mean(values))**2)
    variance_of_model = np.sum((values - predicted)**2)
    R2 = 1 - (variance_of_model / variance_of_data)
    print(noise, R2)

#### R2 or MSE?

Advantages of R2:

- Can calculate statistical significance using a test.

- Scale-Free Comparison: R2 is scale-free.

Advantages of MSE:

- Mathematical Simplicity.

- Faster calculation.


Choosing Between R2 and MSE:

- Model Assessment: Use R2 when you want to assess how well your model explains the variance in the dependent variable relative to a baseline model. It provides a holistic view of model performance in terms of variance explained.

- Model Optimization: Use MSE (or related metrics like RMSE) as a loss function during model training and optimization. MSE directly penalizes larger errors, making it suitable for training models to minimize prediction errors.