# Classification Model Assessment

This notebook discusses the various model assessment metrics commonly used in Classification. It is based on Chapter 3 of HOML.

We'll work with the MNIST dataset again, with a focus on the binary classifier that predicts "5" or "not-5".

## Setup

Taken from the 11b notebook.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import fetch_openml
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler

# use as_frame=False to get data as NumPy arrays
mnist = fetch_openml('mnist_784', as_frame=False)

X, y = mnist.data, mnist.target

# Split into the predefined train and test sets
X_train, X_test = X[:60000], X[60000:]
y_train, y_test = y[:60000], y[60000:]

# create boolean labels for the 5 / not-5 classifier
y_train_5 = (y_train == '5')
y_test_5 = (y_test == '5')

# always scale features for LogReg
scaler = StandardScaler()

# apply scaling without introducing data leakage
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

logr_bin = LogisticRegression()

logr_bin.fit(X_train_scaled, y_train_5)

Does it correctly predict an image we know is labeled "5"?

In [None]:
some_digit = X[0]
logr_bin.predict([some_digit])

In [None]:
# Train accuracy
train_score = logr_bin.score(X_train_scaled, y_train_5)

# Test accuracy (using the actual test set)
test_score = logr_bin.score(X_test_scaled, y_test_5) 
# Note: You'll need to create y_test_5 with: y_test_5 = (y_test == '5')

print(f"Training accuracy: {train_score:.4f}")
print(f"Test accuracy: {test_score:.4f}")

In [None]:
# Make predictions
y_test_pred = logr_bin.predict(X_test_scaled)

# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_test_5, y_test_pred))

print("\nClassification Report:")
print(classification_report(y_test_5, y_test_pred))

To better understand the consequence of FP vs FN errors, it is common to calculate performance metrics separately for each class. Except accuracy, which is inherently an overall measure of performance.

The classification report is divided into by-class (top) and overall (bottom) scores.

For the "False" class (negative class):
- Precision: 0.98 (98% of predicted negatives were actually negative)
- Recall: 0.99 (99% of actual negatives were correctly identified)
- F1-score: 0.99 (harmonic mean of precision and recall)
- Support: 9108 samples

For the "True" class (positive class):
- Precision: 0.90 (90% of predicted positives were actually positive)
- Recall: 0.84 (84% of actual positives were correctly identified)
- F1-score: 0.87 (harmonic mean of precision and recall)
- Support: 892 samples

Overall metrics:
- Accuracy: 0.98 (98% of all predictions were correct)
- Macro avg: 0.94 precision, 0.91 recall, 0.93 F1 (simple average across classes)
- Weighted avg: 0.98 for precision, recall, and F1 (weighted by class support)

Note: accuracy is an overall score unrelated to f1-score. The table is somewhat confusing in that regard. The macro and weighted average rows have values for precision, recall, and f1-score.

Observations:
- Imbalanced dataset (9108 negatives, 892 pos)
- Performs better on majority "False" class
- Good performance on the minority "True" class
- The relatively high precision (0.90) for the "True" class indicates that when the model predicts "True," it's usually correct, while the slightly lower recall (0.84) shows it misses some positive cases.

## Precision and Recall

For individual classifier metrics, use the appropriate SKL functions from `metrics`.

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score

precision_score(y_test_5, y_test_pred)

In [None]:
recall_score(y_test_5, y_test_pred)

From the recall, we see that when the model only detects 83.7% of the 5s in the test data.

By default, these report the score for the positive class only, which match the results above. To score the negative class, use `pos_label=0` to trick it into compliance.

In [None]:
precision_score(y_test_5, y_test_pred, pos_label=0)

In [None]:
f1_score(y_test_5, y_test_pred)

## Precision / Recall Trade-off

We've seen that precision and recall are influenced by the decision threshold. SKL does not let you set that value directly, but we can use the model's `decision_function` method (introduced in 11b), which returns a score for each observation. Then we can use any threshold to make predictions.

In [None]:
y_scores = logr_bin.decision_function([some_digit])
print(f"Score for the first observation: {y_scores[0]:.4f}")

In [None]:
threshold = 0
(y_scores > threshold)

In [None]:
# increase threshold
threshold = 150
(y_scores > threshold)

We know that `some_digit` is, in fact a 5. By default, our model correctly classifies it as such.

For `LogisticRegression` a value of 0 produced by `decision_function` is equivalent to a 0.5 probability from `predict_proba`. So by setting the threshold to zero and comparing it with `y_scores = 132.7`, we are reproducing the normal classification.

By repeating that comparison after increasing `threshold` to 150, we predict a not-5 result. This coifrms that raising the threshold decreases recall by increasing the number of FN results.

So, we know that different decisions may benefit from higher recall or precision, and that this is controlled by the threshold. How do we decide what value to use for threshold?

We can use `cross_val_predict` to get the scores of all instances in the training set, but tell it to return the z values instead of predictions.

In [None]:
from sklearn.model_selection import cross_val_predict

y_scores = cross_val_predict(logr_bin, X_train_scaled, y_train_5, cv=5, method="decision_function")

Then use `precision_recall_curve` to compute both metrics for all possible thresholds, and use matplotlib to plot the results.

In [None]:
from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)

plt.figure(figsize=(10,6))
plt.plot(thresholds, precisions[:-1], "b--", label="Precision", linewidth=2)
plt.plot(thresholds, recalls[:-1], "g-", label="Recall", linewidth=2)
plt.vlines(threshold, 0, 1.0, "k", "dotted", label="threshold")
# beautify the figure: add grid, legend, axis, labels, and circles
plt.grid(True)
plt.legend(loc="upper left", fontsize=10)
plt.xlabel("Threshold", fontsize=12)
plt.ylabel("Score", fontsize=12)
plt.title("Precision and Recall versus the decision threshold")
plt.xlim([-80, 20])
plt.ylim([0, 1.05])

# Find the threshold value where precision and recall are approximately equal
# (This appears to be around threshold = 0)
crossing_point_threshold = thresholds[np.argmin(np.abs(precisions[:-1] - recalls[:-1]))]

# Place circles at this point
crossing_precision = precisions[np.argmin(np.abs(thresholds - crossing_point_threshold))]
crossing_recall = recalls[np.argmin(np.abs(thresholds - crossing_point_threshold))]

# Plot the circles
plt.plot(crossing_point_threshold, crossing_precision, "bo", markersize=8)
plt.plot(crossing_point_threshold, crossing_recall, "go", markersize=8)

# Add the threshold line at this point
plt.vlines(crossing_point_threshold, 0, 1.0, "k", "dotted", label="threshold")
plt.show()

In [None]:
print(crossing_point_threshold)

In [None]:
# Create the plot with proper sizing
plt.figure(figsize=(6, 4))

# Plot the precision-recall curve
plt.plot(recalls, precisions, linewidth=2, label="Precision/Recall curve", color='#1f77b4')

# Beautify the figure
plt.grid(True)
plt.xlabel('Recall', fontsize=12)
plt.ylabel('Precision', fontsize=12)
plt.title('Precision versus Recall', fontsize=14)
plt.axis([0, 1.1, 0, 1.05])  # Set axis limits

# Create custom legend with the point
plt.tight_layout()
plt.show()

Based on this chart, how would you set your threshold?