# Classification Model Assessment

This notebook discusses the various model assessment metrics commonly used in Classification. It is based on Chapter 3 of HOML.

We'll work with the MNIST dataset again, with a focus on the binary classifier that predicts "5" or "not-5".

## Setup

Taken from the 11b notebook.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import fetch_openml
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler

# use as_frame=False to get data as NumPy arrays
mnist = fetch_openml('mnist_784', as_frame=False)

X, y = mnist.data, mnist.target

# Split into the predefined train and test sets
X_train, X_test = X[:60000], X[60000:]
y_train, y_test = y[:60000], y[60000:]

# create boolean labels for the 5 / not-5 classifier
y_train_5 = (y_train == '5')
y_test_5 = (y_test == '5')

# always scale features for LogReg
scaler = StandardScaler()

# apply scaling without introducing data leakage
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

logr_bin = LogisticRegression()

logr_bin.fit(X_train_scaled, y_train_5)

Does it correctly predict an image we know is labeled "5"?

In [None]:
# WRONG
# some_digit = X[0]

# must use scaled data as input for model predictions!
some_digit = X_train_scaled[0]
logr_bin.predict([some_digit])

In [None]:
# Train accuracy
train_score = logr_bin.score(X_train_scaled, y_train_5)

# Test accuracy (using the actual test set)
test_score = logr_bin.score(X_test_scaled, y_test_5) 
# Note: You'll need to create y_test_5 with: y_test_5 = (y_test == '5')

print(f"Training accuracy: {train_score:.4f}")
print(f"Test accuracy: {test_score:.4f}")

In [None]:
# Make predictions
y_test_pred = logr_bin.predict(X_test_scaled)

# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_test_5, y_test_pred))

print("\nClassification Report:")
print(classification_report(y_test_5, y_test_pred))

To better understand the consequence of FP vs FN errors, it is common to calculate performance metrics separately for each class. Except accuracy, which is inherently an overall measure of performance.

The classification report is divided into by-class (top) and overall (bottom) scores.

For the "False" class (negative class):
- Precision: 0.98 (98% of predicted negatives were actually negative)
- Recall: 0.99 (99% of actual negatives were correctly identified)
- F1-score: 0.99 (harmonic mean of precision and recall)
- Support: 9108 samples

For the "True" class (positive class):
- Precision: 0.90 (90% of predicted positives were actually positive)
- Recall: 0.84 (84% of actual positives were correctly identified)
- F1-score: 0.87 (harmonic mean of precision and recall)
- Support: 892 samples

Overall metrics:
- Accuracy: 0.98 (98% of all predictions were correct)
- Macro avg: 0.94 precision, 0.91 recall, 0.93 F1 (simple average across classes)
- Weighted avg: 0.98 for precision, recall, and F1 (weighted by class support)

Note: accuracy is an overall score unrelated to f1-score. The table is somewhat confusing in that regard. The macro and weighted average rows have values for precision, recall, and f1-score.

Observations:
- Imbalanced dataset (9108 negatives, 892 pos)
- Performs better on majority "False" class
- Good performance on the minority "True" class
- The relatively high precision (0.90) for the "True" class indicates that when the model predicts "True," it's usually correct, while the slightly lower recall (0.84) shows it misses some positive cases.

## Precision and Recall

For individual classifier metrics, use the appropriate SKL functions from `metrics`.

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score

precision_score(y_test_5, y_test_pred)

In [None]:
recall_score(y_test_5, y_test_pred)

From the recall, we see that when the model only detects 83.7% of the 5s in the test data.

By default, these report the score for the positive class only, which match the results above. To score the negative class, use `pos_label=0` to trick it into compliance.

In [None]:
precision_score(y_test_5, y_test_pred, pos_label=0)

In [None]:
f1_score(y_test_5, y_test_pred)

## Precision / Recall Trade-off

We've seen that precision and recall are influenced by the decision threshold. SKL does not let you set that value directly, but we can use the model's `decision_function` method (introduced in 11b), which returns a score for each observation. Then we can use any threshold to make predictions.

In [None]:
y_scores = logr_bin.decision_function([some_digit])
print(f"Score for the first observation: {y_scores[0]:.4f}")

In [None]:
threshold = 0
(y_scores > threshold)

In [None]:
# increase threshold
threshold = 2
(y_scores > threshold)

We know that `some_digit` is, in fact a 5. By default, our model correctly classifies it as such.

For `LogisticRegression` a value of 0 produced by `decision_function` is equivalent to a 0.5 probability from `predict_proba`. So by setting the threshold to zero and comparing it with `y_scores = 1.9739`, we are reproducing the normal classification.

By repeating that comparison after increasing `threshold` to 2, we predict a not-5 result. This coifrms that raising the threshold decreases recall by increasing the number of FN results.

So, we know that different decisions may benefit from higher recall or precision, and that this is controlled by the threshold. How do we decide what value to use for threshold?

We can use `cross_val_predict` to get the scores of all instances in the training set, but tell it to return the z values instead of predictions.

In [None]:
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import precision_recall_curve

y_scores = cross_val_predict(logr_bin, X_train_scaled, y_train_5, cv=5, method="decision_function")

precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)

Then use `precision_recall_curve` to compute both metrics for all possible thresholds, and use matplotlib to plot the results.

In [None]:
plt.figure(figsize=(8, 4))  # extra code – it's not needed, just formatting
plt.plot(thresholds, precisions[:-1], "b--", label="Precision", linewidth=2)
plt.plot(thresholds, recalls[:-1], "g-", label="Recall", linewidth=2)
plt.vlines(threshold, 0, 1.05, "k", "dotted", label="threshold")

# extra code – this section just beautifies and saves Figure 3–5
idx = (thresholds >= threshold).argmax()  # first index ≥ threshold
plt.plot(thresholds[idx], precisions[idx], "bo")
plt.plot(thresholds[idx], recalls[idx], "go")
plt.axis([-17, 17, 0, 1.05])
plt.grid()
plt.xlabel("Threshold (decision_function)")
plt.legend(loc="center right")

plt.show()

Note that the decision_function score for our `some_digit` example was still uncertain.

In [None]:
logr_bin.predict_proba([some_digit])

Another useful visualization is the Precision-Recall curve, which shows us the relationship between those values.

In [None]:
plt.figure(figsize=(8, 5))  # extra code – not needed, just formatting

plt.plot(recalls, precisions, linewidth=2, label="Precision/Recall curve")

# extra code – just beautifies and saves Figure 3–6
plt.plot([recalls[idx], recalls[idx]], [0., precisions[idx]], "k:")
plt.plot([0.0, recalls[idx]], [precisions[idx], precisions[idx]], "k:")
plt.plot([recalls[idx]], [precisions[idx]], "ko", label="Point above threshold 2.0")
plt.text(0.6, 0.62, "Higher\nthreshold", color="#333333")
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.axis([0, 1.0, 0, 1.05])
plt.grid()
plt.legend(loc="lower left")

plt.show()

From this we can see that precision falls of quickly when recall exceeds about 0.8, corresponding to a precision of about 0.9.

These two plots are complementary. Both can help us visualize the pr-tradeoff, but only the first can help us choose the corresponding threshold to use when generating new predictions.

We can also use the `argmax` method to find the corresponding threshold in the data, as we did before.

In [None]:
# use argmax to find index of first threshold that gives at least 90% precision
idx_90_prec = (precisions >= 0.90).argmax()

# Get the corresponding threshold value
thresh_90_prec = thresholds[idx_90_prec]

print(f"Threshold value: {thresh_90_prec:.4f}")

This aligns well with the prior chart.

To make predictions using that "ideal" threshold, simply create a boolean array:

In [None]:
y_train_pred_90 = (y_scores >= thresh_90_prec)
precision_score(y_train_5, y_train_pred_90)

In [None]:
recall_score(y_train_5, y_train_pred_90)

These scores demonstrate that our method works, and that it is easy to generate predictions for any threshold value.

## ROC Curve

In [None]:
from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)

Calculate a point of interest.

In [None]:
idx_for_threshold_at_90 = (thresholds <= thresh_90_prec).argmax()
tpr_90, fpr_90 = tpr[idx_for_threshold_at_90], fpr[idx_for_threshold_at_90]

Plot the curve.

In [None]:
import matplotlib.patches as patches  # for the curved arrow

plt.figure(figsize=(8, 5))
plt.plot(fpr, tpr, linewidth=2, label="ROC curve")
plt.plot([0, 1], [0, 1], 'k:', label="Random classifier's ROC curve")
plt.plot([fpr_90], [tpr_90], "ko", label="Threshold for 90% precision")

plt.gca().add_patch(patches.FancyArrowPatch(
    (0.20, 0.89), (0.07, 0.70),
    connectionstyle="arc3,rad=.4",
    arrowstyle="Simple, tail_width=1.5, head_width=8, head_length=10",
    color="#444444"))
plt.text(0.12, 0.71, "Higher\nthreshold", color="#333333")
plt.xlabel('False Positive Rate (Fall-Out)')
plt.ylabel('True Positive Rate (Recall)')
plt.grid()
plt.axis([0, 1, 0, 1])
plt.legend(loc="lower right", fontsize=12)

plt.show()

Calculate area under the curve, ROC AUC score:

In [None]:
from sklearn.metrics import roc_auc_score

roc_auc_score(y_train_5, y_scores)

## K-Nearest Neighbors

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.datasets import load_breast_cancer
from sklearn.pipeline import Pipeline

# Set random seed for reproducibility
np.random.seed(42)

# ---- Create a pipeline with preprocessing and model ----
pipeline = Pipeline([
    ('scaler', StandardScaler()),  # Standardize features
    ('knn', KNeighborsClassifier(n_neighbors=5))  # Default KNN model
])

# ---- Train the model ----
pipeline.fit(X_train, y_train_5)

# ---- Make predictions ----
y_pred = pipeline.predict(X_test)
y_prob = pipeline.predict_proba(X_test)[:, 1]  # Probability of positive class

# ---- Evaluate the model ----
accuracy = accuracy_score(y_test_5, y_pred)
print(f"\nAccuracy: {accuracy:.4f}")

print("\nClassification Report:")
print(classification_report(y_test_5, y_pred))

print("\nConfusion Matrix:")
cm = confusion_matrix(y_test_5, y_pred)
print(cm)

In [None]:
y_probs = cross_val_predict(pipeline, X_train, y_train_5, cv=5, method="predict_proba")[:, 1]
fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)

In [None]:
roc_auc_score(y_train_5, y_probs)

In [None]:
plt.figure(figsize=(8, 5))
plt.plot(fpr, tpr, linewidth=2, label="ROC curve")
plt.plot([0, 1], [0, 1], 'k:', label="Random classifier's ROC curve")
#plt.plot([fpr_90], [tpr_90], "ko", label="Threshold for 90% precision")

plt.xlabel('False Positive Rate (Fall-Out)')
plt.ylabel('True Positive Rate (Recall)')
plt.grid()
plt.axis([0, 1, 0, 1])
plt.legend(loc="lower right", fontsize=12)

plt.show()