# Multi-Label Classification

This notebook covers multi-label classification. It is based on Chapter 3 of HOML.

We'll be exploring the MNIST dataset, a set of 70,000 small images of digits handwritten by high school students and employees of the US Census Bureau. Each image is labeled with the digit it represents. MNIST is considered the "hello world" of classification. It provides an opportunity for us to explore how ML is applied to image data.

## Setup

Yet another way to load a dataset... this time with `fetch_openml` from `sklearn.datasets`. This will take a minute.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import fetch_openml

# use as_frame=False to get data as NumPy arrays
mnist = fetch_openml('mnist_784', as_frame=False)
type(mnist)

We've specified NumPy arrays as they are better suited to working with image data than DataFrames.

The `Bunch` object is a kind of dictionary where the keys can be accessed as attributes. Most contain:

- `DESCR`: a description of the dataset
- `data`: the input data
- `target`: the labels

In [None]:
print(mnist.DESCR)

Extract the data and target.

In [None]:
X, y = mnist.data, mnist.target
type(X), type(y)

In [None]:
X

In [None]:
y

Image data is typically represented as a series of pixels, each with "color" data in red, green, blue values. For example, using 8-bit color where there are $2^8=256$ possible values, a red pixel is represented by $(255, 0, 0)$. In this case, the images are greyscale, so each pixel only has an intensity value in the $[0,255]$ range.

From the results above, we can see that $X$ is an array of 70k lists, each with 784 elements, and that $y$ has the corresponding labels. To see what labels are present, we'll use NumPy's `unique` function.


In [None]:
np.unique(y)

We see that all digits from 0 to 9 are represented. It is important to know if the dataset is balanced, meaning that all classes are pretty equally represented. We can check that visually with a histogram.

In [None]:
# Create a dictionary of {value: count}
unique_values, counts = np.unique(y, return_counts=True)
digit_counts = dict(zip(unique_values, counts))

# Plot as histogram using matplotlib
plt.figure(figsize=(10, 5))
plt.bar(digit_counts.keys(), digit_counts.values())
plt.xlabel('Digit')
plt.ylabel('Count')
plt.title('Distribution of digits in MNIST dataset')
plt.xticks(list(digit_counts.keys()))
plt.grid(axis='y', alpha=0.3)
plt.show()

This shows that the labels are reasonably well balanced. Data that appears perfectly balanced should be scrutinized as it is unusual in the real world.

Each of the 784 elements associated with each list in $X$ represents a pixel. We can use Matplotlib's `imshow` function to look at an image, once the data is properly formatted. The images in this dataset are rectangular (2d), but the data is represented in a vector (1d), so we need to reshape it into the 28 x 28 array (28 x 28 = 784).

This step is for our visualization only. 

In [None]:
# define a utility function to plot an image
def plot_digit(image_data):
    image = image_data.reshape(28, 28)
    # use binary for greyscale images
    plt.imshow(image, cmap="binary")
    plt.axis("off")

In [None]:
# select a digit and plot it
some_digit = X[0]

plot_digit(some_digit)
plt.show()

I... guess that's a 5?

In [None]:
y[0]

It is!

Before we go any further we need to follow best practices and set aside a test set. Per the dataset description, the first 60k images are for training and the last 10k are for test.

In [None]:
# Split into the predefined train and test sets
X_train, X_test = X[:60000], X[60000:]
y_train, y_test = y[:60000], y[60000:]

## Training a Binary Classifier

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, cross_val_score, KFold

We'll start by building a binary classifier that will discriminate between "5" and "not-5". We've done this before, but this is a good opportunity to explore the method in the context of a different kind of data. We'll also discuss a few additional nuances.

Start by building a boolean mask for the training data based on the label "5". This will act as our labels for the binary classifier, indicating "5" (True) or "not-5" (False).

In [None]:
y_train_5 = (y_train == '5')
y_test_5 = (y_test == '5')

In [None]:
y_train_5[:10]

SKL offers several optimization algorithms ('solvers') for logistic regression. While none use plain gradient descent, some solvers like `sag` and `saga` are based on that method. Regardless, they all benefit from standardizing features before fitting the model. This improves model convergence by preventing features with larger scales from dominating the optimization process. This preprocessing step is considered a best practice for logistic regression, even when using the default solver (`lbgfs`).

In [None]:
# Create a scaler
scaler = StandardScaler()

# Fit the scaler on training data and transform it
X_train_scaled = scaler.fit_transform(X_train)

# Only transform the test data (no fitting)
X_test_scaled = scaler.transform(X_test)

Now we'll fit the model as usual, and log the elapsed time.

In [None]:
logr_bin = LogisticRegression()

In [None]:
%%timeit -n1 -r1
logr_bin.fit(X_train_scaled, y_train_5)

Does it correctly predict an image we know is labeled "5"?

In [None]:
# Before we do predictions we need to redefine `some_digit`,
# which currently refers to X[0], unscaled data
# since the model is fit using scaled data, we need to work with that
# from now on!

some_digit = X_train_scaled[0]

logr_bin.predict([some_digit])

Yes! But how does it do overall? Let's perform train / test validation to get a gross approximation of the model's performance.

In [None]:
# Train accuracy
train_score = logr_bin.score(X_train_scaled, y_train_5)

# Test accuracy (using the actual test set)
test_score = logr_bin.score(X_test_scaled, y_test_5) 
# Note: You'll need to create y_test_5 with: y_test_5 = (y_test == '5')

print(f"Training accuracy: {train_score:.4f}")
print(f"Test accuracy: {test_score:.4f}")

Both are over 97% - that seems very high! There are (at least) four questions that should come to mind based on this result.

### What is accuracy, exactly?

Accuracy is, as you might expect, the percentage of correct predictions:

$$
\begin{align}
\text{Accuracy} &= \frac{\text{Correct Predictions}}{\text{Total Predictions}} \\[5pt]
&= \frac{\text{True Positives} + \text{True Negatives}}{\text{Total Number of Observations}}
\end{align}
$$

This is an imperfect measure of performance, as we'll soon see, and there are many other metrics to consider.

### Training accuracy is higher than test. What does that imply?

When performance in training exceeds that in test, it may suggest some degree of overfitting.

In this case, the difference is very small, suggesting that the model generalizes well. For overfitting, we would typically expect a larger gap; often on the order of a few percentage points.

Keep in mind that this is a simple estimate of performance and that cross-validation would give a more rigorous result.

### How good is 97%?

97% accuracy sounds impressive at first glance, but we need context to truly evaluate it. Is it actually good, or just seemingly good? To answer this properly, we should establish a baseline model for comparison.

For binary classification, it is common to use SKL's `DummyClassifier` as a baseline. This model simply predicts everything as the most commonly observed class in the dataset. For our 5 / not-five model, the most common label is `False` (not-five), since only about 10% of the images are for each digit.

In [None]:
from sklearn.dummy import DummyClassifier

dummy_bin = DummyClassifier()
dummy_bin.fit(X_train_scaled, y_train_5)

# does it predict any True values?
print(any(dummy_bin.predict(X_train)))

This result indicates that no 5s are detected, which matches our expectations. Now let's score it.

In [None]:
# Train accuracy
train_score = dummy_bin.score(X_train_scaled, y_train_5)

# Test accuracy (using the actual test set)
test_score = dummy_bin.score(X_test_scaled, y_test_5) 
# Note: You'll need to create y_test_5 with: y_test_5 = (y_test == '5')

print(f"Training accuracy: {train_score:.4f}")
print(f"Test accuracy: {test_score:.4f}")

Even a trivial model gets 90%+ accuracy for this data. That should also be expected since the dataset is balanced with 10 labels, so only 1/10th of our predictions should be wrong if we always predict not-5.

This clearly demonstrates why accuracy is not the generally preferred performance measure for classifiers, especially when dealing with imbalanced (aka skewed) datasets. Instead we will rely on interpreting the confusion matrix in a comprehensive fashion, weighing the strengths and weaknesses of our model with the business objectives in mind.

We will return to model evaluation shortly.

## How can I understand the results?

It depends on the questions you have. Here we will identify images that the model finds difficult to classify.

First, we'll use the `decision_function` to identify images that are near the decision threshold. The `decision_function` returns the linear combination of features and weights, i.e.

$$z = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_n X_n$$

This is the raw $z$ score before applying the sigmoid function,

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

The $z$ value does not return probabilities in the range $[0, 1]$. It ranges from negative infinity to positive infinity, with $0$ being the actual decision boundary. For a binary classifier, positive $z$ values mean the model predicts class $1$ ("5"), and negative values for the $0$ class ("not-5"). Zero is the decision boundary, so observations with a $z$ value near it are weakly classified.

In the code below, the expression `np.argsort(np.abs(decision_scores))[:25]` returns the indicies that would sort the absolute values smallest to largest and selects the 25 examples closest to zero. Those results are plotted, with their predicted labels.

In [None]:
# Get the decision function values for all training examples
decision_scores = logr_bin.decision_function(X_train_scaled)

# Find examples near the decision boundary (small absolute scores)
boundary_indices = np.argsort(np.abs(decision_scores))[:25]

# Plot these examples
plt.figure(figsize=(10, 8))
for i, idx in enumerate(boundary_indices):
    plt.subplot(5, 5, i+1)
    plt.imshow(X_train[idx].reshape(28, 28), cmap='binary')
    plt.title(f"{'5' if y_train_5[idx] else 'not-5'}")
    plt.axis('off')
plt.suptitle('Digits Near Decision Boundary')
plt.tight_layout()
plt.show()

Looking at the images, we can see why the model might struggle with these particular digits:

- Some "5"s are written in unusual ways that make them look similar to other digits
- Several "3"s appear that could be mistaken for "5"s due to similar curved structures
- Some "8"s and "6"s near the decision boundary share structural similarities with "5"s
- There are even some actual "5"s labeled as "not-5" and vice versa, suggesting possible labeling errors or extremely ambiguous handwriting

The middle row shows how a curved "6" could be confused with a "5", while the true "5"s in row 4 have various styles that demonstrate the variability in handwriting the model must handle.

This visualization reveals the complexity of the classification problem beyond abstract accuracy metrics, showing the actual edge cases where human interpretation might even disagree with the labels.

## Training a Multiclass Classifier

In [None]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.multiclass import OneVsOneClassifier

We'll demonstrate the use of OvR, OvO, and direct multinomial methods with Logistic Regression, as described in the slides.

### Compare Approaches

First using the default multinomial method. `max_iter` was set to 1000 because the default value of 100 did not converge (you will get a warning message in this case). Alternatively (or additionally), you may find other solvers converge better in some cases.

Note that this method leverages all CPU cores.

In [None]:
multi_clf = LogisticRegression(max_iter=1000)

Using Jupyter's `%%timeit` magic method to measure the time required. This runs the cell in a different **namespace** so any variable created in it isn't available afterwards. Hence, I created the `multi_clf` instance first, before timing the `fit` below.

In [None]:
%%timeit -n1 -r1
multi_clf.fit(X_train_scaled, y_train)

Since the `fit` method updates the existing `multi_clf` object directly (in-place modification), it is still accessible.

In [None]:
multi_clf

In [None]:
# Score on both training and test sets
train_multi_score = multi_clf.score(X_train_scaled, y_train)
test_multi_score = multi_clf.score(X_test_scaled, y_test)
print(f"Multinomial Training Accuracy: {train_multi_score:.4f}")
print(f"Multinomial Test Accuracy: {test_multi_score:.4f}")
print(f"Difference (Train-Test): {train_multi_score-test_multi_score:.4f}")

Next we'll use OvR. This will train 10 classifiers - one for each digit - using 60k observations of 784 values. In spite of that, it takes only about 2x as long, thanks to efficient utilization of CPU cores.

In [None]:
ovr_clf = OneVsRestClassifier(LogisticRegression(max_iter=1000))

In [None]:
%%timeit -n1 -r1
ovr_clf.fit(X_train_scaled, y_train)

In [None]:
# Score on both training and test sets
train_ovr_score = ovr_clf.score(X_train_scaled, y_train)
test_ovr_score = ovr_clf.score(X_test_scaled, y_test)
print(f"OvR Training Accuracy: {train_ovr_score:.4f}")
print(f"OvR Test Accuracy: {test_ovr_score:.4f}")
print(f"Difference (Train-Test): {train_ovr_score - test_ovr_score:.4f}")

Finally, we'll use OvO. This will train 10 * 9 / 2 = 45 different classifiers, each on a subset of the data. In this case it runs slightly faster than the OvR approach, despite many more models.

In [None]:
ovo_clf = OneVsOneClassifier(LogisticRegression(max_iter=1000))

In [None]:
%%timeit -n1 -r1
ovo_clf.fit(X_train_scaled, y_train)

In [None]:
# Score on both training and test sets
train_ovo_score = ovo_clf.score(X_train_scaled, y_train)
test_ovo_score = ovo_clf.score(X_test_scaled, y_test)
print(f"OvO Training Accuracy: {train_ovo_score:.4f}")
print(f"OvO Test Accuracy: {test_ovo_score:.4f}")
print(f"Difference (Train-Test): {train_ovo_score - test_ovo_score:.4f}")

### Summarize the Scores

In [None]:
# Assuming you've already run the models and collected these scores
# Store all results in a dictionary
results = {
    'Method': ['Multinomial', 'OvR', 'OvO'],
    'Training Accuracy': [train_multi_score, train_ovr_score, train_ovo_score],
    'Test Accuracy': [test_multi_score, test_ovr_score, test_ovo_score]
}

# Create a DataFrame
results_df = pd.DataFrame(results)

# Calculate the difference
results_df['Difference (Train-Test)'] = results_df['Training Accuracy'] - results_df['Test Accuracy']

# Display the table
print("Comparison of Classification Methods:")
print(results_df.to_string(index=False, float_format=lambda x: f"{x:.4f}"))

# Create a plot
plt.figure(figsize=(10, 6))

# Bar plot for accuracies
ax1 = plt.subplot(121)
results_df_melt = pd.melt(results_df, 
                          id_vars=['Method'], 
                          value_vars=['Training Accuracy', 'Test Accuracy'],
                          var_name='Dataset', value_name='Accuracy')
sns.barplot(x='Method', y='Accuracy', hue='Dataset', data=results_df_melt, ax=ax1)
ax1.set_title('Training vs Test Accuracy')
ax1.set_ylim(0.85, 1.0)  # Adjust as needed to make differences visible
ax1.grid(axis='y', alpha=0.3)

# Bar plot for differences
ax2 = plt.subplot(122)
sns.barplot(x='Method', y='Difference (Train-Test)', data=results_df, ax=ax2)
ax2.set_title('Overfitting (Train-Test Difference)')
ax2.grid(axis='y', alpha=0.3)
ax2.set_ylim(0, 0.05)  # Adjust as needed

plt.tight_layout()
plt.show()

At least in this case, the choice between these methods involves trade-offs:

- For highest possible accuracy, OvO wins
- For best generalization, OvR is most stable
- Multinomial is a good middle ground

These results align with theoretical expectations: OvO trains many specialized classifiers that can better capture complex decision boundaries but may learn some training set noise, while OvR's simpler approach generalizes more consistently but with slightly lower overall accuracy.

### Confusion Matrix

As we did for binary classification, we'll create a confusion matrix to better see how this model is predicting each class.

We'll use the `ConfusionMatrixDisplay` to easily create that based on model predictions for the test data. Here, `normalize` and `values_format` are used to display each cell as a percentage of total number of images in the row (the true class). For example, the bottom left cell tells us that 1% of images labeled "9" were misclassified as "0". Normalizing in this fashion allows us to make direct comparisons of the results.

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

# For multiclass classifier (using OvR as an example)
y_pred_multi = ovr_clf.predict(X_test_scaled)
plt.figure(figsize=(10, 8))
ConfusionMatrixDisplay.from_predictions(y_test, y_pred_multi,
                                       normalize="true", 
                                       values_format=".0%")
plt.title("Multiclass Classifier Confusion Matrix - All Predictions")
plt.tight_layout()
plt.show()

Only 86% of the images of 5s were correctly classified. Most commonly, 5s were misclassified as 3s (4% of the time), with 8s (3%) and 6s (2%) aslo prevalent.

To make errors stand out more, we can put zero weight on the correct predictions. This amplifies error patterns.

In [None]:
sample_weight = (y_pred_multi != y_test)
ConfusionMatrixDisplay.from_predictions(y_test, y_pred_multi,
                                        sample_weight=sample_weight,
                                        normalize="true",
                                        values_format=".0%")
plt.title("Multiclass Classifier Confusion Matrix - Errors Only")
plt.tight_layout()
plt.show()

From this we can see that, for examples, 1s are most commonly confused for 8s, and 4s are most commonly confused for 9s.

Tools like these can help us see where the model is doing well or poorly. This information can then be used to target improvements in preprocessing, parameter tuning, feature engineering, model selection, etc., potentially leading to a more robust classifier.

### Explore Results

Let's look at the results of the multinomial approach. If we call the `decision_function` method for a single observation, it will return 10 scores, one for each class.

In [None]:
# decision_function expects an array of observations,
# so we put it in a list
z_values = multi_clf.decision_function([some_digit])

In [None]:
z_values

In [None]:
# create a list of sorted z_values with their labels
# decision_function returns an array of results,
# one per observation, so we have to take the first [0]
label_values = [(label, score) for label, score in enumerate(z_values[0])]

sorted_pairs = sorted(label_values, key=lambda x: x[1], reverse=True)

for label, score in sorted_pairs:
    print(f"{label}: {score:.4f}")

This shows that the multinomial model is most confident that the pixel data in `some_digit` is a $5$, with a $z$ value of 12.9053. Second most confident prediction is $3$, with 11.2705.

In [None]:
multi_clf.classes_

In this case the result is rather trivial, since the indicies and class names match. In other circumstances it is more helpful to know the index of each label.

### Important Reminder

This analysis relies on a single predetermined train-test split without verifying that the observations are assigned to the splits in a random and balanced manner. This approach could introduce bias if the original data has inherent ordering. Cross-validation would provide more reliable performance estimates by reducing the variance in evaluation metrics and giving a better approximation of how the model would perform on unseen data. Additionally, no hyperparameter optimization was performed and important parameters like regularization strength (C), penalty types, or alternative solvers for LogisticRegression were not explored. These could significantly impact the relative performance and overfitting behaviors observed.