# Logistic regression on the MNIST handwritten digits

In this miniproject, you are asked to apply logistic regression to solve the problem of handwritten digit recognition. The data set being used below is rather large in both size and dimension. To speed up computing, you can use PCA to reduce the dimension considerably. You will then apply logistic regression (no regularization) to the dimension reduced data.

Note: One could use the regularization technique (as an alternative to PCA) to handle the many features. It would also effectively handle the issue of multicollinearity but computing is rather intensive.


Import libraries

In [None]:
import numpy as np
from matplotlib import pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
import scipy.io

load the MNIST data set (you need to first download the file mnist.mat and save it in the same place with the notebook)

In [None]:
mat = scipy.io.loadmat('mnist.mat')
mat

In [None]:
X_train = mat['Xtr']
y_train = mat['ytr']
X_test = mat['Xtst']
y_test = mat['ytst']

X_train.shape, y_train.shape, X_test.shape, y_test.shape

Apply PCA to reduce the dimension from 784 to 50

In [None]:
pca50 = PCA(n_components=50).fit(X_train)

X_train_pca = pca50.transform(X_train)
X_test_pca = pca50.transform(X_test)

X_train_pca.shape, X_test_pca.shape

The following experiments are all based on the dimension reduced data (training and test).

## (1) Classifying a pair of digits

First, let's consider the pair $\mathbf{\{0,1\}}$, which should be relatively easy. Train a binary logistic regression classifier on all the training images that contain either digit and evaluate the trained classifer on the test images containing one of the two digits. Display the confusion matrix. What is the overall classification error?

Do the same thing with the pair $\mathbf{\{1,7\}}$ instead. How does it compare with the pair {0,1} in terms of overall error?

Which pair of digits do you think are the hardest to distinguish? Perform binary logistic regression with those images (training and test) and report the overall error.

## (2) Classifying three digits together

Consider the triple of digits $\mathbf{\{0,1,2\}}$, which should also be relatively easy. We need to train a multiclass logistic regression classifier on all the training images that contain the three digit and evaluate the trained classifer on the test images containing one of the three digits. There are three ways to extend binary logistic regression to the multiclass setting.

(2a) Perform the $\textbf{one-versus-rest}$ multiclass logistic regression on the three digits {0,1,2} and report the overall test error.

(2b) Perform $\textbf{multinomial}$ logistic regression with the three digits {0,1,2} and report the overall test error.

(2c) Perform the $\textbf{one-versus-one}$ multiclass logistic regression on the three digits {0,1,2} and report the overall test error. You will need to implment this extension from scratch first.

(2d) Which combination of three digits do you think are the hardest to distinguish? Perform each of the three extensions of multiclass logistic regression with those images (training and test) and report the overall test errors (using a bar plot).

### (3) Classify all 10 digits simultaneously

Perform each of the three extensions of multiclass logistic regression with the full data set and report the overall test errors (using a bar plot).

Summarize your findings by commenting on which version of multiclass logistic regression works the best on the handwritten digits dataset and which digits are the hardest to be distinguished from the rest.