[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/nkeriven/ensta-mt12/blob/main/notebooks/02a_discriminant_analysis/N2_discriminant-analysis_zip_digits.ipynb)

# Handwritten digits recognition

We aim to recognize handwritten digits in digital images.
After normalization (see [here](http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/zip.info.txt) for more details on these data, which are taken from zip codes on US postal envelopes), the resulting images are composed of $16 \times 16$ pixels, each pixel being quantified in gray levels in the interval $[- 1.1]$.
Thus, we  have $X \in [-1,1]^{256}$, i.e $d = 256$, and $Y \in \{0,1, \ldots, 9 \}$.

## Part I

In a fisrt time, we have a training set of $n=257$ samples  (file `zip_train.mat`),  and a test set  of $255$ samples (file `zip_test.mat`).
The code cells below allow to
- [Display some images from each class](#Display-some-digits)
- [Compare several discriminant analysis models and perform a Linear Discriminant Analysis on the training set under the Naïve Bayes assumption (diagonal covariance matrix)](#Compare-the-performances-obtained-for-different-variants-of-discriminant-analysis)
- [Display as an image the estimated mean values for each class $k=0, \ldots, 9$](#Display-as-an-image-the-estimated-mean-values-for-each-class)
- [Display some image realizations according to the generative model that has been learned](#Display-some-image-realizations-according-to-the-generative-model-that-has-been-learned)


### Load data sets

You can either download the data yourself and place it in the same folder as the notebook, or use the command below

In [None]:
!wget https://raw.githubusercontent.com/nkeriven/ensta-mt12/main/notebooks/data/zip_train.mat -O zip_train.mat
!wget https://raw.githubusercontent.com/nkeriven/ensta-mt12/main/notebooks/data/zip_test.mat -O zip_test.mat

In [None]:
# Import modules
%matplotlib inline

import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import scipy as sp
import scipy.io as spio

# Warning: put the data files in the notebook directory
data = spio.loadmat("zip_train.mat")
Xtrain = data["Xtrain"]
Ytrain = data["Ytrain"]
Xshape = Xtrain.shape
Ytrain = np.reshape(Ytrain, (Xshape[0],))
Yshape = Ytrain.shape

print(f"Xtrain is (n={Xshape[0]},p={Xshape[1]}) sized")
print(f"Ytrain is a (n={Yshape[0]},) sized vector of reponses")

data_test = spio.loadmat("zip_test.mat")
Xtest = data_test["Xtest"]
Ytest = data_test["Ytest"]
Ytest = np.reshape(Ytest, (Xtest.shape[0],))
print(f"Xtest is (n={Xtest.shape[0]},p={Xtest.shape[1]}) sized")

### Display some digits

In [None]:
fig = plt.figure(figsize=(12, 5))  # to specify the size of the images
for i in range(10):
    fig.add_subplot(2, 5, i+1)
    plt.axis('off')
    mplot = plt.imshow(np.reshape(Xtrain[i], (16, 16)), cmap="gray_r")
    mplot.axes.get_xaxis().set_visible(False)
    mplot.axes.get_yaxis().set_visible(False)

### Compare the performances obtained for different variants of discriminant analysis

- We already know that LDA and QDA methods are already coded in scikit-learn through [LinearDiscriminantAnalysis](https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html#sklearn.discriminant_analysis.LinearDiscriminantAnalysis) and  [QuadraticDiscriminantAnalysis](https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis.html#sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis).
- Moreover, QDA with Naïve Bayes is available with [GaussianNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html)
- Last, LDA with Naïve Bayes can be obtained as a special case of the [LinearDiscriminantAnalysis](https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html#sklearn.discriminant_analysis.LinearDiscriminantAnalysis) method when the shrinkage parameter equals 1 (`shrinkage=1`)



In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as QDA
from sklearn.naive_bayes import GaussianNB as QDA_NB


def print_discr_analysis_perf(model, msg):
    # train the discr_analysis model
    model.fit(Xtrain, Ytrain)
    # compute the train/test misclassification rates (mcr)
    Ytrain_hat = model.predict(Xtrain)
    mcr_train = np.mean(Ytrain_hat != Ytrain)
    Ytest_hat = model.predict(Xtest)
    mcr_test = np.mean(Ytest_hat != Ytest)
    print("{:7s} misclassification rates: train = {:0.3f}, test = {:0.3f}".
          format(msg, mcr_train, mcr_test))


# QDA case
print_discr_analysis_perf(QDA(), 'QDA')
# QDA + NB case
print_discr_analysis_perf(QDA_NB(), 'QDA+NB')
# LDA case
print_discr_analysis_perf(LDA(), 'LDA')
# LDA + NB case
print_discr_analysis_perf(LDA(solver="eigen", shrinkage=1), 'LDA+NB')

#### Exercise I.1:
- Can you explain why sklearn algo raises a warning when we train the QDA and LDA models (and not in the QDA+NB and LDA+NB case)?
- How to justify that the linear discriminant analysis under the Naïve Bayes assumption seems the most appropriate among all the methods of discriminant analysis?

### Display as an image the estimated mean values for each class 
Wee focus now on the LDA with *NB assumption* (i.e., same *diagonal* covariance matrix for all the classes) classifier

In [None]:
# Train LDA + NB
lda_nb = LDA(solver="eigen",  shrinkage=1)
lda_nb.fit(Xtrain, Ytrain)

In [None]:
# this show the mean vector of the LDA classes
fig = plt.figure(figsize=(12, 5))  # to specify the size of the images
for k in range(0, 10):
    fig.add_subplot(2, 5, k+1)
    Xmean = lda_nb.means_[k]
    mplot = plt.imshow(np.reshape(Xmean, (16, 16)), cmap="gray_r")
    plt.title(k)
    # hide the axis
    plt.axis('off')
plt.show()

### Display some image realizations according to the generative model that has been learned

In [None]:
print('Draw some Gaussian vectors according to the LDA with NB model')
fig = plt.figure(figsize=(12, 5))  # to specify the size of the images
for k in range(0, 10):
    fig.add_subplot(2, 5, k+1)
    # generate the gaussian vector
    Xsynth = lda_nb.means_[k] + \
        np.real(sp.linalg.sqrtm(lda_nb.covariance_)) @ np.random.randn(256)
    mplot = plt.imshow(np.reshape(Xsynth, (16, 16)), cmap="gray_r")
    plt.title(k)
    # hide the axis
    plt.axis('off')
plt.show()

#### Exercise I.2:
Run several times the cells above and below to observe different random realizations
- Do these synthetic examples seem realistic to you? 
- What is the interest of such a model?
- Comparing with QDA based synthetic examples on the cell below, what can we conclude (remember that QDA obtains here catastrophic generalization performances on test data)?

#### Synthetic examples for QDA: display some image realizations based on the QDA generative model

In [None]:
qda = QDA(store_covariance=True)
qda.fit(Xtrain, Ytrain)

print('Draw some Gaussian vectors according to the QDA model')
fig = plt.figure(figsize=(12, 5))  # to specify the size of the images
for k in range(0, 10):
    fig.add_subplot(2, 5, k+1)
    # generate the gaussian vector
    Xsynth = qda.means_[k] + \
        np.real(sp.linalg.sqrtm(qda.covariance_[k])) @ np.random.randn(256)
    mplot = plt.imshow(np.reshape(Xsynth, (16, 16)), cmap="gray_r")
    plt.title(k)
    # hide the axis
    plt.axis('off')
plt.show()

## Part 2:  Regularized discriminant analysis

To obtain more flexibility for LDA methods, a common procedure is regularized (linear) discriminant analysis:  all classes have the same covariance matrix

$
\hat{\Sigma}_\gamma = (1-\gamma) \hat{\Sigma} + \gamma \mathrm{diag}{(\hat{\Sigma})},
$

where $\hat{\Sigma}$ is the empirical pooled covariance matrix, $\mathrm{diag}{(\hat{\Sigma})}$ is the diagonal matrix with diagonal entries equal to those of $\hat{\Sigma}$ (Naïve Bayes empirical estimate), and $\gamma \in [0,1]$ is the amount of regularization.

Within sklearn, the regularization coefficient $\gamma$ correponds to the `shrinkage` parameter of    [LinearDiscriminantAnalysis](https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html#sklearn.discriminant_analysis.LinearDiscriminantAnalysis).


The code cells below allow to
- [Display misclassification rate curves as a function of the regularizion coefficient](#Display-misclassification-rate-curves-as-a-function-of-the-regularized-discriminant-analysis-coefficient-%24%5Cgamma%24)
- [Estimate the optimal parameter $\gamma$ in an automatic way](#Automatic-estimation-of-the-optimal-regularization-parameter-$\gamma$)
- [Train LDA on a larger dataset](#Dealing-with-larger-datasets)
- [Display the confusion matrix to see the most common errors](#Display-the-confusion-matrix)


### Display misclassification rate curves as a function of the regularized discriminant analysis coefficient $\gamma$

In [None]:
res = 30  # you can crank up the resolution but it will not change a lot
gamma = np.linspace(1e-6, 1, res) # array of regularization coefficients.
# 0 results in numerical issue, we take a small minimum value

#### EXO: display test and train error vs gamma. Give the optimal value wrt test error

#### Exercise II.1: 
- Explain what happens in the limiting case $\gamma=0$ or $\gamma=1$? And which methods correspond to these particular cases of regularized discriminant analysis?
- What are a good choices here for the $\gamma$ values?
- In practice, when there is no test set, which common procedure can we use to estimate the optimal value of $\gamma$? (note: we will see below a performant and cost-efficient alternative)
- (*Optional*) Compare with the performance/computational cost  obtained for a k-NN classifier

### Automatic estimation of the optimal regularization parameter $\gamma$

In the high dimensional framework, i.e. large dimension $p$ and large sample size $n$, this becomes possible to estimate the optimal shrinkage parameter (which minimizes the Frobenius norm $||\Sigma-\widehat{\Sigma}_\gamma||$) in a **simple analytic way** following the result introduced in
 > Ledoit O, Wolf M. *Honey, I Shrunk the Sample Covariance Matrix*. The Journal of Portfolio Management 30(4), 110-119, 2004.

With sklearn, this can be used by setting the shrinkage parameter of the `LinearDiscriminantAnalysis` class to ‘auto’, i.e. ```shrinkage='auto'```. 

In [None]:
lda_auto = LDA(solver="eigen", shrinkage="auto")
lda_auto.fit(Xtrain, Ytrain)

y_hat = lda_auto.predict(Xtrain)
mcr_train = np.mean(Ytrain != y_hat)

y_hat = lda_auto.predict(Xtest)
mcr_test = np.mean(Ytest != y_hat)

print("Auto regularized LDA mcr: train = {:0.3f}, test = {:0.3f}".
          format(mcr_train, mcr_test))

#### Exercise II.2:
- Is this in good agreement with the estimates of the optimal regularized LDA perfomance derived previously?
- What are the benefits of this 'automatic' method compare to cross-validation?

### Dealing with larger datasets
We consider now the larger datasets `Xtrain_full`  (Matlab file `zip_train_full.mat`) and `Xtest_full` (`zip_test_full.mat`):

In [None]:
!wget https://raw.githubusercontent.com/nkeriven/ensta-mt12/main/notebooks/data/zip_train_full.mat -O zip_train_full.mat
!wget https://raw.githubusercontent.com/nkeriven/ensta-mt12/main/notebooks/data/zip_test_full.mat -O zip_test_full.mat

In [None]:
# Warning: put the data files in the notebook directory
data = spio.loadmat("zip_train_full.mat")
Xtrain_full = data["Xtrain_full"]
Ytrain_full = data["Ytrain_full"]
Xshape = Xtrain_full.shape
Ytrain_full = np.reshape(Ytrain_full, (Xshape[0],))
Yshape = Ytrain_full.shape

print("Xtrain_full is (n={},p={}) sized".format(Xshape[0], Xshape[1]))
print("Ytrain_full is a (n={},) sized vector of reponses".format(Yshape[0]))

data_test = spio.loadmat("zip_test_full.mat")
Xtest_full = data_test["Xtest_full"]
Ytest_full = data_test["Ytest_full"]
Ytest_full = np.reshape(Ytest_full, (Xtest_full.shape[0],))
print("Xtest is (n={},p={}) sized".format(Xtest_full.shape[0], Xtest_full.shape[1]))

#### Exercise II.3:
- Apply all the previous algo to these larger datasets and compare the performances with the previous one. What can you conclude? Do we still get numerical warnings ? why ?
- Compare now the optimal values for the regularization parameter $\gamma$ for LDA. How to explain this?
- (*Optional*) Compare now with a regularized QDA (set the `reg_param` parameter). How to explain this?
- (*Optional*) Compare with the the performance/cost ratio obtained for a k-NN classifier

### Display the *confusion matrix*

The *confusion matrix* (see the [scikitlearn user guide](https://scikit-learn.org/stable/modules/model_evaluation.html#confusion-matrix) for some examples) is a useful tool in supervised learning to evaluate classification accuracy. Quoting Wikipedia:
>Each column of the matrix represents the instances in a predicted class while each row represents the instances in an actual class (or vice versa, depending on the convention). The name stems from the fact that it makes it easy to see the confusion between two classes (i.e. commonly mislabeling one as another). 

In [None]:
from sklearn.metrics import confusion_matrix
lda_auto = LDA(solver="eigen", shrinkage="auto")
lda_auto.fit(Xtrain_full, Ytrain_full)

y_hat = lda_auto.predict(Xtest_full)
confusion_matrix(Ytest_full, y_hat) # y_hat is the auto regularized LDA prediction

In [None]:
from sklearn.metrics import plot_confusion_matrix
plot_confusion_matrix(lda_auto, Xtest_full, Ytest_full)
plt.show() 

#### Exercise II.4:
- What are the the most common confusions between classes?