# <center>+++ Project is currently in progress ;-) +++<center>

# Principal component analysis on MNIST

In this notebook, I apply **principal component analysis (PCA)** on the MNIST data set of handwritten digits.

## 1. Set up the notebook and load the MNIST data

In contrary to previous projects I will work with the entire MNIST dataset for this project

In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import multivariate_normal

In [4]:
# Get data from csv files
train_images = np.loadtxt("MNIST/mnist_train.csv", delimiter=",")
test_images = np.loadtxt("MNIST/mnist_test.csv", delimiter=",")

In [5]:
print(np.shape(train_images))
print(np.shape(test_images))

(60000, 785)
(10000, 785)


In [8]:
train_data = train_images[:,1:]
train_labels = train_images[:,:1]
test_data = test_images[:,1:]
test_labels = test_images[:,:1]

print('Shape of training data: {}'.format(np.shape(train_data)))
print('Shape of training labels: {}'.format(np.shape(train_labels)))
print('Shape of test data: {}'.format(np.shape(test_data)))
print('Shape of test labels: {}'.format(np.shape(test_labels)))

Shape of training data: (60000, 784)
Shape of training labels: (60000, 1)
Shape of test data: (10000, 784)
Shape of test labels: (10000, 1)


# 2. Statistics

In principal component analysis the first eigenvectors of the covariance matrix of the data indicate the directions of the highest variance. These directions can be interpreted as the projected features of the data that contain the most information.

* *The ith **diagonal entry** of the covariance is the variance in the ith coordinate (the ith pixel).*
* *The ith **eigenvalue** of the covariance matrix is the variance in the direction of the ith eigenvector.*

In [10]:
# Compute covariance matrix
sigma = np.cov(train_data, rowvar=0, bias=1)
# Compute coordinate-wise variances, in increasing order
coordinate_variances = np.sort(sigma.diagonal())
# Compute variances in eigenvector directions
eigenvector_variances = np.sort(np.linalg.eigvalsh(sigma))

In [36]:
print('Highest variances \n\nin coordinate direction \tin eigenvector direction')
[print('{}\t\t\t\t{}'.format(coord, eig)) \
     for coord, eig in zip(np.round(coordinate_variances[::-1][:10], 0), np.rint(eigenvector_variances[::-1][:10]))];

Highest variances 

in coordinate direction 	in eigenvector direction
12953.0				332719.0
12934.0				243280.0
12752.0				211504.0
12736.0				184773.0
12689.0				166924.0
12682.0				147842.0
12682.0				112176.0
12680.0				98873.0
12645.0				94695.0
12629.0				80808.0


The highest variances in eigenvector directions are significantly higher than the highest variances in the directions of the image pixel coordinates. This shows that the **first eigenvector directions contain more information** than any of the image pixel coordinates.