# Playing around with PCA on MNIST data.
stough 202-

We're going look at Principal Component Analysis and the MNIST data today, and an analogy with some of the block transform and image compression stuff we've done before.

## Import 
A lot. We need MNIST data, PCA, and all of our block transform materials.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

# For the MNIST data
from torchvision import datasets, transforms
from torchvision.utils import make_grid

# For PCA 
from sklearn.decomposition import PCA

from skimage.util import montage

# For importing some of our own utility codes.
import sys  
sys.path.insert(0, '../dip_utils')

from matrix_utils import (arr_info,
                          make_linmap)
from vis_utils import (vis_rgb_cube,
                       vis_hists,
                       vis_pair)

from wavelet_utils import (make_haar_matrix,
                           make_random_basis,
                           make_klt_basis,
                           make_dct_matrix,
                           make_standard_matrix,
                           vis_blocks)

## Loading and Formatting the MNIST Data

In [None]:
# Thank you: https://www.aiworkbox.com/lessons/load-mnist-dataset-from-pytorch-torchvision
# https://pytorch.org/docs/stable/torchvision/datasets.html
mnist_trainset = datasets.MNIST(root='/home/dip365/data', train=True, download=False, transform=None)
mnist_testset = datasets.MNIST(root='/home/dip365/data', train=False, download=False, transform=None)

x_train = np.stack([np.array(x).ravel() for x,_ in mnist_trainset])
x_test = np.stack([np.array(x).ravel() for x,_ in mnist_testset])
y_train = np.stack([y for _,y in mnist_trainset]).astype('long')
y_test = np.stack([y for _,y in mnist_testset]).astype('long')

# These are Nx784
x_train = x_train.astype(np.float32)/255.0
x_test = x_test.astype(np.float32)/255.0

## PCA 
We'll do a PCA analysis of the MNIST training set to see its power on this kind of data. Let's use [pca_spanFaces](./pca_spanFaces.ipynb) as a guide. Here each row of `x_train` for example represents a $28\times28$ image. PCA won't know anything about that, and will just think we have 60K points in a 784-D space.

## Thinking about PCA as a pattern basis,
like we did with block transforms and image compression.

In [None]:
H = make_klt_basis(np.reshape(x_train[1], (28,28)), size=28)

In [None]:
vis_blocks(H)