# Discussion 03

## Principal Components

Welcome to Discussion 03. In this discussion, we'll gain a deeper understanding of principal components and how they are used for dimensionality reduction.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import sklearn.cluster

plt.rcParams['figure.figsize'] = (7,7)

Let's start by generating some data:

In [None]:
C = np.array([
    [3, -2],
    [-2, 3]
])

In [None]:
np.random.seed(42)
X = np.random.multivariate_normal([0,0], C, 300)

In [None]:
plt.scatter(*X.T)
plt.gca().set_aspect(1)

**Question 01**. What is the *top* eigenvector of the covariance matrix, $C$? What is the *second* eigenvector?

In [None]:
...

**Question 02**. Plot both eigenvectors on top of the data. Using color, distinguish the top eigenvector from the second eigenvector.

In [None]:
...

**Question 03**. For each data point $\vec x^{(i)}$, project it onto the first eigenvector, $\vec u^{(1)}$ to get a new vector: $(\vec x{(i)} \cdot \vec u{(i)}) \, \vec u{(i)}$. Plot your new points on top of the previous plot.

In [None]:
...

**Question 04**. Now let's create a new dataset, $Z$, in the following way. Given a vector $\vec x = (x_1, x_2)^T$, we produce a new representation $\vec z = (z_1, z_2)^T$, where $z_1 = \vec x \cdot \vec u^{(1)}$ and $z_2 = \vec x \cdot \vec u^{(2)}$. Plot $Z$ as a scatter plot.

In [None]:
...

The following cell will download the MNIST digit dataset:

You should notice that your new scatter plot is a rotated version of the original scatter plot of $X$.

**Question 05**. Compute the covariance matrix for $Z$. What do you notice about the off-diagonal entries? What does this mean, informally?

In [None]:
...

Now let's try this on a more interesting data set. The cell below will download the MNIST digits.

In [None]:
%%bash
if [[ ! -e "mnist.npz" ]]; then
    wget 'https://f000.backblazeb2.com/file/jeldridge-data/mnist.npz'
fi

In [None]:
mnist = np.load('mnist.npz')

Let's use only the "training" data -- there is no training and testing in PCA, it's all just "data".

In [None]:
X = mnist['train'].T
X.shape

There are 60,000 images in 784 dimensions.

**Question 06**. Compute the covariance matrix. What should be its size? Check to make sure.

In [None]:
C = ...

**Question 07**. Compute the eigenvectors of the covariance matrix.

In [None]:
...

**Question 08**. Each of the eigenvectors is a unit vector -- a direction in 784 dimensions. We can think of each eigenvector as a vector of "mixing coefficients" which creates a particular mixture of the features in the original image (i.e., the pixel intensities). That is, if $\vec u = (u_1, u_2, \ldots, u_{784})^T$, we can think of $u_1$ as being the "mixture coefficient" of pixel 1, $u_2$ is the coefficient of pixel 2, and so forth.

This interpretation allows us to visualize the 784-dimensional eigenvectors as images by reshaping them into 28 x 28 arrays.

Visualize the top 5 eigenvectors of the covariance matrix as images. Do the same, but for the *bottom* five eigenvectors.

In [None]:
...

In [None]:
...