# Week 7: More PCA

## Goals:
- Build a 'PCA' class
- See what happens with distances after projecting

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Problem

Using the data set in `data/UN_IRE_data_smaller.csv` perform PCA. Build your own class to interact with the data.
1. Write functions like `__init__` and `__repr__`. 
2. Write a method to compute all of the principal components (return them in order).
3. Find a reasonable $k$ such that almost all of the total variability is captured in the first $k$ principal components. 
4. Project the data onto the first $k$ components.
5. Plot the projected data on the first two components. 

## Distances

Recall with PCA, we transform our data $X$ with a new basis given by $P$:
$$
	PX = Y
$$

This is achieved by eigendecomposition of the corresponding covariance matrices. Thus, the covariance matrix of $Y$ is diagonal:
$$
	C_Y = \begin{pmatrix}
		\lambda_1 \\ 
		& \lambda_2 \\ 
		& & \ddots \\ 
		& & & \lambda_m
	\end{pmatrix}
$$

with $\lambda_1\geq \lambda_2 \geq \cdots \geq \lambda_m$.

If all but first $k$ principal components are $0$, then projecting onto the first $k$ principal components **preserves distance**

Question: What happens when the all but the first $k$ principal components are just *close* to $0$?

In [None]:
def get_distances(M, i):
	ncols = M.shape[1]
	return np.array(
		[np.linalg.norm(M[:, j] - M[:, i]) for j in range(ncols) if j != i]
	)

In [None]:
# Sanity check
M = np.random.randint(-3, 3, (3, 6))
i = 2
print(M)
print(get_distances(M, i)**2)

In [None]:
fig, ax = plt.subplots()
# ax.scatter(c='b', label='old')
# ax.scatter(c='o', label='new')
ax.set_title('Comparing distances before and after PCA')
ax.grid()
ax.legend()