# Principal Component Analysis 
Using Jon Shlens "A Tutorial On Principal Component Analysis" (2003). 
## 1) Change of Basis
"Is there another basis, which is a linear combination of the original basis, that best expresses our data set?" 
Let $\bf{X}$ and $\bf{Y}$ be $m\times n$ matrices related by a linear transformation $\bf{P}$. $\bf{X}$ is the original recorded data set and $\bf{Y}$ is a re-representation of that data set 
$$
\begin{equation}
\bf{PX} = \bf{Y}
\end{equation}
$$
We define the following quantities : 

* $\bf{p_i}$ are the rows of $\bf{P}$
* $\bf{x_i}$ are the columns of $\bf{X}$
* $\bf{y_i}$ are the columns of $\bf{Y}$

Then each $\bf{y_i}$ has the form 
$$\bf{y_i} =  \begin{bmatrix}
                \langle \bf{p_1, x_i}\rangle \\
                \vdots \\ 
                \langle \bf{p_m, x_i}\rangle \\
            \end{bmatrix}$$

that is, the jth coefficient of $\bf{y_i}$ is a projection on to the jth row of $\bf{P}$. 
In fact, we can interpret the above form of $\bf{Y}$ as a projection onto the basis $\{\bf{p_1}, \dots, \bf{p_m} \}$ 

We now come to the question of choosing a good $\bf{P}$. 

## 2) Choosing P
We want a data set that has low levels of noise and low redundancy. 
### Noise 
A common measure for noise is the signal to noise ratio, SNR : 
$$
\begin{equation}
\frac{\sigma^2_{\text{signal}}}{\sigma^2_{\text{noise}}}
\end{equation}
$$
how this is actually measured is specific to the data and measurement devices. 

### Redundancy 
In short, we do not want dimensions that tell the same story, i.e. high correlation. 

The SNR and redundancy can be determined using the covariance matrix. 

## Covariance Matrix
Consider two random variables $a,b$ each with $\mu=0$, then the measure of their 
covariance is given by 
$$
(\mu_a - a)(\mu_b - b) = ab
$$
likewise, we can extend this to random vectors with zero mean, 
$$
\sigma^2_{ab} \equiv \frac{1}{n-1}\bf{ab}^T 
$$

then for our data matrix $\bf{X}$ the covariance matrix $\bf{S_X}$ is 
$$
\begin{equation}
\bf{S_X} \equiv \frac{1}{n-1}\bf{XX}^T
\end{equation}
$$
### Properties 
* $\bf{S_X}$ is a square symmetric $m \times m$ matrix
* The diagonal terms of $\bf{S_X}$ are the variance of the particular measurement types
* Off diagonal terms are the covariance between measurement types. 

### Optimizing $\bf{S_X}$

Suppose we want to optimize certain properties of $\bf{S_X}$, such as redundancy and SNR. 
What might this look like? We will refer to the manipulated $\bf{S_X}$ as $\bf{S_y}$. 

We begin with redundancy. 

#### Redundancy 
Obviously, the off diagonal terms would be 0, in that they are orthogonal. Therefore, 
removing redundancy diagonalizes $\bf{S_Y}$
PCA assumes 
1. that all the basis vectors, $\bf{P}$ are orthonormal explained by argument above
2. The directions with largest variances are the most important. This can be seen when considering how one might even build a an orthonormal matrix, given that we need directions perpindicular to all previous selected directions, which would fail for small magnitudes. 

## 3) Summary Of Assumptions and Limits

1. Linearity

2. Mean and variance are sufficient statistics 

3. Large variances have important dynamics 

4. The principal components are orthogonal

## 4) Solving PCA : Eigenvectors of Covariance 
We summarize our goal : 

Find some orthonormal matrix $\bf{P}$ where $\bf{Y} = \bf{PX}$ such that $\bf{S_Y} \equiv 
\frac{1}{n-1}\bf{YY}^T$ is diagonalized. The rows of $\bf{P}$ are the principal components
of $\bf{X}$. 

Right now we have two unkowns. 
$
\begin{align*}
\bf{S_Y} &= \frac{1}{n-1}\bf{YY}^T \\
    &= \frac{1}{n-1}(\bf{PX})(\bf{PX})^T \\
    &= \frac{1}{n-1}\bf{PX}\bf{X}^T\bf{P}^T \\
\end{align*}
$
We define $\bf{A} = \bf{XX}^T$, which is symmetric. 

\begin{theorem}
A matrix is symmetric if and only if it is orthogonally diagonalizable.
\end{theorem}




































































