# Week 8: Principal Component Analysis (PCA)

Assume we have feature scaled and mean normalized some data.  

We could project our data for dimensionality reduction on many planes/lines.   

PCA would tell you which line to choose.  

**Goal of PCA** Find direction (vector $u^{(1)} \in R^n$) onto which to project the data so as to minimize the projection error.  

This case shows 2 dim to 1 dim.  Generally: 
**Reduce n-dim to k-dim** by finding $k$ vectors $u^{(1)} \ldots u^{(k)}$ so as to minimize projection errors.

<img src="figures/fig18.png" width=300>

### Example:  $R^3 \to R^2$

Find 2 vectors so that the plan defined by them minimizes the projection distance using the actual data.

<img src="figures/fig19.png" width=300>

## PCA Is Not Linear Regression

**Lin Reg:** Find straight line so to minimize squared error between line and the points.  
- tries to use values of x to predict y.

**PCA:** Tries to minimize the shortest orthogonal distance between points and the line.
- there is no special variable y I'm trying to predict. Just have multiple features treated equally.

<img src="figures/fig20.png" width=500>

## Principle Component Analysis Algorithm

### Data Preprocessing

Train set: $x^{(i)}, \  i = 1 \ldots m$   

Now mean normalize:  

$\mu_j = \frac{1}{m} \sum_{i=1}^m x^{(i)}$  

$$x^{(i)} \leftarrow x^{(i)}-\mu_j$$

For different featurs of different values, then normalize to stan dev.:  

$$x^{(i)} \leftarrow \frac{x^{(i)}-\mu_j}{s_j}$$

### Compute Covariance Matrix $\Sigma$

$$\Sigma = \frac{1}{m} \sum_1^n (x^{(i)})(x^{(i)})^T $$

- $\Sigma$ always satisfy symmetric positive semi-definite.

### Compute Singular Value Decomposition 

$$\Sigma = U S V^T$$  

The col vecs $U$ are the ones you want for the PCA. Just take the first $k$ of them. 
$$U = [u^{(1)}, u^{(2)}, \ldots, u^{(m)}] \in R^{n\times n}$$  

$$U_{reduce} = [u^{(1)}, u^{(2)}, \ldots, u^{(k)}] \in R^{n\times k}$$    

### Compute z The Reduced Data Vector

Then form: $z$, the reduced data vector, using $U_{reduce}$ data vector $x \in R^{n\times 1}$  

$$z = U_{reduce}^T x \in R^{k \times 1}$$

**NOTE: DO NOT SET x_0 = 1**

## Applying PCA: Reconstruction from Compressed Representation

Now that we have a reduced data vector, how can we go back to the original space?  

$$ z \in R^k \to x \in R^n, \ k<n$$ 

$$ z = U^T_{reduce} x $$

Fact is, we will end up with an **approximation** to the original data: 

$$ x \approx x_{approx} = U_{reduce} z $$

**Note:** $U_{reduce}$ is a unitary matrix, because $U$ is unitary.

## Applying PCA: ChoosingThe Number of Principle Components in PCA $k$.

**Average squared projection error:**

\begin{align}
 \frac{1}{m} \sum_{i=1}^{m}\left\|x^{(i)}-x_{\text {approx }}^{(i)}\right\|^{2}
 \end{align}
 
**Total variation in data:** 
\begin{align} 
\frac{1}{m} \sum_{i=1}^{m}\left\|x^{(i)}\right\|^{2}
\end{align}

**Rule of thumb 99% of varaince is retained**

Choose $k$ = smallest value so that:

\begin{align}
 \frac{\frac{1}{m} \sum_{i=1}^{m}\left\|x^{(i)}-x_{\text {approx }}^{(i)}\right\|^{2}}{
\frac{1}{m} \sum_{i=1}^{m}\left\|x^{(i)}\right\|^{2}} \leq 0.01 \ \ (1\%)
\end{align}

Other common values:  
- 0.05 (95% retained)
- 0.10 (90% retained)

**Notes:** Turns out much real life data is highly correlated so you can reduce many of the features while retaining much of the variance.

## Implementing Choosing $k$

1. Try with $k=1$
2. Compute $U_{reduce}, \ z^{(1)}, ...z^{(k)}$
3. Check if: 
    $$\begin{align}
 \frac{\frac{1}{m} \sum_{i=1}^{m}\left\|x^{(i)}-x_{\text {approx }}^{(i)}\right\|^{2}}{
\frac{1}{m} \sum_{i=1}^{m}\left\|x^{(i)}\right\|^{2}} \leq 0.01 \ \ (1\%)
\end{align}$$
4. If not, then increase $k$. Repeat until you reach threshold.

**More or less horribly inefficient**

You can do this instead ->  **MUCH MORE EFFICIENT!**

When calling SVD, you get $U, S, V$. 

Matrix $S$ is diagonal , it contains the singular values.  
These values help to compute exactly the test quantity above so that:

\begin{align}
%
\frac{\frac{1}{m} \sum_{i=1}^{m}\left\|x^{(i)}-x_{\text {approx }}^{(i)}\right\|^{2}}{\frac{1}{m} \sum_{i=1}^{m}\left\|x^{(i)}\right\|^{2}} 
%
=
%
1 - \frac{ \sum_{i=1}^k S_{ii} }{ \sum_{i=1}^n S_{ii} } 
\leq 0.01 \ \ (1\%) 
%
\end{align}

**equivalently**

\begin{align}
 \frac{ \sum_{i=1}^k S_{ii} }{ \sum_{i=1}^n S_{ii} } 
\geq 0.99 \ \ (99\%)
\end{align}  

Only need to call SVD once! Versus upstairs, you'd have to call this multiple times to update $k$ each time.  

You should report the smallest value of $k$ that gives you what you need. 

## Applying PCA: Advice for Applying PCA

Supervised learning speedup:  

Given original data set:  
${(x^{(i)}, y^{(i)})}^m_1$  

For $m=10^4$ we have a slow learning algorithm.  Instead reduce with PCA.  

Extract inputs:
- unlabeled dataset: $ x^{(i)} \in R^{10000} \ \rightarrow \ z^{(i)} \in R^{1000}$
- new training set: ${(z^{(i)}, y^{(i)})}^m_1$   

Feed the reduced training data set to the learning algo.  

For ex: we would train a hypothesis $h_\theta (z)$ using $z^{(i)}$:

$$x \ \rightarrow \ z \ \rightarrow \ h_\theta(z) = \frac{1}{1 + exp(-\theta^T z)}$$

Mapping PCA should be defined only by running the PCA on the training set:

$$PCA: x^{(i)} \ \to \ z^{(i)}$$. 

It's like saying "PCA finds parameter $U_{reduce}$. " So if we do this on the original training set, we can reuse it for the cross-validation set $x^{(i)}_{cv}$ and the text set $x^{(i)}_{test}$.


### Bad Use Of PCA: To Prevent Overfitting

Idea: use $z^{(i)}$ instead of $x^{(i)}$ to reduce num features to $k$ from $n$. Therefore fewer features = less likely to overfit.

This is bad, because PCA throws away information! Regularization with penalty often works better, because each stage of the minimization still considers all of the parameters to be fitted.

<img src="figures/fig21.png">

### Bad Use Of PCA: Design Of ML System

1. get train set: $(x^{(i)}, y^{(i)})$
2. run PCA reduce $x^{(i)} \to z^{(i)}$
3. train logistic regression on $(z^{(i)}, y^{(i)})$
4. test on test set:
    1. map $x^{(i)}_{test} \to  z^{(i)}_{test} $.
    2. run $h_\theta(z)$ on $(z^{(i)}, y^{(i)})$
    
INSTEAD: 
- always use the entire data set FIRST before using PCA. IE cutout step 2. unless you believe there is good reason to run PCA.
- if you don't you will end up spending time on things like finding $k$ and figuring out this bs first.