**Motivation I: Data Compression**

- We may want to reduce the dimension of our features if we have a lot of redundant data.
- To do this, we find two highly correlated features, plot them, and make a new line that seems to describe both features accurately. We place all the new features on this single line

- Doing dimensionality reduction will reduce the total data we have to store in computer memory and will speed up our learning algorithm.


**Motivation II: Visualization**

- It is not easy to visualize data that is more than three dimensions. We can reduce the dimensions of our data to 3 or less in order to plot it.

- We need to find new features, $z_1, z_2$(and perhaps $z_3$) that can effectively summarize all the other features.

Example: hundreds of features related to a country's economic system may all be combined into one feature that you call "Economic Activity."

# PCA

## **Problem formulation**

Given two features, $x_1$ and $x_2$, we want to find a single line that effectively describes both features at once. We then map our old features onto this new line to get a new single feature.

The same can be done with three features, where we map them to a plane.

The goal of PCA is to reduce the average of all the distances of every feature to the projection line. This is the projection error.

Reduce from 2d to 1d: find a direction (a vector $u(1)\in R^n$) onto which to project the data so as to minimize the projection error.

General case :

Reduce from n-dimension to k-dimension: Find $k$ vectors $u^{(1)}, u^{(2)}, \dots, u^{(k)}$ onto which to project the data so as to minimize the projection error.

If we are converting from 3d to 2d, we will project our data onto two directions (a plane), so $k$ will be 2.

## Principal Component Analysis Algorithm

Before we can apply PCA, there is a data pre-processing step we must perform:

“Covariance” indicates the direction of the linear relationship between variables. “Correlation” on the other hand measures both the strength and direction of the linear relationship between two variables. 

**Use the correlation matrix  when within-variable range and scale widely differs, and use the covariance matrix  to preserve variance if the range and scale of variables is similar or in the same units of measure.**


**Data preprocessing**

- Given training set: $x^{(1)},x^{(2)},\dots,x^{(m)}$
- Preprocess (feature scaling/mean normalization):
    - $\mu_j = \frac{1}{m}\sum_{i=1}^{m}x_{j}^{(i)}$
- Replace each $x_j^{(i)}$ with $x_j^{(i)} - \mu_j$
- If different features are on different scales , scale features to have comparable range of values.
    - $x_j^{(i)} = \dfrac{x_j^{(i)} - \mu_j}{s_j}$


1. **Compute Covariance Matrix**

$$\Sigma = \dfrac{1}{m}\sum^m_{i=1}(x^{(i)})(x^{(i)})^T$$

- Covariance matrix is denoted by capital sigma, do not confuse with symbol for summation
- Note that $x^{(i)}$ is an n×1 vector, ($(x^{(i)})^T$ is an 1×n vector and X is a m×n matrix (row-wise stored examples). The product of those will be an n×n matrix, which are the dimensions of $\Sigma$.

Percentage of variance retained is given by :
 $\displaystyle\frac{\sum_{i=1}^kS_{ii}}{\sum_{i=1}^nS_{ii}}$


2. compute Singular Value Decomposition (decompose into the dot product of three simple matrices):
        * a rotation matrix $U$ (an $m \times m$ orthogonal matrix)
        * a scaling & projecting matrix $\Sigma$ (an $m \times n$ diagonal matrix)
        * and another rotation matrix $V^T$ (an $n \times n$ orthogonal matrix)
    - extract first two unit vectors that define first two Principal Components from $V^T$
    - Once you have identified all the principal components, you can reduce the dimensionality of the dataset down to d dimensions by projecting it onto the hyperplane defined by the first d principal components
        - Selecting this hyperplane ensures that the projection will preserve as much variance as possible.
    





$$z = \begin{bmatrix}\vert & \vert & & \vert \\ u^{(1)} & u^{(2)} & \dots & u^{(k)}\\ \vert & \vert & & \vert \end{bmatrix}^T x=\begin{bmatrix} \text{---} & (u^{(1)})^T & \mbox{---}\\\text{---} & (u^{(2)})^T & \mbox{---}\\ & \vdots & \\ \text{---} & (u^{(k)})^T & \mbox{---}\end{bmatrix}x$$



```python
X_centered = X - X.mean(axis=0)
U, s, Vt = np.linalg.svd(X_centered)
c1 = Vt.T[:, 0]
c2 = Vt.T[:, 1]
```

**Project the training set onto plane defined by first two principal components**

```python
W2 = Vt.T[:, :2]
X2D = X_centered.dot(W2)
```


## Reconstruction from compressed Representation


## Choosing the Number of Principal Components

How do we choose k, also called the number of principal components? Recall that k is the dimension we are reducing to.

One way to choose k is by using the following formula:

- Given the average squared projection error: $\dfrac{1}{m}\sum^m_{i=1}||x^{(i)} - x_{approx}^{(i)}||^2$

- Also given the total variation in the data: $\dfrac{1}{m}\sum^m_{i=1}||x^{(i)}||^2$

- Choose k to be the smallest value such that: $\dfrac{\dfrac{1}{m}\sum^m_{i=1}||x^{(i)} - x_{approx}^{(i)}||^2}{\dfrac{1}{m}\sum^m_{i=1}||x^{(i)}||^2} \leq 0.01$


In other words, the squared projection error divided by the total variation should be less than one percent, so that **99% of the variance is retained.**

- Use SVD to obtain matrix $U$
    - a rotation matrix 𝑈 (an 𝑚×𝑚 orthogonal matrix)
- check for 99% of retained variance using the U matrix as follows:
    -  $\displaystyle\frac{\sum_{i=1}^kS_{ii}}{\sum_{i=1}^nS_{ii}} \geq 0.99$
    
## Advice for Applying PCA


The most common use of PCA is to speed up supervised learning.

Given a training set with a large number of features (e.g. $x^{(1)},\dots,x^{(m)} \in \mathbb{R}^{10000}$ ) we can use PCA to reduce the number of features in each example of the training set (e.g. $z^{(1)},\dots,z^{(m)} \in \mathbb{R}^{1000}$).

Note that we should define the PCA reduction from $x^{(i)}$ to $z^{(i)}$ only on the training set and not on the cross-validation or test sets. You can apply the mapping $z^{(i)}$ to your cross-validation and test sets after it is defined on the training set.

Applications:
- compressions
    - reduce space of data 
    - speed up algorithm
- visualization of data
    - choose k=2 or k=3
    
Bad use of PCA: trying to prevent overfitting. We might think that reducing the features with PCA would be an effective way to address overfitting. It might work, but is not recommended because it does not consider the values of our results y. Using just regularization will be at least as effective.

Don't assume you need to do PCA. Try your full machine learning algorithm without PCA first. Then use PCA if you find that you need it.

