# Principal Component Analysis #

## ***Vocabulary***

**Covariance Matrix**

# Lecture Notes #

## ***1.11.0 Introduction***

#### **Introduction**

PCA is possibly the most important technique for reducing dimensionality of data.

Introductory details:
- Dimensionality reduction technique
- PCA works for any $k$
- PCA looks at the data to find a new representation, meaning it is data-dependent

Preview of the PCA process:

<br>
<center>
    <img src="images/1.11.10.png" alt="Professor Notes" />
</center>
<br>

#### **High-level goal of PCA**

The goal of PCA is to find vectors $v_1, ..., v_k$ such that:

$$ \forall x \in S\;\;\;\; x\approx \sum_{j=1}^k a_jv_j$$

#### **A Note About Data Preprocessing**

For the dataset $S$, we must perform the following preprocessing:
- subtract the mean or center of mass from each data point (the mean will now be 0)
- normalize by the standard deviation of each feature (we want the features to have relatively similar values in features)
    - For every feature in $i$ perform:
 $$ \sqrt{\frac{1}{m}\sum_{j=1}^m(x_i^j)^2} = \sigma_i$$
    - Then divide all the $i^th$ features by $\sigma_i$
- We want each feature to, as closely as possible, have mean 0 and standard deviation 1

#### **How to Begin**
Find $v_1$, a vector that minimizes square-distance. So,

$$ \underset{v, \;||v||_2 \;= 1}{\min}\;\;\;\frac{1}{m}\sum_{j=1}^m(distance\;between\;x^j\;and\;v)^2 $$

<br>
<center>
    <img src="images/1.11.2.png" alt="Professor Notes" />
</center>
<br>

We want to find the direction, that when we project all of the points orthogonally down that direction, the sum of the squares of the distances is minimal. The difference from linear regression is that for the loss we are measuring the distance from each point via its orthogonal projection onto the line, and then take squares.

## ***1.11.1 Explaining PCA***

#### **Objective**

Recall our objective function which we wish to minimize:
$$ \underset{v, \;||v||_2 \;= 1}{\min}\;\;\;\frac{1}{m}\sum_{j=1}^m(distance\;between\;x^j\;and\;v)^2 $$

<br>
<center>
    <img src="images/1.11.3.png" alt="Professor Notes" />
</center>
<br>

Because of the pythagorean theorem, as illustrated above, since $||x||_2^2$ is fixed, we can have an equivalent objective formulation, which is to find a $v$ such that it maximizes the distance from $x$ to $v$:

$$ \underset{v, \;||v||_2 \;= 1}{\max}\;\;\;\frac{1}{m}\sum_{j=1}^m\langle x^i, v \rangle ^2 $$

This is referred to as **the direction of maximal variance**. This value is equal to the sample variance of our data in the direction $v$ (recall that $var = \mathbf{E}[x^2] - (\mathbf{E}[x])^2$, where $(\mathbf{E}[x])^2 = 0$ since we subtracted out the mean from each point).

#### **Main Idea**

The main goal is to find the vector, or vectors that retains the most variance from the dataset. In the below example, $v_2$ is the better vector because it preserves more of the variance of the points.

<br>
<center>
    <img src="images/1.11.4.png" alt="Professor Notes" />
</center>
<br>

***Question: That was a good example for 1 vector, but what about $k$ vectors/components?*** 

The optimization problem for this is:

$$ \underset{s\;of\;dimension\;k}{max\;subspaces}\;\;\;\frac{1}{m}\sum_{j=1}^m (length\;of\;x^j\;projected\;onto\;S)^2 $$

A nice and prefereable basis would be an orthonormal basis $v_1, ..., v_k$. This is because if we want to understand the length of some point $x$ projected onto $S$, and we have an orthonormal basis, then we can just project the point $x$ onto each vector in the basis and take the sum of the squares directly.

Assuming that we have an orthonormal basis:

$$ (distance\;from\;x^j\;to\;S)^2 = ||x||^2-(\langle x,v_1,\rangle^2 + ... + \langle x,v_k,\rangle^2) $$

With all of this context...

#### **The Formal PCA Objective**

$$ \underset{v_1,...,v_k\;;\;orthogonal}{max}\;\;\;\frac{1}{m}\sum_{j=1}^m \sum_{i=1}^k\langle x^j,v_i\rangle ^2 $$

***Question: Let's assume we have $v_1, ..., v_k$... How do we express $x$ once we have these vectors?***

We can find this by simply taking its projection onto each basis vector: $$x = \langle x, v_1\rangle * v_1 + \langle x, v_2\rangle * v_2 + ... + \langle x, v_k\rangle * v_k$$

As such, $x$ can be written as a vector in $\mathbf{R}^k$ corresponding to these projections.

## ***1.11.2 Applications***

#### **Applications of PCA**

**1. Understanding genomes - "Genes Mirror Geography Within Europe"**

A researcher took 1,400 people from Europe, and each person was represented according to about 200,000 genetic markers in their genome. So, each person was represented purely by genetic material, and the number of features was $\approx$ 200,000.

This corresponds to a matrix of dimension 1400 x 200,000. The researchers ran PCA on this data to find vectors $v_1$ and $v_2$. So now, each person corresponds to two numbers. Then the researchers plotted the two numbers, and color coded each point according to the country of origin. The plot is below, and as you can see, it is almost exactly a map of Europe rotated 16 degrees.

<br>
<center>
    <img src="images/1.11.5.png" alt="Professor Notes" />
</center>
<br>

**2. Image Data Compression - "Eigenfaces"**

This is a strategy for compressing image data, where the images are of faces. Each data point is an image (vector of pixels), and each image has 65,000 pixels. So that means we have 65,000 features, with each feature being a pixel. 

Researchers ran PCA on this dataset with $k \approx 100 \;to\;150$. Now each image $\approx$ linear combination of 150 vectors of length 65,000.

Below is one of the original images which was a vector of length 65,000, next to the compressed version of a vector of length $\approx$ 150.

<br>
<center>
    <img src="images/1.11.6.png" alt="Professor Notes" />
</center>
<br>

## ***1.11.3 The Big Question***

### **How do we find these vectors v1 through vk?**

Recall the optimization problem:

$$ \underset{v_1,...,v_k\;;\;orthonormal}{max}\;\;\;\frac{1}{m}\sum_{j=1}^m \sum_{i=1}^k\langle x^j,v_i\rangle ^2 $$

### **Setup**
- Let $X$ be an $m$ by $n$ matrix 
- $m$ is the number of points in our training set 
- $n$ is the dimension.
- $v$ denotes a column vector 
- $v^T$ denotes a row vector 
- $v^Tv$ is an inner product (scalar) 
- $vv^T$ is an outer product (matrix)

We will now look at $X^TX$, which is an $n$ by $n$ matrix. Multiplying this matrix by $\frac{1}{m}$ will yield the **sample covariance matrix**, so $\frac{1}{m}X^TX$. The $(i,j)^m$ entry of $X^TX$ corresponds to "how simiar is feature $i$ to feature $j$".

---

**Notes:**
- $X^TX$ is a symmetric matrix.
- All eigenvalues of symmetric matrices are $\ge 0$.
- For a matrix $A$, $v$ is an eigenvector if $A*v = \lambda*v$ for some $\lambda \in \mathbf{R}$, where $\lambda$ is an eigenvalue.
- An orthogonal matrix is one where all the columns are orthonormal $\iff A^TA = I$. Therefore also $AA^T = I$.

**Spectral Theorem:** \
Every symmetric matrix $A$ has an eigendecomposition: $$A=Q*D*Q^T$$ Where $Q$ is an orthogonal matrix, and $D$ is a diagonal matrix. The entries of $D$ are the eigenvalues of $A$.

---

Let's try to compute $v_1$ now. Recall $X$ is the matrix corresponding to $S$, so $X$ is $m$ by $n$. If we multiple $X$ by a vector $v$, we get, where each $x_i$ is a row of $X$:

$$   \begin{align}
    X\cdot v &= \begin{bmatrix}
           \langle x_{1}, v \rangle \\
           \vdots \\
           \langle x_{m}, v \rangle
         \end{bmatrix}
  \end{align}
$$


If we take the inner product with itself:

$$ (Xv)^T \cdot (Xv) = \sum_{i=1}^m \langle x_i, v \rangle ^ 2 $$

Which can also be written as:

$$ v^TX^TXv = (Xv)^T \cdot (Xv) = \sum_{i=1}^m \langle x_i, v \rangle ^ 2 $$

Notice the similarity to the PCA objective function.

Recall we are trying to find a $v$ that maximizes $\langle x_i, v \rangle ^2$, thus equivalently (as shown above), we want to find a vector $v$ that maximizes $v^T(X^TX)v$. We will call $(X^TX)$ $A$. This is called "maximizing a quadratic form".

With this notation, our new goal is:

$$ \underset{v, \;||v||_2 \;= 1}{\max}\;\;\;v^TAv $$

---

Let's look at a simple case, $A$ is diagonal.

<br>
<center>
    <img src="images/1.11.7.png" alt="Professor Notes" />
</center>
<br>

***Question: Which $v$ should we choose?***

We should pick $v = (1, ..., 0)$. Why? $v$ has to have norm 1, and $A$ is constructed such that $\lambda_1$ is the largest eigenvalue. So we should pick $v$ that picks off that largest eigenvalue.

Another way to view $v$:

$$ v^TAv = v_1, ..., v_n\cdot
    \begin{align}
        \begin{bmatrix} 
            \lambda_1 \cdot v_1\\ 
            \vdots \\ 
            \lambda_n \cdot v_n 
        \end{bmatrix} 
    \end{align}
 = \sum_{i=1}^n v_i^2 \cdot  \lambda_i
$$

So whatever $v$ you choose will be the sum of $v_i^2 \cdot \lambda_i$

---

**However, we don't know if $A$ is diagonal**, but we do know that $A = Q*DQ^T$ where $D$ *is* diagonal. So $A$ is *almost* diagonal. Thus, instead of picking a vector that has a 1 in the first component and 0 everywhere else, the correct thing to do is...

Let $e_1 = (1, 0, ..., 0, 0)$, the matrix with a 1 in the first component and 0 everywhere else.

Choose $v = Q*e_1$. This is the vector that maximizes the objective function. This is the top eigenvector of $A$.

### **Breakdown**

**Step 1: Covariance Matrix $A$**

We start with a covariance matrix $A$:

$$
A = \begin{pmatrix} 3 & 1 \\ 1 & 2 \end{pmatrix}
$$

**Step 2: Compute Eigenvalues and Eigenvectors**

To find the eigenvalues ($\lambda$) and eigenvectors ($v$) of $A$, we solve the characteristic equation:

$$
\text{det}(A - \lambda I) = 0
$$

This results in the eigenvalues:

$$
\lambda_1 = 3.618 \quad \text{and} \quad \lambda_2 = 1.382
$$

Next, for each eigenvalue $\lambda_i$, we solve the equation:

$$
(A - \lambda_i I)v_i = 0
$$

For $\lambda_1 = 3.618$:

$$
(A - 3.618I)v_1 = 0 \implies \begin{pmatrix} -0.618 & 1 \\ 1 & -1.618 \end{pmatrix} v_1 = 0
$$

The solution yields the eigenvector:

$$
v_1 = \begin{pmatrix} 0.850 \\ 0.526 \end{pmatrix}
$$

For $\lambda_2 = 1.382$:

$$
(A - 1.382I)v_2 = 0 \implies \begin{pmatrix} 1.618 & 1 \\ 1 & 0.618 \end{pmatrix} v_2 = 0
$$

The solution yields the eigenvector:

$$
v_2 = \begin{pmatrix} -0.526 \\ 0.850 \end{pmatrix}
$$

**Step 3: Diagonalization of $A$**

Using the eigenvectors $v_1$ and $v_2$, we can form the matrix $Q$:

$$
Q = \begin{pmatrix} 0.850 & -0.526 \\ 0.526 & 0.850 \end{pmatrix}
$$

The diagonal matrix $\Lambda$ (which contains the eigenvalues) is:

$$
\Lambda = \begin{pmatrix} 3.618 & 0 \\ 0 & 1.382 \end{pmatrix}
$$

So, the diagonalization of $A$ is given by:

$$
A = Q \Lambda Q^T
$$

**Step 4: Maximize the Objective Function**

The objective function in PCA is to maximize the variance of the data projected onto $v$. This is done by maximizing $v^T A v$, where $v$ is a unit vector.

$$
v^T A v = v^T Q \Lambda Q^T v
$$

Substituting $u = Q^T v$ (where $u$ is also a unit vector):

$$
v^T A v = u^T \Lambda u = \sum_{i=1}^n \lambda_i u_i^2
$$

To maximize $v^T A v$, we set $u = e_1 = (1, 0, \dots, 0)$ to align with the largest eigenvalue $\lambda_1$:

$$
v = Q e_1 = \begin{pmatrix} 0.850 \\ 0.526 \end{pmatrix}
$$

**Step 5: Interpretation**

Choosing $v = Q e_1$ means selecting the eigenvector $v_1$ corresponding to the largest eigenvalue $\lambda_1$. This vector $v_1$ is the direction that maximizes the variance of the data when projected onto it, which is the principal component in PCA.


## ***1.11.4 Recap***

### **Example Optimization Problem**

Our optimization problem is:

$$ \underset{v, \;||v||_2 \;= 1}{\max}\;\;\;v^TAv $$

Where $A$ corresponds to a covariance matrix $X^TX$ (some people use $\frac{1}{m}X^TX$ but we will not be).

And the "easy case" was when $A$ was a diagonal matrix. For example, let 

$$ A = \begin{pmatrix} 3 & 0 \\ 0 & 1 \end{pmatrix} $$

<br>
<center>
    <img src="images/1.11.8.png" alt="Professor Notes" />
</center>
<br>

The visualization shows how the chosen vector will capture the axis that keeps the most information about the data.

### **Example Solution to Optimization Problem**

The solution to

$$ \underset{v, \;||v||_2 \;= 1}{\max}\;\;\; v^T\begin{pmatrix} 3 & 0 \\ 0 & 1 \end{pmatrix}v $$

is $v = (1, 0)$, which corresponds precisely to the stretching on the horizontal axis shown above. This is the direction of maximum variance.

### **Rotation Matrices**

A rotation matrix is an orthogonal matrix that rotate your coordinate axis. 

This matrix will rotate the axes $\theta$ degrees counterclockwise:

$$ \begin{pmatrix} cos\theta & -sin\theta \\ sin\theta & cos\theta \end{pmatrix} $$

This matrix will rotate the axes $\theta$ degrees clockwise:

$$ \begin{pmatrix} cos\theta & sin\theta \\ -sin\theta & cos\theta \end{pmatrix} $$

### **Covariance Matrix Rotation**

For any matrix of the form $X^TX$ (a covariance matrix), you can write that covariance matrix as a diagonal matrix times some rotation matrices.

For example:

$$ \begin{pmatrix} 2 & 1 \\ 1 & 2 \end{pmatrix} = \begin{pmatrix} \frac{1}{\sqrt{2}} & -\frac{1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}} \end{pmatrix} \cdot \begin{pmatrix} 3 & 0 \\ 0 & 1 \end{pmatrix} \cdot \begin{pmatrix} \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}} \\ -\frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}} \end{pmatrix}$$

Recall that $\frac{1}{\sqrt{2}}$ is the $cos$ of 45 degrees. So this is a rotation 45 degrees clockwise.

<br>
<center>
    <img src="images/1.11.9.png" alt="Professor Notes" />
</center>
<br>

#### **Online Resources**

Online matrix calculators, such as BlueBit, can be used for matrix operations including eigenvector decomposition.

## ***1.11.5 Spectral Theorem***

#### **Spectral Theorem Definition**

Any symmetric matrix can be written: $$A=Q*D*Q^T$$ Where $Q$ is an orthogonal matrix, and $D$ is a diagonal matrix with real values on the diagonal. The entries of $D$ are the eigenvalues of $A$.

Furthermore, if $A = X^TX$, then all eigenvalues $\ge 0$.

#### **Proof**

**Claim 1: For any $v$, $v^TAv \ge 0$.**

This is proven by $A = X^TX$, $v^TAv = (Xv)^T \cdot Xv \ge 0$.

**Claim 2: $A$ cannot have negative eigenvalues**

Let's assume by contradiction that $\lambda_i \lt 0$ ($i^th$ eigenvalue is negative).

Rewriting $A$ by the Spectral Theorem, $A = Q*D*Q^T$. Let's now consider the vector $Q*e_i$, where $e_i = (0, 0, ..., 1, 0, ..., 0)$ where the 1 is in the $i^th$ position in the vector. We now call $v = Q*e_i$.

Consider now $v^TAv$, by substitution this is equivalent to $e_i^TQ^TQDQ^TQe_i$. Since $Q^TQ = I$, we are left with $e_i^TDe_i$. Since we assumed by contradiction that D has a negative eigenvalue, $D*e_i$ is a vector of all 0's with a negative value in the $i^th$ position. Thus, $e_i^TDe_i \lt 0$.

However, this contradicts claim 1, and therefore is false.

---

This theorem and properties are why we can view or interpret covariance matrices geometrically, as rotations and scalings.

## ***1.11.6 Eigenvector Decomposition***

To recap the process of PCA:

<br>
<center>
    <img src="images/1.11.10.png" alt="Professor Notes" />
</center>
<br>

Proving that the $i^th$ row of $Q^T$ is an eigenvector of $A$:

<br>
<center>
    <img src="images/1.11.11.png" alt="Professor Notes" />
</center>
<br>

***Question: What algorithm should we use to compute this decomposition?***

We must deal with the **"singular value decomposition"** (SVD) of the matrix. If this is computed, then you will get the eigenvector eigenvalue decomposition of the matrix. There exist polynomial time algorithms for computing the SVD, but are expensive and we would like to avoid them for large datasets. So, instead the method used in practice is the power method [lecture abruptly cuts off and ends lol].

The next lecture covers SVD.

# Personal Notes #

none yet