# An Introduction to Principle Component Analysis (PCA)

![Nirenberg’s handwritten genetic code chart, 1965.](nirenberg-handwritten-1965.png)
*First Summary of Genetic Code by Marshall W. Nirenberg. 18 January 1965. Link: https://profiles.nlm.nih.gov/spotlight/jj/catalog/nlm:nlmuid-101584910X475-img*

Although PCA was first described in 1901 by Karl Pearson, this 1965 table serves as an illustrative example of the types of data that brought forth PCA.

## Before PCA

Biology and Social Sciences dominated the turn of the 20th Century. Researchers were collecting a great deal of measurements like height, weight, limb lengths, etc. In an attempt to understand and link variables.

Handling and identifying correlations in these different measurements involced brute force and intuition. There were just too many measurements for each subject. Working with these datasets - summarizing, visualizing, and drawing meaningful conclusions was cumbersome and difficult.

## A Little Bit of This and a Little Bit of That

Not only were biologists recording any measurement they could get, there was a deep interest in how all this data correlated with one another, and whether or not measured variables could be explained by a combination of variables.

This systematic way of identifying which combinations of variables that best descibe what we see is PCA.

## The Algorithm

Suppose we have some data points:
```
Measurements of the Species Gizmus Widgetmus

Leg Length: [1, 2, 3]
Hair Color (Scale 1-10 from black to white): [2, 3, 5]
Hair Length: [3, 5, 4]
Age: [4, 4, 6]
```
We want to know which variables are responsible for the most change in the data - which measurements are most responsible for change. Here are the steps:

Step 1. Center the Data / Normalization

Step 2. Compute Covariance

Step 3. Compute the Determinant

Step 4. Build the Equation


In [72]:
import numpy as np
from rich.console import Console
console = Console()

"""
Let's center some data
"""
data = np.array([
    [1,2,3],
    [2,3,5],
    [3,5,4],
    [4,4,6]
])

avg = np.mean(data, axis=0)
centered_data = data - avg
        
console.print(centered_data)

In [73]:
"""
Compute the Covariance

Cov(X,Y) = Σ ( Xi - Xmean ) * ( Yi - Ymean ) / ( n - 1 )

"""
covariance = np.cov(centered_data)
console.print(covariance)

In [74]:
"""
Compute the determinant: det(C - λI) = 0
"""
eigenvalues, eigenvectors = np.linalg.eig(covariance)
console.print("Eigenvalues:", eigenvalues)
console.print("Eigenvectors:", eigenvectors)

In [75]:
"""
Intepretting Results
"""
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]

total_variance = sum(eigenvalues)
console_out = "Here's how much each principle component explains the variance in the data:\n"
for i, eigenval in enumerate(eigenvalues):
    console_out += f"Componenet {i+1}: {int((eigenval / total_variance)*100)} %\n"
console.print(console_out)

console_out = "" # reusing string variable
console_out += "But what do the principle components represent? The vectors point in the direction of most variance.\n"
console_out += f"Eigenvalues: [{', '.join(f'{eig:.2f}' for eig in eigenvalues)}]\n"
for i, eigenvect in enumerate(eigenvectors):
    console_out += f"Vector {i+1}: {eigenvect}\n"
console.print(console_out)

# What does this mean?

(INTERPRETATION SHOULD BE CITED AND ATTRIBUTED TO JOLLIFFE)

Used this way, PCA is way of discovering which measurments and features lead to the most variance in the data. It automatically picks up correlations, though wihtout explicitly computing those correlations.

It means that there are two components that explain the most variance: Component 1 and Component 2. Component 1 is entirely the 4th feature/measurement. Component 2 is a mix of the first three features.

Most of the variance in data can be attrbitued to age of the individuals.

$$ Component 1 (89\%): PC1 = 1.00 * Age $$

The rest of the variance can be explained by a combination of leg length, hair color, and hair length. This means a small amount of variance is from this: shorter legs is associated with lighter hair color and longer hair, and longer-legged individuals tend to have darker, shorter hair.

$$ Component 2 (10\%): PC2 = -0.4 * Leg Length + 0.7 Hair Color + 0.5 Hair Length$$

That's it. No other set of features are able to explain additional variance in the data.

$$ Component 3 (0\%):Irrelevant $$

$$ Component 4 (0\%): Irrelevant $$

The maximum number of relevant PCs we can get using PCA is either the number of features or the number of samples -1, depending on which is lower. We had 4 features and (3-1) samples (CITE THIS IT COMES FROM LEMMA: RANK OF COVARIANCE MATRIX). So having 2 PCs checks out.