# An Introduction to Principle Component Analysis (PCA)

![Nirenberg’s handwritten genetic code chart, 1965.](nirenberg-handwritten-1965.png)
*First Summary of Genetic Code by Marshall W. Nirenberg. 18 January 1965. Link: https://profiles.nlm.nih.gov/spotlight/jj/catalog/nlm:nlmuid-101584910X475-img*

Although PCA was first described in 1901 by Karl Pearson, this 1965 table serves as an illustrative example of the types of data that brought forth PCA.

## Before PCA

Biology and Social Sciences dominated the turn of the 20th Century. Researchers were collecting a great deal of measurements like height, weight, limb lengths, etc. In an attempt to understand and link variables.

Handling and identifying correlations in these different measurements involced brute force and intuition. There were just too many measurements for each subject. Working with these datasets - summarizing, visualizing, and drawing meaningful conclusions was cumbersome and difficult.

## A Little Bit of This and a Little Bit of That

Not only were biologists recording any measurement they could get, there was a deep interest in how all this data correlated with one another, and whether or not measured variables could be explained by a combination of variables.

This systematic way of identifying which combinations of variables that best descibe what we see is PCA.

## The Algorithm

Suppose we have some data points:
```
Measurements of the Species Gizmus Widgetmus
[height, width, length]
[1, 2, 3]
[2, 3, 5]
[3, 5, 4]
[4, 4, 6]
```
We want to know which variables are responsible for the most change in the data - which measurements are most responsible for change.

Step 1. Center the Data / Normalization

Step 2. Compute Covariance

Step 3. Compute the Determinant

Stpe 4. Build the Equation


In [None]:
import numpy as np
from rich.console import Console
console = Console()

"""
Let's center some data
"""
data = [
    [1,2,3],
    [2,3,5],
    [3,5,4],
    [4,4,6]
]

avg_x = sum(data[0]) / len(data[0])
avg_y = sum(data[1]) / len(data[1])
avg_z = sum(data[2]) / len(data[2])
avg = [avg_x, avg_y, avg_z]

centered_data = [[None for _ in range(len(data[0]))] for _ in range(len(data))]
for sample in range(len(data)):
    for measurement in range(len(data[0])):
        centered_data[sample][measurement] = data[sample][measurement] - avg[measurement]
        
console.print(centered_data)

In [16]:
"""
Compute the Covariance

Cov(X,Y) = Σ ( Xi - Xmean ) * ( Yi - Ymean ) / ( n - 1 )

"""
covariance = np.cov(np.array(centered_data).T)
console.print(covariance)

In [12]:
"""
Compute the determinant: det(C - λI) = 0
"""
eigenvalues, eigenvectors = np.linalg.eig(covariance)
console.print("Eigenvalues:", eigenvalues)
console.print("Eigenvectors:", eigenvectors)