
# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Prinicpal Components (Analysis)

_Author:_ Timothy Book, General Assembly DC

## Agenda:
1. **What** problem(s) will PCA solve?
1. **How** does it solve these problems?
1. **When** are some times when PCA is used?

# **What** problem(s) does PCA solve?
PCA is an **unsupervised learning** method. So, it is often used to **preprocess** data before it goes into a supervised learning method. Specifically, it can be used to "solve" the following common problems:

1. Too many columns ($p \gg n$)
1. Multicollinearity
1. Usually, both of the above at the same time

# **How** does it work?
Suppose you have $p$ feature columns. The **first principal component** is a linear combination of all $p$ columns that accounts for the **maximum variance** among them.  That is,

$$z_1 = c_1x_1 + c_2x_2 + \cdots + c_px_p$$

The **second principal component** is another linear combination of the $p$ features that accounts for the maximum of the _remaining_ variance after the first. Another condition is that the second PC must be **orthogonal (perpendicular)** to the first.

The **third principal component** maximizes the remaining variance while being orthogonal (read: _independent_) to the first two, and so on.

# **Geometric** interpretation:
Suppose you have two $x$-variables that look like this:

<img src="imgs/p1.png" width="500px"/>

$x_1$ and $x_2$ are clearly correlated. So any $y$-variable we try to predict will have _multicollinearity_. So should we only pick one? How do we know which to pick? What if, instead of two $x$-variables, we have many? And their correlations are a little weaker? What do we do?!

The geometric interpretation is the act of **rotating** axes to **decorrelate** your $x$-variables.

<img src="imgs/p2.png" width="500px"/>

## **When** is this used IRL?
In class, I frequently give the example of working with **genomic data**. It is expensive to gather rows, since genetic testing is difficult. But once you sample sommeone, you typically collect thousands of genetic markers (columns). However, only a few of them are significant, and many of them are correlated. Biostatisticians will often only use the first few PCs in their analyses.

From a more data sciency perspective, PCA is often performed on image data to get low-resolution versions of images so they are easier to work with for other types of analyses. PCA is very heavily employed in **image processing** for this reason.

# **Example 1 in Python**: Leggooooo

In [None]:
import numpy as np
import pandas as pd

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
df = pd.read_csv('data/bikeshare.csv', index_col='datetime', parse_dates=True)

In [None]:
df.head()

In [None]:
xvars = ['temp', 'atemp', 'humidity', 'holiday', 'workingday', 'windspeed']
X = df[xvars]
y = df['count']

In [None]:
sns.pairplot(X)

In [None]:
for k in range(1, 6):
    sc = StandardScaler()
    X_sc = sc.fit_transform(X)
    pc = PCA(n_components=k)
    X_pc = pc.fit_transform(X_sc)
    lm = LinearRegression()
    lm.fit(X_pc, y)
    print(lm.score(X_pc, y))

In [None]:
sc = StandardScaler()
X_sc = sc.fit_transform(X)
pc = PCA(n_components=6)
pc.fit(X_sc)
plt.plot(range(1, 7), pc.explained_variance_ratio_)

In [None]:
pc.explained_variance_ratio_

Our preprocessing is now starting to get pretty complicated. Luckily, `sklearn` gives us a way to smash it all together in a data science "pipeline."

In [None]:
pipe = Pipeline([
    ('sc', StandardScaler()),
    ('pc', PCA(n_components=2)),
    ('lm', LinearRegression())
])

In [None]:
pipe.fit(X, y)
pipe.score(X, y)

## **Example 2**: Speed Dating Data

In [None]:
dating = pd.read_csv('data/speed_dating.csv').dropna()

In [None]:
dating.shape

In [None]:
dating.columns

In [None]:
dating.head()

In [None]:
sns.heatmap(dating.iloc[:, 2:].corr(), cmap='coolwarm', vmin=-1)

In [None]:
pipe = Pipeline([
    ('sc', StandardScaler()),
    ('pc', PCA()),
    ('lm', LinearRegression())
])

X = dating.iloc[:, 2:].drop('objective_attractiveness', axis=1)
y = dating['objective_attractiveness']

In [None]:
X.shape

In [None]:
pipe.get_params()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=100)

In [None]:
rsq_list = []
for k in range(1, X.shape[1] + 1):
    pipe.set_params(pc__n_components=k)
    pipe.fit(X_train, y_train)
    rsq = pipe.score(X_test, y_test)
    rsq_list.append(rsq)
    print(f"k = {k}: Rsq = {rsq}")

In [None]:
plt.plot(rsq_list)