# Principal Component Analysis

In [None]:
import pandas as pd
import numpy as np

from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

from sklearn.linear_model import LinearRegression

from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.model_selection import train_test_split, cross_val_score

import seaborn as sns
from matplotlib import pyplot as plt

%config Completer.use_jedi = False

Think of the predictors in your dataset as dimensions in what we can usefully call "feature space". If we're predicting house prices, then we might have a 'square feet' dimension or a 'number of bathrooms' dimension, etc. Then each record (of a house or a house sale, say) would be represented as a point (or vector) in this feature space. Some would score higher on the 'latitude' dimension or lower on the 'number of bedrooms' dimension, or whatever.

One difficulty is that, despite our working nomenclature, these things aren't really *dimensions* in the truest sense, since they're not independent of each other. When we talk about the x-, y-, and z-dimensions of Euclidean 3-space, for example, one important feature is that values of x have no bearing (per se) on values of y or of z. I can move three units along the x-dimension without changing my y- or z-position.

But the same thing is generally not true for datasets. When I increase my position along the 'number of bedrooms' dimension (or, better, *direction*), I also tend to increase my position along, say, the 'square feet' direction as well.

This is problematic for a couple reasons: One is that my model could be in effect "double-counting" certain features of my signal, which can lead to overfit models. And if my goal is inference or explanation, then I'm going to have a very hard time distinguishing between the idea that the number of bedrooms is what's *really* predictive of housing prices and the idea that the number of square feet is what's really so predictive.

Within a predictive lense, sometimes we may have a feature space that is so large (often as a product of OneHotEncoding) that there is no concievable way to produce a model that is not highly overfit to the training data. 

The idea behind Principal Component Analysis (PCA) is to transform our dataset into something more useful for building models. What we want to do is to build new dimensions (predictors) out of the dimensions we are given in such a way that:

(1) each dimension we draw captures as much of the remaining variance among our predictors as possible; and <br/>
(2) each dimension we draw is orthogonal to the ones we've already drawn.

## Motivation

Think back to multiple linear regression for a moment.

The fundamental idea is that I can get a better prediction for my dependent variable by considering a *linear combination of my predictors* than I can get by considering any one predictor by itself.

$\rightarrow$ **PCA insight**: If the combinations of predictors work better than the predictors themselves, then let's just treat the combinations as our primary dimensions!

But one problem with having lots of predictors is that it raises the chance that some will be nearly *collinear*.

$\rightarrow$ **PCA insight**: Since we're reconstructing our dimensions anyway, let's make sure that the dimensions we construct are mutually orthogonal! <br/>
$\rightarrow$ **PCA insight**: Moreover, since we'll be capturing much of the variance among our predictors in the first few dimensions we construct, we'll be able in effect to *reduce  the dimensionality* of our problem. Thus PCA is a fundamental tool in *dimensionality reduction*.

In [None]:
cars = pd.read_csv('cars.csv')

In [None]:
cars.head()

In [None]:
cars.dtypes

**Data Formatting**

In the cell below, reformat the column names so 
- There are not preceeding or trailing spaces 
- All spaces and dashes have been replaced with underscores

In [None]:
# Your code here

In the cell below, change `'cubicinches'` and `'weightlbs'` to a numeric datatype. Replace non convertable observations to `np.nan`.

In [None]:
# Your code here

In the cell below, seperate `'mpg'` from the rest of the data, and create a train test split. 
- Assign the `'mpg'` column to the variable `y`.
- Assign all other columns to the variable `X`.
- Create a train test split with a `random_state` of 20


In [None]:
# Your code here

In [None]:
number_selector = make_column_selector(dtype_include='number')
object_selector = make_column_selector(dtype_include='object')

column_transform = make_column_transformer(
                    (StandardScaler(), number_selector),
                    (OneHotEncoder(), object_selector),
                    remainder='passthrough')

preprocessing = make_pipeline(column_transform, SimpleImputer())

In [None]:
preprocessing.fit(X_train)

In [None]:
X_tr_pp = preprocessing.transform(X_train)
X_te_pp = preprocessing.transform(X_test)

In [None]:
# Let's construct a linear regression

lr = LinearRegression().fit(X_tr_pp, y_train)

# Score on train
lr.score(X_tr_pp, y_train)

In [None]:
# Score on test

lr.score(X_te_pp, y_test)

In [None]:
# Get the coefficients of the best-fit hyperplane

lr.coef_

Thus, our best-fit hyperplane is given by:

$- 1.555\times cyl\_sd + 2.189\times in^3\_sd - 1.154\times hp\_sd - 4.681\times lbs.\_sd  - 0.267\times time_{60}\_sd + 2.604\times yr\_sd + 0.708\times brand_{Europe} + 0.912\times brand_{Japan} - 1.620\times brand_{US}$

## Eigenvalues and Eigenvectors

The key idea is to diagonalize (i.e. find the eigendecomposition of) the covariance matrix. The decomposition will produce a set of orthogonal vectors that explain as much of the remaining variance as possible. These are our [principal components](https://math.stackexchange.com/questions/23596/why-is-the-eigenvector-of-a-covariance-matrix-equal-to-a-principal-component).

In [None]:
matrix = np.array([[0,1], [1,0]])
matrix

In [None]:
vector = [5,2]
matrix.dot(vector)

In [None]:
np.linalg.eig(matrix)

In [None]:
matrix.dot([0.70710678,0.70710678])

The definition of an eigenvector is: $\vec{x}$ is an eigenvector of the matrix $A$ if $A\vec{x} = \lambda\vec{x}$, for some scalar $\lambda$. That is, the vector is oriented in just such a direction that multiplying the matrix by it serves only to lengthen or shorten the original vector.

Suppose we have the matrix
$A =
\begin{bmatrix}
a_{11} & a_{12} \\
a_{21} & a_{22} \\
\end{bmatrix}
$.

Let's calculate the eigendecomposition of this matrix.

In order to do this, we set $(A - \lambda I)\vec{x} = 0$. One trivial solution is $\vec{x} = \vec{0}$, but if there are more interesting solutions, then it must be that $|A - \lambda I| = 0$, which is to say that some column vector in $A - \lambda I$ must be expressible as a linear combination of the other columns. (Otherwise, there would be no way to "undo" the multiplicative effect of a column vector on $\vec{x}$!) For more on this point, see [this page](http://www2.math.uconn.edu/~troby/math2210f16/LT/sec1_7.pdf).

So we have:

$\begin{vmatrix}
a_{11} - \lambda & a_{12} \\
a_{21} & a_{22} - \lambda
\end{vmatrix} = 0$

$(a_{11} - \lambda)(a_{22} - \lambda) - a_{12}a_{21} = 0$

$\lambda^2 - (a_{11} + a_{22})\lambda + a_{11}a_{22} - a_{12}a_{21}$

$\lambda = \frac{a_{11} + a_{22}\pm\sqrt{(a_{11} + a_{22})^2 + 4(a_{12}a_{21} - a_{11}a_{22})}}{2}$

Suppose e.g. we had

$A = \begin{bmatrix}
5 & 3 \\
3 & 5
\end{bmatrix}$.

We can use the equation we just derived to solve for the eigenvalues of this matrix. Then we can plug *those* into our eigenvector definition to solve for the eigenvectors:

So:

### Eigenvalues

$\lambda = \frac{5+5\pm\sqrt{(5+5)^2+4(3\times 3 - 5\times 5)}}{2} = 5\pm\frac{\sqrt{36}}{2} = 2, 8$.

### Eigenvectors

Now we can plug those in. If we plug in $\lambda = 8$, then we get:

$\begin{bmatrix}
5-8 & 3 \\
3 & 5-8
\end{bmatrix}
\begin{bmatrix}
x_1 \\
x_2
\end{bmatrix}
=
\begin{bmatrix}
-3 & 3 \\
3 & -3
\end{bmatrix}
\begin{bmatrix}
x_1 \\
x_2
\end{bmatrix} = 0.$

So:

$-3x_1 + 3x_2 = 0$ (or $3x_1 - 3x_2 = 0$)

$x_1 = x_2$.

Therefore, we find that any 2 element column vector in which the two elements have equal magnitude and the same sign are eigenvectors for this matrix.

It is standard to scale eigenvectors to a magnitude of 1, and so we would write this eigenvector as
$\begin{bmatrix}
\frac{\sqrt{2}}{2} \\
\frac{\sqrt{2}}{2}
\end{bmatrix}$.

If we plug in $\lambda = 2$, we find a second eigenvector equal to
$\begin{bmatrix}
-\frac{\sqrt{2}}{2} \\
\frac{\sqrt{2}}{2}
\end{bmatrix}$. Therefore, we find that any 2 element column vector in which the two elements have equal magnitude and opposite signs are eigenvectors for this matrix.
 (I'll leave this as an exercise.)

**Thus we can express the full diagonalization of our matrix as follows**:

$A = \begin{bmatrix}
5 & 3 \\
3 & 5
\end{bmatrix} =
\begin{bmatrix}
\frac{\sqrt{2}}{2} & -\frac{\sqrt{2}}{2} \\
\frac{\sqrt{2}}{2} & \frac{\sqrt{2}}{2}
\end{bmatrix}
\begin{bmatrix}
8 & 0 \\
0 & 2
\end{bmatrix}
\begin{bmatrix}
\frac{\sqrt{2}}{2} & \frac{\sqrt{2}}{2} \\
-\frac{\sqrt{2}}{2} & \frac{\sqrt{2}}{2}
\end{bmatrix}$

### In Code

In [None]:
# We can use np.linalg.eig()

A = np.array([[5, 3], [3, 5]])
np.linalg.eig(A)

In [None]:
# np.linalg.eig(X) returns a double of NumPy arrays, the first containing
# the eigenvalues of X and the second containing the eigenvectors of X.

values, vectors = np.linalg.eig(A)

In [None]:
values

In [None]:
# np.diag()

np.diag(values)

In [None]:
# Reconstruct A by multiplication

vectors.dot(np.diag(values)).dot(vectors.T)

## PCA by Hand

What follows is indebted to [Sebastian Raschka](http://sebastianraschka.com/Articles/2015_pca_in_3_steps.html#pca-vs-lda).

In [None]:
# We'll start by producing the covariance matrix for the columns of X_tr_pp.

cov_mat = np.cov(X_tr_pp, rowvar=False)
cov_mat.shape

In [None]:
cov_mat

In [None]:
np.linalg.eig(cov_mat)

In [None]:
# Let's assign the results of eig(cov_mat) to a double of variables.

eigvals, eigvecs = np.linalg.eig(cov_mat)

In [None]:
# The columns of "eigvecs" are the eigenvectors!

eigvecs

In [None]:
# The eigenvectors of the covariance matrix are our principal components.
# Let's look at the first three.

pcabh = np.vstack([row[:3] for row in eigvecs])

In [None]:
pcabh

Now, to transform our data points into the space defined by the principal components, we simply need to compute the dot-product of `X_tr_pp` with those principal components.

Why? Think about what this matrix product looks like:

We take a row of `X_tr_pp` and multiply it by a column of `pcabh`, pairwise. The row of `X_tr_pp` represents the values for the columns in the original space. The column of `pcabh` represents the weights we need on each of the original columns in order to transform a value into principal-component space. And so the product of these two matrices will be each row, transformed into principal-component space!

In [None]:
X_tr_pp[:5, :]

In [None]:
X_tr_pp.dot(pcabh)[:5, :]

# Sklearn

In [None]:
# Naturally, sklearn has a shortcut for this!

pca = PCA(n_components=3) # Check out how `n_components` works

X_train_new = pca.fit_transform(X_tr_pp)

In [None]:
# Let's check out the explained variance

pca.explained_variance_

In [None]:
# The ratio is often more informative

pca.explained_variance_ratio_

In [None]:
# We can also check out the Principal Components themselves

pca.components_

In [None]:
X_train.columns

The results of our PCA are as follows: 

**PC1** = 0.450 * cylinders_sd + 0.464 * cubicinches_sd + 0.455 * hp_sd + 0.433 * 𝑙𝑏𝑠_𝑠𝑑 - 0.350 * time-to-60_sd - 0.188 * year_sd - 0.068 * Europe - 0.073 * Japan + 0.141 * US

**PC2** = -0.132 * cylinders_sd - 0.1 * cubicinches_sd + 0.005 * hp_sd  -0.194 * 𝑙𝑏𝑠_𝑠𝑑 - 0.123 * time-to-60_sd - 0.938 * year_sd + 0.13 * Europe + 0.022 * Japan - 0.152 * US

**PC3** = 0.189 * cylinders_sd + 0.142 * cubicinches_sd - 0.143 * hp_sd + 0.341 * 𝑙𝑏𝑠_𝑠𝑑 + 0.851 * time-to-60_sd - 0.236 * year_sd - 0.041 * Europe - 0.132 * Japan + 0.091 * US

## Orthogonality

These principal components should also be mutually orthogonal. If they are, then the dot product of any two of them should be 0. Let's check!

In [None]:
pca.components_[0].dot(pca.components_[1])

In [None]:
pca.components_[0].dot(pca.components_[2])

In [None]:
pca.components_[1].dot(pca.components_[2])

## Transformed dimensions have zero correlation

In [None]:
np.corrcoef(X_train_new.T)

## Visualizations

In [None]:
X_test_new = pca.transform(X_te_pp)

In [None]:
# Reassembling the whole dataset for the sake of visualization

X_transformed = np.vstack([X_train_new, X_test_new])
y_new = np.concatenate([y_train, y_test])

In [None]:
f, a = plt.subplots()
a.plot(X_transformed[:, 0], y_new, 'r.');

In [None]:
f, a = plt.subplots()
a.plot(X_transformed[:, 1], y_new, 'g.');

In [None]:
f, a = plt.subplots()
a.plot(X_transformed[:, 2], y_new, 'k.');

In [None]:
df = pd.DataFrame(np.hstack([X_transformed, y_new[:, np.newaxis]]),
                  columns=['PC1', 'PC2', 'PC3', 'y'])
df.head()

In [None]:
sns.relplot(data=df,
            x='PC1',
            y='PC2',
           hue='y');

## Relation to Linear Regression

Question: Is the first principal component the same line we would get if we constructed an ordinary least-squares regression line?

Answer: No. The best-fit line minimizes the sum of squared errors, i.e. the minimum sum of ("vertical") distances between the predictions and the real values of the dependent variable. Principal Component Analysis, by contrast, is not a modeling procedure and so has no target. The first principal component thus cannot minimize the sum of distances between predictions and real values; instead, it minimizes the sum of ("perpendicular") distances between the data points and *it (the line) itself*.

Suppose we look at MPG vs. z-scores of weight in lbs. Let's make a scatter plot:

In [None]:
f, a = plt.subplots()

a.scatter(X_tr_pp[:, 1], y_train)
a.set_xlabel('weight z-scores (lbs.)')
a.set_ylabel('efficiency (MPG)')
a.set_title('MPG vs. Weight');

Let's add the best-fit line:

In [None]:
beta1 = LinearRegression().fit(X_tr_pp[:, 1].reshape(-1, 1),
                               y_train).coef_
beta0 = LinearRegression().fit(X_tr_pp[:, 1].reshape(-1, 1),
                               y_train).intercept_

In [None]:
f, a = plt.subplots()

a.scatter(X_tr_pp[:, 1], y_train)
a.plot(X_tr_pp[:, 1],
       beta1[0] * X_tr_pp[:, 1] + beta0,
      c='r', label='best-fit line')
a.set_xlabel('weight z-scores (lbs.)')
a.set_ylabel('efficiency (MPG)')
a.set_title('MPG vs. Weight')
plt.legend();

Now let's see what the principal component looks like. We'll make use of the `inverse_transform()` method of `PCA()` objects.

In [None]:
pc1 = PCA(n_components=1).fit(np.concatenate((X_tr_pp[:, 1].reshape(-1, 1),
                                 y_train.values.reshape(-1, 1)),
                                axis=1))

pc = pc1.transform(np.concatenate((X_tr_pp[:, 1].reshape(-1, 1),
                                 y_train.values.reshape(-1, 1)),
                                axis=1))

pc_inv = pc1.inverse_transform(pc)

In [None]:
f, a = plt.subplots()

a.scatter(X_tr_pp[:, 1], y_train)
a.plot(X_tr_pp[:, 1],
       beta1[0] * X_tr_pp[:, 1] + beta0,
      c='r', label='best-fit line')
a.plot(pc_inv[:, 0],
       pc_inv[:, 1],
      c='b', label='principal component')
a.set_xlabel('weight z-scores (lbs.)')
a.set_ylabel('efficiency (MPG)')
a.set_title('MPG vs. Weight')
plt.legend();

Check out this post, to which I am indebted, for more on this subtle point: https://shankarmsy.github.io/posts/pca-vs-lr.html

## Modeling with New Dimensions

Now that we have optimized our features, we can build a new model with them!

In [None]:
lr_pca = LinearRegression()
lr_pca.fit(X_train_new, y_train)
lr_pca.score(X_train_new, y_train)

In [None]:
X_test_new = pca.transform(X_te_pp)

In [None]:
lr_pca.score(X_test_new, y_test)

In [None]:
lr_pca.coef_

Thus, our best-fit hyperplane is given by:

$-2.967\times PC1 - 1.162\times PC2 -2.486\times PC3$

Of course, since the principal components are just linear combinations of our original predictors, we could re-express this hyperplane in terms of those original predictors!

And if the PCA was worth anything, we should expect the new linear model to be *different from* the first!

Recall that we had:

**PC1** = 0.450 * cylinders_sd + 0.464 * cubicinches_sd + 0.455 * hp_sd + 0.433 * 𝑙𝑏𝑠_𝑠𝑑 - 0.350 * timeto60sd - 0.188 * year_sd - 0.068 * Europe - 0.073 * Japan + 0.141 * US

**PC2** = -0.132 * cylinders_sd - 0.1 * cubicinches_sd + 0.005 * hp_sd  -0.194 * 𝑙𝑏𝑠_𝑠𝑑 - 0.123 * timeto60sd - 0.938 * year_sd + 0.13 * Europe + 0.022 * Japan - 0.152 * US

**PC3** = 0.189 * cylinders_sd + 0.142 * cubicinches_sd - 0.143 * hp_sd + 0.341 * 𝑙𝑏𝑠_𝑠𝑑 + 0.851 * timeto60sd - 0.236 * year_sd - 0.041 * Europe - 0.132 * Japan + 0.091 * US

Therefore, our new PCA-made hyperplane can be expressed as:

$-2.967\times(0.450 * cylinderssd + 0.464 * cubicinchessd + 0.455 * hpsd + 0.433 * 𝑙𝑏𝑠𝑠𝑑 - 0.350 * timeto60sd - 0.188 * yearsd - 0.068 * Europe - 0.073 * Japan + 0.141 * US)$ <br/> $- 1.162\times(-0.132 * cylinderssd - 0.1 * cubicinchessd + 0.005 * hpsd -0.194 * 𝑙𝑏𝑠𝑠𝑑 - 0.123 * timeto60sd - 0.938 * yearsd + 0.13 * Europe + 0.022 * Japan - 0.152 * US)$ <br/> $- 2.486\times(0.189 * cylinderssd + 0.142 * cubicinchessd - 0.143 * hpsd + 0.341 * 𝑙𝑏𝑠𝑠𝑑 + 0.851 * timeto60sd - 0.236 * yearsd - 0.041 * Europe - 0.132 * Japan + 0.091 * US)$

Let's make these calculations:

In [None]:
def pca_original(feature_names, model, pca, class_index=1):
    """
    
    Returns the coefficients for a model that has been reduced
    with sklearn's PCA.
    
    """
    
    coeffs = {}
    # For multi class classification problems, model.coef_
    # returns a matrix of coefficients for each class
    # If model.coef_.shape[1] exists and is not 0
    # then the coefficients are collected for the desired
    # class
    try:
        if model.coef_.shape[1]:
            weights = model.coef_[class_index]
        else:
            weights = model.coef_
    except:
        weights = model.coef_
    
    for idx in range(len(feature_names)):
        coeffs[feature_names[idx]] = np.round(weights @ pca.components_[:,idx], 3)
    return coeffs

In [None]:
feature_names = ['cylinders_sd', 'cubicinches_sd', 'horsepower_sd', 
                 'weightlbs_sd','timeto60_sd', 'year_sd', 'Europe',
                'Japan', 'US']

pca_original(feature_names, lr_pca, pca)

So our best-fit hyperplane using PCA is:

$-1.659\times cyl\_sd -1.62\times in^3\_sd-1.003\times hp\_sd-1.911\times lbs.\_sd -0.936\times time_{60}\_sd + 2.237\times yr\_sd -0.052\times brand_{Europe} + 0.52\times brand_{Japan} -0.468\times brand_{US}$


Recall that our first linear regression model had:

$- 1.555\times cyl\_sd + 2.189\times in^3\_sd - 1.154\times hp\_sd - 4.681\times lbs.\_sd  - 0.267\times time_{60}\_sd + 2.604\times yr\_sd + 0.708\times brand_{Europe} + 0.912\times brand_{Japan} - 1.620\times brand_{US}$

which is clearly a different hyperplane.

# Importance of scaling

In [None]:
from sklearn.datasets import load_wine
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

In [None]:
unscaled = make_pipeline(PCA(.95), LogisticRegression())
scaled = make_pipeline(StandardScaler(), PCA(.95), LogisticRegression())
data = load_wine()
X = pd.DataFrame(data['data'], columns=data['feature_names'])
y = data['target']

In [None]:
cross_val_score(unscaled, X, y)

In [None]:
cross_val_score(scaled, X, y)

In [None]:
pca = PCA(2)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_pca = pca.fit_transform(X)
X_pca = pd.DataFrame(X_pca, columns=['f1', 'f2'])
X_scaled_pca = pca.fit_transform(X_scaled)
X_scaled_pca = pd.DataFrame(X_scaled_pca, columns=['f1', 'f2'])

fig, ax = plt.subplots(1,2, figsize=(15,6))
for label in pd.Series(y).unique():
    frame = X_pca[y==label]
    frame_scaled = X_scaled_pca[y==label]
    ax[0].scatter(frame.f1, frame.f2, label=label)
    ax[1].scatter(frame_scaled.f1, frame_scaled.f2, label=label)
ax[0].set_title('PCA Unscaled')
ax[1].set_title('PCA Scaled')
plt.legend();

# Selecting `n_components`.

In [None]:
pca = PCA().fit(X_scaled)
cumulative_variance = np.cumsum(pca.explained_variance_ratio_)
plt.plot(cumulative_variance)
plt.xlabel('Number of Components')
plt.ylabel('Percentage of explained variance');

In [None]:
pca = PCA(12)
pipeline = make_pipeline(StandardScaler(), pca, LogisticRegression())
cross_val_score(pipeline, X, y)

In [None]:
pipeline.fit(X, y)

In [None]:
pca_original(X.columns, pipeline.steps[-1][1], pipeline.steps[-2][1])

## Extra Resource

- [StatQuests Longform PCA video](https://www.youtube.com/watch?v=_UVHneBUBW0)
- [Three Blue One Brown Video on Eigan Vectors](https://www.youtube.com/watch?v=PFDu9oVAE-g)
- [Python Data Science Handbook - In Depth PCA](https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html#:~:text=PCA%20is%20fundamentally%20a%20dimensionality,and%20engineering%2C%20and%20much%20more)