# Principal Component Analysis

<a href="https://en.wikipedia.org/wiki/Principal_component_analysis">Principal Component Analysis</a> is a technique to reduce the dimension of training data to mitigate the <a href="https://en.wikipedia.org/wiki/Curse_of_dimensionality">curse of dimensionality</a>

The idea is to find a linear transformation of the original dataset to a smaller dimension dataset the preserves most of the data variance. The algorithms find the dimension of highest variance, then (rersively) project the data to the orthogonal space that is left and find the next highest dimension of variance, and keeps going until some stopping point (typically number of dimensions of percent of varianance). Given a set of training features X, this is equivalent to first creating the covariant matrix C (by subtract the mean of each feature from X, and computing X.T * X), and then using <a href="https://en.wikipedia.org/wiki/Eigendecomposition_of_a_matrix">eigendecomposition</a> (C = Q D Q.T, where D is diagonal sorted by eigenvalue, Q is a column matrix of eigenvalues), and projecting the features in X to the first "k" eigenvalues before further training:
<center><img src="images/pca.jpg" style="width:500px;height:250;"></center>

Note that this is quite imperfect, because:
* PCA removes dimensions based purely on their variance, and independant of their impact on the predictor variable. So low-variance features that are highly correlated to the target variable being predicted may be removed.
* It can be hard to interpret PCA features, as they are linear combinations of existing features.

In [30]:
import numpy as np
from sklearn.decomposition import PCA

np.random.seed(12)

def pca(X, dim=2):
    """ Perform PCA on X and return the reduced data.
    :param X: data matrix
    :param dim: dimension of the reduced data
    :return: reduced data matrix, percent of variance explained by each component """
    pca = PCA(n_components=dim)
    x = pca.fit_transform(X)
    return x, pca.explained_variance_ratio_

def pca_correlation_issue():
    """ This shows the issue with PCA removing features that are highly correlated with the output, just because they have lower variance."""
    # Generate data, 100 samples, 10 features, one of the features matches the desired prediction perectectly, but has high variance
    n_samples = 100
    n_features = 10
    y = np.random.randn(n_samples, 1)
    x = np.random.randn(n_samples, n_features - 1) * 1000
    x = np.hstack((x, y))

    # That data gives a great model (MSE = 0), since it uses the low variance feature to predict the output
    from sklearn import linear_model
    model = linear_model.LinearRegression()
    model.fit(x, y)
    print (f"MSE on original data with {x.shape[1]} features: {np.mean((model.predict(x) - y) ** 2):.2f}")

    # Now let's try PCA
    x, _ = pca(x, dim=x.shape[1] - 1)
    model.fit(x, y)
    print (f"MSE on PCA-reduced data with {x.shape[1]} features: {np.mean((model.predict(x) - y) ** 2):.2f}")

    # Even if all the features are initially scaled to have the same variance, PCA will still perform worse by partially removing the feature that is highly correlated with the output
    x = np.random.randn(n_samples, n_features - 1)
    x = np.hstack((x, y))
    x = x / np.std(x, axis=0)
    model.fit(x, y)
    print (f"MSE on normalized dataset with {x.shape[1]} features: {np.mean((model.predict(x) - y) ** 2):.2f}")
    x, _ = pca(x, dim=x.shape[1]//2)
    model.fit(x, y)
    print (f"MSE on PCA-reduced data with {x.shape[1]} features: {np.mean((model.predict(x) - y) ** 2):.2f}")

pca_correlation_issue()

MSE on original data with 10 features: 0.00
MSE on PCA-reduced data with 9 features: 0.98
MSE on normalized dataset with 10 features: 0.00
MSE on PCA-reduced data with 5 features: 0.50
