# Principal Component Analysis

Up until this point, you've seen supervised machine learning algorithms; you provide the algorithm with labeled target data whether numerical or categorical which the algorithm then learns to predict given some other set of features for a given observation. Principal Component Analysis (PCA) is a little different. It is what is known as an unsupervised [machine] learning algorithm. Unsupervised learning algorithms simply transform existing data based on its structure into new useful representations which can then feed into a larger data pipeline. 

PCA reorientates data onto new feature dimensions which account for the maximum amount of variance amongst the observation features. These feature dimensions correspond to the eigenvectors of the covariance matrix of the original dataset. This allows you to reduce the dimension of the dataset while also preserving as much statistical information (variance) inherent in the data.

The primary reason for performing dimensionality reduction is the **curse of dimensionality**. While more features will typically add to the performance of a supervised machine learning model, as you add more and more features, the volume of these n-dimensional spaces begins to grow exponentially. This means that these points in space are further and further from each other requiring more and more observations in order to train a model with more features. Principal Component Analysis can be used to reduce a large feature set to a smaller handful of meaningful features. Admittedly however, the new features, or principal components, as it were, are less directly interpretable from an analysis point of view.   

Some other examples of unsupervised algorithms include:
    * Clustering (KMeans, Hierarchical Agglomerative Clustering, etc.)
    * Dimensionality Reduction (PCA, Singular Value Decomposition (SVD))
    * Generative Modeling

If you want to really dig into your linear algebra background, here's another resource on the calculation of eigenvectors:  
* [FINDING EIGENVALUES AND EIGENVECTORS](https://www.scss.tcd.ie/~dahyotr/CS1BA1/SolutionEigen.pdf)

In [75]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
mat = np.array([[2, 1, 3], [4, 2, 5], [-3, -1, 1]])

> **References**  
N.B.: What follows is indebted to http://sebastianraschka.com/Articles/2015_pca_in_3_steps.html#pca-vs-lda

In [69]:
mat_inv = np.linalg.inv(mat)

In [70]:
mat.dot(mat_inv)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [71]:
cars = pd.read_csv('/Users/gdamico/Desktop/cars.csv')
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 261 entries, 0 to 260
Data columns (total 8 columns):
mpg             261 non-null float64
 cylinders      261 non-null int64
 cubicinches    261 non-null object
 hp             261 non-null int64
 weightlbs      261 non-null object
 time-to-60     261 non-null int64
 year           261 non-null int64
 brand          261 non-null object
dtypes: float64(1), int64(4), object(3)
memory usage: 16.4+ KB


In [72]:
cars[' cubicinches'].replace(' ', 0, inplace=True)
cars[' cubicinches'] = cars[' cubicinches'].map(int)
cars[' cubicinches'].replace(0, 261 / 259 * cars[' cubicinches'].mean(), inplace=True)

In [73]:
cars[' weightlbs'].replace(' ', 0, inplace=True)
cars[' weightlbs'] = cars[' weightlbs'].map(int)
cars[' weightlbs'].replace(0, 261 / 259 * cars[' weightlbs'].mean(), inplace=True)

In [74]:
X = cars.drop([' brand', 'mpg'], axis=1)
X.dtypes

 cylinders        int64
 cubicinches    float64
 hp               int64
 weightlbs      float64
 time-to-60       int64
 year             int64
dtype: object

In [78]:
ss = StandardScaler()
X_scaled = ss.fit_transform(X)

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


In [82]:
cov_mat = np.cov(X_scaled.T)

In [83]:
np.linalg.eig(cov_mat)

(array([4.33232011, 0.86354026, 0.59656752, 0.13119557, 0.03453943,
        0.06491403]),
 array([[ 0.45262519,  0.16567065,  0.19134957,  0.68172934,  0.4975751 ,
          0.1368691 ],
        [ 0.46651081,  0.13891271,  0.15733722,  0.15852654, -0.81207642,
          0.23176961],
        [ 0.46098113,  0.02694489, -0.13906066, -0.59872616,  0.28849458,
          0.57072793],
        [ 0.44040047,  0.23558892,  0.33158932, -0.36215923,  0.08457073,
         -0.70871877],
        [-0.35222468,  0.15137815,  0.85641321, -0.14254556,  0.05041327,
          0.31099536],
        [-0.21674803,  0.9349401 , -0.27549486, -0.01276263, -0.00529948,
          0.05309351]]))

In [85]:
eigvals, eigvecs = np.linalg.eig(cov_mat)

In [87]:
eigpairs = [(eigvals[i], eigvecs[:, i]) for i in range(len(eigvals))]

In [89]:
eigpairs[0][1]

array([ 0.45262519,  0.46651081,  0.46098113,  0.44040047, -0.35222468,
       -0.21674803])

In [91]:
pcabh = np.hstack((eigpairs[0][1].reshape(6, 1),
                 eigpairs[1][1].reshape(6, 1),
                  eigpairs[2][1].reshape(6, 1)))
pcabh

array([[ 0.45262519,  0.16567065,  0.19134957],
       [ 0.46651081,  0.13891271,  0.15733722],
       [ 0.46098113,  0.02694489, -0.13906066],
       [ 0.44040047,  0.23558892,  0.33158932],
       [-0.35222468,  0.15137815,  0.85641321],
       [-0.21674803,  0.9349401 , -0.27549486]])

In [93]:
X_scaled.dot(pcabh)

array([[ 3.28064647e+00, -6.32132091e-01,  6.98091889e-02],
       [-1.86575305e+00,  1.17740757e-01, -1.33836007e+00],
       [ 2.57540388e+00, -1.23106831e+00, -4.29997142e-01],
       [ 3.39369241e+00, -1.06445704e+00, -4.95182374e-01],
       [-2.09719845e+00, -1.99746071e-01, -2.12338673e-01],
       [ 1.58008843e+00,  1.57591418e+00,  9.53495103e-01],
       [ 3.04449786e+00, -2.55838022e-02,  2.98639004e-01],
       [ 4.71432404e+00, -8.68913402e-01, -7.40071994e-01],
       [-1.02541897e+00,  1.09653795e+00,  1.28455646e+00],
       [-2.38656660e+00,  8.17832836e-01, -5.25840665e-01],
       [-2.17210083e+00,  1.15917879e+00, -5.25423984e-01],
       [-1.91708611e+00,  5.67474331e-01, -7.34224814e-01],
       [ 2.33248073e+00,  1.47660076e-01,  4.21652653e-01],
       [ 3.17034033e+00, -3.10730750e-01, -6.41595546e-01],
       [-6.49962723e-01,  1.31224408e+00,  8.32520790e-01],
       [ 2.40374997e+00,  1.61957000e+00,  4.40668777e-01],
       [-2.50821750e+00,  3.95578921e-01

In [92]:
pca = PCA(n_components=3)
pca.fit_transform(X_scaled)

array([[ 3.28064647e+00,  6.32132091e-01,  6.98091889e-02],
       [-1.86575305e+00, -1.17740757e-01, -1.33836007e+00],
       [ 2.57540388e+00,  1.23106831e+00, -4.29997142e-01],
       [ 3.39369241e+00,  1.06445704e+00, -4.95182374e-01],
       [-2.09719845e+00,  1.99746071e-01, -2.12338673e-01],
       [ 1.58008843e+00, -1.57591418e+00,  9.53495103e-01],
       [ 3.04449786e+00,  2.55838022e-02,  2.98639004e-01],
       [ 4.71432404e+00,  8.68913402e-01, -7.40071994e-01],
       [-1.02541897e+00, -1.09653795e+00,  1.28455646e+00],
       [-2.38656660e+00, -8.17832836e-01, -5.25840665e-01],
       [-2.17210083e+00, -1.15917879e+00, -5.25423984e-01],
       [-1.91708611e+00, -5.67474331e-01, -7.34224814e-01],
       [ 2.33248073e+00, -1.47660076e-01,  4.21652653e-01],
       [ 3.17034033e+00,  3.10730750e-01, -6.41595546e-01],
       [-6.49962723e-01, -1.31224408e+00,  8.32520790e-01],
       [ 2.40374997e+00, -1.61957000e+00,  4.40668777e-01],
       [-2.50821750e+00, -3.95578921e-01

In [65]:
np.linalg.eig(X.T.dot(X))

(array([3.53260068e+09, 5.55327692e+07, 4.76177106e+05, 6.86136991e+04,
        6.96841134e+01, 7.47144039e+02]),
 array([[ 1.57262914e-03,  9.18598870e-04, -1.26325211e-02,
         -9.22353065e-03,  9.98433926e-01,  5.36816943e-02],
        [ 5.89313698e-02,  1.29082736e-01, -9.33404935e-01,
         -3.29160452e-01, -1.46500396e-02, -7.66434219e-03],
        [ 3.02013547e-02,  2.86777053e-02, -3.25427062e-01,
          9.40995182e-01,  8.93873324e-03, -8.25279405e-02],
        [ 8.47426044e-01,  5.16053753e-01,  1.24702485e-01,
          4.29483936e-04, -3.65911375e-04,  2.56834607e-03],
        [ 4.06887543e-03, -9.63092035e-03,  3.31177413e-02,
         -7.61597387e-02,  5.32186559e-02, -9.95068527e-01],
        [ 5.26749507e-01, -8.46232924e-01, -7.77520411e-02,
         -1.72017610e-02, -1.67677136e-03,  8.98345581e-03]]))

## Visualizing updated feature spaces