# Principle Component Analysis
- PCA reduces the dimensionality of data from higher to lower.
- In theory there is no increase in model performance after PCA but in practice the performance of model may increase.

Source
- https://plot.ly/ipython-notebooks/principal-component-analysis/
- https://www.coursera.org/lecture/machine-learning/choosing-the-number-of-principal-components-S1bq1
- https://stats.stackexchange.com/questions/55034/how-does-pca-improve-the-accuracy-of-a-predictive-model
- https://georgemdallas.wordpress.com/2013/10/30/principal-component-analysis-4-dummies-eigenvectors-eigenvalues-and-dimension-reduction/

In [1]:
import numpy as np

### Initialize the random X 

In [2]:
np.random.seed(10)
X = np.random.randint(100, size=(10,3))
X

array([[ 9, 15, 64],
       [28, 89, 93],
       [29,  8, 73],
       [ 0, 40, 36],
       [16, 11, 54],
       [88, 62, 33],
       [72, 78, 49],
       [51, 54, 77],
       [69, 13, 25],
       [13, 92, 86]])

### Mean of X across column

In [3]:
X_mean = np.mean(X, axis=0)
X_mean

array([37.5, 46.2, 59. ])

### Standard Deviation of X across column

In [4]:
X_var = np.std(X, axis=0)
X_var

array([28.91107054, 31.72317765, 22.17205448])

### Standardization of X

In [5]:
X_std = (X - X_mean) / X_var
X_std

array([[-0.98578155, -0.98350803,  0.2255091 ],
       [-0.32859385,  1.34917127,  1.53346186],
       [-0.29400502, -1.20416688,  0.63142547],
       [-1.29708099, -0.1954407 , -1.03734185],
       [-0.74365977, -1.1095988 , -0.2255091 ],
       [ 1.74673573,  0.49805855, -1.17264731],
       [ 1.19331451,  1.00242165, -0.4510182 ],
       [ 0.46694916,  0.24587701,  0.81183275],
       [ 1.08954803, -1.04655342, -1.53346186],
       [-0.84742625,  1.44373935,  1.21774913]])

### Covariance Matrix of X

In [6]:
X_cov = np.matmul(X_std.T,X_std)/(X_std.shape[0])
X_cov

array([[ 1.        ,  0.17761524, -0.43087725],
       [ 0.17761524,  1.        ,  0.40661502],
       [-0.43087725,  0.40661502,  1.        ]])

### Eigen Value and Eigen Vector of Covariance Matrix of X

Eigen vector gives the direction of component and eigen value gives about the variance of data in that direction

In [7]:
eig_vals, eig_vecs = np.linalg.eig(X_cov)
print("Eigen Values:\n",eig_vals,"\n")
print("Eigen Vectors: \n",eig_vecs)

Eigen Values:
 [0.31222915 1.17725232 1.51051853] 

Eigen Vectors: 
 [[ 0.54483383 -0.68156765 -0.48848914]
 [-0.52653672 -0.73144979  0.43329008]
 [ 0.65262178 -0.02113638  0.75738898]]


In [8]:
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]
eig_pairs

[(0.3122291500261839, array([ 0.54483383, -0.52653672,  0.65262178])),
 (1.1772523178708882, array([-0.68156765, -0.73144979, -0.02113638])),
 (1.5105185321029295, array([-0.48848914,  0.43329008,  0.75738898]))]

In [9]:
eig_pairs.sort()
eig_pairs.reverse()
eig_pairs

[(1.5105185321029295, array([-0.48848914,  0.43329008,  0.75738898])),
 (1.1772523178708882, array([-0.68156765, -0.73144979, -0.02113638])),
 (0.3122291500261839, array([ 0.54483383, -0.52653672,  0.65262178]))]

### Projection Matrix
Projection Matrix is constructed from the top two principle components.

In [10]:
p_matrix = np.hstack((eig_pairs[0][1].reshape(3,1),
                      eig_pairs[1][1].reshape(3,1)))
p_matrix

array([[-0.48848914, -0.68156765],
       [ 0.43329008, -0.73144979],
       [ 0.75738898, -0.02113638]])

### New 2D X_std.

In [11]:
X_std_dim = np.matmul(X_std, p_matrix)
X_std_dim

array([[ 0.22619741,  1.38649711],
       [ 1.90652417, -0.79530393],
       [ 0.10009939,  1.06782588],
       [-0.23674382,  1.04892915],
       [-0.28830654,  1.3232367 ],
       [-1.52560775, -1.53003788],
       [-0.49017803, -1.53701278],
       [ 0.49330966, -0.51526333],
       [-2.14712071,  0.05531241],
       [ 1.96182623, -0.50418333]])

### X_Std Approximation

In [12]:
X_std_approx = np.matmul(X_std_dim, p_matrix.T)
X_std_approx

array([[-1.05548656, -0.91614392,  0.1420139 ],
       [-0.38926292,  1.4078029 ,  1.46079024],
       [-0.77669304, -0.73768894,  0.0532442 ],
       [-0.59926939, -0.86981775, -0.20147772],
       [-0.76104071, -1.09280156, -0.24632863],
       [ 1.78806714,  0.45811517, -1.12313904],
       [ 1.28702483,  0.91185839, -0.33876856],
       [ 0.11021041,  0.59063543,  0.3845181 ],
       [ 1.011146  , -0.97078436, -1.62737467],
       [-0.61469576,  1.21882463,  1.49652217]])

### Loss in information after applying PCA

In [14]:
num = np.sum((X_std - X_std_approx)**2)
den = np.sum((X_std)**2)
loss = num*100 / den
print("Loss : {0:.2f}%".format(loss))


Loss : 10.41%
