# Principal Component Analysis

### Assumptions
* The data matrix X is structured such that rows are attributes and columns are samples

In [240]:
# import libraries
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA

We will use the following data matrix below throughout the development of the algorithm to demonstrate and test the intermediate functions we define.

In [241]:
# sample data matrix for testing/demonstration

A = np.random.randint(10, size=(3,5))

print(A)
print(A.shape)

[[0 2 6 6 2]
 [9 5 4 6 1]
 [9 3 5 4 8]]
(3, 5)


## Step 1: Centering the Data Matrix

Per the algorithm for PCA in section 10.6 of the book, we must first center the data matrix so that each dimension has a mean of 0. We achieve this by computing the mean of each dimension (row) and then subtracting all elements in each row by the respective row's mean value.

In [242]:
D, samples = A.shape
rowmeans = np.mean(A, axis=1)
offsetmatrix = np.repeat(rowmeans, samples, axis=0).reshape((D,samples))
centered = A - offsetmatrix

print(centered)

[[-3.2 -1.2  2.8  2.8 -1.2]
 [ 4.   0.  -1.   1.  -4. ]
 [ 3.2 -2.8 -0.8 -1.8  2.2]]


## Step 2: Standardization

The next step in the PCA algorithm involves standardizing each component of the data matrix by dividing by the component's respective standard deviation. We show this behavior below.

In [243]:
rowstds = np.std(centered, axis=1)
standardized = (centered.T / rowstds).T

print(standardized)

[[-1.33333333 -0.5         1.16666667  1.16666667 -0.5       ]
 [ 1.53392998  0.         -0.38348249  0.38348249 -1.53392998]
 [ 1.38218948 -1.2094158  -0.34554737 -0.77748158  0.95025527]]


## Step 3: Eigendecomposition of the Covariance Matrix

We must first find the covariance matrix of the centered and standardized data array. We then use this covariance matrix to compute the eigendecomposition.

In [244]:
covmatrix = np.cov(standardized)
print(covmatrix)

[[ 1.25       -0.31956875 -0.75588487]
 [-0.31956875  1.25        0.12422941]
 [-0.75588487  0.12422941  1.25      ]]


In [258]:
# Compute the eigenvalues and unit-length eigenvectors of the standardized covariance matrix
# The eigenvalues and vectors are ordered in ascending order

res = np.linalg.eigh(covmatrix)

# flip the order so the eigenvalues and vectors are sorted in descending order based on eigenvalue
eigvals = np.flip(res[0])
eigvecs = np.flip(res[1], axis=1)

print(eigvals)
print(eigvecs)

[2.12055179 1.16188943 0.46755879]
[[-0.6844767  -0.10734127  0.7210891 ]
 [ 0.34305938 -0.92017176  0.18866426]
 [ 0.64327436  0.37651267  0.66666056]]


In [259]:
DESIRED_COMPONENTS = 2

B = eigvecs[:, 0:DESIRED_DIMENSIONS]
print(B)

[[-0.6844767  -0.10734127]
 [ 0.34305938 -0.92017176]
 [ 0.64327436  0.37651267]]


In [247]:
pca = PCA(n_components=DESIRED_COMPONENTS, svd_solver='full')
pca.fit(standardized.T)
print(pca.get_covariance())

[[ 1.25       -0.31956875 -0.75588487]
 [-0.31956875  1.25        0.12422941]
 [-0.75588487  0.12422941  1.25      ]]


In [262]:
x = np.array([1,1,1])
pca.transform(x.reshape(1,-1))

array([[ 0.30185704, -0.65100036]])

In [260]:
B.T @ x

array([ 0.30185704, -0.65100036])

In [261]:
B @ (B.T @ x)

array([-0.1367349 ,  0.70258703, -0.05093299])