# Principal Component Analysis (PCA) Algorithm

In this section, we walk through the steps behind implementing the PCA algorithm outlined in section 10.6 of the textbook. The algorithm will consist of four main steps:

1. Centering the original data set samples around the origin
2. Standardizing the data set based on standard deviation
3. Computing the eigendecomposition of the covariance matrix from the centered, standardized data set
4. Choosing the dimension for approximating the original space


#### Assumptions

* The data matrix X is structured such that rows are attributes and columns are samples
* The number of rows in data matrix X is less than the number of columns

In [26]:
# import libraries

import numpy as np
import pandas as pd
import sklearn
import sys

sys.path.append('../src/')
from PCA import PCA

We will use the following data matrix below throughout the development of the algorithm to demonstrate intermediate steps.

In [27]:
# sample data matrix for testing/demonstration

A = np.array([
    [2,8,2,1,5],
    [8,7,2,2,6],
    [4,0,5,0,4]
])

print(A)
print(A.shape)

[[2 8 2 1 5]
 [8 7 2 2 6]
 [4 0 5 0 4]]
(3, 5)


## Step 1: Centering the Data Matrix

Per the algorithm for PCA in section 10.6 of the book, we must first center the data matrix so that each dimension has a mean of 0. We achieve this by computing the mean of each dimension (row) and then subtracting all elements in each row by the respective row's mean value.

In [28]:
D, samples = A.shape
rowmeans = np.mean(A, axis=1)
offsetmatrix = np.repeat(rowmeans, samples, axis=0).reshape((D,samples))
centered = A - offsetmatrix

print(centered)

[[-1.6  4.4 -1.6 -2.6  1.4]
 [ 3.   2.  -3.  -3.   1. ]
 [ 1.4 -2.6  2.4 -2.6  1.4]]


## Step 2: Standardization

The next step in the PCA algorithm involves standardizing each component of the data matrix by dividing by the component's respective standard deviation. We show this behavior below.

In [29]:
rowstds = np.std(centered, axis=1)
standardized = (centered.T / rowstds).T

print(standardized)

[[-0.62092042  1.70753116 -0.62092042 -1.00899568  0.54330537]
 [ 1.18585412  0.79056942 -1.18585412 -1.18585412  0.39528471]
 [ 0.64993368 -1.2070197   1.11417203 -1.2070197   0.64993368]]


## Step 3: Eigendecomposition of the Covariance Matrix

We must first find the covariance matrix of the centered and standardized data array. We then compute the eigendecomposition of this covariance matrix. Following this, we can choose a desired number of dimensions to use for dimensionality reduction.

In [30]:
covmatrix = np.cov(standardized)
print(covmatrix)

[[ 1.25        0.69030098 -0.39635072]
 [ 0.69030098  1.25        0.04587658]
 [-0.39635072  0.04587658  1.25      ]]


In [31]:
# Compute the eigenvalues and unit-length eigenvectors of the standardized covariance matrix
# The eigenvalues and vectors are ordered in ascending order

res = np.linalg.eigh(covmatrix)

# flip the order so the eigenvalues and vectors are sorted in descending order based on eigenvalue
eigvals = np.flip(res[0])
eigvecs = np.flip(res[1], axis=1)

print(eigvals)
print(eigvecs)

[2.026786  1.2895867 0.4336273]
[[ 0.71551818 -0.02918716 -0.69798413]
 [ 0.61644272  0.4964592   0.61116826]
 [-0.32868237  0.86756923 -0.3732178 ]]


We show below how we execute dimensionality reduction. We simply choose the same number of eigenvectors (descending order by weight) as the number dimensions we desire, and we represent these in a matrix as column vectors. We call this matrix B.

In [32]:
DESIRED_DIMENSIONS = 2

B = eigvecs[:, 0:DESIRED_DIMENSIONS]
print(B)

[[ 0.71551818 -0.02918716]
 [ 0.61644272  0.4964592 ]
 [-0.32868237  0.86756923]]


## Step 4: Projection

Using our dimension-reducing matrix, we can take vectors from the original space and project them onto the principal subspace with fewer dimensions than the original space. To express this projection in the original space, we multiply by the original standard deviation and add the mean of each for each vector component.

In [33]:
# x = a vector from the original space

x = np.array([1,1,1])


# x_standardized = x tranformed into centered, standardized space

x_standardized = (x - rowmeans) / rowstds


# x_principal = the vector obtained by transforming x_standardized into the principal subspace

x_principal = B @ (B.T @ x_standardized)


# x_approx = x_principal transformed back into the original space ~> An approximation of x after dimension reduction

x_approx = (x_principal * rowstds) + rowmeans

print("x =", x)
print("x_standardized =", x_standardized)
print("x_principal =", x_principal)
print("x_approx =", x_approx)

x = [1 1 1]
x_standardized = [-1.00899568 -1.58113883 -0.74278135]
x_principal = [-0.99842797 -1.59039212 -0.73713071]
x_approx = [1.02723108 0.97659083 1.01217185]


We see that in this example the approximation is quite close to the original sample after dimension reduction. 

## Implementation

Now that we have illustrated the steps, we show the implementation of this procedure as a Python class.

```python
class PCA:
    def __init__(self, X, dimensions):
        # center the data matrix
        D, samples = X.shape
        self.rowmeans = np.mean(X, axis=1)
        centered = X - np.repeat(self.rowmeans, samples, axis=0).reshape((D,samples))

        # standardize the centered data matrix
        self.rowstds = np.std(centered, axis=1)
        standardized = (centered.T / self.rowstds).T

        # compute the covariance matrix
        self.covmatrix = np.cov(standardized)

        # compute the eigendecomposition of the covariance matrix
        res = np.linalg.eigh(self.covmatrix)
        eigvals = np.flip(res[0])
        eigvecs = np.flip(res[1], axis=1)

        # compute B
        self.B = eigvecs[:, 0:dimensions]

    def reducer(self, x):
        x_standardized = (x - self.rowmeans) / self.rowstds
        x_principal = self.B @ (self.B.T @ x_standardized)
        x_reduced = (x_principal * self.rowstds) + self.rowmeans
        return x_reduced

    def covariance_matrix(self):
        return self.covmatrix
```


In [34]:


pca = sklearn.decomposition.PCA(n_components=DESIRED_DIMENSIONS, svd_solver='full')
pca.fit(standardized.T)
print(pca.get_covariance())
print(pca.inverse_transform(pca.transform(x.reshape(1,-1))))


[[ 1.25        0.69030098 -0.39635072]
 [ 0.69030098  1.25        0.04587658]
 [-0.39635072  0.04587658  1.25      ]]
[[0.6789038  1.28115797 0.82830725]]


In [35]:
pca_ = PCA(A, DESIRED_DIMENSIONS)
print(pca_.covariance_matrix())
print(pca_.reducer(x))

[[ 1.25        0.69030098 -0.39635072]
 [ 0.69030098  1.25        0.04587658]
 [-0.39635072  0.04587658  1.25      ]]
[1.02723108 0.97659083 1.01217185]
