# Mini-project 1: Principal Component Analysis (PCA)

## Algorithm Development Steps

In this section, we walk through the steps behind implementing the PCA algorithm outlined in section 10.6 of the textbook. The algorithm will consist of four main steps:

1. Centering the original data set samples around the origin
2. Standardizing the data set based on standard deviation
3. Computing the eigendecomposition of the covariance matrix from the centered, standardized data set
4. Choosing the dimension for approximating the original space


#### Assumptions

* The data matrix X is structured such that rows are attributes and columns are samples
* The number of rows in data matrix X is less than the number of columns

In [12]:
# import libraries

import numpy as np
import pandas as pd
import sklearn.decomposition as skl
import sys

sys.path.append('../src/')
from PCA import PCA

We will use the following data matrix below throughout the development of the algorithm to demonstrate intermediate steps.

In [13]:
# sample data matrix for testing/demonstration

A = np.array([
    [2,8,2,1,5],
    [8,7,2,2,6],
    [4,0,5,0,4]
])

print(A)
print(A.shape)

[[2 8 2 1 5]
 [8 7 2 2 6]
 [4 0 5 0 4]]
(3, 5)


### Step 1: Centering the Data Matrix

Per the algorithm for PCA in section 10.6 of the book, we must first center the data matrix so that each dimension has a mean of 0. We achieve this by computing the mean of each dimension (row) and then subtracting all elements in each row by the respective row's mean value.

In [14]:
D, samples = A.shape
rowmeans = np.mean(A, axis=1)
offsetmatrix = np.repeat(rowmeans, samples, axis=0).reshape((D,samples))
centered = A - offsetmatrix

print(centered)

[[-1.6  4.4 -1.6 -2.6  1.4]
 [ 3.   2.  -3.  -3.   1. ]
 [ 1.4 -2.6  2.4 -2.6  1.4]]


### Step 2: Standardization

The next step in the PCA algorithm involves standardizing each component of the data matrix by dividing by the component's respective standard deviation. We show this behavior below.

In [15]:
rowstds = np.std(centered, axis=1)
standardized = (centered.T / rowstds).T

print(standardized)

[[-0.62092042  1.70753116 -0.62092042 -1.00899568  0.54330537]
 [ 1.18585412  0.79056942 -1.18585412 -1.18585412  0.39528471]
 [ 0.64993368 -1.2070197   1.11417203 -1.2070197   0.64993368]]


### Step 3: Eigendecomposition of the Covariance Matrix

We must first find the covariance matrix of the centered and standardized data array. We then compute the eigendecomposition of this covariance matrix. Following this, we can choose a desired number of dimensions to use for dimensionality reduction.

In [16]:
covmatrix = np.cov(standardized)
print(covmatrix)

[[ 1.25        0.69030098 -0.39635072]
 [ 0.69030098  1.25        0.04587658]
 [-0.39635072  0.04587658  1.25      ]]


In [17]:
# Compute the eigenvalues and unit-length eigenvectors of the standardized covariance matrix
# The eigenvalues and vectors are ordered in ascending order

res = np.linalg.eigh(covmatrix)

# flip the order so the eigenvalues and vectors are sorted in descending order based on eigenvalue
eigvals = np.flip(res[0])
eigvecs = np.flip(res[1], axis=1)

print(eigvals)
print(eigvecs)

[2.026786  1.2895867 0.4336273]
[[ 0.71551818 -0.02918716 -0.69798413]
 [ 0.61644272  0.4964592   0.61116826]
 [-0.32868237  0.86756923 -0.3732178 ]]


We show below how we execute dimensionality reduction. We simply choose the same number of eigenvectors (descending order by weight) as the number dimensions we desire, and we represent these in a matrix as column vectors. We call this matrix B.

In [18]:
DESIRED_DIMENSIONS = 2

B = eigvecs[:, 0:DESIRED_DIMENSIONS]
print(B)

[[ 0.71551818 -0.02918716]
 [ 0.61644272  0.4964592 ]
 [-0.32868237  0.86756923]]


### Step 4: Projection

Using our dimension-reducing matrix, we can take vectors from the original space and project them onto the principal subspace with fewer dimensions than the original space. To express this projection in the original space, we multiply by the original standard deviation and add the mean of each for each vector component.

In [19]:
# x = a vector from the original space

x = np.array([1,1,1])


# x_standardized = x tranformed into centered, standardized space

x_standardized = (x - rowmeans) / rowstds


# x_principal = the vector obtained by transforming x_standardized into the principal subspace

x_principal = B @ (B.T @ x_standardized)


# x_approx = x_principal transformed back into the original space ~> An approximation of x after dimension reduction

x_approx = (x_principal * rowstds) + rowmeans

print("x =", x)
print("x_standardized =", x_standardized)
print("x_principal =", x_principal)
print("x_approx =", x_approx)

x = [1 1 1]
x_standardized = [-1.00899568 -1.58113883 -0.74278135]
x_principal = [-0.99842797 -1.59039212 -0.73713071]
x_approx = [1.02723108 0.97659083 1.01217185]


We see that in this example the approximation is quite close to the original sample after dimension reduction. 

## Algorithm Implementation

Now that we have illustrated the steps for the algorithm, we show the implementation of this procedure as a Python class.

```python
# Python class to express the behavior of PCA analysis
class PCA:
    # initialize the PCA class with a given data set X (required)
    # optionally supply N, the number of reduction dimensions
    def __init__(self, X, N=1):
        # set the class variables
        self.X = X
        self.N = N

        # center the data matrix
        D, samples = X.shape
        self.rowmeans = np.mean(X, axis=1)
        self.centered = X - np.repeat(self.rowmeans, samples, axis=0).reshape((D,samples))

        # standardize the centered data matrix
        self.rowstds = np.std(self.centered, axis=1)
        self.standardized = (self.centered.T / self.rowstds).T

        # compute the covariance matrix
        self.covmatrix = np.cov(self.standardized)

        # compute the eigendecomposition of the covariance matrix
        res = np.linalg.eigh(self.covmatrix)
        self.eigvals = np.flip(res[0])
        self.eigvecs = np.flip(res[1], axis=1)

        # compute B
        self.B = self.eigvecs[:, 0:self.N]

    # set N, the number of dimensions to reduce to
    def set_N(self, N):
        # set N, recompute B
        self.N = N
        self.B = self.eigvecs[:, 0:self.N]

    # center and standardize variance to 1
    def standardize_sample(self, x):
        return (x - self.rowmeans) / self.rowstds

    # shift sample back to original data space
    def unstandardize_sample(self, x):
        return (x * self.rowstds) + self.rowmeans

    # return the covariance matrix of the centered, standardized data
    def get_covariance_matrix(self):
        return self.covmatrix

    # transform a standardized sample of D dimensions into N dimensions
    def transform_reduce(self, x):
        return self.B.T @ x

    # transform a dimension-reduced sample of N dimensions into D dimensions
    # the result is centered and standardized
    def transform_inverse(self, z):
        return self.B @ z

    # perform end-to-end transformation
    # centers, standardizes, reduces, inverts, and unstandardizes
    # this function takes a sample and "approximates" it using PCA with given N
    def transform(self, x):
        x_standardized = self.standardize_sample(x)
        x_principal = self.transform_inverse(self.transform_reduce(x_standardized))
        x_transformed = self.unstandardize_sample(x_principal)
        return x_transformed
```

### Implementation Testing

Now that we have an implementation, we can test this implementation against the SciKit Learn implementation of PCA to ensure we get the same result on a test sample. The difference between our implementation and the SciKit Learn implementation is that our implementation automatically centers and standardizes the dataset prior to calculating the covariance matrix and eigendecompositions. Hence, we must have SciKit Learn's implementation perform PCA on the standardized data set, and we also must perform standardization on the sample point and undo this standardization after performing transformation. Below, we show the results of dimension reduction against sample x=(1,1,1) defined above. We see that both implementations compute the same covariance matrix and transformed sample.

In [20]:
# Our implementation of PCA dimension reduction

pca = PCA(A, N=DESIRED_DIMENSIONS)
print(pca.get_covariance_matrix())
print(pca.transform(x))

[[ 1.25        0.69030098 -0.39635072]
 [ 0.69030098  1.25        0.04587658]
 [-0.39635072  0.04587658  1.25      ]]
[1.02723108 0.97659083 1.01217185]


In [21]:
# SciKit Learn implementation of PCA dimension reduction
# This implementation does not automatically center and standardize the data set
# Hence, we must perform the PCA on the centered, standardized dataset

pca_skl = skl.PCA(n_components=DESIRED_DIMENSIONS, svd_solver='full')
pca_skl.fit(standardized.T)
print(pca_skl.get_covariance())
print(pca.unstandardize_sample(pca_skl.inverse_transform(pca_skl.transform(pca.standardize_sample(x).reshape(1,-1)))))

[[ 1.25        0.69030098 -0.39635072]
 [ 0.69030098  1.25        0.04587658]
 [-0.39635072  0.04587658  1.25      ]]
[[1.02723108 0.97659083 1.01217185]]


## Analysis of Triathlon Data Set

We have obtained a data set from https://www.kaggle.com/mpwolke/wired-differently-triathlon/data. This data represents results from an Ironman 70.3 triathlon race that took place in 2019. We want to perform PCA analysis on the data set and compare PCA dimension reduction with SVD dimension reduction. We also want to explore the correlations between each of the sub-events (swim, bike, run) as well as overall time among the race participants.

### PCA and SVD Dimension Reduction

We will start by performing PCA and SVD dimension reduction on each sample in our data set. We will vary the number of dimensions and observe the accuracy of each method. We use the SVD implementation provided by the SciKit Learn library, consistent with the SVD algorithm outlined in section 10.4 of the book.