# Linear Algebra
> Principal Component Analysis

- toc: true
- branch: true
- badges: true
- comments: true
- author: Gui Osorio

Principal Component Analysis (PCA for short), is a technique used to reduce the dimensionality of data. Let's say we have a dataset of dimension 100x250 (100 rows, 250 columns), and we want to use that to train a machine learning model. Using all of this dataset's features is impractical and inneficient.

To overcome this, we can use PCA, which will reduce the dimensionality of our data and facilitate building our ML model.

## PCA from scratch

Let's go through the Principal Component Analysis algorithm step-by-step using only NumPy.

In [1]:
#collapse-hide
import numpy as np

First, let's define a 3x2 matrix M

In [2]:
M = np.array([[1,2], [3,4], [5,6]])
M

array([[1, 2],
       [3, 4],
       [5, 6]])

Our first step is to calculate the mean values across each column.

In [3]:
col_means = np.mean(M, axis=0)
# in the axis parameter, we can specify in which direction to calculate the means: 0 for columns, 1 for rows

Next, we need to center the column values by subtracting the respective mean.

In [4]:
centered_M = M - col_means
centered_M

array([[-2., -2.],
       [ 0.,  0.],
       [ 2.,  2.]])

Once we have the centered matrix, we can calculate the covariance matrix of its transpose.

A covariance matrix gives us the correlation scores between columns in a matrix.

In [5]:
covar = np.cov(centered_M.T)
covar

array([[4., 4.],
       [4., 4.]])

Next, we need to calculate the eigendecomposition of this covariance matrix, and store the resulting eigenvectors and eigenvalues.

Eigenvectors represent the directions for the reduced matrix, whereas the eigenvalues represent the magnitudes of these directions.

Eigenvalues close to 0 represent components which are not relevant. When performing PCA, we need to choose the top k eigenvalues to keep, which will represent the most relevant components of the features (called principals).

In [6]:
eig_vals, eig_vecs = np.linalg.eig(covar)
print(eig_vals) # eigenvalues close to 0 represent components which are not relevant
print('--------------------------------')
print(eig_vecs)

[8. 0.]
--------------------------------
[[ 0.70710678 -0.70710678]
 [ 0.70710678  0.70710678]]


Finally, it's time to project our transformed data. This will correspond to dt product of the transpose of our eigenvectors and the transpose of the centered matrix calculated earlier.

In [7]:
transformed_M = eig_vecs.T.dot(centered_M.T).T
transformed_M

array([[-2.82842712,  0.        ],
       [ 0.        ,  0.        ],
       [ 2.82842712,  0.        ]])

This result indicates that we can transform our original 3x2 matrix into a 3x1 matrix with minimal loss (this could have already been concluded after noticing an eigenvalue of 0).

## PCA with scikit-learn

Now, let's make things easy and use scikit-learn to help in the process.

In [8]:
#collapse-hide
from sklearn.decomposition import PCA

We can easily achieve the same results with fewer lines of code using scikit-learn!

First, let's initialize a PCA class and fit it into our matrix.

In [10]:
# Initialize
pca = PCA(2)

# Fit
pca.fit(M)

After fitting the model, we have access to all of the values calculated in the NumPy approach.

In [14]:
# Eigenvalues
print('EIGENVALUES')
print(pca.explained_variance_)
print('--------------------------------')

# Eigenvectors
print('EIGENVECTORS')
print(pca.components_)
print('--------------------------------')

# Transformed data
transformed_M2 = pca.transform(M)
print('TRANSFORMED MATRIX')
print(transformed_M2)

EIGENVALUES
[8.00000000e+00 2.25080839e-33]
--------------------------------
EIGENVECTORS
[[ 0.70710678  0.70710678]
 [ 0.70710678 -0.70710678]]
--------------------------------
TRANSFORMED MATRIX
[[-2.82842712e+00  2.22044605e-16]
 [ 0.00000000e+00  0.00000000e+00]
 [ 2.82842712e+00 -2.22044605e-16]]


**You have now been introduced to a valuable dimensionality reduction technique, the Principal Component Analysis!**