# Principal Component Analysis (PCA) - From Scratch

An important machine learning method for dimensionality reduction is called Principal Component Analysis. It is a method that uses simple matrix operations from linear algebra and statistics to calculate a projection of the original data into the same number or fewer dimensions. It can be thought of as a projection method where data with m-columns (features) is projected into a subspace with m or fewer columns, whilst retaining the essence of the original data.

PCA is an operation applied to a dataset, represented by an n x m matrix A that results in a projection of A which we will call B.

`B = PCA(A)`

### Step 1
The first step is to calculate the mean values of each column. Transpose A if you are using np.mean().

`M = mean(A)`

### Step 2
Next, we need to center the values in each column by subtracting the mean column value.

`C = A - M`

### Step 3
The next step is to calculate the covariance matrix of the centered matrix C. A covariance matrix is a calculation of covariance of a given matrix with covariance scores for every column with every other column, including itself.

`V = cov(C)`

### Step 4
Finally, we calculate the eigendecomposition of the covariance matrix V. This results in a list of eigenvalues and a list of eigenvectors. The eigenvectors represent the directions or components for the reduced subspace of B, whereas the eigenvalues represent the magnitudes for the directions.

`eigenvalues, eigenvectors = eig(V)` 

### Step 5
The eigenvectors can be sorted by the eigenvalues in descending order to provide a ranking of the components or axes of the new subspace for A. If all eigenvalues have a similar value, then we know that the existing representation may already be reasonably compressed or dense and that the projection may offer little. If there are eigenvalues close to zero, they represent components or axes of B that may be discarded. A total of m or less components must be selected to comprise the chosen subspace. Ideally, we would select k eigenvectors, called principal components, that have the k largest eigenvalues.

`B = select(eigenvalues, eigenvectors)`

### Note
Other matrix decomposition methods can be used such as Singular-Value Decomposition, or SVD. As such, generally the values are referred to as singular values and the vectors of the subspace are referred to as principal components.

This is called the covariance method for calculating the PCA, although there are alternative ways to to calculate it.

### Result
Once chosen, data can be projected into the subspace via matrix multiplication. Where C is the centered original data that we wish to project, B^T is the transpose of the chosen principal components and P is the projection of C.

`P = B^T . C`


## Import libraries

In [1]:
import numpy as np
from sklearn.decomposition import PCA

## Manually execute PCA

In [2]:
# define a matrix
A = np.array([[1, 2], [3, 4], [5, 6]])
print(A)

[[1 2]
 [3 4]
 [5 6]]


In [3]:
# Step 1: calculate the mean of each column
M = np.mean(A.T, axis=1)
print(M)

[3. 4.]


In [4]:
# Step 2: center columns by subtracting column means
C = A - M
print(C)

[[-2. -2.]
 [ 0.  0.]
 [ 2.  2.]]


In [5]:
# Step 3: calculate covariance matrix of centered matrix
V = np.cov(C.T)
print(V)

[[4. 4.]
 [4. 4.]]


In [6]:
# Step 4: eigendecomposition of covariance matrix
eigenvalues, eigenvectors = np.linalg.eig(V)
print(eigenvectors)
print(eigenvalues)

[[ 0.70710678 -0.70710678]
 [ 0.70710678  0.70710678]]
[8. 0.]


In [7]:
# Step 5: select k eigenvectors (here all of them)
B = eigenvectors
print(B)

[[ 0.70710678 -0.70710678]
 [ 0.70710678  0.70710678]]


In [8]:
# Result: project data
P = B.T.dot(C.T)
print(P.T)

[[-2.82842712  0.        ]
 [ 0.          0.        ]
 [ 2.82842712  0.        ]]


Interestingly, we can see that only the first eigenvector is required, suggesting that we could project our 3×2 matrix onto a 3×1 matrix with little loss.

## Reusable Principal Component Analysis

We can calculate a Principal Component Analysis on a dataset using the PCA() class in the scikit-learn library. The benefit of this approach is that once the projection is calculated, it can be applied to new data again and again quite easily. When creating the class, the number of components can be specified as a parameter.

The class is first fit on a dataset by calling the fit() function, and then the original dataset or other data can be projected into a subspace with the chosen number of dimensions by calling the transform() function. Once fit, the eigenvalues and principal components can be accessed on the PCA class via the explained_variance_ and components_ attributes.

In [9]:
# define a matrix
A = np.array([[1, 2], [3, 4], [5, 6]])
print(A)

[[1 2]
 [3 4]
 [5 6]]


In [10]:
# create the PCA instance
pca = PCA(2)

In [11]:
# fit on data
pca.fit(A)

PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

In [12]:
# access values and vectors
print(pca.components_)
print(pca.explained_variance_)

[[ 0.70710678  0.70710678]
 [-0.70710678  0.70710678]]
[8. 0.]


In [13]:
# transform data
B = pca.transform(A)
print(B)

[[-2.82842712e+00 -2.22044605e-16]
 [ 0.00000000e+00  0.00000000e+00]
 [ 2.82842712e+00  2.22044605e-16]]
