## Principal Component Analysis
### What Is Principal Component Analysis?

Principal Component Analysis, or PCA for short, is a method for reducing the dimensionality
of data. It can be thought of as a projection method where data with m-columns (features) is
projected into a subspace with m or fewer columns, whilst retaining the essence of the original
data.

\begin{equation*}
A = 
\begin{pmatrix}
a_{1,1} & a_{1,2} \\
a_{2,1} & a_{2,2} \\
a_{3,1} & a_{3,2}
\end{pmatrix}
\end{equation*}

So

$$B = PCA(A)$$

The first step is to calculate the mean values of each column.

$$M = mean(A)$$

Next, we need to center the values in each column by subtracting the mean column value.

$$C = A - M$$

The next step is to calculate the covariance matrix of the centered matrix C.

$$V = cov(C)$$

Finally, we calculate the eigendecomposition of the covariance matrix V . This results in a
list of eigenvalues and a list of eigenvectors.

$$values, vectors = eig(V)$$

we would select k eigenvectors, called principal components, that have the k
largest eigenvalues.

$$B = select(values, vectors)$$

Other matrix decomposition methods can be used such as Singular-Value Decomposition,
or SVD. As such, generally the values are referred to as singular values and the vectors of the
subspace are referred to as principal components. Once chosen, data can be projected into the
subspace via matrix multiplication.

$$P = B^{T} . A$$


### Calculating Principal Component Analysis:
There is no pca() function in NumPy, but we can easily calculate the Principal Component
Analysis step-by-step using NumPy functions:

In [2]:
# Principal component analysis

from numpy import array
from numpy import mean
from numpy import cov
from numpy.linalg import eig

# defining an array
A = array([
    [1, 2],
    [3, 4],
    [5, 6]
])
print(f"A: \n{A}\n")

# column means
M = mean(A.T, axis=1)

# center columns by subtracting column means
C = A - M

# calculating variance matrix of centered matrix
V = cov(C.T)

# factorizing covariance matrix
values, vectors = eig(V)

print(f"vectors: \n{vectors}\n")
print(f"values: {values}\n")

# projecting data
P = vectors.T.dot(C.T)
print(f"P.T: \n{P.T}")

A: 
[[1 2]
 [3 4]
 [5 6]]

vectors: 
[[ 0.70710678 -0.70710678]
 [ 0.70710678  0.70710678]]

values: [8. 0.]

P.T: 
[[-2.82842712  0.        ]
 [ 0.          0.        ]
 [ 2.82842712  0.        ]]


### Principal Component Analysis In Scikit-Learn:

We can calculate a Principal Component Analysis on a dataset using the PCA() class in the
scikit-learn library. The benefit of this approach is that once the projection is calculated, it can
be applied to new data again and again quite easily. When creating the class, the number of
components can be specified as a parameter. The class is first fit on a dataset by calling the fit()
function, and then the original dataset or other data can be projected into a subspace with the
chosen number of dimensions by calling the transform() function. Once fit, the singular values
and principal components can be accessed on the PCA class via the explained variance and
components attributes:

In [6]:
# PCA using scikit-learn

from numpy import array
from sklearn.decomposition import PCA

# defining an array
A = array([
    [1, 2],
    [3, 4],
    [5, 6]
])
print(f"A: \n{A}\n")

# creating the transform
pca = PCA(2)

# fitting transform
pca.fit(A)

# accessing values and vectors
print(f"pca.components_: \n{pca.components_}\n")
print(f"pca.explained_variance_: {pca.explained_variance_}\n")

# transforming data
B = pca.transform(A)
print(f"B: \n{B}")


A: 
[[1 2]
 [3 4]
 [5 6]]

pca.components_: 
[[ 0.70710678  0.70710678]
 [-0.70710678  0.70710678]]

pca.explained_variance_: [8. 0.]

B: 
[[-2.82842712e+00 -2.22044605e-16]
 [ 0.00000000e+00  0.00000000e+00]
 [ 2.82842712e+00  2.22044605e-16]]
