## PCA
                         
Principal Component Analysis(PCA) is a dimesionality reduction technique used to transform high-dimensional datasets into a datasets with fewer variables where the set of resulting variables explains the maximum variance within the dataset. PCA is used prior to supervised machine learning and unsupervised machine learning steps to reduce the number of features for analysis, thereby reducing the likelihood of error.

The overall goal of PCA is to reduce the number of d dimensions (features) in a dataset by projecting it into a k dimensional subspace where k < d. The approch used to compete PCA can be summarized  as follows:

1. Standardize the data
2. Use the standardized data to generate a covariance matrix (or perform Singular Vector Decomposition)
3. Obtain eigenvectors (principal components) and eigenvalues from the covariance matrix. Each eigenvector have eigenvalue.
4. Sort the eigenvalues in desending order.
5. Select the k eigenvectors with the largest eigenvalues, where k is the number of dimensions used in the new feature space.
6. Construct a new matrix with the selected k eigenvectors.




## Manually Calculate Principal Component Analysis 

There is no pca() function in NumPy, but we can easily calculate the Principal Component Analysis step-by-step using NumPy functions.

The example below defines a small 3×2 matrix, centers the data in the matrix, calculates the covariance matrix of the centered data, and then the eigendecomposition of the covariance matrix. The eigenvectors and eigenvalues are taken as the principal components and singular values and used to project the original data.

In [24]:
from numpy import array
from numpy import mean
from numpy import cov
from numpy.linalg import eig
# define a matrix
A = array([[1, 2], [3, 4], [5, 6]])

In [3]:
A

array([[1, 2],
       [3, 4],
       [5, 6]])

In [4]:
# calculate the mean of each column
M = mean(A.T, axis=1)

In [5]:
M

array([3., 4.])

In [6]:
# center columns by subtracting column means
C = A - M
C

array([[-2., -2.],
       [ 0.,  0.],
       [ 2.,  2.]])

In [7]:
# calculate covariance matrix of centered matrix
V = cov(C.T)
V

array([[4., 4.],
       [4., 4.]])

In [8]:
# eigendecomposition of covariance matrix
values, vectors = eig(V)
print(vectors)
print(values)

[[ 0.70710678 -0.70710678]
 [ 0.70710678  0.70710678]]
[8. 0.]


In [9]:
# project data
P = vectors.T.dot(C.T)
print(P.T)

[[-2.82842712  0.        ]
 [ 0.          0.        ]
 [ 2.82842712  0.        ]]


In [10]:
## Principal Component  Analysis
from numpy import array
from sklearn.decomposition import PCA

A = array([[1, 2], [3, 4], [5, 6]])

In [13]:
pca = PCA(2)
pca.fit(A)

print(pca.components_)
print(pca.explained_variance_)

B = pca.transform(A)
print(B)

[[ 0.70710678  0.70710678]
 [ 0.70710678 -0.70710678]]
[8.00000000e+00 2.25080839e-33]
[[-2.82842712e+00  2.22044605e-16]
 [ 0.00000000e+00  0.00000000e+00]
 [ 2.82842712e+00 -2.22044605e-16]]
