# Principal Component Analalysis

My goal is to explain basic things about principal component analysis - one of the dimension reduction method. I would like to explain this method using easy example on matrix with three features ($X_{1}$, $X_{2}$, $X_{3}$) and four observations.

In [2]:
import numpy as np
import pandas as pd

In [3]:
X_1 = [2, 4, 9]
X_2 = [4, 8, 7]
X_3 = [3, 6, 2]

X = np.array([X_1, X_2, X_3]).T
X

array([[2, 4, 3],
       [4, 8, 6],
       [9, 7, 2]])

Above I create numpy array with three features. My first step is to find expected values of this features. I will use below estimator of mean:

\begin{align}
\bf{E(X)} = \frac{1}{N} \sum_{n=1}^{N} x_{i}
\end{align}    

In [266]:
X_mean = np.mean(X, axis=0)
print("Expected value:", X_mean)

Expected value: [5. 5. 4.]


In [267]:
X_feature_std = np.std(X, axis=0)
print("Standard devaition of each feature:", X_feature_std)

Standard devaition of each feature: [2.54950976 2.73861279 1.58113883]


In [268]:
X_std = (X-X_mean)
print("Deviation matrix:\n", X_std)

Deviation matrix:
 [[-3. -1. -1.]
 [-1.  3.  2.]
 [ 4.  2. -2.]
 [ 0. -4.  1.]]


In [269]:
Cov_matrix = (1/3)*np.dot(X_std.T, X_std)
print("Covariance matrix: \n", Cov_matrix)

Covariance matrix: 
 [[ 8.66666667  2.66666667 -2.33333333]
 [ 2.66666667 10.         -0.33333333]
 [-2.33333333 -0.33333333  3.33333333]]


In [270]:
X_std.T

array([[-3., -1.,  4.,  0.],
       [-1.,  3.,  2., -4.],
       [-1.,  2., -2.,  1.]])

In [271]:
np.cov(X_std.T)

array([[ 8.66666667,  2.66666667, -2.33333333],
       [ 2.66666667, 10.        , -0.33333333],
       [-2.33333333, -0.33333333,  3.33333333]])

In [272]:
np.linalg.eig(np.cov(X_std.T))

(array([12.41661146,  7.1857126 ,  2.39767594]),
 array([[-0.64554098, -0.66129609,  0.38205279],
        [-0.73895116,  0.66721154, -0.09370137],
        [ 0.19294569,  0.34280642,  0.9193779 ]]))

In [287]:
values, vectors = np.linalg.eig(Cov_matrix)

In [288]:
print("Principal Components:", values)

Principal Components: [12.41661146  7.1857126   2.39767594]


In [289]:
print("Vectors: \n", vectors)

Vectors: 
 [[-0.64554098 -0.66129609  0.38205279]
 [-0.73895116  0.66721154 -0.09370137]
 [ 0.19294569  0.34280642  0.9193779 ]]


In [293]:
P = np.dot(vectors.T, X_std.T)
print(P)

[[ 2.4826284  -1.18542112 -4.44595759  3.14875031]
 [ 0.97387029  3.34854356 -1.99637411 -2.32603975]
 [-1.97183491  1.17559891 -0.49794736  1.29418336]]


In [274]:
from sklearn.decomposition import PCA

In [297]:
pca_model = PCA(n_components=3, svd_solver = 'full')
pca_model.fit(X)

PCA(copy=True, iterated_power='auto', n_components=3, random_state=None,
  svd_solver='full', tol=0.0, whiten=False)

In [298]:
print("Mean of the variables:", pca_model.mean_)

Mean of the variables: [5. 5. 4.]


In [299]:
pca_model.singular_values_

array([6.10326424, 4.64296649, 2.68198207])

In [300]:
pca_model.components_

array([[ 0.64554098,  0.73895116, -0.19294569],
       [-0.66129609,  0.66721154,  0.34280642],
       [-0.38205279,  0.09370137, -0.9193779 ]])

In [301]:
Y = pca_model.transform(X)
print(Y)

[[-2.4826284   0.97387029  1.97183491]
 [ 1.18542112  3.34854356 -1.17559891]
 [ 4.44595759 -1.99637411  0.49794736]
 [-3.14875031 -2.32603975 -1.29418336]]


In [302]:
Y.T

array([[-2.4826284 ,  1.18542112,  4.44595759, -3.14875031],
       [ 0.97387029,  3.34854356, -1.99637411, -2.32603975],
       [ 1.97183491, -1.17559891,  0.49794736, -1.29418336]])

In [303]:
P

array([[ 2.4826284 , -1.18542112, -4.44595759,  3.14875031],
       [ 0.97387029,  3.34854356, -1.99637411, -2.32603975],
       [-1.97183491,  1.17559891, -0.49794736,  1.29418336]])