## Principle Component Analysis (PCA)

describes the underlying structure of the data (Find linearly independent dimensions). 

It consists of:
- eigenvector (direction), which describes the direction in which the data is spread out, and
- eigenvalue (magnitude), which is the line with the highest variance of data. 

PCA is used for dimension reduction by measuring the eigenvalues (variance) of the independent variables, we are able to select the variables with the highest information degree on the dependent variable. 

Reducing the dimensions of the independent variables you have fewer relationships to consider on the dependent variable (eliminating independent variables that are less important) and thereby reduce the risk of overfitting. 



In [4]:
import numpy as np
np.random.seed(1)

vec1 = np.array([0, 0, 0])
mat1 = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1]])
sample_for_class1 = np.random.multivariate_normal(vec1, mat1, 20).T
assert sample_for_class1.shape == (3, 20), "The dimension of the sample_for_class1 matrix is not 3x20"

vec2 = np.array([1, 1, 1])
mat2 = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1]])
sample_for_class2 = np.random.multivariate_normal(vec2, mat2, 20).T
assert sample_for_class2.shape == (3, 20), "The dimension of the sample_for_class2 matrix is not 3x20"

all_data = np.concatenate((sample_for_class1, sample_for_class2), axis=1)
assert all_data.shape == (3, 40), "The dimension of the all_data matrix is not 3x20"

In [5]:
mean_dim1 = np.mean(all_data[0, :])
mean_dim2 = np.mean(all_data[1, :])
mean_dim3 = np.mean(all_data[2, :])

mean_vector = np.array([[mean_dim1], [mean_dim2], [mean_dim3]])

print('The Mean Vector:\n', mean_vector)

scatter_matrix = np.zeros((3,3))
for i in range(all_data.shape[1]):
    scatter_matrix += (all_data[:, i].reshape(3, 1) - mean_vector).dot((all_data[:, i].reshape(3, 1) - mean_vector).T)
print('The Scatter Matrix is :\n', scatter_matrix)


The Mean Vector:
 [[0.41667492]
 [0.69848315]
 [0.49242335]]
The Scatter Matrix is :
 [[38.4878051  10.50787213 11.13746016]
 [10.50787213 36.23651274 11.96598642]
 [11.13746016 11.96598642 49.73596619]]
