# Principal Component Analysis

PCA is a commonly used method to transform data so it can be better processed by a model.

## Pseudo Code

1. Standardize the dataset! mu=0, sig=1 (This step is very important since PCA looks on the variance of each feature. If they are not Standardized this method will not work properly.
2. Find the line through the date with the maximum spread / variance
3. Obtimize the lines parameter with least error
4. Draw the next Dimension orthogonal to the first
5. Go back to 4 until all dimensions are consumed
6. The output of the pca method will give us a weight and an order which is relative to the features importance on explaining the data!
7. If necessary start reducing features

### Creating a multivariate random dataset

In [8]:
import numpy as np
np.random.seed(1)

vec1 = np.array([0, 0, 0])
mat1 = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1]])
sample_for_class1 = np.random.multivariate_normal(vec1, mat1, 20).T
assert sample_for_class1.shape == (3, 20), "The dimension of the sample_for_class1 matrix is not 3x20"

vec2 = np.array([1, 1, 1])
mat2 = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1]])
sample_for_class2 = np.random.multivariate_normal(vec2, mat2, 20).T
assert sample_for_class2.shape == (3, 20), "The dimension of the sample_for_class2 matrix is not 3x20"

all_data = np.concatenate((sample_for_class1, sample_for_class2), axis=1)
assert all_data.shape == (3, 40), "The dimension of the all_data matrix is not 3x20"

In [9]:
mean_dim1 = np.mean(all_data[0, :])
mean_dim2 = np.mean(all_data[1, :])
mean_dim3 = np.mean(all_data[2, :])

mean_vector = np.array([[mean_dim1], [mean_dim2], [mean_dim3]])

print('The Mean Vector:\n', mean_vector)

scatter_matrix = np.zeros((3,3))
for i in range(all_data.shape[1]):
    scatter_matrix += (all_data[:, i].reshape(3, 1) - mean_vector).dot((all_data[:, i].reshape(3, 1) - mean_vector).T)
print('The Scatter Matrix is :\n', scatter_matrix)

The Mean Vector:
 [[0.41667492]
 [0.69848315]
 [0.49242335]]
The Scatter Matrix is :
 [[38.4878051  10.50787213 11.13746016]
 [10.50787213 36.23651274 11.96598642]
 [11.13746016 11.96598642 49.73596619]]


### Compute the eigenvektor of the scatter matrix:

In [10]:
eig_val, eig_vec = np.linalg.eig(scatter_matrix)
for ev in eig_vec:
    np.testing.assert_array_almost_equal(1.0, np.linalg.norm(ev))

In [12]:
eig_vec

array([[-0.49210223, -0.64670286,  0.58276136],
       [-0.47927902, -0.35756937, -0.8015209 ],
       [-0.72672348,  0.67373552,  0.13399043]])

In [14]:
# We Make a list of tuple containing (eigenvalue, eigenvector)
eig_pairs = [(np.abs(eig_val[i]), eig_vec[:,i]) for i in range(len(eig_val))]

# We then Sort list of tuples by the eigenvalue
eig_pairs.sort(key=lambda x: x[0], reverse=True)

# verify that the list is correctly sorted by decreasing eigenvalues
for i in eig_pairs:
    print(i[0])

65.16936779078195
32.69471296321796
26.596203282097097


In [16]:
matrix_w = np.hstack((eig_pairs[0][1].reshape(3,1), eig_pairs[1][1].reshape(3,1)))
print('Matrix W:\n', matrix_w)

Matrix W:
 [[-0.49210223 -0.64670286]
 [-0.47927902 -0.35756937]
 [-0.72672348  0.67373552]]


In [17]:
transformed = matrix_w.T.dot(all_samples)
assert transformed.shape == (2,40), "The matrix is not 2x40 dimensional."

NameError: name 'all_samples' is not defined