In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [4]:
def rotation(alpha, beta, gamma):
    alpha = alpha * 3.14 / 180 
    beta = beta * 3.14 / 180
    gamma = gamma * 3.14 / 180
    return np.array([
    [np.cos(alpha)*np.cos(beta) , np.cos(alpha)*np.sin(beta)*np.sin(gamma) - np.sin(alpha)*np.cos(gamma), np.cos(alpha)*np.sin(beta)*np.cos(gamma) + np.sin(alpha)*np.sin(gamma)],
    [np.sin(alpha)*np.cos(beta) , np.sin(alpha)*np.sin(beta)*np.sin(gamma) + np.cos(alpha)*np.cos(gamma), np.sin(alpha)*np.sin(beta)*np.cos(gamma) - np.cos(alpha)*np.sin(gamma)],
    [-np.sin(beta),  np.cos(beta)*np.sin(gamma), np.cos(beta)*np.cos(gamma)]])
    
mean = np.array([10,3,5])
cov_variance_tilde = np.array([[9, 0 , 0 ], [0,2,0], [0, 0, 0.5]])
cov_matrix = np.matmul(np.matmul(rotation(40, 30, 20).T , cov_variance_tilde) , rotation(40, 30, 20))
data = np.random.multivariate_normal(mean, cov_matrix, size=1000)

## Dimensionality Reduction

### Principal Component Analysis (PCA)

PCA algorithm performs **dimensionality reduction** on the data set. ***Dimensionality reduction implies, we attempt to capture the essence of a multivariate dataset into fewer number of variables that would explain the required result***.

So subsequent to generation of individual PCs, only those PCs are selected that explain the maximum variation thereby capturing the essence of the analysis.

PCA algorithm involves the following steps:

1. **Standardization**:
As we have already seen, this step involves standardizing the input variables so that they may be used in the PCA analysis. Accuracy of PCA algorithm is a function of the accuracy of inputs. So, in the very first step of the algorithm, we perform a standardization which results in all variables getting transformed to a same scale. This builds the foundation of the PCA analysis.

2. **Covariance matrix**:
This step involves computation of a covariance matrix which gives the relationship between the input variables. A covariance matrix is a symmetric matrix. With the diagonal elements giving the correlation, and the off-diagonal elements giving the covariance between variables. Depending on the sign of the covariance, the algorithm understands whether there is a direct or an inverse relation between variables. This step is important with respect to dimensionality reduction, as highly correlated variables may convey redundant information so the algorithm may appropriately handle these during the analysis.

3. **Eigen algebra**:
Eigen values and Eigen vectors are calculated from the covariance matrix computed in the step above. Eigenvectors of the covariance matrix are the direction of the axes where there is most variance i.e. most information. These are the PCs. Eigen values are the coefficients attached to the eigen vectors; they explain the amount of variance carried by each of the PCs. By ranking the eigen vectors in the order of their eigen values, we get the PCs in order of their significance. Next, we choose only the top PCs that explain a given percentage of the variance.

**For further details see slides in chapter 2**

In [5]:
pd.DataFrame(data, columns=["x1", "x2", "x3"])

Unnamed: 0,x1,x2,x3
0,10.509757,2.324407,6.725287
1,14.613840,1.519780,7.718482
2,10.520990,2.186009,5.798401
3,6.976431,3.822711,3.565686
4,10.648823,1.722884,4.029829
...,...,...,...
995,9.236007,2.848346,4.633646
996,13.524193,2.376648,8.560059
997,10.609743,0.898286,7.172645
998,10.178144,1.572685,6.282383


In [15]:
%matplotlib notebook
fig = plt.figure()
ax = fig.add_subplot(projection='3d')

ax.scatter(data[:,0], data[:,1], data[:,2])
ax.set_xlabel('$x_1$')
ax.set_ylabel('$x_2$')
ax.set_zlabel('$x_3$')
plt.show()

<IPython.core.display.Javascript object>

In [16]:
np.cov(data.T)

array([[ 1.001001  , -0.49950156,  0.82460288],
       [-0.49950156,  1.001001  , -0.67235766],
       [ 0.82460288, -0.67235766,  1.001001  ]])

In [17]:
#cov_matrix

In [18]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data = scaler.fit_transform(data)

%matplotlib notebook
fig = plt.figure()
ax = fig.add_subplot(projection='3d')

ax.scatter(data[:,0], data[:,1], data[:,2])
ax.set_xlabel('$x_1$')
ax.set_ylabel('$x_2$')
ax.set_zlabel('$x_3$')
plt.show()

<IPython.core.display.Javascript object>

In [19]:
from sklearn.decomposition import PCA
pca = PCA()
pca.fit(data)
print(pca.explained_variance_ratio_)
print(pca.singular_values_)

data_pca = pca.transform(data)

fig , ax = plt.subplots(1,1, figsize=(10,5))
ax.bar([1,2,3], pca.explained_variance_ratio_, label="variance")
ax.scatter([1,2,3], pca.explained_variance_ratio_.cumsum(), c = "orange", )
ax.plot([1,2,3], pca.explained_variance_ratio_.cumsum(), c = "orange", label="cumulative variance")
ax.set_xticks([1,2,3])
ax.set_xticklabels(["$z_1$", "$z_2$", "$z_3$"])
ax.set_ylabel("normed variance")
ax.legend()
plt.show()

[0.7794727  0.17284926 0.04767805]
[48.3571927  22.77164419 11.95968793]


<IPython.core.display.Javascript object>

In [20]:
%matplotlib notebook
fig = plt.figure()
ax = fig.add_subplot(projection='3d')

ax.scatter(data_pca[:,0], data_pca[:,1], data_pca[:,2])
ax.set_xlabel('$z_1$')
ax.set_ylabel('$z_2$')
ax.set_zlabel('$z_3$')
plt.show()

<IPython.core.display.Javascript object>

In [21]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(data)
data_pca = pca.transform(data)

fig, ax = plt.subplots(1,1, figsize = (8,5))
ax.scatter(data_pca[:,0], data_pca[:,1])
ax.set_xlabel('$z_1$')
ax.set_ylabel('$z_2$')
plt.show()

<IPython.core.display.Javascript object>

In [22]:
np.cov(data_pca.T)

array([[ 2.34075884e+00, -5.33440492e-16],
       [-5.33440492e-16,  5.19066846e-01]])