### PCA


We try here to create data and apply the scikit learn's PCA api - and see how it works

##### 1. Creating data

Let us do in these steps
1. Let data be 15 dimensional( ie $R^{15}$ and having N=1000 data points.
2. We create 5 PC vectors (challenge: how to create 5 orthogonal $R^{15}$-vectors?)  - use scipy
3. Create each data point as a linear combination of these vectors


In [2]:
from scipy.stats import ortho_group
from scipy import random
import numpy as np
from numpy import random as npr

npr.seed(seed=144)


def get_k_ndim_pc_vectors(k, n):
    x = ortho_group.rvs(dim = n)
    return x[range(k),:]

#x = get_k_n_dim_pc_vectors(15, 5)

pc_vecs = get_k_ndim_pc_vectors(5, 15)
pc_vecs

array([[-3.26291933e-01, -6.64136469e-02,  2.67134455e-01,
        -1.76450505e-01,  6.73835035e-02,  6.50302768e-01,
         4.22182916e-04,  4.41506706e-01, -8.54629656e-04,
         5.28377819e-02, -1.24489436e-01, -1.70253556e-01,
         3.36465677e-02, -2.79008814e-01, -1.94961956e-01],
       [-2.32522371e-02, -5.29135073e-01, -4.98369098e-01,
         1.91045813e-01, -1.09358341e-01,  1.41556933e-01,
        -2.02376447e-01,  1.49545257e-01, -1.09663044e-01,
         3.13232660e-01, -1.18860838e-01,  2.72085259e-01,
        -3.63930774e-01, -5.44110605e-02,  7.47153527e-02],
       [ 1.76075685e-02,  7.56573807e-03, -3.27797521e-01,
         2.09019797e-01,  3.41418111e-01, -8.88363233e-02,
        -3.75123750e-01, -5.30461823e-02,  1.11062506e-01,
         2.79882508e-01, -9.20845356e-02, -4.84609731e-01,
         4.45687521e-01, -9.74416013e-02, -1.95894910e-01],
       [ 4.66118943e-01,  9.67574157e-02,  2.61439598e-01,
        -9.30293071e-02, -3.66199430e-02,  1.00422272

Let us do linear combination of these  PC vectors($R^{15}$) to produce $N=1000$ points.
Evey point will have a different linear combination. So we get $pc_loadings$ of $5 x 1000$ so that we get 15 x 1000 dim matrix (whose each column will be generated point ie $R^{15}$ vector)

In [None]:
#help(npr.uniform)
pc_loadings = 10 * (npr.uniform(size=(5,1000)) - 0.5)
pc_loadings.shape
print(pc_loadings)

In [None]:
#generate all 1000 points
random_noise = npr.normal(size = (15,1000)) * 0.01
all_points   = pc_vecs.dot(pc_loadings) + random_noise

all_points is a $15x1000$ where each column is the point as mentioned before

In [None]:
print(all_points.shape)
print(all_points)

##### ------------------------  Data generation part completes here ------------------------ #####

Let us start doing PCA of **all_points** and see if we can recover the components(ie PCs and the coeeficients **pc_loadings** and data back)

In [None]:
import pandas as pd
in_data = pd.DataFrame(all_points)

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components = in_data.shape[0])
pca.fit(in_data)

In [None]:
print(pca.explained_variance_ratio_)
print(np.cumsum(pca.explained_variance_ratio_))

In [None]:
# Let us try to discover how many components are needed
# var_ret is [0,1], which is fraction of total variance retained
def get_num_components(in_mat, var_ret):
    pca = PCA(n_components = in_mat.shape[0])
    pca.fit(in_mat)
    expl = pca.explained_variance_ratio_
    needed_components = 1 + np.min(np.where(np.cumsum(expl/sum(expl)) >= var_ret))
    return needed_components

In [None]:
get_num_components(in_data, 0.99)

As expected, we have 5 principal components which explain the variance in the data.

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components = 15)
pca.fit(in_data)
pca.components_

In [None]:
pca.components_.shape

**components_** : ndarray of shape (n_components, n_features)
Principal axes in feature space, representing the directions of maximum variance in the data. The components are sorted by explained_variance_.


In [None]:
pca.components_[:,0]

In [None]:
pc_vecs