### PCA


We try here to create data and apply the scikit learn's PCA api - and see how it works

##### 1. Creating data

Let us do in these steps
1. Let data be 15 dimensional( ie $R^{15}$ and having N=1000 data points.
2. We create 5 PC vectors (challenge: how to create 5 orthogonal $R^{15}$-vectors?)  - use scipy
3. Create each data point as a linear combination of these vectors


In [1]:
from scipy.stats import ortho_group
from scipy import random
import numpy as np
from numpy import random as npr

npr.seed(seed=144)


def get_k_ndim_pc_vectors(k, n):
    x = ortho_group.rvs(dim = n)
    return x[range(k),:]

#x = get_k_n_dim_pc_vectors(15, 5)

n = 3    #dimensions
num_main_pcs = 2 #number of main explaining components
tot = 25 #Total number of points
pc_vecs = get_k_ndim_pc_vectors(n, n)

#Though rows and columns are all orthogonal
pc_vecs_as_rows = pc_vecs.T

#pc_vecs
print(pc_vecs_as_rows.shape)

(3, 3)


In [2]:
# Let us make sure these are orthogonal - It will be like identity vector
pc_vecs.dot(pc_vecs.T).round(2)


array([[ 1., -0.,  0.],
       [-0.,  1., -0.],
       [ 0., -0.,  1.]])

Let us do linear combination of these  PC vectors($R^{15}$) to produce $N=1000$ points.
Evey point will have a different linear combination. So we get **pc_loadings** of $5 x 1000$ so that we get 15 x 1000 dim matrix (whose each column will be generated point ie $R^{15}$ vector)

We also add noise loadings **random_noise**, these loadings will be small as compared to actual participating PC's which are first to 5th. These noisy loadings will add in orthogonal directions to span of first five PCs (which are going to explain for most variation).

In [None]:
npr.random_integers(low = 1, high=10, size=(num_main_pcs,tot))


In [22]:
# Main PC loading coefficients for 1000 points
pc_loadings = npr.random_integers(low = 1, high=10, size=(num_main_pcs,tot)) - 5
print(pc_loadings.shape)
print(pc_loadings)

(2, 25)
[[ 0  0  2 -2 -2  3 -1 -3  0 -2  4  1 -1  1 -4  5 -1 -1 -2 -3 -4 -1  4 -4
   1]
 [-4 -3  3  5  4  2  3  4 -1 -2  5  3 -1  4  2 -1  3 -1  3 -4  5 -3  1  2
  -3]]


  


In [24]:
# Orthogonal loading coefficients for 1000 points
random_noise = random.uniform(low = -0.01, high = + 0.01, size = ((n-num_main_pcs),tot)) # npr.normal(size = ((n-num_main_pcs),tot)) * 0.001
final_pc_loadings = np.concatenate((pc_loadings, random_noise), axis=0).T
print("final_pc_loadings shape:" + str(final_pc_loadings.shape))
np.round(final_pc_loadings,3)

final_pc_loadings shape:(25, 3)


array([[ 0.e+00, -4.e+00, -9.e-03],
       [ 0.e+00, -3.e+00, -1.e-03],
       [ 2.e+00,  3.e+00,  2.e-03],
       [-2.e+00,  5.e+00, -4.e-03],
       [-2.e+00,  4.e+00, -9.e-03],
       [ 3.e+00,  2.e+00, -6.e-03],
       [-1.e+00,  3.e+00, -9.e-03],
       [-3.e+00,  4.e+00,  2.e-03],
       [ 0.e+00, -1.e+00,  4.e-03],
       [-2.e+00, -2.e+00,  3.e-03],
       [ 4.e+00,  5.e+00,  7.e-03],
       [ 1.e+00,  3.e+00, -2.e-03],
       [-1.e+00, -1.e+00,  1.e-03],
       [ 1.e+00,  4.e+00, -6.e-03],
       [-4.e+00,  2.e+00, -7.e-03],
       [ 5.e+00, -1.e+00, -5.e-03],
       [-1.e+00,  3.e+00, -4.e-03],
       [-1.e+00, -1.e+00,  8.e-03],
       [-2.e+00,  3.e+00,  9.e-03],
       [-3.e+00, -4.e+00,  4.e-03],
       [-4.e+00,  5.e+00, -9.e-03],
       [-1.e+00, -3.e+00, -2.e-03],
       [ 4.e+00,  1.e+00,  2.e-03],
       [-4.e+00,  2.e+00, -6.e-03],
       [ 1.e+00, -3.e+00, -9.e-03]])

In [25]:
pc_vecs_as_rows

array([[-0.99602866, -0.07097906,  0.05374832],
       [-0.02515794,  0.80346097,  0.59482564],
       [ 0.08540484, -0.59111119,  0.80205584]])

In [40]:
#generate all 1000 points
all_points   = final_pc_loadings.dot(pc_vecs_as_rows) #+ random_noise
np.savetxt("/tmp/all_points.csv", all_points, delimiter=",")
print("Shape all_points: " + str(all_points.shape))
print(all_points)

Shape all_points: (25, 3)
[[ 0.09982826 -3.20828274 -2.38684828]
 [ 0.07536229 -2.40961103 -1.78552427]
 [-2.06734955  2.26716806  1.89367878]
 [ 1.86595015  4.16146042  2.86364995]
 [ 1.89069321  3.36087089  2.26492815]
 [-3.03894021  1.39771088  1.34584042]
 [ 0.91977482  2.48676076  1.72340321]
 [ 2.88764651  3.42545027  2.21986332]
 [ 0.02553781 -0.80609015 -0.59125821]
 [ 2.04261801 -1.4666583  -1.29484876]
 [-4.10926742  3.72898042  3.1951028 ]
 [-1.0716813   2.34064161  1.83654579]
 [ 1.02128603 -0.7331701  -0.64764019]
 [-1.0971638   3.14634896  2.4283234 ]
 [ 3.93323157  1.89476389  0.96933137]
 [-4.95544374 -1.15518373 -0.33038874]
 [ 0.92017765  2.48397269  1.72718623]
 [ 1.02184339 -0.73702773 -0.64240591]
 [ 1.91737519  2.54686159  1.68441513]
 [ 3.08904834 -3.00319495 -2.53744271]
 [ 3.85752621  4.30674948  2.75163369]
 [ 1.0713106  -2.33807587 -1.84002714]
 [-4.00913734  0.51860872  0.81108896]
 [ 3.9333025   1.89427298  0.96999747]
 [-0.92136281 -2.47576986 -1.73831633]

all_points is a **tot x n** where each column is the point as mentioned before

##### ------------------------  Data generation part completes here ------------------------ #####

Let us start doing PCA of **all_points** and see if we can recover the components(ie PCs and the coefficients **pc_loadings** and data back)


In [41]:
print("Shape of all_points:" + str(all_points.shape))
#Standardising 
means_of_15_dims = np.apply_over_axes(lambda x, axis: np.mean(x, axis = axis), all_points, axes = 0)
means_of_15_dims
print("means{}:".format(means_of_15_dims.shape))
print(means_of_15_dims)

sd_of_15_dims = np.apply_over_axes(lambda x, axis: np.std(x, axis = axis, ddof=0), all_points, axes = 0)
sd_of_15_dims
print("sds {}:".format(sd_of_15_dims.shape))
print(sd_of_15_dims)

print("all_points shape:" + str(all_points.shape))

Shape of all_points:(25, 3)
means(1, 3):
[[0.37208665 0.86510229 0.59561153]]
sds (1, 3):
[[2.48448114 2.40649585 1.7648127 ]]
all_points shape:(25, 3)


In [42]:
scaled = (all_points - means_of_15_dims)/sd_of_15_dims
print("shape scaled: " + str(scaled.shape))
scaled

shape scaled: (25, 3)


array([[-0.1095836 , -1.69266239, -1.68995827],
       [-0.11943112, -1.36078079, -1.34922862],
       [-0.98186948,  0.58261716,  0.73552693],
       [ 0.60127785,  1.36977512,  1.28514399],
       [ 0.6112369 ,  1.03709658,  0.94588884],
       [-1.37293329,  0.22132122,  0.42510398],
       [ 0.22044368,  0.67386714,  0.63904328],
       [ 1.01250914,  1.06393202,  0.92035364],
       [-0.1394854 , -0.69445058, -0.67251881],
       [ 0.67238641, -0.96894436, -1.07119599],
       [-1.80373841,  1.19006153,  1.4729559 ],
       [-0.58111448,  0.6131485 ,  0.70315352],
       [ 0.26130179, -0.66414924, -0.70446667],
       [-0.59137114,  0.94795371,  1.03847388],
       [ 1.43335558,  0.4278676 ,  0.21176176],
       [-2.14432313, -0.83951361, -0.52470173],
       [ 0.22060582,  0.67270858,  0.64118686],
       [ 0.26152613, -0.66575225, -0.70150076],
       [ 0.62197636,  0.69884156,  0.61695137],
       [ 1.09357307, -1.60743981, -1.77528995],
       [ 1.40288428,  1.43014882,  1.221

In [43]:
from sklearn.preprocessing import StandardScaler
x_transformed = StandardScaler().fit_transform(all_points)
x_transformed

array([[-0.1095836 , -1.69266239, -1.68995827],
       [-0.11943112, -1.36078079, -1.34922862],
       [-0.98186948,  0.58261716,  0.73552693],
       [ 0.60127785,  1.36977512,  1.28514399],
       [ 0.6112369 ,  1.03709658,  0.94588884],
       [-1.37293329,  0.22132122,  0.42510398],
       [ 0.22044368,  0.67386714,  0.63904328],
       [ 1.01250914,  1.06393202,  0.92035364],
       [-0.1394854 , -0.69445058, -0.67251881],
       [ 0.67238641, -0.96894436, -1.07119599],
       [-1.80373841,  1.19006153,  1.4729559 ],
       [-0.58111448,  0.6131485 ,  0.70315352],
       [ 0.26130179, -0.66414924, -0.70446667],
       [-0.59137114,  0.94795371,  1.03847388],
       [ 1.43335558,  0.4278676 ,  0.21176176],
       [-2.14432313, -0.83951361, -0.52470173],
       [ 0.22060582,  0.67270858,  0.64118686],
       [ 0.26152613, -0.66575225, -0.70150076],
       [ 0.62197636,  0.69884156,  0.61695137],
       [ 1.09357307, -1.60743981, -1.77528995],
       [ 1.40288428,  1.43014882,  1.221

We see above that Scikit's standard scalar iz just z-scoring as we do directly in **scaled**

In [44]:
import pandas as pd
in_data = pd.DataFrame(x_transformed)

In [45]:
from sklearn.decomposition import PCA
pca = PCA(n_components = in_data.shape[1])
pca.fit(in_data)

PCA(n_components=3)

In [46]:
print(pca.explained_variance_ratio_)
print(np.cumsum(pca.explained_variance_ratio_))

[6.63577294e-01 3.36420067e-01 2.63905025e-06]
[0.66357729 0.99999736 1.        ]


In [47]:
# Let us try to discover how many components are needed
# var_ret is [0,1], which is fraction of total variance retained
def get_num_components(in_mat, var_ret):
    pca = PCA(n_components = in_mat.shape[1])
    pca.fit(in_mat)
    expl = pca.explained_variance_ratio_
    needed_components = 1 + np.min(np.where(np.cumsum(expl/sum(expl)) >= var_ret))
    return needed_components

In [48]:
explained_by = get_num_components(in_data, 0.99)
explained_by

2

In [49]:
from sklearn.decomposition import PCA
pca = PCA()
pca.fit(in_data)
pca.components_

array([[-0.04461722, -0.70808837, -0.70471282],
       [-0.99342745, -0.04298386,  0.10608626],
       [ 0.10540973, -0.70481434,  0.7015166 ]])

As expected, we have **explained_by** principal components which explain the variance in the data.

In [54]:
#from sklearn.decomposition import PCA
#pca = PCA(n_components = explained_by)
#pca.fit(in_data)
#pca.components_
pca.explained_variance_ratio_
pca.components_

array([[-0.04461722, -0.70808837, -0.70471282],
       [-0.99342745, -0.04298386,  0.10608626],
       [ 0.10540973, -0.70481434,  0.7015166 ]])

In [51]:
pca.components_.shape

(3, 3)

**components_** : ndarray of shape (n_components, n_features)
Principal axes in feature space, representing the directions of maximum variance in the data. The components are sorted by explained_variance_.


In [52]:
pca.components_[0,:]

array([-0.04461722, -0.70808837, -0.70471282])

In [18]:
## PC vectors are orthogonal to each other
print(pca.components_.shape)
np.round(pca.components_.dot(pca.components_.T),2)

(2, 3)


array([[1., 0.],
       [0., 1.]])

In [19]:
#Lets us see if our components are perpendicular/prallel to our starting generator vectors pc_vecs 
pca.components_  #5 x 15


array([[ 0.44721309,  0.64450015,  0.6201774 ],
       [ 0.8873025 , -0.23233779, -0.39838853]])

In [55]:
[ pca.components_[0,:].dot(pc_vecs_as_rows[i,:]) for i in range(n)]

[0.05682235157504861, -0.9869801505221955, -0.15047060456004563]

In [56]:

# Fixing random state for reproducibility
np.random.seed(19680801)


def randrange(n, vmin, vmax):
    '''
    Helper function to make an array of random numbers having shape (n, )
    with each number distributed Uniform(vmin, vmax).
    '''
    return (vmax - vmin)*np.random.rand(n) + vmin

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

# For each set of style and range settings, plot n random points in the box
# defined by x in [23, 32], y in [0, 100], z in [zlow, zhigh].
ax.scatter(scaled[:,0], scaled[:,1], scaled[:,2], marker='^')

ax.set_xlabel('X Label')
ax.set_ylabel('Y Label')
ax.set_zlabel('Z Label')

plt.show()

NameError: name 'plt' is not defined