Suggested use in order:

1. The thoeretical review on PCA (Principle Component Analysis).
2. Use this notebook to apply PCA on python environment.
3. Apply the PCA analysis onto Customer Analytics (Segmentation with KMeans).

The notebook covers the contents on Hands-On Machine Learning with Scikit-Learn Keras & TensorFlow (page 213 - 224 before Randomized PCA). The below is just to comprehend the concept and methods in Python only, and the quality of the test is not of concern here.

In [17]:
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

In [18]:
X = np.random.random((20,10)) * 10

### 1. Preprocessing 
PCA assumes that the dataset is centered around the origin. Therefore, it is crucial to process the data so that they are centered. For different features on different scales, normalization should be performanced ( ( x - x.mean( ) ) / x.std( ) )

In [19]:
X_centered = X - X.mean(axis=0) # as X is random, the process in this case may be too much..?

In [83]:
scaler = StandardScaler()
X_centered = scaler.fit_transform(X)

### 2. Singular Value Decomposition (SVD)
Using numpy

In [84]:
U, S, V = np.linalg.svd(X_centered)

In [85]:
c1 = V.T[:,0] # unit vector that defines principle component, same result as V[0] 
c2 = V.T[:,1]

In [86]:
W2 = V.T[:, :2] # set down to 2 dimensions
X2D = X_centered.dot(W2) # project X_centered into 2D dimension defined by W2

In [87]:
X2D.shape, X2D

((20, 2),
 array([[-2.3965023 ,  0.18066819],
        [ 0.06207904,  0.65892585],
        [-0.0714091 , -0.81968094],
        [ 2.72532889,  0.33760876],
        [-0.98795342,  0.89464246],
        [-0.38030095, -0.06896783],
        [-1.92818995,  0.93635026],
        [ 1.66616118,  2.56742312],
        [-0.11485834, -0.13997117],
        [-1.62908634, -0.34160591],
        [-2.03710146, -2.15219237],
        [ 2.11586979, -2.38246788],
        [ 0.2484445 , -0.0405181 ],
        [ 2.03512913, -1.29778913],
        [ 0.12148528, -0.14124142],
        [ 2.30872033,  0.58622783],
        [-1.81766101,  0.08551069],
        [ 0.1692068 ,  1.28464506],
        [-0.24044141,  2.14986928],
        [ 0.15107933, -2.29743676]]))

### 3. Principal Component Analysis (PCA)
Scikit-Learn's PCA Class uses SVD decomposition to implement PCA, just like the above.

In [88]:
pca = PCA(n_components=2, random_state=42)

In [89]:
X2D = pca.fit_transform(X_centered)

In [90]:
X2D # same result as X_centered.dot(W2), where W2 = V.T[:, :2]

array([[-2.3965023 ,  0.18066819],
       [ 0.06207904,  0.65892585],
       [-0.0714091 , -0.81968094],
       [ 2.72532889,  0.33760876],
       [-0.98795342,  0.89464246],
       [-0.38030095, -0.06896783],
       [-1.92818995,  0.93635026],
       [ 1.66616118,  2.56742312],
       [-0.11485834, -0.13997117],
       [-1.62908634, -0.34160591],
       [-2.03710146, -2.15219237],
       [ 2.11586979, -2.38246788],
       [ 0.2484445 , -0.0405181 ],
       [ 2.03512913, -1.29778913],
       [ 0.12148528, -0.14124142],
       [ 2.30872033,  0.58622783],
       [-1.81766101,  0.08551069],
       [ 0.1692068 ,  1.28464506],
       [-0.24044141,  2.14986928],
       [ 0.15107933, -2.29743676]])

### 4. Choosing right number of dimensions 
Choosing k that retains a parameter of variances (e.g., 95% ~ a sufficiently large poprotion of the variance). 

In [91]:
pca.explained_variance_ratio_.sum() # the 2D is not sufficient to explain X_centered...

0.39313782935364255

In [92]:
pca = PCA()

In [93]:
pca.fit(X_centered)
cumsum = np.cumsum(pca.explained_variance_ratio_)
d = np.argmax(cumsum >= 0.95) + 1

In [94]:
d # the dataset needs at least 9 dimensions, PCA not really helps

9