In [14]:
"""
    Many ML problems involve thousands or even millions 
    of features for each training instance. Not only does this make
    training extremely slow, it can also make it much harder to find
    a good solution!
    
    Fortunately in real-world problems, it is often possible to reduce
    the number of features considerably, turning an intractanble 
    problem into a tracable one. For instance, the images of MNIST dataset
    always have the border pixels white => we could completely drop them
    from the training set, without losing much information. 
    Moreover, two neighbouring pixels are often highly correlated => 
    we can merge them into a single pixel, by taking the mean of the 
    two pixel intensities, w/o losing much info!!
    
    Apart from speeding up training, dimensionality reduction is also
    extremely useful for data visualization - aka dataViz. Reducing
    the nr of dims down to 2 or 3 make it possible to plot
    a high dim training set on a graph and often gain important insights 
    by visually detecting patterns, like clusters. 
    
    Two main approaches to DR are: 
         - projection
         - Manifold learning

    others include PCA, Kernel PCA, and LLE
   
   It turns out that many things behave very differently in high-dimensional space. For example, if you pick a
    random point in a unit square (a 1 × 1 square), it will have only about a 0.4% chance of being located less
    than 0.001 from a border (in other words, it is very unlikely that a random point will be “extreme” along
    any dimension). But in a 10,000-dimensional unit hypercube (a 1 × 1 × × 1 cube, with ten thousand 1s),
    this probability is greater than 99.999999%. Most points in a high-dimensional hypercube are very close
    to the border.It turns out that many things behave very differently in high-dimensional space. For example, if you pick a
    random point in a unit square (a 1 × 1 square), it will have only about a 0.4% chance of being located less
    than 0.001 from a border (in other words, it is very unlikely that a random point will be “extreme” along
    any dimension). But in a 10,000-dimensional unit hypercube (a 1 × 1 × × 1 cube, with ten thousand 1s),
    this probability is greater than 99.999999%. Most points in a high-dimensional hypercube are very close
    to the border.
    
    Projection:
    In most problems, training instances are not spread out uniformly
    across all dimensions. Many features are almost constant, while
    others are highly correlated => all training instances actually
    lie within a much lower-dimensional subspace of the high-dimensional
    space.
    
    For instance, we project EVERY TRAINING INSTANCE PERPENDICULAR TO A NEW 
    2D SUBSET, hence reduce from 3D to 2D;
    
    PCA first identifies the hyperplane that lies closest to the data, 
    and then it projects the data onto it. 
    
    PCA identifies the axis that accounts for the largest amount
    of variance in the training set. 
    It also finds a second axis, orthogonal to the first one, that 
    accounts for the largest amount
    of remaining variance. 
    
"""
import numpy as np

X = 2 * np.random.rand(100, 1)
X_a = np.c_[np.ones((100,1)),X]

# PCA
from sklearn.decomposition import PCA

pca = PCA(n_components = 2)
X2D = pca.fit_transform(X_a)

X2D

pca.explained_variance_ratio_

array([1., 0.])