# Dimensionality Reduction

Many problems in machine learning involve thousands or more features for each training instance and it can make the training process extremely slow. This problem is called _the curse of dimensionality_.

However, it is possible to reduce the number of features. It not only allows us to speed up training but also it is usable for data visualization.(reducing number of features down to two)

There are two main approaches to dimensionality reduction: projection and Manifold Learning.

## Projection

Some features are almost constant, while others are highly correlated. As a result, training instances lie close to some lower dimensional subspace of the high dimensional space. 

Here all training instances lie close to a plane: this is a lower-dimensional (2D) subspace of the high-dimensional (3D) space.

![source](img/subspace_PCA.png)
Source: Hands on Machine Learning with Scikit-Learn and Keras

We project every training instance onto this subspace and we get the new 2D dataset.

![source](img/projection.png)
Source: Hands on Machine Learning with Scikit-Learn and Keras

The axes correspond to new features z1 and z2.

In [1]:
#Generate data
%matplotlib inline

import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt

np.random.seed(4)
m = 60
w1, w2 = 0.1, 0.3
noise = 0.1

angles = np.random.rand(m) * 3 * np.pi / 2 - 0.5
X = np.empty((m, 3))
X[:, 0] = np.cos(angles) + np.sin(angles)/2 + noise * np.random.randn(m) / 2
X[:, 1] = np.sin(angles) * 0.7 + noise * np.random.randn(m) / 2
X[:, 2] = X[:, 0] * w1 + X[:, 1] * w2 + noise * np.random.randn(m)

### PCA in Scikit

Principal Component Analysis (PCA) is by far the most popular dimensionality reduction algorithm.

It identifies the hyperplane that lies closest to the data, and then it projects the data onto it.

In order to project the training set onto a lower-dimensional hyperplane, we need to choose the right hyperplane that preserves the maximum variance.(to lose less information than the other projections)

Example of selecting the subspace onto which to project:

![source](img/selecting_subspace.png)
Source: Hands on Machine Learning with Scikit-Learn and Keras

PCA identifies the axis that accounts for the largest amount of variance in the training set. The unit vector that defines the $i^{th}$ axis is called the $i^{th}$ principal component (PC).

To find the principal components of a training set, there is a standard matrix factorization technique called Singular Value Decomposition (SVD). It can decompose the training set matrix X into the dot product of three matrices $U \cdot \sum \cdot V^T$, where $V^T$ contains all the principal components that we are looking for.

After identifying all the principal components, we can start reducing the dimensionality to _d_ dimensions by projecting it onto the hyperplane defined by the first _d_ principal components. 
It is done by simply by computing the dot product of the training set matrix __X__ by the matrix $\textbf{W}_d$(matrix containing the first d principal components).

\begin{equation*}
X_{d-proj} = X \cdot W_d
\end{equation*}

In [2]:
from sklearn.decomposition import PCA

pca = PCA(n_components = 2)
X2D = pca.fit_transform(X)

Very useful information is the explained variance ratio of each principal component, this is the proportion of the dataset’s variance that lies along the axis of each principal component. (Here it says that  84.2% of the dataset’s variance lies along the first axis, and 14.6% lies along the second axis.)

In [3]:
pca.explained_variance_ratio_

array([0.84248607, 0.14631839])

Now we will recover 3d points projected on the plane.

In [4]:
X3D_inv = pca.inverse_transform(X2D)

Running PCA multiple times on slightly different datasets may result in different results.

In [5]:
X2D[:5]

array([[ 1.26203346,  0.42067648],
       [-0.08001485, -0.35272239],
       [ 1.17545763,  0.36085729],
       [ 0.89305601, -0.30862856],
       [ 0.73016287, -0.25404049]])

We can compute reconstruction error.

In [6]:
np.mean(np.sum(np.square(X3D_inv - X), axis=1))

0.010170337792848549

The PCA object gives access to the principal components that it computed:

In [7]:
pca.components_

array([[-0.93636116, -0.29854881, -0.18465208],
       [ 0.34027485, -0.90119108, -0.2684542 ]])

### Choosing the Right Number of Dimensions

Is is a good practice to choose the number of dimensions that add up to a sufficiently large portion of the variance. 

The code below computes PCA without reducing dimensionality, then computes the minimum number of dimensions required to preserve 95% of the training set’s variance.

In [8]:
pca = PCA()
pca.fit(X)
cumsum = np.cumsum(pca.explained_variance_ratio_) 
d = np.argmax(cumsum >= 0.95) + 1
print(d)

2


But it can be also done easier with scikit by setting n_components to be a float between 0.0 and 1.0, indicating the ratio of variance you wish to preserve:

In [9]:
pca = PCA(n_components=0.95) 
X_reduced = pca.fit_transform(X)

But projection is not always the best approach to dimensionality reduction, example below:

![source](img/swiss_roll.png)
Source: Hands on Machine Learning with Scikit-Learn and Keras


Projecting onto a plane would squash different layers of the Swiss roll together.

![source](img/projected_swiss_roll.png)
Source: Hands on Machine Learning with Scikit-Learn and Keras

# Manifold Learning

The Swiss roll is an example of a 2D manifold, which is a 2D shape that can be bent and twisted in a higher-dimensional space. 

Many dimensionality reduction algorithms work by modeling the manifold on which the training instances lie, this is called Manifold Learning. It relies on the manifold assumption, also called the manifold hypothesis, which holds that most real-world high-dimensional datasets lie close to a much lower-dimensional manifold. 

The manifold assumption is often accompanied by another implicit assumption: that the task will be simpler if expressed in the lower-dimensional space of the manifold.

![source](img/manifold.png)
Source: Hands on Machine Learning with Scikit-Learn and Keras