# 08. Dimensionality Reduction

The enourmous scale of data available is a double-edged sword. Massive dataset can be slow to process, leading to what is usually called the **curse of dimensionality**.

In real work applications, we can the methods used to reduce this size to a treatable scale **dimensionality reduction**. 
Clear advantages are: 
1. Lower memory requirements (obviously)
2. Faster computation 
3. Ease of visualization, if we manage to get down to 2D/3D

Generally speaking, there are two main approaches to dimensionality reduction: projection and manifold learning.

#### Projection

The idea behind projection is fairly simple: in most cases, instances are not evenly spread out, but highly correlated. As a result, all training instances actually lie within
(or close to) a much lower-dimensional _subspace_ of the high-dimensional space.

#### Manifold Learning

A _d-dimensional_ manifold is a part of an _n-dimensional_ space (where d < n) that locally resembles a _d-dimensional_ hyperplane. 

Many dimensionality reduction algorithms work by modeling the manifold on which the training instances lie; this is called **Manifold Learning**. This works due to two main hypotheses:

1. Most real-world high-dimensional datasets lie close to a much lower-dimensional manifold, an hypothesis that holds true empirically very often. 
2. Task will be easier to perform on lower-dimensional space. This does not always hold true so we should be cautious to understand if this is the case.  

### Principal Component Analysis 

PCA is by far the most popular dimensionality reduction algorithm. It works by identifying the hyperplane closest to the data and projecting the data onto it.

The hyperplane itself is chosen as the one that **preserve the most variance** as it will most likely retain the most information.

#### Principal components

After idenfying the first axis, we then proceed to find a second one that retains the most of the remaining variance.

The unit vector that defines the $i^{th}$ axis is called the $i^{th}$ principal component. 

We can find PCs of a training set by using a standard matrix factorization technique known as **Singular Value Decomposition** (SVD) that decomposes the training matrix in three values: u, $\sigma$ and $V^T$ where $V$ contains all the PCs.

Using Numpy:

In [None]:
X_centered = X - X.mean(axis=0)
U, s, Vt = np.linalg.svd(X_centered)
c1 = Vt.T[:, 0]
c2 = Vt.T[:, 1]

**Note**: PSA assumes that the dataset is centered. Usually Scikit-learn takes care of this.

Once we have identified all the principal components, you can reduce the dimensionality of the dataset down to $d$ dimensions by projecting it onto the hyperplane defined by the first $d$ principal components.

In Python, we simply have to compute the matrix multiplication of the training matrix $X$ and the matrix containing the first $d$ PCs: 

$X_{d-proj} = X W_d$

In [None]:
W2 = Vt.T[:, :2]
X2D = X_centered.dot(W2)

Using Scikit-learn:

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components = 2)
X2D = pca.fit_transform(X)

#### Explained Variance Ratio

EVR indicates the proportion of dataset's variance that lies along the axis of each principal component. This can be a good proxy to understand how much information we are retaining.  

This concept is strictly related to how to choose the number of dimensions. In most cases, we start with a target variance and choose the minimum number of dimensions to preserve that variance. 

In [None]:
pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X_train)

#### Randomized PCA

To make things even faster, we can rely on randomized PCA, which works using an approximation of the first $d$ principal components. 

In [None]:
rnd_pca = PCA(n_components=154, svd_solver="randomized")
X_reduced = rnd_pca.fit_transform(X_train)

#### Incremental PCA

Differently from Randomized PCA, Incremental PCA does not require to store the whole training set in memory, making it suitable for batches and online learning.

### Kernel PCA