## Chapter 8. Dimensionality Reduction

Purpose of dimension reduction: 
 - _Curse of dimensionality_: large number of features $\rightarrow$ training slow, difficult to find a good solution.
 - Data visualization.

Main approaches:
 - projection
 - Manifold Learning
 
Techniques:
 - PCA
 - Kernel PCA
 - LLE
 
### The Curse of Dimensionality

High-dimensional datasets are at risk of being very sparse. Of course, this also means that a new instance will likely be far away from any training instance, making predictions much less reliable than in lower dimensions, since they will be based on much larger extrapolations. In short, the more dimensions the training set has, the greater the risk of overfitting it.

In theory, one solution could be to increase the size of the training set. Unfortunately, in practice, the number of training instances required to reach a given density grows exponentially with the number of dimensions.

### Main Approaches for Dimensionality Reduction

#### Projection

In most real-world problems, training instances are not spread out uniformly across all dimensions. Many features are almost constant, while others are highly correlated. As a result, all training instances actually lie within (or close to) a much lower-dimensional subspace of the high-dimensional space.

#### Manifold Learning

A $d$-dimensional manifold is a part of an $n$-dimensional space (where $d < n$) that locally resembles a $d$-dimensional hyperplane. (Swiss roll, $d=2, \ n=3$) 

_Manifold Learning_: modeling the manifold on which the training instances lie.

_Manifold assumption_ (_manifold hypothesis_): most real-world high-dimensional datasets lie close to a much lower-dimensional manifold. (often empirically observed)

Another implicit assumption: the task will be simpler if expressed in the lower-dimensional space of the manifold. 

Reducing the dimensionality of your training set before training a model, will definitely speed up training, but it may not always lead to a better or simpler solution; it all depends on the dataset.

### PCA

_Principle Component Analysis_ is the most popular dimensionality reduction algorithm. First, identifies the hyperplane that lies closest to the data; Then, projects the data onto it.

#### Preserving the Variance

Select the axis that
 - preserves the maximum amount of variance.
 - minimizes the mean squared distance between the original dataset and its projection onto that axis.

#### Principle Components

PCA identifies the axis that accounts for the largest amount of variance in the training set, also finds a second axis, orthogonal to the first one, that accounts for the largest amount of remaining variance.

_$i^{th}$ principal component_ (PC): unit vector of the $i^{th}$ axis.

_Singular Value Decomposition_ (SVD)

$$ \mathbf{X} = \mathbf{U} \cdot \mathbf{\Sigma} \cdot \mathbf{V}^T$$

_Principle component matrix_ ($n \times n$)

$$ \mathbf{V}^T = 
\begin{pmatrix}
 \vdots & \vdots  &  & \vdots \\ 
 c_1 & c_2 & \cdots & c_n \\ 
 \vdots & \vdots &  & \vdots
\end{pmatrix} $$

<font color=red>_WARNING_</font>
>PCA assumes that the dataset is centered around the origin.

#### Projecting Down to d Dimensions

_Projecting the training set down to $d$ dimensions_

$$\mathbf{X}_{d-proj} = \mathbf{X} \cdot \mathbf{W}_d$$

where $\mathbf{W}_d$ contains the first $d$ principal components

#### Explained Variance Ratio

The proportion of the dataset's variance that lies along the axis of each principal component.

#### Choosing the Right Number of Dimensions



