In [None]:
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('default')
from metrics import accuracy

from sklearn import datasets
from sklearn.model_selection import train_test_split

# Curse of Dimensionality
Many datasets have thousands or even millions of features per training instance - Using all of them will likely slow down your training.     
Consider that:     
Some of the features may not even add much information and could be discarded or ignored.     
Other features may be so highly correlated that you might be able to merge them into one and lose little information.     
   
Furthermore, most people can readily understand 3 dimensions but begin to struggle with 4 dimensions, let alone thousands.    

Other particularities of high dimensional data:
* In a unit square (1x1) points are unlikely to be extreme along any dimension whereas in a high dimensional space (eg. 10.000 dims) most points lay along the border of their spaces hypercube.
* high-dimensional datasets are at risk of being sparse, eg. the average distance between 2 points in a 3D unit square is 0.66, in a 10.000 dim space the average distance is around 408,25. 

In theory, you could increase the amount of training data to counter the sparseness of high dimensionality.
Because however the number of training instances required for a given density grows exponentially with the number of features, this is unrealistic.
As an example, for an average distance of 0.1 between points with 100 dimensions you would require more than the "number of atoms in our observable universe" (Aurelien Géron)

## Main approaches to dimensionality reduction
### Projection
In real-world datasets most points lie within a lower-dimensional subspace meaning you may not need the full set of features to get a good approximation of your data.

Projection essentially squasches higher dim points down onto a lower dimension. In some cases this works well because the higher dimensional data may already imitate the shape it would have in the lower dimension. In other cases such as the swiss toy roll dataset, different layers would be projected on top of each other, loosing valuable information in a lower dimension.


### Manifold Learning
Manifold - a shape in a lower dimensionality that has been twisted in a higher dimension. (TODO put images here for better visualization)
(example of this the toy roll dataset)

Manifold hypothesis: Most real-world high-dimensional data lie close to a much lower dimensional manifold. This assumptions is very often empirically observed.

**As a note:**\
While dimensionality will always speed up training it will not always improve the predictions. In some cases the decision boundary will be less complex in higher dimesions whereas in others it will be so in smaller dimensions. That is why before using dimensionality reduction, a model should first be tested on the full set of features.

# PCA: Principle Component Analysis

PCA reduces dimesionality by identifying the hyperplane that is closest to the data and then projects the data onto this hyperplane.