# Dimensionality Reduction

Dimensionality Reduction is used to reduce the amount of features in a dataset without losing valuable information/connections. It can also help with data visualizations and make the data easier to understand. Reducing Dimensionality does cause info loss. It may speed up training but will most likely hurt the performance of the model.<br> <b>Try training with full dataset before using any Dimensionality reduction</b>

## Curse of Dimensionality 
Because of the nature of dimensions, the higher the dimension, the more space you have. In high dimensional datasets, the distance between instances can be very large. <b>The more dimensions in a dataset, the more likely you will overfit.</b>

## Main Approaches for D-Reduct

- Projection
- Manifold Learning

## PCA
Principal component analysis is the most popular dimensionality reduction algorithm. The main idea of PCA is to minimize the mean squared distance between the original dataset and its projection onto the new axis.

In [2]:
import numpy as np

# Making a 3d Dataset
np.random.seed(4)
m = 60
w1, w2 = 0.1, 0.3
noise = 0.1

angles = np.random.rand(m) * 3 * np.pi / 2 - 0.5
X = np.empty((m, 3))
X[:, 0] = np.cos(angles) + np.sin(angles)/2 + noise * np.random.randn(m) / 2
X[:, 1] = np.sin(angles) * 0.7 + noise * np.random.randn(m) / 2
X[:, 2] = X[:, 0] * w1 + X[:, 1] * w2 + noise * np.random.randn(m)

In [3]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2) # n_components is what dimension to make the dataset into(ex. 3d to 2d)
X2d = pca.fit_transform(X)

In [5]:
# This ratio indicates the proportion of the dataset's varience that lies along each prinicpal component.
pca.explained_variance_ratio_

array([0.84248607, 0.14631839])