# Dimensionality reduction : Principal Component Analysis

As the dimensionality of a dataset grows, it is often the case that some features are correlated.
Correlated features can hinder the human analyst but they are also a factor of confusion and inefficacy during the training of machine learning models.

A number of techniques exist that explicit aim at reducing the dimensionality of datasets.
They can be used by a human analyst for an exploratory data analysis, or as a pre- or post-processing step in a data science or machine learning pipeline, or as an initialization technique for costly machine learning methods.
We will start with a classic and intuitive method : [Principal Component Analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).

## Principal Component Analysis
The [scikit-learn machine learning library](https://scikit-learn.org) provides an [implementation of PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html), along with a number of variants. You can read more about the implementation of PCA and its variants in the [scikit-learn user guide](https://scikit-learn.org/stable/modules/decomposition.html#pca).

In [8]:
# load the iris dataset as a pandas DataFrame ("as_frame=True")
from sklearn.datasets import load_iris
iris = load_iris(as_frame=True)
# assign the feature matrix and target vector to variables
X = iris.data
y = iris.target

We will reduce the dataset to a 3-dimension space, where we can create visualizations as we did in the previous notebook.

In [9]:
from sklearn.decomposition import PCA
# create a PCA model that will map the input data to 3 dimensions
pca = PCA(n_components=3)
# compute the PCA for the feature matrix X
pca.fit(X)

The 3 principal components are stored in `pca.components_`.

In [20]:
# TODO display the 3 principal components


What is each of these 3 components mathematically ? What types of mathematical objects are they ?

The amount of variance explained by each of the components is stored in `pca.explained_variance_`.

In [21]:
# TODO display the explained variance


It is usually more immediately useful to display the percentage of variance explained by each of the selected components:
- The first principal components explain the maximum of variance. Knowing how much of the variance they explain is useful to decide how many principal components we need to keep if we want to compress the data to preserve a given total percentage of the variance of the data (eg. 2 components might explain more than 90% of the variance, and it is enough for your needs).
- The last principal components explain less variance, but sometimes they are the most interesting because they capture the variance that corresponds to subtle differences between otherwise very similar target classes.
The percentage of variance is stored in `pca.explained_variance_ratio_`.

In [22]:
# TODO display the explained variance ratios


The Principal Component Analysis object `pca`, fit on the feature matrix `X`, can transform each feature vector `X_i` from 4 dimensions to a reduced `X_r_i` in 3 dimensions.

In [17]:
X_r = pca.transform(X)

We can compare the first two samples with their PCA reduction.

In [23]:
# TODO display the first two feature vectors in X


In [24]:
# TODO display the first two feature vectors in X_r


We can see that `X_r` is quite different from `X` but it might be clearer with plots.

## Visualizing the PCA
We can plot the samples in the reduced feature space, just as we did in the original feature space.

Plot the samples on each pair of dimensions in the PCA reduced feature space (pair plots for `X_r` and `y`).

In [None]:
# TODO create the pair plots


Plot the samples in the 3 dimensions of the PCA reduced feature space, in a 3d plot.

In [3]:
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

In [25]:
# TODO create the 3d plot


How do the plots in the reduced feature space compare to the plots in the original feature space ?