# Dimensionality Reduction Techniques

In this module we are going to learn about the ways to reduce the number of dimensions (features) in our problems. This is a useful trick that sometimes improves performance and speeds up the algorithms we use. We are going to learn about how to select the most "optimal" features (in various senses), as well as using the PCA (Principal Component Analysis) to project the features down to a smaller subspace.

<b>Functions and attributes in this lecture: </b>
- `sklearn.datasets` - Submodule for pre-built datasets
 - `fetch_covtype` - Fetches the forest covertypes dataset
- `sklearn.feature_selection` - Submodule for selecting features
 - `VarianceThreshold` - Select features based on having high variance
- `sklearn.decomposition` - Submodule decomposition techniques
 - `PCA` - Implementation of Principal Component Analysis

In [None]:
# Non-sklearn packages
import numpy as np
import pandas as pd
import seaborn as sns

# Sklearn packages
from sklearn.datasets import fetch_covtype
from sklearn.feature_selection import VarianceThreshold
from sklearn.decomposition import PCA

In [None]:
# Importing the dataset
X, y = fetch_covtype(return_X_y=True, as_frame=True)

# Printing the description for the dataset
print(fetch_covtype()["DESCR"])

## Checking out a multiclass dataset

In [None]:
# Looking at the data


In [None]:
# The problem is a multi-class problem


In [None]:
# Some tree species are more common than others


In [None]:
# There are not any missing values


Check the memory usage above! It's a quite large dataset.

## Reduction Based on Correlation

A very simple way to reduce the number of features is to look at which features have the most (linear) correlation with the target.

In [None]:
# Combine the features and the target


In [None]:
# Very hard to see with a heatmap


In [None]:
# Just get the features that are most corrolated to the target 


## Using Variance Threshold to Remove Features

The idea behind variance threshold is that features that vary little have a small effect on the outcome. In an extreme example, a feature that always have the same value gives no information about the values of the target.

In [None]:
# Initiating the class


In [None]:
# Using the transformer


In [None]:
# Get the trimmed features


In [None]:
# Seems the same as before


The reason no features have been trimmed away is that the default `threshold` parameter for `VarianceThreshold` is 0. Hence nothing will be trimmed by default. Let's try to change that!

In [None]:
# Initiating the class with a threshold


In [None]:
# Using the transformer


In [None]:
# Get the trimmed features


Only the features that vary much are remaining. This is another way of retaining features that have an effect on the target.

## Implementing PCA

Implementing PCA is super simple in scikit-learn.

In [None]:
# Initiating a PCA instance


In [None]:
# Transform the data


In [None]:
# The projection into 5 dimensions that best explain the variance


In [None]:
# Combine again with the targets


In [None]:
# Some of the new features might have low variance
