# Dimensionality Reduction Techniques

In this module we are going to learn about the ways to reduce the number of dimensions (features) in our problems. This is a useful trick that sometimes improves performance and speeds up the algorithms we use. We are going to learn about how to select the most "optimal" features (in various senses), as well as using the PCA (Principal Component Analysis) to project the features down to a smaller subspace.

<b>Functions and attributes in this lecture: </b>
- `sklearn.datasets` - Submodule for pre-built datasets
 - `fetch_covtype` - Fetches the forest covertypes dataset
- `sklearn.feature_selection` - Submodule for selecting features
 - `VarianceThreshold` - Select features based on having high variance
- `sklearn.decomposition` - Submodule decomposition techniques
 - `PCA` - Implementation of Principal Component Analysis

In [51]:
# Non-sklearn packages
import numpy as np
import pandas as pd
import seaborn as sns
import ssl

ssl._create_default_https_context = ssl._create_unverified_context

# Sklearn packages
from sklearn.datasets import fetch_covtype
from sklearn.feature_selection import VarianceThreshold
from sklearn.decomposition import PCA

In [53]:
# Importing the dataset
X, y = fetch_covtype(return_X_y=True, as_frame=True)

# Printing the description for the dataset
print(fetch_covtype()["DESCR"])

.. _covtype_dataset:

Forest covertypes
-----------------

The samples in this dataset correspond to 30×30m patches of forest in the US,
collected for the task of predicting each patch's cover type,
i.e. the dominant species of tree.
There are seven covertypes, making this a multiclass classification problem.
Each sample has 54 features, described on the
`dataset's homepage <https://archive.ics.uci.edu/ml/datasets/Covertype>`__.
Some of the features are boolean indicators,
while others are discrete or continuous measurements.

**Data Set Characteristics:**

Classes                        7
Samples total             581012
Dimensionality                54
Features                     int

:func:`sklearn.datasets.fetch_covtype` will load the covertype dataset;
it returns a dictionary-like 'Bunch' object
with the feature matrix in the ``data`` member
and the target values in ``target``. If optional argument 'as_frame' is
set to 'True', it will return ``data`` and ``target`` as pandas
data

## Checking out a multiclass dataset

In [54]:
# Looking at the data
X.head()

Unnamed: 0,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,...,Soil_Type_30,Soil_Type_31,Soil_Type_32,Soil_Type_33,Soil_Type_34,Soil_Type_35,Soil_Type_36,Soil_Type_37,Soil_Type_38,Soil_Type_39
0,2596.0,51.0,3.0,258.0,0.0,510.0,221.0,232.0,148.0,6279.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2590.0,56.0,2.0,212.0,-6.0,390.0,220.0,235.0,151.0,6225.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2804.0,139.0,9.0,268.0,65.0,3180.0,234.0,238.0,135.0,6121.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2785.0,155.0,18.0,242.0,118.0,3090.0,238.0,238.0,122.0,6211.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2595.0,45.0,2.0,153.0,-1.0,391.0,220.0,234.0,150.0,6172.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [55]:
# The problem is a multi-class problem
y.head()

0    5
1    5
2    2
3    2
4    5
Name: Cover_Type, dtype: int32

In [56]:
# Some tree species are more common than others
y.value_counts()

Cover_Type
2    283301
1    211840
3     35754
7     20510
6     17367
5      9493
4      2747
Name: count, dtype: int64

In [57]:
# There are not any missing values
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 581012 entries, 0 to 581011
Data columns (total 54 columns):
 #   Column                              Non-Null Count   Dtype  
---  ------                              --------------   -----  
 0   Elevation                           581012 non-null  float64
 1   Aspect                              581012 non-null  float64
 2   Slope                               581012 non-null  float64
 3   Horizontal_Distance_To_Hydrology    581012 non-null  float64
 4   Vertical_Distance_To_Hydrology      581012 non-null  float64
 5   Horizontal_Distance_To_Roadways     581012 non-null  float64
 6   Hillshade_9am                       581012 non-null  float64
 7   Hillshade_Noon                      581012 non-null  float64
 8   Hillshade_3pm                       581012 non-null  float64
 9   Horizontal_Distance_To_Fire_Points  581012 non-null  float64
 10  Wilderness_Area_0                   581012 non-null  float64
 11  Wilderness_Area_1         

Check the memory usage above! It's a quite large dataset.

## Reduction Based on Correlation

A very simple way to reduce the number of features is to look at which features have the most (linear) correlation with the target.

In [58]:
# Combine the features and the target
combined = pd.concat([X,y],axis=1)

In [59]:
# Very hard to see with a heatmap
combined.corr()

Unnamed: 0,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,...,Soil_Type_31,Soil_Type_32,Soil_Type_33,Soil_Type_34,Soil_Type_35,Soil_Type_36,Soil_Type_37,Soil_Type_38,Soil_Type_39,Cover_Type
Elevation,1.0,0.015735,-0.242697,0.306229,0.093306,0.365559,0.112179,0.205887,0.059148,0.148022,...,0.167077,0.070633,0.011731,0.083005,0.021107,0.035433,0.217179,0.193595,0.212612,-0.269554
Aspect,0.015735,1.0,0.078728,0.017376,0.070305,0.025121,-0.579273,0.336103,0.646944,-0.109172,...,0.056233,0.019163,0.010861,-0.021991,0.002281,-0.020398,0.017706,0.008294,-0.005866,0.01708
Slope,-0.242697,0.078728,1.0,-0.010607,0.274976,-0.215914,-0.327199,-0.526911,-0.175854,-0.185662,...,-0.133504,0.208942,-0.011002,-0.022228,0.002918,0.007848,-0.072208,0.093602,0.025637,0.148285
Horizontal_Distance_To_Hydrology,0.306229,0.017376,-0.010607,1.0,0.606236,0.07203,-0.027088,0.04679,0.05233,0.051874,...,0.127217,0.101195,0.070268,-0.005231,0.033421,-0.006802,0.043031,0.031922,0.14702,-0.020317
Vertical_Distance_To_Hydrology,0.093306,0.070305,0.274976,0.606236,1.0,-0.046372,-0.166333,-0.110957,0.034902,-0.069913,...,0.039762,0.167091,0.060274,-0.006092,0.012955,-0.00752,-0.008629,0.043859,0.179006,0.081664
Horizontal_Distance_To_Roadways,0.365559,0.025121,-0.215914,0.07203,-0.046372,1.0,0.034349,0.189461,0.106119,0.33158,...,-0.089019,-0.082779,0.00639,-0.003,0.00755,0.016313,0.079778,0.033762,0.016052,-0.15345
Hillshade_9am,0.112179,-0.579273,-0.327199,-0.027088,-0.166333,0.034349,1.0,0.010037,-0.780296,0.132669,...,0.006494,-0.064381,0.007154,0.02787,0.007865,0.010332,0.015108,-0.02962,-1.6e-05,-0.035415
Hillshade_Noon,0.205887,0.336103,-0.526911,0.04679,-0.110957,0.189461,0.010037,1.0,0.594274,0.057329,...,0.125395,-0.086164,0.043061,0.005863,0.016239,-0.022707,0.042952,-0.071961,-0.040176,-0.096426
Hillshade_3pm,0.059148,0.646944,-0.175854,0.05233,0.034902,0.106119,-0.780296,0.594274,1.0,-0.047981,...,0.083066,-0.024393,0.017757,-0.016482,0.00133,-0.022064,0.022187,-0.02904,-0.024254,-0.04829
Horizontal_Distance_To_Fire_Points,0.148022,-0.109172,-0.185662,0.051874,-0.069913,0.33158,0.132669,0.057329,-0.047981,1.0,...,-0.089977,-0.059067,-0.035067,-8.1e-05,-0.010595,0.00418,-0.01974,-0.003301,0.008915,-0.108936


In [60]:
# Just get the features that are most corrolated to the target 
most_correlated_feature_names = (abs(combined.corr()["Cover_Type"])
                                 .sort_values(ascending=False).iloc[1:6,].index.to_list())
correlated_features = X[most_correlated_feature_names]
correlated_features.head()

Unnamed: 0,Wilderness_Area_3,Elevation,Soil_Type_9,Wilderness_Area_0,Soil_Type_37
0,0.0,2596.0,0.0,1.0,0.0
1,0.0,2590.0,0.0,1.0,0.0
2,0.0,2804.0,0.0,1.0,0.0
3,0.0,2785.0,0.0,1.0,0.0
4,0.0,2595.0,0.0,1.0,0.0


## Using Variance Threshold to Remove Features

The idea behind variance threshold is that features that vary little have a small effect on the outcome. In an extreme example, a feature that always have the same value gives no information about the values of the target.

In [61]:
# Initiating the class
variance_threshold = VarianceThreshold()

In [62]:
# Using the transformer
outcome = variance_threshold.fit_transform(X)

In [63]:
# Get the trimmed features
outcome = pd.DataFrame(outcome, columns=variance_threshold.get_feature_names_out())

In [64]:
# Seems the same as before
outcome.head()

Unnamed: 0,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,...,Soil_Type_30,Soil_Type_31,Soil_Type_32,Soil_Type_33,Soil_Type_34,Soil_Type_35,Soil_Type_36,Soil_Type_37,Soil_Type_38,Soil_Type_39
0,2596.0,51.0,3.0,258.0,0.0,510.0,221.0,232.0,148.0,6279.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2590.0,56.0,2.0,212.0,-6.0,390.0,220.0,235.0,151.0,6225.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2804.0,139.0,9.0,268.0,65.0,3180.0,234.0,238.0,135.0,6121.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2785.0,155.0,18.0,242.0,118.0,3090.0,238.0,238.0,122.0,6211.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2595.0,45.0,2.0,153.0,-1.0,391.0,220.0,234.0,150.0,6172.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


The reason no features have been trimmed away is that the default `threshold` parameter for `VarianceThreshold` is 0. Hence nothing will be trimmed by default. Let's try to change that!

In [65]:
# Initiating the class with a threshold
variance_threshold.n_features_in_

54

In [66]:
# Using the transformer
variance_threshold_1 = VarianceThreshold(threshold=1.0)
outcome_1 = variance_threshold_1.fit_transform(X)

In [67]:
# Get the trimmed features
outcome_1 = pd.DataFrame(outcome, columns=variance_threshold_1.get_feature_names_out())
outcome_1.head()

Unnamed: 0,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points
0,2596.0,51.0,3.0,258.0,0.0,510.0,221.0,232.0,148.0,6279.0
1,2590.0,56.0,2.0,212.0,-6.0,390.0,220.0,235.0,151.0,6225.0
2,2804.0,139.0,9.0,268.0,65.0,3180.0,234.0,238.0,135.0,6121.0
3,2785.0,155.0,18.0,242.0,118.0,3090.0,238.0,238.0,122.0,6211.0
4,2595.0,45.0,2.0,153.0,-1.0,391.0,220.0,234.0,150.0,6172.0


In [68]:
# Get new column length
len(outcome_1.columns)

10

Only the features that vary much are remaining. This is another way of retaining features that have an effect on the target.

## Implementing PCA

Implementing PCA is super simple in scikit-learn.

In [86]:
# Initiating a PCA instance
pca = PCA(n_components=5)

In [87]:
# Transform the data
pca_transformed = pca.fit_transform(X)

In [88]:
# The projection into 5 dimensions that best explain the variance
pca_transformed = pd.DataFrame(pca_transformed)
pca_transformed.head()

Unnamed: 0,0,1,2,3,4
0,674.821965,4634.599374,-244.289792,107.119699,-39.260429
1,543.787831,4651.724081,-263.848362,66.62687,-32.446703
2,2870.252696,3092.562604,-216.452213,91.788291,25.567561
3,2839.580217,3216.681905,-236.661476,85.251145,40.991986
4,516.489772,4606.119618,-286.846634,14.147064,-42.333874


In [91]:
# Combine again with the targets
combined = pd.concat([pca_transformed,y], axis=1)
combined

Unnamed: 0,0,1,2,3,4,Cover_Type
0,674.821965,4634.599374,-244.289792,107.119699,-39.260429,5
1,543.787831,4651.724081,-263.848362,66.626870,-32.446703,5
2,2870.252696,3092.562604,-216.452213,91.788291,25.567561,2
3,2839.580217,3216.681905,-236.661476,85.251145,40.991986,2
4,516.489772,4606.119618,-286.846634,14.147064,-42.333874,5
...,...,...,...,...,...,...
581007,-2538.746554,220.208732,-435.328761,52.315953,-9.432613,3
581008,-2546.043798,234.007959,-447.837154,38.032431,-9.687357,3
581009,-2545.907082,244.302744,-455.505134,33.642014,0.556144,3
581010,-2540.765016,252.666167,-457.305806,34.575046,15.280406,3


In [92]:
# Some of the new features might have low variance
combined.corr()

Unnamed: 0,0,1,2,3,4,Cover_Type
0,1.0,-2.515722e-15,5.211018e-15,1.912452e-15,-4.382593e-16,-0.167383
1,-2.515722e-15,1.0,8.091854e-15,2.9918750000000004e-17,3.9654320000000006e-17,0.004308
2,5.211018e-15,8.091854e-15,1.0,5.941837e-16,1.839235e-16,-0.187592
3,1.912452e-15,2.9918750000000004e-17,5.941837e-16,1.0,3.9137880000000005e-17,0.14325
4,-4.382593e-16,3.9654320000000006e-17,1.839235e-16,3.9137880000000005e-17,1.0,0.010209
Cover_Type,-0.1673827,0.00430796,-0.1875918,0.1432505,0.01020885,1.0
