## CHAPTER 10
---
# DIMENSIONALITY REDUCTION USING FEATURE SELECTION

---
In Chapter 9, we discussed how to reduce the dimensionality of our feature matrix by creating new features with (ideally) similar ability to train quality models but with significantly fewer dimensions. This is called `feature extraction`. In this chapter we will cover an alternative approach: selecting high-quality, informative features and dropping less useful features. This is called `feature selection`.

There are three types of feature selection methods:
- `Filter`: select the best features by examining their statistical properties
- `Wrapper`: use trial and error to find the subset of features that produce models with the highest quality predictions 
- `Embedded`: select the best feature subset as part or as an extension of a learning algorithm’s training process

In this chapter we cover only filter and wrapper feature selection methods

## 10.1 Thresholding Numerical Feature Variance

- You have a set of numerical features and want to remove those with low variance (i.e., likely containing little information).
- Select a subset of features with variances above a given threshold.

In [1]:
# Load libraries
from sklearn import datasets
from sklearn.feature_selection import VarianceThreshold

# import some data to play with
iris = datasets.load_iris()

# Create features and target
features = iris.data
target = iris.target

# Create thresholder
thresholder = VarianceThreshold(threshold=.5)

# Create high variance feature matrix
features_high_variance = thresholder.fit_transform(features)

# View high variance feature matrix
features_high_variance[0:3]

array([[5.1, 1.4, 0.2],
       [4.9, 1.4, 0.2],
       [4.7, 1.3, 0.2]])

#### Discussion:
Variance thresholding (VT) is one of the most basic approaches to feature selection. It is motivated by the idea that features with low variance are likely less interesting (and useful) than features with high variance. VT first calculates the variance of each feature, then it drops all features whose variance does not meet that threshold:
$$
operatornameVar(x) = \frac{1}{n}\sum_{i=1}^{n}(x_i - \mu)^2
$$
where
- $x$ is the feature vector, 
- $x_i$ is an individual feature value, and 
- $\mu$ is that feature’s mean value. 

In [2]:
# View variances
thresholder.fit(features).variances_

array([0.68112222, 0.18871289, 3.09550267, 0.57713289])

If the features have been standardized (to mean zero and unit variance), then for obvious reasons variance thresholding will not work correctly:

In [3]:
# Load library
from sklearn.preprocessing import StandardScaler

# Standardize feature matrix
scaler = StandardScaler()
features_std = scaler.fit_transform(features)

# Caculate variance of each feature
selector = VarianceThreshold()
selector.fit(features_std).variances_

array([1., 1., 1., 1.])

## 10.2 Thresholding Binary Feature Variance

- You have a set of binary categorical features and want to remove those with low variance (i.e., likely containing little information).
- Select a subset of features with a Bernoulli random variable variance above a given threshold:

In [1]:
# Load library
from sklearn.feature_selection import VarianceThreshold

# Create feature matrix with:
# Feature 0: 80% class 0
# Feature 1: 80% class 1
# Feature 2: 60% class 0, 40% class 1
features = [[0, 1, 0],
            [0, 1, 1],
            [0, 1, 0],
            [0, 1, 1],
            [1, 0, 0]]

# Run threshold by variance
thresholder = VarianceThreshold(threshold=(.75 * (1 - .75)))
thresholder.fit_transform(features)

array([[0],
       [1],
       [0],
       [1],
       [0]])

#### Discussion:
Just like with numerical features, one strategy for selecting highly informative catego‐
rical features is to examine their variances. In binary features (i.e., Bernoulli random
variables), variance is calculated as:
$$
Var(x) = p(1 − p)
$$
where $p$ is the proportion of observations of class 1. Therefore, by setting $p$, we can
remove features where the vast majority of observations are one class.

## 10.3 Handling Highly Correlated Features

- You have a feature matrix and suspect some features are highly correlated
- Use a correlation matrix to check for highly correlated features. If highly correlated features exist, consider dropping one of the correlated features:

In [2]:
# Load libraries
import pandas as pd
import numpy as np

# Create feature matrix with two highly correlated features
features = np.array([[1, 1, 1],
                     [2, 2, 0],
                     [3, 3, 1],
                     [4, 4, 0],
                     [5, 5, 1],
                     [6, 6, 0],
                     [7, 7, 1],
                     [8, 7, 0],
                     [9, 7, 1]])

# Convert feature matrix into DataFrame
dataframe = pd.DataFrame(features)

# Create correlation matrix
corr_matrix = dataframe.corr().abs()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape),
                          k=1).astype(np.bool))

# Find index of feature columns with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]

# Drop features
dataframe.drop(dataframe.columns[to_drop], axis=1).head(3)

Unnamed: 0,0,2
0,1,1
1,2,0
2,3,1


#### Discussion:
One problem we often run into in machine learning is highly correlated features. If two features are highly correlated, then the information they contain is very similar, and it is likely redundant to include both features. The solution to highly correlated features is simple: remove one of them from the feature set.

In our solution, first we create a correlation matrix of all features:

In [3]:
# Correlation matrix
dataframe.corr()

Unnamed: 0,0,1,2
0,1.0,0.976103,0.0
1,0.976103,1.0,-0.034503
2,0.0,-0.034503,1.0


Second, we look at the upper triangle of the correlation matrix to identify pairs of highly correlated features:

In [4]:
# Upper triangle of correlation matrix
upper

Unnamed: 0,0,1,2
0,,0.976103,0.0
1,,,0.034503
2,,,


Third, we remove one feature from each of those pairs from the feature set.

## 10.4 Removing Irrelevant Features for Classification

- You have a categorical target vector and want to remove uninformative features.
- If the features are categorical, calculate a chi-square ($χ^2$) statistic between each feature and the target vector
- If the features are quantitative, compute the ANOVA F-value between each featureand the target vector

In [5]:
# Load libraries
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2, f_classif

# Load data
iris = load_iris()
features = iris.data
target = iris.target

# Convert to categorical data by converting data to integers
features = features.astype(int)

# Select two features with highest chi-squared statistics
chi2_selector = SelectKBest(chi2, k=2)
features_kbest = chi2_selector.fit_transform(features, target)

# Show results
print("Original number of features:", features.shape[1])
print("Reduced number of features:", features_kbest.shape[1])

Original number of features: 4
Reduced number of features: 2
