# Univariate Feature Selection

## Removing features with low variance

* We want to remove all features whose variance doesn’t meet some threshold
* For example, we should remove all zero-variance features, i.e. features that have the same value in all samples
* As an example, suppose that we have a dataset with boolean features, and we want to remove all features that are either one or zero (on or off) in more than 80% of the samples
* Boolean features are Bernoulli random variables, and the variance of such variables is given by $Var[x] = p(1-p)$

In [9]:
import numpy as np
from sklearn.feature_selection import VarianceThreshold

X = np.array([[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]])
X

array([[0, 0, 1],
       [0, 1, 0],
       [1, 0, 0],
       [0, 1, 1],
       [0, 1, 0],
       [0, 1, 1]])

In [10]:
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
sel.fit_transform(X)

array([[0, 1],
       [1, 0],
       [0, 0],
       [1, 1],
       [1, 0],
       [1, 1]])

As expected, the first column is gone.

# Multivariate Feature Elimination

## Recursive Feature Elimination (RFE)

The Recursive Feature Elimination (RFE) method is a feature selection approach. It works by recursively removing attributes and building a model on those attributes that remain. It uses the model accuracy to identify which attributes (and combination of attributes) contribute the most to predicting the target attribute.

In [1]:
# Recursive Feature Elimination
from sklearn import datasets
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# load the iris datasets
dataset = datasets.load_iris()

# create a base classifier used to evaluate a subset of attributes
model = LogisticRegression()

# create the RFE model and select 3 attributes
rfe = RFE(model, 3)
rfe = rfe.fit(dataset.data, dataset.target)

# summarize the selection of the attributes
print(rfe.support_)

print(rfe.ranking_)

[False  True  True  True]
[2 1 1 1]


## Feature Importance

Methods that use ensembles of decision trees (like Random Forest and Extra Trees) can also compute the relative importance of each attribute. These importance values can be used to inform a feature selection process. 

In [12]:
# Feature Importance
from sklearn import datasets
from sklearn import metrics
from sklearn.ensemble import ExtraTreesClassifier

# load the iris datasets
dataset = datasets.load_iris()

# fit an Extra Trees model to the data
model = ExtraTreesClassifier()
model.fit(dataset.data, dataset.target)

# display the relative importance of each attribute
print(model.feature_importances_)

[ 0.07268384  0.07023344  0.38643323  0.47064949]


## Using Cutoff

* SelectKBest: removes all but the k highest scoring features
* SelectPercentile: removes all but a user-specified highest scoring percentage of features

In [13]:
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

iris = load_iris()
X, y = iris.data, iris.target
X.shape

(150, 4)

In [14]:
X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
X_new.shape

(150, 2)

In [17]:
# Linear Regression
import numpy as np
from sklearn import datasets
from sklearn.linear_model import LinearRegression

dataset = datasets.load_iris()

model = LinearRegression()
model.fit(dataset.data, dataset.target)
expected = dataset.target
predicted = model.predict(dataset.data)

# summarize the fit of the model
mse = np.mean((predicted-expected)**2)
print(mse)
print(model.score(dataset.data, dataset.target))

0.0463850883112
0.930422367533


In [20]:
# Linear Regression
import numpy as np
from sklearn import datasets
from sklearn.linear_model import LinearRegression

dataset = datasets.load_iris()

X_new = SelectKBest(chi2, k=3).fit_transform(X, y)

model = LinearRegression()
model.fit(X_new, dataset.target)
expected = dataset.target
predicted = model.predict(X_new)

# summarize the fit of the model
mse = np.mean((predicted-expected)**2)
print(mse)
print(model.score(X_new, dataset.target))

0.0465592211731
0.93016116824
