# Recursive Feature Elimination (RFE)

RFE is a technique for iteratively removing features that have the least impact on the model (ie have the lowest coefficients).

From the docs:

> Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), the goal of recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through a coef_ attribute or through a feature_importances_ attribute. Then, the least important features are pruned from current set of features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.

Resources:

* [Dimensionality Reduction in Python](https://campus.datacamp.com/courses/dimensionality-reduction-in-python/feature-selection-ii-selecting-for-model-accuracy)
* [RFE Docs](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html)

In [76]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import RFE
from sklearn.metrics import accuracy_score

# Using coefficients

In [91]:
df = pd.read_csv("data/titantic-train.csv").dropna()
y = df["Survived"]

## Default score and coefficients

In [92]:
X = df[["Age", "Fare", "Pclass", "SibSp", "Parch"]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

lr = LogisticRegression()
lr.fit(X_train_scaled, y_train)

print("Accuracy score: {:.2f}".format(accuracy_score(y_test, lr.predict(X_test_scaled))))
print("Coefficients:", dict(zip(X.columns, abs(lr.coef_[0]).round(2))))

Accuracy score: 0.69
Coefficients: {'Age': 0.75, 'Fare': 0.47, 'Pclass': 0.23, 'SibSp': 0.12, 'Parch': 0.38}


## Score after removing feature with lowest coefficient

In [84]:
X = df[["Age", "Fare", "Pclass", "Parch"]]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

lr = LogisticRegression()
lr.fit(X_train_scaled, y_train)

print("Accuracy score: {:.2f}".format(accuracy_score(y_test, lr.predict(X_test_scaled))))
print("Coefficients:", dict(zip(X.columns, abs(lr.coef_[0]).round(2))))

Accuracy Score: 0.69
Coefficients: {'Age': 0.76, 'Fare': 0.51, 'Pclass': 0.24, 'Parch': 0.38}


## Using RFE to remove low importance features automatically

In [105]:
X = df[["Age", "Fare", "Pclass", "SibSp", "Parch"]]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

lr = LogisticRegression()

# Can also set a step parameter to have it remove multiple features in each step, not just 1
# which can improve performance
rfe = RFE(lr, n_features_to_select=4, verbose=1)
rfe.fit(X_train_scaled, y_train)

# Print the features and their ranking (high = dropped early on)
print("\nRankings:", dict(zip(X.columns, rfe.ranking_)))
print("Selected features:", X.columns[rfe.support_].values)
print("Accuracy score: {:.2f}".format(accuracy_score(y_test, rfe.predict(X_test_scaled))))

Fitting estimator with 5 features.

Rankings: {'Age': 1, 'Fare': 1, 'Pclass': 1, 'SibSp': 2, 'Parch': 1}
Selected features: ['Age' 'Fare' 'Pclass' 'Parch']
Accuracy score: 0.69


# Using feature importances

In [102]:
X = df[["Age", "Fare", "Pclass", "SibSp", "Parch"]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

rf = RandomForestClassifier(random_state=0, max_depth=3)
rf.fit(X_train, y_train)

print("Accuracy score: {:.2f}".format(accuracy_score(y_test, rf.predict(X_test))))
print("Feature importances:", dict(zip(X.columns, rf.feature_importances_.round(2))))

mask = rf.feature_importances_ > 0.10
reduced_features = X.loc[:, mask]
print("Reduced features:", reduced_features.columns.values)

Accuracy score: 0.80
Feature importances: {'Age': 0.45, 'Fare': 0.39, 'Pclass': 0.04, 'SibSp': 0.05, 'Parch': 0.07}
Reduced features: ['Age' 'Fare']


## Using RFE to remove low importance features automatically

In [106]:
X = df[["Age", "Fare", "Pclass", "SibSp", "Parch"]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

rf = RandomForestClassifier(random_state=0, max_depth=3)
rfe = RFE(rf, n_features_to_select=2, verbose=1)
rfe.fit(X_train, y_train)

print("\nRankings:", dict(zip(X.columns, rfe.ranking_)))
print("Selected features:", X.columns[rfe.support_].values)
print("Accuracy score: {:.2f}".format(accuracy_score(y_test, rfe.predict(X_test))))


Fitting estimator with 5 features.
Fitting estimator with 4 features.
Fitting estimator with 3 features.

Rankings: {'Age': 1, 'Fare': 1, 'Pclass': 4, 'SibSp': 2, 'Parch': 3}
Selected features: ['Age' 'Fare']
Accuracy score: 0.78
