Step Forward Selection (SFS), also called Forward Selection, is a wrapper method for
feature selection in machine learning. Its goal is to select a subset of relevant features that
improve model performance, reducing dimensionality and removing irrelevant or redundant
features.
How Step Forward Selection Works:
1. Start with an empty set of features: Initially, no features are selected.
2. Evaluate each feature individually: Add one feature at a time to the model and
evaluate the model's performance (using metrics like accuracy, precision, etc.).
3. Select the best feature: The feature that gives the best performance (e.g., highest
accuracy) is selected and added to the feature set.
4. Repeat the process: Add another feature from the remaining features to the already
selected feature set. Evaluate the model with the new feature set.
5. Stop when a criterion is met: Continue adding features until you meet a stopping
condition, such as a specified number of features or no further improvement in
performance.

In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.metrics import accuracy_score

iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

clf = LogisticRegression(max_iter=200)

sfs = SFS(clf,
k_features=3,
forward=True,
floating=False,
scoring='accuracy',
cv=5)
sfs = sfs.fit(X_train, y_train)

selected_features = sfs.k_feature_idx_
print(f"Selected Features: {selected_features}")

X_train_sfs = sfs.transform(X_train)
X_test_sfs = sfs.transform(X_test)

clf.fit(X_train_sfs, y_train)

y_pred = clf.predict(X_test_sfs)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model accuracy with selected features: {accuracy}")

Selected Features: (0, 2, 3)
Model accuracy with selected features: 1.0


How Step Backward Feature Selection Works:
1. Start with all features: Begin with a model that includes all the features.
2. Evaluate each feature: For each feature, calculate how much removing that feature
affects the model's performance (using metrics like accuracy, precision, etc.).
3. Remove the least important feature: Remove the feature that results in the smallest
decrease (or the largest increase) in model performance.
4. Repeat the process: Rebuild the model with the remaining features, and again remove
the feature whose absence impacts performance the least.
5. Stop when a criterion is met: Continue the process until a stopping condition is met,
such as when a predefined number of features are left or no further improvement is
observed.


In [2]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.metrics import accuracy_score
# Load dataset
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target
# Splitting dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initializing the classifier (Logistic Regression in this case)
clf = LogisticRegression(max_iter=200)
# Initializing the Sequential Backward Selection (SBS)
sbs = SFS(clf,
k_features=3, # Select top 3 features
forward=False, # Backward selection
floating=False, # Set to False for simple step backward
scoring='accuracy',
cv=5) # 5-fold cross-validation
# Perform SBS
sbs = sbs.fit(X_train, y_train)
# Get selected feature indices
selected_features = sbs.k_feature_idx_
print(f"Selected Features: {selected_features}")
# Subset the dataset with selected features
X_train_sbs = sbs.transform(X_train)
X_test_sbs = sbs.transform(X_test)
# Train classifier with selected features
clf.fit(X_train_sbs, y_train)
# Predict and evaluate model performance
y_pred = clf.predict(X_test_sbs)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model accuracy with selected features: {accuracy}")

Selected Features: (0, 2, 3)
Model accuracy with selected features: 1.0
