Stepwise Regression with statsmodels

    * Forward Selection: Starts with no features and adds them one at a time based on a criterion (e.g., p-value).
    * Backward Elimination: Starts with all features and removes them one at a time based on a criterion.

Detailed Steps

    1. Prepare the Data: Load or generate your dataset.
    2. Define the Forward Selection Function: Implement the logic to add features.
    3. Define the Backward Elimination Function: Implement the logic to remove features.
    4. Evaluate the Results: Check the final model summary.

In [1]:
# Step 1: Prepare the Data
import pandas as pd
import numpy as np
from sklearn.datasets import make_regression
from statsmodels.api import OLS, add_constant
from sklearn.model_selection import train_test_split

# Generate synthetic data
X, y = make_regression(n_samples=100, n_features=10, noise=0.1, random_state=42)
feature_names = [f'feature_{i}' for i in range(X.shape[1])]
df = pd.DataFrame(X, columns=feature_names)
df['target'] = y

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df.drop('target', axis=1), df['target'], test_size=0.2, random_state=42)


In [5]:
# Step 2: Define Forward Selection
# Start with no features.
# Iteratively add features based on the lowest p-value.

def forward_selection(X, y, threshold_in=0.05):
    initial_features = []
    best_features = []

    while True:
        remaining_features = set(X.columns) - set(initial_features)
        best_pvalue = float('inf')
        best_feature = None

        for feature in remaining_features:
            model = OLS(y, add_constant(X[initial_features + [feature]])).fit()
            pvalue = model.pvalues[feature]

            if pvalue < best_pvalue:
                best_pvalue = pvalue
                best_feature = feature

        if best_pvalue < threshold_in:
            initial_features.append(best_feature)
            best_features.append(best_feature)
        else:
            break

    return best_features


In [6]:
# Step 3: Define Backward Elimination
# Start with all features.
# Remove the feature with the highest p-value if it exceeds the threshold.
def backward_elimination(X, y, threshold_out=0.05):
    features = list(X.columns)

    while True:
        model = OLS(y, add_constant(X[features])).fit()
        pvalues = model.pvalues[1:]  # Ignore constant

        worst_pvalue = pvalues.max()
        if worst_pvalue > threshold_out:
            worst_feature = pvalues.idxmax()
            features.remove(worst_feature)
        else:
            break

    return features


In [7]:
# Step 4: Execute and Evaluate
# Forward Selection
selected_features_forward = forward_selection(X_train, y_train)
print("Selected features after forward selection:", selected_features_forward)

# Backward Elimination
selected_features_backward = backward_elimination(X_train, y_train)
print("Selected features after backward elimination:", selected_features_backward)

# Fit final model with selected features
final_model = OLS(y_train, add_constant(X_train[selected_features_forward])).fit()
print(final_model.summary())


Selected features after forward selection: ['feature_6', 'feature_4', 'feature_9', 'feature_5', 'feature_3', 'feature_1', 'feature_0', 'feature_7', 'feature_2', 'feature_8']
Selected features after backward elimination: ['feature_0', 'feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9']
                            OLS Regression Results                            
Dep. Variable:                 target   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 2.259e+07
Date:                Thu, 19 Sep 2024   Prob (F-statistic):          1.33e-220
Time:                        16:26:53   Log-Likelihood:                 73.260
No. Observations:                  80   AIC:                            -124.5
Df Residuals:                      69   BIC:                            -98.32
Df Model:        