In this activity, you will implement backward elimination, a feature selection technique that helps you identify the most important features for your ML model. Backward elimination starts with all features and progressively removes the least significant ones, leading to a more efficient model. The goal of this activity is to help you apply this technique to a dataset and refine your model by eliminating irrelevant features.

By the end of this activity, you'll be able to:

Implement backward elimination: identify and remove the least significant features from a dataset.

Apply statistical modeling: fit a linear regression model using the statsmodels library and interpret the p-values to determine feature significance.

Refine and simplify models: analyze the impact of removing irrelevant features on model performance and interpret the results to improve model efficiency.

In [1]:
import pandas as pd
import statsmodels.api as sm
from sklearn.model_selection import train_test_split

In [6]:
# Sample dataset
data = {
    'StudyHours': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'PrevExamScore': [30, 40, 45, 50, 60, 65, 70, 75, 80, 85],
    'Pass': [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]  # 0 = Fail, 1 = Pass
}

df = pd.DataFrame(data)

# Features and target variable
X = df[['StudyHours', 'PrevExamScore']]
y = df['Pass']

In [7]:
# Add a constant to the model (for the intercept)
X = sm.add_constant(X)

In [8]:
# Fit the model using Ordinary Least Squares (OLS) regression
model = sm.OLS(y, X).fit()

# Display the summary, including p-values for each feature
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                   Pass   R-squared:                       0.758
Model:                            OLS   Adj. R-squared:                  0.688
Method:                 Least Squares   F-statistic:                     10.94
Date:                Wed, 04 Jun 2025   Prob (F-statistic):            0.00701
Time:                        14:32:42   Log-Likelihood:               -0.17258
No. Observations:                  10   AIC:                             6.345
Df Residuals:                       7   BIC:                             7.253
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const            -0.3333      1.464     -0.228

  return hypotest_fun_in(*args, **kwds)


In [9]:
# Define a significance level
significance_level = 0.05

# Perform backward elimination
while True:
    # Fit the model
    model = sm.OLS(y, X).fit()
    # Get the highest p-value in the model
    max_p_value = model.pvalues.max()
    
    # Check if the highest p-value is greater than the significance level
    if max_p_value > significance_level:
        # Identify the feature with the highest p-value
        feature_to_remove = model.pvalues.idxmax()
        print(f"Removing feature: {feature_to_remove} with p-value: {max_p_value}")
        
        # Drop the feature
        X = X.drop(columns=[feature_to_remove])
    else:
        break

# Display the final model summary
print(model.summary())

Removing feature: PrevExamScore with p-value: 0.9999999999999999
Removing feature: const with p-value: 0.11419580126842226
                                 OLS Regression Results                                
Dep. Variable:                   Pass   R-squared (uncentered):                   0.831
Model:                            OLS   Adj. R-squared (uncentered):              0.812
Method:                 Least Squares   F-statistic:                              44.31
Date:                Wed, 04 Jun 2025   Prob (F-statistic):                    9.31e-05
Time:                        14:32:42   Log-Likelihood:                         -1.8294
No. Observations:                  10   AIC:                                      5.659
Df Residuals:                       9   BIC:                                      5.961
Df Model:                           1                                                  
Covariance Type:            nonrobust                                                

  return hypotest_fun_in(*args, **kwds)
