# Tutorial: Using Forward Stepwise for Feature Selection
In this tutorial, we will demonstrate how to use Forward Stepwise methods for feature selection with the Ozone dataset. This dataset, which predicts ozone levels based on various weather conditions, will help us understand how feature selection techniques can improve environmental models.

## Setting Up the Environment
Before starting, ensure you have all the required libraries installed:

In [None]:
# Install the necessary libraries
%pip install numpy pandas scikit-learn statsmodels faraway

## Importing the Required Libraries
Let's import the libraries we need. These libraries include tools for data manipulation, statistical modeling, and machine learning.

In [1]:
# Import essential libraries
from sklearn.model_selection import train_test_split
import statsmodels.api as sm

# Import dataset
import faraway.datasets.ozone as ozone

## Loading and Preparing the Dataset
We will focus on the Ozone dataset, which contains multiple weather-related variables that we will use to predict ozone levels. We will separate the features X and the target variable y then spit the data into training and testing sets.

In [2]:
# Load the Ozone dataset
data = ozone.load()

# Separate features (X) and the target variable (y)
X = data.drop(columns=['O3'])
y = data['O3']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

## Fitting a Full Model and Analyzing Significance
Now that we have our training data, we will start by fitting a simple linear regression model using all available features and then evaluate the significance of each variable.

In [3]:
# Add an intercept (constant) to the model
X_train = sm.add_constant(X_train)

# Fit a multiple linear regression model using all features
model_all_features_OLS = sm.OLS(y_train, X_train).fit()

# Display the summary of the model
print(model_all_features_OLS.summary())

                            OLS Regression Results                            
Dep. Variable:                     O3   R-squared:                       0.694
Model:                            OLS   Adj. R-squared:                  0.683
Method:                 Least Squares   F-statistic:                     63.96
Date:                Sat, 28 Sep 2024   Prob (F-statistic):           3.08e-60
Time:                        15:39:59   Log-Likelihood:                -767.04
No. Observations:                 264   AIC:                             1554.
Df Residuals:                     254   BIC:                             1590.
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         30.3074     33.267      0.911      0.3

* The R-squared value of 0.694 tells us that about 69.4% of the changes in ozone levels (O3) can be explained by the variables we included in the model. This means our model does an ok job, but there's still 30.6% of the ozone variability that it doesn't capture, suggesting there's room to make the model better. 

* The Adjusted R-squared is slightly lower at 0.683 because it considers the number of variables we used, showing that our model is not the strongest one if we account for complexity. 

* The F-statistic of 63.96, along with a very tiny p-value (3.08e-60), shows that our overall model is statistically significant, meaning that at least one of the variables is meaningfully linked to O3. 

This confirms that our model is useful for predicting ozone levels, but there's still potential to refine it further.

## Performance of our model
Let's focus on the R-squared value of the full model, which measures how well the model explains the variability in the target variable (O3). 

Then let's predict the ozone levels (O3) using the training data and then evaluating the model's accuracy by calculating the Mean Squared Error (MSE).

In [10]:
# Calculate the R-squared value of the full model
R2 = model_all_features_OLS.rsquared
print("R-squared =", R2)

# Predict values using the training set
y_pred_train = model_all_features_OLS.predict(X_train)

# Calculate the Mean Squared Error (MSE)
MSE = metrics.mean_squared_error(y_train, y_pred_train)
print("MSE =", MSE)

R-squared = 0.6938475174333434
MSE = 19.55081161581369


The R-squared value indicates the proportion of the variance explained by the model, and the MSE shows how well the model fits the training data. So there's still room of improvement in the features we are using.

## Identifying most important features
The p-value is a statistical measure that helps us determine whether a variable significantly affects the target variable (O3). A low p-value (typically less than 0.05) suggests that the variable has a significant impact. Let's get these relevant variables:

In [11]:
# Extract p-values of the features
p_values = model_all_features_OLS.pvalues

# Check which variables are significant at the 5% level
sigVars = [p_values.iloc[i] < 0.05 for i in range(len(p_values))]
variable_names = X_train.columns.tolist()
significant_variable_names = [name for name, is_sig in zip(variable_names, sigVars) if is_sig]

# Display significant variables
print("Significant Variables:", significant_variable_names)

Significant Variables: ['humidity', 'temp', 'ibt', 'doy']


## Forward Stepwise Selection
As we previously defined, Forward stepwise selection involves starting with a model and adding one variable at a time, choosing the variable that improves the model the most.

Let's use this along the Adjusted R-squared to get the best combination of features that best represents our dataset:

In [7]:
# Define a function to calculate Adjusted R-squared for a given set of features
def calculate_adjusted_r2(X, y):
    model_OLS = sm.OLS(y, X).fit()
    return model_OLS.rsquared_adj, model_OLS

# Select only the significant features for a new model
significant_features = X_train.columns[sigVars]
X_train_significant_features = X_train[significant_features]

# Initialize with significant features
selected_vars = list(X_train_significant_features.columns)
remaining_vars = [var for var in X_train.columns if var not in selected_vars and var != 'const']

# Best model using initial significant features
best_adj_r2, best_model = calculate_adjusted_r2(X_train_significant_features, y_train)

# Forward Stepwise process: Add variables one at a time
while remaining_vars:
    adj_r2_with_candidates = []
    for candidate in remaining_vars:
        # Add candidate variable to the current model
        X_candidate = X_train_significant_features.join(X_train[candidate])
        adj_r2, model = calculate_adjusted_r2(X_candidate, y_train)
        adj_r2_with_candidates.append((adj_r2, candidate, model))
    
    # Sort and select the best candidate variable (higher Adjusted R-squared is better)
    adj_r2_with_candidates.sort(reverse=True)  # Sort in descending order to get the highest value
    best_new_adj_r2, best_new_var, best_new_model = adj_r2_with_candidates[0]

    # Update model if new Adjusted R-squared is higher
    if best_new_adj_r2 > best_adj_r2:
        selected_vars.append(best_new_var)
        X_train_significant_features = X_train_significant_features.join(X_train[best_new_var])
        best_adj_r2 = best_new_adj_r2
        best_model = best_new_model
        remaining_vars.remove(best_new_var)
    else:
        break

# Display final selected variables and Adjusted R-squared
print("Adjusted R-squared of Forward Stepwise Model:", best_adj_r2)
print("Selected Variables:", selected_vars)

Adjusted R-squared of Forward Stepwise Model: 0.9030375117616768
Selected Variables: ['humidity', 'temp', 'ibt', 'doy', 'vh', 'vis']


## Testing the Final Model
Finally, we test our model on unseen data (test set) to evaluate its performance.

In [12]:
# Prepare the test set with the selected features
X_test = sm.add_constant(X_test)
X_test_best_forward_features = X_test[selected_vars]

# Predict on the test set
pred = best_model.predict(X_test_best_forward_features)

R2 = best_model.rsquared
print("Test Set R-squared =", R2)

# Calculate the MSE on the test set
MSEpred = metrics.mean_squared_error(y_test, pred)

print(f"Test Set MSE = {MSEpred}\n")

Test Set R-squared = 0.9052412046761842
Test Set MSE = 17.658748987607208



Our adjusted forward stepwise model, which uses only the most impactful features, has shown improved performance compared to the initial model that included all variables. 

By honing in on key predictors, we've not only boosted the model's accuracy but also simplified its structure, highlighting the effectiveness of systematic feature selection.