### Linear Model Selection Example 2.5

If we want to find the best among all models  produced by stepwise forward selection, that is, if we want to identify the most appropriate number of predictors on the basis of the AIC, we only have to repeat the known procedure, for the number of Predictors. Note that the parameter *scoreby* is now set to AIC.

In [1]:
import pandas as pd
import numpy as np
from LMS_def import *

# Load data
df = pd.read_csv('./data/Credit.csv')

# Convert Categorical variables
df = pd.get_dummies(data=df, drop_first=True, 
                    prefix=('Gender_', 'Student_', 
                            'Married_', 'Ethnicity_'))

x_full = df.drop(columns='Balance')
y = df['Balance']

results = pd.DataFrame(data={'Best_Pred': [], 'AIC':[]})

# Define the empty predictor
x0 = [np.zeros(len(y))]

x = x0
x_red = x_full.copy()

for i in range(x_full.shape[1]):
    results_i, best_i = add_one(x_red, x, y, scoreby='AIC')
    
    # Update the empty predictor with the best predictor
    x = np.concatenate((x, [df[best_i]]))

    # Remove the chosen predictor from the list of options
    x_red = x_red.drop(columns=best_i)

    # Save results 
    results.loc[i, 'Best_Pred'] = best_i
    results.loc[i, 'AIC'] = results_i['AIC'].min()
    
print('Best Predictors and corresponding AIC:\n', results, 
      '\n\nThe best model thus contains', 
      results['AIC'].argmin() + 1, ' predictors')

Best Predictors and corresponding AIC:
                Best_Pred          AIC
0                 Rating  5494.781548
1                 Income  5212.557085
2           Student__Yes  4849.386992
3                  Limit  4832.524008
4                  Cards  4817.666820
5                    Age  4815.038963
6         Gender__Female  4815.900560
7             Unnamed: 0  4817.004037
8       Ethnicity__Asian  4818.286694
9           Married__Yes  4819.562597
10  Ethnicity__Caucasian  4820.935924
11             Education  4822.448073 

The best model thus contains 6  predictors
