### Linear Model Selection Example 2.6

The right-hand panel displays the BIC for the **Credit** data set. For instance, the BIC values result from subtracting the BIC of the null model. 

Based on the BIC values displayed in the **Python**-output, we conclude that the best model produced by forward stepwise selection contains five predictor variables. 


The **Python**-output in the previous example shows that the best model produced by forward stepwise selection and containing five predictors is given by
\begin{align*}
 balance
&=\beta_{0}+\beta_{1}\cdot income +\beta_{2}\cdot limit +\beta_{3}\cdot rating +\beta_{4}\cdot cards \\
&\quad+\beta_{5}\cdot student +epsilon
\end{align*}

In [1]:
import pandas as pd
import numpy as np
from LMS_def import *

# Load data
df = pd.read_csv('./data/Credit.csv')

# Convert Categorical variables
df = pd.get_dummies(data=df, drop_first=True, 
                    prefix=('Gender_', 'Student_', 
                            'Married_', 'Ethnicity_'))

x_full = df.drop(columns='Balance')
y = df['Balance']

results = pd.DataFrame(data={'Best_Pred': [], 'BIC':[]})

# Define the empty predictor
x0 = [np.zeros(len(y))]

x = x0
x_red = x_full.copy()

for i in range(x_full.shape[1]):
    results_i, best_i = add_one(x_red, x, y, scoreby='BIC')
    
    # Update the empty predictor with the best predictor
    x = np.concatenate((x, [df[best_i]]))

    # Remove the chosen predictor from the list of options
    x_red = x_red.drop(columns=best_i)

    # Save results 
    results.loc[i, 'Best_Pred'] = best_i
    results.loc[i, 'BIC'] = results_i['BIC'].min()
    
print('Best Predictors and corresponding BIC:\n', results, 
      '\n\nThe best model thus contains', 
      results['BIC'].argmin() + 1, ' predictors')

Best Predictors and corresponding BIC:
                Best_Pred          BIC
0                 Rating  5502.764477
1                 Income  5224.531479
2           Student__Yes  4865.352851
3                  Limit  4852.481331
4                  Cards  4841.615607
5                    Age  4842.979215
6         Gender__Female  4847.832276
7             Unnamed: 0  4852.927218
8       Ethnicity__Asian  4858.201340
9           Married__Yes  4863.468707
10  Ethnicity__Caucasian  4868.833498
11             Education  4874.337112 

The best model thus contains 5  predictors


If we want to identify the best model  by *backward stepwise selection*, then we can use the same procedure as before:

In [2]:
results = pd.DataFrame(data={'Worst_Pred': [], 'BIC':[]})

# Define the full predictor
x = x_full.copy()

for i in range(x_full.shape[1]):
    results_i, worst_i = drop_one(x, y, scoreby='BIC')
    
    # Update the empty predictor with the best predictor
    x = x.drop(columns=worst_i)

    # Save results 
    results.loc[i, 'Worst_Pred'] = worst_i
    results.loc[i, 'BIC'] = results_i['BIC'].min()
    
print('Worst Predictors and corresponding BIC:\n', results, 
      '\n\nThe best model thus contains', 
      x_full.shape[1] - results['BIC'].argmin(), ' predictors')

Worst Predictors and corresponding BIC:
               Worst_Pred          BIC
0              Education  4868.833498
1   Ethnicity__Caucasian  4863.468707
2           Married__Yes  4858.201340
3       Ethnicity__Asian  4852.927218
4             Unnamed: 0  4847.832276
5         Gender__Female  4842.979215
6                    Age  4841.615607
7                 Rating  4840.658660
8                  Cards  4873.759072
9           Student__Yes  5237.176925
10                Income  5507.965562
11                 Limit  6044.702777 

The best model thus contains 5  predictors
