<div >
<img src = "../banner.jpg" />
</div>

<a target="_blank" href="https://colab.research.google.com/github/ignaciomsarmiento/BDML_202402/blob/main/Lecture05/Notebook_ModelSelection.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


# Introduction

The process of model selection involves choosing the most appropriate model from a set of potential candidates based on their predictive performance. Model selection techniques are essential to identify the model that best balances complexity and accuracy without overfitting or underfitting the data.

Among the various strategies for model selection, three popular methods are Best Subset Selection, Forward Selection, and Backward Selection. Each method has a distinct approach to selecting variables.

In the following sections, we will explore each of these methods in detail, highlighting their procedures, advantages, and when to use them."



# Running Example: Predicting Wages

Our objective today is to construct a model of individual wages

$$
w = f(X) + u 
$$

where w is the  wage, and X is a matrix that includes potential explanatory variables/predictors. In this problem set, we will focus on a linear model of the form

\begin{align}
 ln(w) & = \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p  + u 
\end{align}

were $ln(w)$ is the logarithm of the wage.

To illustrate I'm going to use a sample of the NLSY97. The NLSY97 is  a nationally representative sample of 8,984 men and women born during the years 1980 through 1984 and living in the United States at the time of the initial survey in 1997.  Participants were ages 12 to 16 as of December 31, 1996.  Interviews were conducted annually from 1997 to 2011 and biennially since then.  

Let's load the packages and the data set:

In [None]:
#install.packages("pacman") #for google colab

In [None]:
#packages
require("pacman")
p_load("tidyverse")

nlsy <- read_csv('https://raw.githubusercontent.com/ignaciomsarmiento/datasets/main/nlsy97.csv')

nlsy <- nlsy  %>%   drop_na(educ) #dropea los valores faltantes (NA)

In [None]:
colnames(nlsy[1:21])

In [None]:
table(nlsy$yhea_100_1997)

Let's keep a subset of these predictors

In [None]:
nlsy<- nlsy[1:15]

In [None]:
dim(nlsy)

### Best Subset Selection


1.  Let $M_0$ denote the null model, which contains no predictors. This
    model simply predicts the sample mean for each observation.

2.  For $k=1,2,\dots,p$:

    1.  Fit all $\binom{p}{k}$ models that contain exactly k predictors

    2.  Pick the best among these $\binom{p}{k}$ models, and call it
        $M_k$. Where *best* is the one with the smallest $MSE$

3.  Select a single best model from among $M_0,\dots, M_p$, using cross-validated prediction error, Cp (AIC), BIC, or adjusted R2.

In [None]:
p_load("leaps")

#performs step 2
subset<-regsubsets(lnw_2016 ~ ., method="exhaustive",nvmax=14,data = nlsy)

summary(subset)

In [None]:
# step 3
best_subset <- summary(subset)


data.frame(
  Adj.R2 = which.max(best_subset$adjr2),
  CP = which.min(best_subset$cp),
  BIC = which.min(best_subset$bic)
)

In [None]:
coef(subset,6)

##  Stepwise Selection

-   For computational reasons, best subset selection cannot be applied
    with very large p.

-   Best subset selection may also suffer from statistical problems when
    p is large

-   An enormous search space can lead to overfitting and high variance
    of the coefficient estimates.

-   For both of these reasons, stepwise methods, which explore a far
    more restricted set of models, are attractive alternatives to best
    subset selection.



###  Forward Stepwise Selection

    -   Start with no predictors

    -   Test all models with 1 predictor. Choose the best model

    -   Add 1 predictor at a time, without taking away.

    -   Of the p+1 models, choose the one with smallest prediction error
        using cross validation
        
    -   We have $1+ p(p+1)/2$ Models. In best subset we had $2^p$ 

### Backward Stepwise Selection

    -   Same idea but start with a complete model and go backwards,
        taking one at a time.


### Forward Selection

-   Computational advantage over best subset selection is clear.

-   It is not guaranteed to find the best possible model out of all
    $2^p$ models containing subsets of the p predictors.

-   Drawback: once a predictor enters, it cannot leave.

In [None]:
forward<-regsubsets(lnw_2016 ~ ., method="forward", nvmax=14,data = nlsy)

summary(forward)

In [None]:
# step 3
best_forward <- summary(forward)


data.frame(
  Adj.R2 = which.max(best_forward$adjr2),
  CP = which.min(best_forward$cp),
  BIC = which.min(best_forward$bic)
)

In [None]:
coef(forward, 6)

In [None]:
coef(subset,6)

### Backward Selection

-   Like forward stepwise selection, the backward selection approach
    searches through only $1 + p(p + 1)/2$ models

-   However, unlike forward stepwise selection, it begins with the model
    containing all p predictors, and then iteratively removes the least
    useful predictor, one-at-a-time.

-   Like forward stepwise selection, backward stepwise selection is not
    guaranteed to yield the best model containing a subset of the p
    predictors.

-   Backward selection requires that the number of observations
    (samples) $n$ is larger than the number of variables $p$ (so that
    the full model can be fit).

-   In contrast, forward stepwise can be used even when $n < p$, and so
    is the only viable subset method when p is very large.
    
Not much has to change to implement backward selection... just looping through the predictors in reverse!

In [None]:
backwards<-regsubsets(lnw_2016 ~ ., method="backward", nvmax=14,data = nlsy)

summary(backwards)

In [None]:
# step 3
best_backwards <- summary(backwards)


data.frame(
  Adj.R2 = which.max(best_backwards$adjr2),
  CP = which.min(best_backwards$cp),
  BIC = which.min(best_backwards$bic)
)

In [None]:
coef(backwards, 6)

In [None]:
coef(forward, 6)

In [None]:
coef(subset,6)