<div >
    <img src = "../banner/banner_ML_UNLP_1900_200.png" />
</div>

<a target="_blank" href="https://colab.research.google.com/github/ignaciomsarmiento/ML_UNLP_Lectures/blob/main/Week02/Notebook_SS02_ModelSelection.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>



# Introduction

The concept behind resampling techniques for evaluating model performance is straightforward: a portion of the data is used to train the model, while the remaining data is used to assess the model's accuracy. 

This process is repeated several times with different subsets of the data, and the results are averaged and summarized. The primary differences between resampling techniques lie in the method by which the subsets of data are selected. 

In the following sections, we will discuss the main types of resampling techniques.




# Predicting Wages

Our objective today is to construct a model of individual wages

$$
w = f(X) + u 
$$

where w is the  wage, and X is a matrix that includes potential explanatory variables/predictors. In this problem set, we will focus on a linear model of the form

\begin{align}
 ln(w) & = \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p  + u 
\end{align}

were $ln(w)$ is the logarithm of the wage.


Let's load the modules:

In [1]:
import numpy as np
import pandas as pd

and the data set that is a sample of the NLSY97. The NLSY97 is  a nationally representative sample of 8,984 men and women born during the years 1980 through 1984 and living in the United States at the time of the initial survey in 1997.  Participants were ages 12 to 16 as of December 31, 1996.  Interviews were conducted annually from 1997 to 2011 and biennially since then.  

In [2]:
nlsy=pd.read_csv('https://raw.githubusercontent.com/ignaciomsarmiento/datasets/main/nlsy97.csv')

what are the predictors that I have available?

In [None]:
nlsy.head()

Let's keep a subset of these predictors

In [4]:
nlsy_subset = nlsy[["lnw_2016", "educ", "exp", "afqt", "mom_educ", "dad_educ"]]
nlsy_subset = nlsy_subset.dropna()

In [None]:
np.isnan(nlsy_subset).sum()

In [6]:
X = nlsy_subset[[ "educ", "exp", "afqt", "mom_educ", "dad_educ"]]

y=nlsy_subset[["lnw_2016"]]

In [None]:
X

### Best Subset Selection


1.  Let $M_0$ denote the null model, which contains no predictors. This
    model simply predicts the sample mean for each observation.

2.  For $k=1,2,\dots,p$:

    1.  Fit all $\binom{p}{k}$ models that contain exactly k predictors

    2.  Pick the best among these $\binom{p}{k}$ models, and call it
        $M_k$. Where *best* is the one with the smallest $MSE$

3.  Select a single best model from among $M_0,\dots, M_p$

In [8]:
from sklearn.model_selection import KFold, cross_val_score

kf = KFold(n_splits=5, shuffle=True, random_state=42)


In [9]:
from sklearn.linear_model import LinearRegression

# Modelo
model = LinearRegression()

In [10]:
def processSubset(feature_set):
    # Fit model on feature_set and calculate MSE    
    scores = cross_val_score(model, X[list(feature_set)], y, cv=kf, scoring='neg_mean_squared_error')
    # Resultados
    mse_scores = -scores  # Convertir a positivo
    return {"model":feature_set, "MSE":mse_scores.mean()}

In [11]:
import itertools
#import time


results = []
    
for combo in itertools.combinations(X.columns, 2):
        results.append(processSubset(combo))

In [None]:
models = pd.DataFrame(results)

models

In [None]:
best_model = models.loc[models['MSE'].argmin()]
best_model

In [14]:
def getBest(k):
    
    tic = time.time()
    
    results = []
    
    for combo in itertools.combinations(X.columns, k):
        results.append(processSubset(combo))
    
    # Wrap everything up in a nice dataframe
    models = pd.DataFrame(results)
    
    # Choose the model with the highest MSE
    best_model = models.loc[models['MSE'].argmin()]
    
    toc = time.time()
    print("Processed", models.shape[0], "models on", k, "predictors in", (toc-tic), "seconds.")
    
    # Return the best model, along with some other useful information about the model
    return best_model

In [None]:
#import itertools
import time

# Could take quite awhile to complete...

models_best = pd.DataFrame(columns=["MSE", "model"])

tic = time.time()
for i in range(1,5):
    models_best.loc[i] = getBest(i)

toc = time.time()
print("Total elapsed time:", (toc-tic), "seconds.")


Now we have one big DataFrame that contains the best models we've generated along with their MSE:

In [None]:
models_best

In [None]:
2^k

##  Stepwise Selection

-   For computational reasons, best subset selection cannot be applied
    with very large p.

-   Best subset selection may also suffer from statistical problems when
    p is large

-   An enormous search space can lead to overfitting and high variance
    of the coefficient estimates.

-   For both of these reasons, stepwise methods, which explore a far
    more restricted set of models, are attractive alternatives to best
    subset selection.



###  Forward Stepwise Selection

    -   Start with no predictors

    -   Test all models with 1 predictor. Choose the best model

    -   Add 1 predictor at a time, without taking away.

    -   Of the p+1 models, choose the one with smallest prediction error
        using cross validation
        
    -   We have $1+ p(p+1)/2$ Models. In best subset we had $2^p$ 

### Backward Stepwise Selection

    -   Same idea but start with a complete model and go backwards,
        taking one at a time.


### Forward Selection

-   Computational advantage over best subset selection is clear.

-   It is not guaranteed to find the best possible model out of all
    $2^p$ models containing subsets of the p predictors.

-   Drawback: once a predictor enters, it cannot leave.

In [17]:
def forward(predictors):

    # Pull out predictors we still need to process
    remaining_predictors = [p for p in X.columns if p not in predictors]
    
    tic = time.time()
    
    results = []
    
    for p in remaining_predictors:
        results.append(processSubset(predictors+[p]))
    
    # Wrap everything up in a nice dataframe
    models = pd.DataFrame(results)
    
    # Choose the model with the highest RSS
    best_model = models.loc[models['MSE'].argmin()]
    
    toc = time.time()
    print("Processed ", models.shape[0], "models on", len(predictors)+1, "predictors in", (toc-tic), "seconds.")
    
    # Return the best model, along with some other useful information about the model
    return best_model

In [None]:
models_fwd = pd.DataFrame(columns=["MSE", "model"])

tic = time.time()
predictors = []

for i in range(1,len(X.columns)+1):    
    models_fwd.loc[i] = forward(predictors)
    predictors = models_fwd.loc[i]["model"]

toc = time.time()
print("Total elapsed time:", (toc-tic), "seconds.")

In [None]:
models_fwd


### Backward Selection

-   Like forward stepwise selection, the backward selection approach
    searches through only $1 + p(p + 1)/2$ models

-   However, unlike forward stepwise selection, it begins with the model
    containing all p predictors, and then iteratively removes the least
    useful predictor, one-at-a-time.

-   Like forward stepwise selection, backward stepwise selection is not
    guaranteed to yield the best model containing a subset of the p
    predictors.

-   Backward selection requires that the number of observations
    (samples) $n$ is larger than the number of variables $p$ (so that
    the full model can be fit).

-   In contrast, forward stepwise can be used even when $n < p$, and so
    is the only viable subset method when p is very large.
    
Not much has to change to implement backward selection... just looping through the predictors in reverse!

In [20]:
def backward(predictors):
    
    tic = time.time()
    
    results = []
    
    for combo in itertools.combinations(predictors, len(predictors)-1):
        results.append(processSubset(combo))
    
    # Wrap everything up in a nice dataframe
    models = pd.DataFrame(results)
    
    # Choose the model with the highest MSE
    best_model = models.loc[models['MSE'].argmin()]
    
    toc = time.time()
    print("Processed ", models.shape[0], "models on", len(predictors)-1, "predictors in", (toc-tic), "seconds.")
    
    # Return the best model, along with some other useful information about the model
    return best_model

In [None]:
models_bwd = pd.DataFrame(columns=["MSE", "model"], index = range(1,len(X.columns)))

tic = time.time()
predictors = X.columns

while(len(predictors) > 1):  
    models_bwd.loc[len(predictors)-1] = backward(predictors)
    predictors = models_bwd.loc[len(predictors)-1]["model"]

toc = time.time()
print("Total elapsed time:", (toc-tic), "seconds.")

In [None]:
models_bwd

In [None]:
models_fwd

In [None]:
models_best