<a target="_blank" href="https://colab.research.google.com/github/ignaciomsarmiento/ML_UNLP_Lectures/blob/main/Week01/Notebook_SS01_CV.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>



# Introduction

The concept behind resampling techniques for evaluating model performance is straightforward: a portion of the data is used to train the model, while the remaining data is used to assess the model's accuracy. 

This process is repeated several times with different subsets of the data, and the results are averaged and summarized. The primary differences between resampling techniques lie in the method by which the subsets of data are selected. 

In the following sections, we will discuss the main types of resampling techniques.




# Predicting Wages

Our objective today is to construct a model of individual wages

$$
w = f(X) + u 
$$

where w is the  wage, and X is a matrix that includes potential explanatory variables/predictors. In this problem set, we will focus on a linear model of the form

\begin{align}
 ln(w) & = \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p  + u 
\end{align}

were $ln(w)$ is the logarithm of the wage.

To illustrate I'm going to use a sample of the NLSY97. The NLSY97 is  a nationally representative sample of 8,984 men and women born during the years 1980 through 1984 and living in the United States at the time of the initial survey in 1997.  Participants were ages 12 to 16 as of December 31, 1996.  Interviews were conducted annually from 1997 to 2011 and biennially since then.  

Let's load the modules and the data set:

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

In [2]:
nlsy=pd.read_csv('https://raw.githubusercontent.com/ignaciomsarmiento/datasets/main/nlsy97.csv')

what are the predictors that I have available?

In [None]:
nlsy.head()

Let's keep a couple of these predictors

In [4]:
X = nlsy[[ "educ", "black", "hispanic", "other", "exp", "afqt", "mom_educ", "dad_educ"]]

y=nlsy[["lnw_2016"]]

The descriptive statistics?

In [None]:
nlsy.describe()

# Validation Set  Approach

The first method to evaluate out-of-sample performance is the validation set approach. In this approach, a fixed portion of the data is designated as the validation set, and the model is trained on the remaining data. The model's performance is then evaluated on the validation set. These partitions are usually called:

   - Training sample: to build/estimate/train the model
   - Testing (validation, hold-out) sample:  to evaluate its performance 

Partitions can be of any size. Usually, 70%-30% or 80%-20% are used. Graphically, a 70%-30% partition looks like:     
    
<div>
<img src="figs_notebook/30-70.png" width="500"/>
</div>

Let's implement this in `Python`.

We begin by generating a sample index that will indicate with `TRUE` those observations randomly assigned to the training data set with 70% probability, and with `FALSE` those observations randomly assigned to the testing data set with 30% chance.

In [6]:
X_train, X_test, y_train, y_test = train_test_split(
                                        X,
                                        y,
                                        test_size=0.3,
                                        train_size=0.7,
                                        random_state = 123
                                    )

we can check that the partition:

In [None]:
X_train.head()

## Predicting wages

With these partitions in place, we can start building our predictive models. We begin by using a simple model with no covariates, just a constant:

In [None]:
X0 = np.ones((len(y_train), 1))
model1=  LinearRegression().fit(X0,y_train)
model1.intercept_

In this case, the prediction for the log wage is the average train sample average

$$
\hat{y}=\hat{\beta_1}=\frac{\sum y_i}{n}=m
$$

In [None]:
y_train.mean()

Since we are concerned with predicting well out-of -sample, we need to evaluate our model in the testing data. For that, we use the coefficient estimated in the training data and use it as a predictor in the testing data:

In [10]:
#prediction on new data
X0_test = np.ones((len(y_test), 1))
y_hat_model1 = model1.predict(X0_test)

Then we can calculate the out-of-sample performance using the MSE:

$$
test\,MSE=E((y-\hat{y})^2)
$$ 


In [None]:
from sklearn.metrics import mean_squared_error

# Calculate Mean Squared Error
mse1 = mean_squared_error(y_test, y_hat_model1)

print(f'Mean Squared Error: {mse1}')

This is quite a naive model that uses the sample average as a prediction. 

Let's see if we can improve (reduce the prediction error) this model.

To improve our prediction, we can start adding explanatory variables. Let's begin by adding only one variable,  `education (educ)`:

In [None]:

model2=  LinearRegression().fit(X_train[['educ']],y_train)
model2.coef_

In [13]:
#prediction on new data
y_hat_model2 = model2.predict(X_test[['educ']])

and evaluate the  out-of-sample performance

In [None]:
# Calculate Mean Squared Error
mse2 = mean_squared_error(y_test, y_hat_model2)

print(f'Mean Squared Error: {mse2}')

There's a clear diminution in MSE. Let's add complexity by adding more variables:


In [None]:
model3=  LinearRegression().fit(X_train[['educ','exp','afqt','mom_educ','dad_educ']],y_train)
model3.coef_


The performance:

In [None]:
#prediction on new data
y_hat_model3 = model3.predict(X_test[['educ','exp','afqt','mom_educ','dad_educ']])

# Calculate Mean Squared Error
mse3 = mean_squared_error(y_test, y_hat_model3)

print(f'Mean Squared Error: {mse3}')

In this case, the MSE keeps improving. Is there a limit to this improvement? Can we keep adding features and complexity? What about an extremely complex model that includes polynomials and interactions.

In [None]:
from sklearn.preprocessing import PolynomialFeatures

?PolynomialFeatures

In [18]:

poly = PolynomialFeatures(degree=2)

X_train_poly = poly.fit_transform(X_train)


model4 =  LinearRegression().fit(X_train_poly,y_train)


The performance:

In [None]:
#prediction on new data
X_test_poly = poly.fit_transform(X_test)
y_hat_model4 = model4.predict(X_test_poly)

# Calculate Mean Squared Error
mse4 = mean_squared_error(y_test, y_hat_model4)

print(f'Mean Squared Error: {mse4}')

It is clear that as complexity increases, performance improves until a point where too much complexity results in inferior performance. 

This is an illustration of the Bias-Variance-Trade-Off.

Although the validation set approach is quite nice, there are at least two problems with it
  
  1. The first one is that given an original data set if part of it is left aside to test the model, fewer data is left for estimation (leading to less efficiency).
  2. A second problem is deciding which data will be used to train the model and which one to test it:
  
  <div>
<img src="figs_notebook/fig52.png" width="800"/>
</div>




# Leave-One-Out Cross-Validation (LOOCV) 

This method is similar to the Validation Set Approach, but it tries to address the latter's disadvantages. Leave-One-Out Cross-Validation (LOOCV) is a resampling technique for evaluating model performance. Each sample in the data is used once as the validation set, and the model is trained on the remaining samples. 

Graphically the LOOCV looks like this: 


<div>
<img src="figs_notebook/1.png" width="500"/>
</div>

<div>
<img src="figs_notebook/2.png" width="500"/>
</div>


<div>
<img src="figs_notebook/3.png" width="500"/>
</div>

.

.

.

.

.

.

.

.

<div>
<img src="figs_notebook/20.png" width="500"/>
</div>


LOOCV is computationally expensive, as a separate model has to be fit `n` times, where `n` is the number of samples in the data. However, LOOCV is more thorough in its model evaluation, as each sample is used as the validation set exactly once, giving a more comprehensive assessment of the model's performance.

The LOOCV estimate for the test MSE is

\begin{align}
LOOCV(n) &= \frac{1}{n}\sum MSE_{-i} \\ 
      &= \frac{1}{n}\sum(y_i -\hat{y}_{-i})^2
\end{align}

where $-i$ indicates that the model to obtain the prediction was trained in all observations except $i$.

LOOCV is particularly useful in cases where the number of samples in the data is small, and the risk of overfitting is high.


LOOCV is a special case of k-fold cross-validation, where k is equal to the number of samples in the data. Given that it's a particular case of k-fold cross-validation, we will implement this instead.





# K-Fold Cross-Validation

K-Fold Cross-Validation  is a widely used resampling technique for evaluating model performance. It involves dividing the data into k equally sized folds, where k is a user-defined constant. The model is then fit k times, with each fold used once as the validation set and the remaining k-1 folds used as the training set. This process results in k estimates of the model's performance, which can then be averaged to obtain an overall estimate. Graphically it looks like this:




<div>
<img src="figs_notebook/fold.png" width="500"/>
</div>

K-Fold Cross-Validation is a trade-off between the computational efficiency of the validation set approach and the thoroughness of LOOCV. On the one hand, K-Fold Cross-Validation is more computationally efficient than LOOCV, as the model is fit k times instead of n times, where n is the number of samples in the data. On the other hand, K-Fold Cross-Validation is less thorough than LOOCV, as each sample is used in the validation set k-1/k of the time, giving a less comprehensive assessment of the model's performance.However, K-Fold Cross-Validation is widely used in practice. It provides a good balance between computational efficiency and thoroughness while allowing the user to control the number of times the model fits.

K-Fold Cross-Validation provides a more robust evaluation of the model's performance than the validation set approach.  In the validation set approach, a fixed portion of the data is used as the validation set, which can result in a suboptimal estimation of the model's performance if the validation set is not representative of the data. In contrast, K-Fold Cross-Validation ensures that each sample is used in the validation set exactly once, providing a more comprehensive assessment of the model's performance.

To sum up, to implement K-Fold Cross-Validation, we need to:

- Split the data into K parts $(n=\sum_{j=1}^k n_j)$

- Fit the model leaving out one of the folds $\rightarrow$ $\hat{y}_{-k}$
  
- Cycle through all k folds
 
-  The CV(k) estimate for the test MSE is


\begin{align}
CV_{(k)} &= \frac{1}{k}\sum_{j=1}^k MSE_j \\
         &= \frac{1}{k}\sum_{j=1}^k (y_j^k-\hat{y}_{-k})^{2}
\end{align}




<div>
<img src="figs_notebook/fig54.png" width="800"/>
</div>




In [20]:
from sklearn.model_selection import KFold, cross_val_score

kf = KFold(n_splits=5, shuffle=True, random_state=42)



In [21]:
# Modelo
model = LinearRegression()


## Calculating the MSE


Finally, we need to calculate the  CV(k) estimate for the test MSE. We know that it takes the form:

\begin{align}
CV_{(k)} &= \frac{1}{k}\sum_{j=1}^k MSE_j \\
         &= \frac{1}{k}\sum_{j=1}^k (y_j^k-\hat{y}_{-k})^{2}
\end{align}

To implment this formula we first need to calculate the MSE for each fold:

In [None]:
scores = cross_val_score(model, X, y, cv=kf, scoring='neg_mean_squared_error')

# Resultados
mse_scores = -scores  # Convertir a positivo
print(f'MSE por iteración: {mse_scores}')
print(f'Media del MSE: {mse_scores.mean()}')

In [None]:
scores


# If we have enough data


In some cases, where there's enough data, researchers may use both K-Fold Cross-Validation and the validation set approach in combination to evaluate the performance of a machine learning model. This can be useful when a researcher wants to obtain a more robust evaluation of the model's performance while maintaining computational efficiency.

The following figure shows the strategy followed by Kleinberg et al. (2017) in their paper "Human decisions and machine predictions":



<div>
<img src="figs_notebook/human_decisions.png" width="500"/>
</div>


This strategy prevents the machine learning algorithm from appearing to do well simply because it is being evaluated on data it has already seen. Moreover, they add an extra layer of protection to ensure that the results are not an artifact of unhelpful "human data mining," adding a "pure hold-out."

By combining K-Fold Cross-Validation and the validation set approach, researchers can obtain a more comprehensive evaluation of the model's performance while maintaining computational efficiency. The specific combination of K-Fold Cross-Validation and the validation set approach will depend on the researcher's goals and the particular constraints of the study. When choosing a resampling technique, it is essential to carefully consider the trade-offs between computational efficiency and thoroughness.


