<div >
<img src = "../banner.jpg" />
</div>

<a target="_blank" href="https://colab.research.google.com/github/ignaciomsarmiento/BDML_202302/blob/main/Lecture05/Notebook_Ridge.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>



# Regularization: Ridge

## Predicting Wages

Our objective today is to construct a model of individual wages

$$
w = f(X) + u 
$$

where w is the  wage, and X is a matrix that includes potential explanatory variables/predictors. In this problem set, we will focus on a linear model of the form

\begin{align}
 ln(w) & = \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p  + u 
\end{align}

were $ln(w)$ is the logarithm of the wage.

To illustrate I'm going to use a sample of the NLSY97. The NLSY97 is  a nationally representative sample of 8,984 men and women born during the years 1980 through 1984 and living in the United States at the time of the initial survey in 1997.  Participants were ages 12 to 16 as of December 31, 1996.  Interviews were conducted annually from 1997 to 2011 and biennially since then.  

Let's load the packages and the data set:

In [None]:
#install.packages("pacman") #for google colab

In [None]:
#packages
require("pacman")
p_load("tidyverse","stargazer")

nlsy <- read_csv('https://raw.githubusercontent.com/ignaciomsarmiento/datasets/main/nlsy97.csv')

nlsy <- nlsy  %>%   drop_na(educ) #dropea los valores faltantes (NA)

We want to construct a model that predicts well out of sample, and we have potentially 994 regressors. We are going to regularize this regression using Ridge.

## Ridge

We first illustrate ridge regression, which can be fit using glmnet() with alpha = 0 and seeks to minimize

$$
\sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij}    \right) ^ 2 + \lambda \sum_{j=1}^{p} \beta_j^2 .
$$

Notice that the intercept is not penalized. 


Ridge penalizes the squares  of the coefficients. As a result, ridge shrinks coefficients toward zero, but not all the way.

We are going to use Glmnet. Glmnet is a package that fits generalized linear and similar models via penalized maximum likelihood. The regularization path is computed for the lasso or elastic net penalty at a grid of values (on the log scale) for the regularization parameter lambda. The algorithm is extremely fast!

## Glmnet

To apply a regularized model we can use the `glmnet::glmnet()` function. The `alpha` parameter tells glmnet to perform a ridge (`alpha` = 0), lasso (`alpha` = 1), or elastic net (0 < `alpha` < 1) model. 

By default, `glmnet` will do two things that you should be aware of:

1. Since regularized methods apply a penalty to the coefficients, we need to ensure our coefficients are on a common scale. If not, then predictors with naturally larger values  will be penalized more than predictors with naturally smaller values. By default, `glmnet` automatically standardizes your features. If you standardize your predictors prior to glmnet you can turn this argument off with `standardize = FALSE`.

2. `glmnet` will fit ridge models across a wide range of  $\lambda$  values, which is illustrated below:

`glmnet` has some drawbacks, the main one is that we need to specify the arguments in terms of matrices and vectors:

In [None]:
p_load("glmnet")

#Vector that needs predicting
y <- nlsy$lnw_2016


# Matrix of predictos (only educ, mother and father's education)
X <- as.matrix(nlsy  %>% select(educ,mom_educ,dad_educ))



Let's run the ridge regression (we need to set the parameter `alpha` to zero)

In [None]:

ridge <- glmnet(
  x = X,
  y = y,
  alpha = 0 #ridge
)

Let's see how  how much the coefficients are penalized for different values of $\lambda$. Notice none of the coefficients are forced to be zero, although they get close to it.

In [None]:
plot(ridge, xvar = "lambda")

#### All the predictors

In [None]:

# Matrix of predictos (all but lnw_2016)
X <- as.matrix(nlsy  %>% select(-lnw_2016))

ridge <- glmnet(
  x = X,
  y = y,
  alpha = 0 #ridge
)

plot(ridge, xvar = "lambda")

## Scale Equivariance

We are going to illustrate the scale problems using just education and afqt scores


In [None]:

#Vector that needs predicting
y <- nlsy$lnw_2016

# Matrix of predictos (only educ and afqt)
X <- as.matrix(nlsy  %>% select(educ,afqt))



In [None]:
stargazer(data.frame(X),type="text")

Let's run the ridge regression (we need to set the parameter `alpha` to zero)

In [None]:
ridge <- glmnet(
  x = X,
  y = y,
  alpha = 0, #ridge
  lambda=20,
  standardize=FALSE,
)

Let's see the coefficients we obtained


In [None]:
coef(ridge)

Compare to OLS

In [None]:
ols<-lm(y~X)
summary(ols)

### What happens if we change the scale for education?

In [None]:
X[,1]<-X[,1]*1000 #multiply first column by 1000

In [None]:
ridge_1000 <- glmnet(
  x = X,
  y = y,
  alpha = 0, #ridge
 lambda=20,
  standardize=FALSE,
)

In [None]:
coef(ridge_1000)[2]

In [None]:
coef(ridge_1000)[2]*1000

In [None]:
ols_1000<-lm(y~X)
summary(ols_1000)

In [None]:
ols_1000$coefficients[2]*1000

## Penalty selection

In [None]:
# Matrix of predictos (all but lnw_2016)
X <- as.matrix(nlsy  %>% select(-lnw_2016))

#Vector that needs predicting
y <- nlsy$lnw_2016

In [None]:

ridge <- glmnet(
  x = X,
  y = y,
  alpha = 0 #ridge
)

In [None]:
ridge

### Kfold cross validation

In [None]:
?cv.glmnet

In [None]:
cv.ridge <- cv.glmnet(
  x = X,
  y = y,
  alpha = 0 #ridge
)

In [None]:
cv.ridge

We can plot:

In [None]:
plot(cv.ridge)

This plots the cross-validation curve (red dotted line) along with upper and lower standard deviation curves
along the $\lambda$ sequence (error bars). 

Two special values along the $\lambda$ sequence are indicated by the vertical dotted lines:
 - lambda.min is the value of $\lambda$ that gives minimum mean cross-validated error, while 
 - lambda.1se is the value of $\lambda$ that gives the most regularized model such that the cross-validated error is within one standard error of the minimum.

We can use the following code to get the value of `lambda.min` 

In [None]:
log(cv.ridge$lambda.min)

In [None]:
cv.ridge$lambda.min

and the model coefficients at that value of $\lambda$:

In [None]:
coef(cv.ridge, s = "lambda.min")

Predictions can be made based on the fitted cv.glmnet object as well. The code below gives predictions for
the new input matrix `newx` at `lambda.min`:

In [None]:
predict(cv.ridge, newx = X[1:5,], s = "lambda.min")