<div >
<img src = "banner.jpg" />
</div>

# Ridge and Lasso Basics

## Data: House Prices



We will predict house prices using the data set `matchdata` included in the `McSpatial package` for `R` to illustrate the different strategies' usage. The data contains 3204 sales of single-family homes on the Far North Side of Chicago in 1995 and 2005.

<div>
<img src="chicago.png" width="250"/>
</div>

This data set includes 18 variables/features about the home, including the price the house was sold, the number of bathrooms and bedrooms, the latitude, and longitude, etc.

Let's load the packages and the data set:

In [None]:
#packages
require("pacman")
p_load("tidyverse","stargazer")

load(url("https://github.com/ignaciomsarmiento/datasets/blob/main/matchdata.rda?raw=true"))

Stargazer can easily provide us a table with descriptive statistics:

In [None]:
stargazer(matchdata, header=FALSE, type='text',title="Variables Included in the Matched Data Set")

Note that the variable price is in logs. We are going to transform it back to levels:

In [None]:
matchdata <- matchdata %>% 
                      mutate(price=exp(lnprice) #transforms log prices to standard prices
                             ) 

## Ridge

In [None]:
p_load("glmnet")
X<-model.matrix(~rooms+bedrooms+bathrooms-1,matchdata)
y<-matchdata$price


grid=10^seq(10,-2,length=100)


ridge1<-glmnet(x=X,
               y=y,
               alpha=0, #0 is ridge, 1 is lasso
               lambda=grid)
head(coef(ridge1))

In [None]:
#Put coefficients in a data frame, except the intercept
coefs_ridge<-data.frame(t(as.matrix(coef(ridge1)))) %>% select(-X.Intercept.)
#add the lambda grid to to data frame
coefs_ridge<- coefs_ridge %>% mutate(lambda=grid)              

#ggplot friendly format
coefs_ridge<- coefs_ridge %>% pivot_longer(cols=!lambda,
                                          names_to="variables",
                                          values_to="coefficients")



ggplot(data=coefs_ridge, aes(x = lambda, y = coefficients, color = variables)) +
  geom_line() +
  scale_x_log10(
    breaks = scales::trans_breaks("log10", function(x) 10^x),
    labels = scales::trans_format("log10",
                                  scales::math_format(10^.x))
  ) +
  labs(title = "Coeficientes Ridge", x = "Lambda", y = "Coeficientes") +
  theme_bw() +
  theme(legend.position="bottom")

## Lasso

In [None]:
#Same grid
grid=10^seq(10,-2,length=100)

lasso1<-glmnet(x=X,
               y=y,
               alpha=1, #0 is ridge, 1 is lasso
               lambda=grid)

In [None]:
#Put coefficients in a data frame, except the intercept
coefs_lasso<-data.frame(t(as.matrix(coef(lasso1)))) %>% select(-X.Intercept.)
#add the lambda grid to to data frame
coefs_lasso<- coefs_lasso %>% mutate(lambda=grid)              

#ggplot friendly format
coefs_lasso<- coefs_lasso %>% pivot_longer(cols=!lambda,
                                          names_to="variables",
                                          values_to="coefficients")



ggplot(data=coefs_lasso, aes(x = lambda, y = coefficients, color = variables)) +
  geom_line() +
  scale_x_log10(
    breaks = scales::trans_breaks("log10", function(x) 10^x),
    labels = scales::trans_format("log10",
                                  scales::math_format(10^.x))
  ) +
  labs(title = "Coeficientes Lasso", x = "Lambda", y = "Coeficientes") +
  theme_bw() +
  theme(legend.position="bottom")

# Choosing the penalty parameter for out of sample performance

## Benchmark: OLS

In [None]:
p_load("caret")

set.seed(123)
fitControl <- trainControl(## 5-fold CV, 10 better
                           method = "cv",
                           number = 5)


fmla<-formula(lnprice~rooms*bedrooms*bathrooms+lnland*lnbldg*poly(dcbd,3,raw=TRUE))
linear_reg<-train(fmla,
                  data=matchdata,
                  method = 'lm', 
                  trControl = fitControl,
                  preProcess = c("center", "scale")
) 


linear_reg

In [None]:
summary(linear_reg)

## Ridge

In [None]:
ridge<-train(fmla,
             data=matchdata,
             method = 'glmnet', 
             trControl = fitControl,
             tuneGrid = expand.grid(alpha = 0, #Ridge
                                    lambda = seq(0.001,0.02,by = 0.001)),
             preProcess = c("center", "scale")
              ) 

plot(ridge$results$lambda,
     ridge$results$RMSE,
     xlab="lambda",
     ylab="Root Mean-Squared Error (RMSE)"
     )

In [None]:
ridge$bestTune

In [None]:
coef_ridge<-coef(ridge$finalModel, ridge$bestTune$lambda)
coef_ridge

## Lasso

In [None]:
lasso<-train(fmla,
             data=matchdata,
             method = 'glmnet', 
             trControl = fitControl,
             tuneGrid = expand.grid(alpha = 1, #lasso
                                    lambda = seq(0.001,0.02,by = 0.001)),
             preProcess = c("center", "scale")
              ) 

plot(lasso$results$lambda,
     lasso$results$RMSE,
     xlab="lambda",
     ylab="Root Mean-Squared Error (RMSE)"
     )

In [None]:
lasso$bestTune

In [None]:
coef_lasso<-coef(lasso$finalModel, lasso$bestTune$lambda)
coef_lasso

## Elastic Net

In [None]:

EN<-train(fmla,
             data=matchdata,
             method = 'glmnet', 
             trControl = fitControl,
             tuneGrid = expand.grid(alpha = seq(0,1,by = 0.1), #lasso
                                    lambda = seq(0.001,0.02,by = 0.001)),
             preProcess = c("center", "scale")
              ) 

In [None]:
EN$bestTune

In [None]:
coef_EN<-coef(EN$finalModel,EN$bestTune$lambda)
coef_EN

In [None]:
coefs_df<-cbind(coef(linear_reg$finalModel),as.matrix(coef_ridge),as.matrix(coef_lasso),as.matrix(coef_EN))
colnames(coefs_df)<-c("OLS","RIDGE","LASSO","ELASTIC_NET")
round(coefs_df,4)

In [None]:
RMSE_df<-cbind(linear_reg$results$RMSE,ridge$results$RMSE[which.min(ridge$results$lambda)],lasso$results$RMSE[which.min(lasso$results$lambda)],EN$results$RMSE[which.min(EN$results$lambda)])
colnames(RMSE_df)<-c("OLS","RIDGE","LASSO","EN")
RMSE_df