<div >
<img src = "banner.jpg" />
</div>

# Resampling Methods for Out of Sample Performance


# Introduction



For this tutorial we will use the data set `matchdata` included in the
`McSpatial package` for R. The data contains data on 3204 sales of single-family homes on the Far North Side of Chicago in 1995 and 2005. 


<div>
<img src="chicago.png" width="250"/>
</div>



This data set includes 18 variables/features about the home, including the price the home was sold, the number of bathrooms, bedrooms, the latitude and longitude, etc.



In [None]:
#packages
require("pacman")
p_load("tidyverse","stargazer")

In [None]:
#Data 
load(url("https://github.com/ignaciomsarmiento/datasets/blob/main/matchdata.rda?raw=true"))

In [None]:
stargazer(matchdata, header=FALSE, type='text',title="Variables Included in the Matched Data Set")

In [None]:
matchdata <- matchdata %>% 
                      mutate(price=exp(lnprice) #transforms log prices to standard prices
                             ) 

## Validation Set  Approach


<div>
<img src="30-70.png" width="500"/>
</div>


In [None]:
#make this example reproducible
set.seed(123)

#use 70% of dataset as training set and 30% as test set
sample <- sample(c(TRUE, FALSE), nrow(matchdata), replace=TRUE, prob=c(0.7,0.3))
head(sample)

In [None]:
sum(sample)/nrow(matchdata)

In [None]:
train  <- matchdata[sample, ]
test   <- matchdata[!sample, ]
dim(train)

In [None]:
#{    "tags": [  "hide-input"]


# Another way
#make this example reproducible
#set.seed(123)

#create ID column
#df$id <- 1:nrow(df)

#use 70% of dataset as training set and 30% as test set 
#train <- df %>% dplyr::sample_frac(0.70)
#test  <- dplyr::anti_join(df, train, by = 'id')
#}

---

The objective then is to be able to get the best prediction of house prices. We begin by using a simple model with no covariates, just a constant

In [None]:
model1<-lm(price~1,data=train)
summary(model1)

In this case our prediction for the log price is the average train sample average

$$
\hat{y}=\hat{\beta_1}=\frac{\sum y_i}{n}=m
$$

In [None]:
coef(model1)


In [None]:
paste("Coef:", mean(train$price))

---

But we are concernded on predicting well our of sample, so we need to evaluate our model in the testing data 

In [None]:
test$model1<-predict(model1,newdata = test)

Then the $test\,MSE=E((y-\hat{y})^2)=E((y-m)^2)=$ 

In [None]:
with(test,mean((price-model1)^2))

This is our starting point, then the question is how can we improve it.

---

To improve our prediction we can start adding variables and thus *building* $f$. The standard approach to build $f$ would be using a hedonic house price function. In its basic form the hedonic price function is linear in the explanatory characteristics

$$
Price=\beta_0+\beta_1 x_1+\beta_2 x_2 + \dots + \beta_p x_p +u
$$

where $y$ is ussually the sales price, and $x_1  \dots x_p$ are attributes of the house, like  structural characteristics and it's location. So estimating an hedonic price function seems a good idea to start with. 
However, the theory says little on what are the relevant attributes of the house. So we are going to explore the effects of adding house characteristics on our out of sample MSE.

We begin by showing that the simple inclusion of a single covariate reduces the MSE with respect to the \textit{naive} model that used the sample mean.

In [None]:
model2<-lm(price~bedrooms,data=train)
test$model2<-predict(model2,newdata = test)
with(test,mean((price-model2)^2))

---

What about if we include more variables? 

In [None]:
model3<-lm(price~bedrooms+bathrooms+centair+fireplace+brick,data=train)
test$model3<-predict(model3,newdata = test)
with(test,mean((price-model3)^2))

Note that the MSE is once more reduced. If we include all?

In [None]:

model4<-lm(price~bedrooms+bathrooms+centair+fireplace+brick+
                lnland+lnbldg+rooms+garage1+garage2+dcbd+rr+
                yrbuilt+factor(carea)+latitude+longitude,data=train)
test$model4<-predict(model4,newdata = test)
with(test,mean((price-model4)^2))

 In this case the MSE keeps improving. Is there a limit to this improvement? Can we keep adding features and complexity?

In [None]:
model5<-lm(price~poly(bedrooms,2,raw=TRUE):poly(bathrooms,3,raw=TRUE):centair:fireplace:brick:lnland:lnbldg+garage1+garage2+rr+
                yrbuilt+factor(carea)+poly(latitude,8,raw=TRUE):poly(longitude,8,raw=TRUE),data=train)
test$model5<-predict(model5,newdata = test)

In [None]:
with(test,mean((price-model5)^2))

Compare everything

In [None]:
mse1<-with(test,round(mean((price-model1)^2),3))
mse2<-with(test,round(mean((price-model2)^2),3))
mse3<-with(test,round(mean((price-model3)^2),3))
mse4<-with(test,round(mean((price-model4)^2),3))
mse5<-with(test,round(mean((price-model5)^2),3))

In [None]:
mse<-c(mse1,mse2,mse3,mse4,mse5)

db<-data.frame(model=factor(c("model1","model2","model3","model4","model5"),ordered=TRUE),MSE=mse)

db

## LOOCV


<div>
<img src="1.png" width="500"/>
</div>

<div>
<img src="2.png" width="500"/>
</div>


<div>
<img src="3.png" width="500"/>
</div>

.

.

.

.

.

.

.

.

<div>
<img src="20.png" width="500"/>
</div>



# K- Fold Cross Validation

<div>
<img src="fold.png" width="500"/>
</div>


In [None]:

# Specify the number of folds for
# V-fold cross-validation
set.seed(123)
folds = 5

index <- split(1:1000, 1:folds)
splt <- lapply(1:folds, function(ind) matchdata[index[[ind]], ])


In [None]:
View(head(splt))

In [None]:
head(splt[[2]])

In [None]:
head(splt[[5]])

In [None]:
p_load(data.table)
m1 <- lapply(1:folds, function(ii) lm(price~bedrooms, data = rbindlist(splt[-ii]))) 

In [None]:
#lm(price~bedrooms, data = rbindlist(splt[-1]))

In [None]:
p1 <- lapply(1:folds, function(ii) data.frame(predict(m1[[ii]], newdata = rbindlist(splt[ii]))))

In [None]:
#p1[1]

In [None]:

for (i in 1:folds) {
  colnames(p1[[i]])<-"yhat"
  splt[[i]] <- cbind(splt[[i]], p1[[i]])

}

In [None]:

MSE2_k <- lapply(1:folds, function(ii) mean((splt[[ii]]$price - splt[[ii]]$yhat)^2))
MSE2_k

In [None]:
mean(unlist(MSE2_k))

In [None]:
db$MSE[db$model=="model2"]