<div >
<img src = "../banner.jpg" />
</div>

# The Bootstrap


## Introduction


The bootstrap is a widely applicable and extremely powerful statistical tool that can be used to quantify the uncertainty associated with a given estimator or statistical learning method. 

As a simple example, the bootstrap can be used to estimate the standard errors of the coefficients from a linear regression fit. 

In the specific case of linear regression, this is not particularly useful, since standard statistical software outputs such standard errors automatically.

However, the power of the bootstrap lies in the fact that it can be easily applied to a wide range of statistical learning methods, including some for which a measure of variability is otherwise difficult to obtain and is not automatically output by statistical software.

Let's illustrate this with an example. The objective is to estimate gasoline demand's elasticity and quantify its associated uncertainty.


Suppose we have the following data to estimate the elasticity where all the variables are in logs:

In [None]:
#install.packages("pacman") for #google colab

In [None]:
gas<-read.csv("https://raw.githubusercontent.com/ignaciomsarmiento/datasets/main/gas.csv",header=T)
head(gas)



Before estimating different models, let's recall that the elasticity of demand, usually denoted by $\eta_{qp}$, is the percentage change in the quantity demanded divided by the percentage change in the price.  It gives the percentage change in quantity demanded when there is a one percent increase in price, holding everything else constant:

\begin{align}
\eta_{qp} &=\frac{\frac{\partial Q}{Q}}{\frac{\partial P}{P}} \\
          &=\frac{\partial Q}{\partial P}\frac{P}{Q}
\end{align}

## Case 1



To begin with, let's suppose that the demand model takes the form:

$$
\ln{Quantity}_{t}= \alpha + \beta_1 \ln{Price}_{t} +\beta_2 \ln{Income}_{t} +u_{t}
$$

given this specification, we have that the elasticity of demand is the coefficient associated with `lnPrice`



$$
\eta_{qp} =\beta_1
$$

Let's estimate this

In [None]:
require("pacman")
p_load("tidyverse","stargazer")

mod1<- lm(consumption~price+income,gas)
stargazer(mod1,type="text", omit.stat=c("ser","f","adj.rsq"))


Thus the elasticity of demand, given the above specification, is -0.838, and the standard error of that elasticity of demand is 0.025. 


In [None]:
str(mod1)

In [None]:
mod1$coefficients

In [None]:
round(mod1$coefficients[2],3)



We can also find the standard error using *bootstrap*. In general terms, if $\theta$ is the magnitude of interest (in our case, is $\eta_{qp} =\beta_1$) we need to perform the following steps:
    
  1. Take a sample of size $n$ with replacement ( *bootstrap sample*)
  2. Compute $\hat{\theta}_j$ $j=1,\dots,B$, here it would be  $\eta_{qp} =\beta_1$ 
  3. Repeat $B$ times
  4. Calculate the standard error


Let's implement the bootstrap in two ways: "by hand", and using the package `boot`.

The first step is to tell `R`  a seed so results are reproducible and set the number of bootstrap samples $B$:

In [None]:
set.seed(123)

B<-1000 # Number of Repetions()

eta_mod1<-rep(NA,B)#this is an empty vector where we are going to save our elasticity estimates
length(eta_mod1)

Next, we have to create a loop that takes a sample of size $n$ with replacement, estimates the coefficient of interest, and saves it to the above empty vector.

In [None]:
for(i in 1:B){
        
      db_sample<- sample_frac(gas,size=1,replace=TRUE) #takes a sample with replacement of the same size of the original sample (1 or 100%)
        
      f<-lm(consumption~price+income,db_sample)# estimates the models
      
      coefs<-f$coefficients[2] # gets the coefficient of interest that coincides with the elasticity of demand
      
      eta_mod1[i]<-coefs #saves it in the above vector
    }

We can check that we have B=1000 estimates of the elasticity of demand

In [None]:
length(eta_mod1)

We can plot the sampling distribution of the estimated elasticity of demand

In [None]:
plot(hist(eta_mod1))

Obtain the mean

In [None]:
mean(eta_mod1)

and finally, obtain the standard error

In [None]:
sqrt(var(eta_mod1))

We could also estimate any quantity, for example the 2.5%. and 97.5% percentiles

In [None]:
quantile(eta_mod1,c(0.025,0.975))

### Bootstrap with the boot package

`R` as it is a heavily used software by statisticians and econometricians, it already includes a package that simplifies and speeds up obtaining standard errors using bootstrap

In [None]:
p_load("boot")

This package contains the function `boot` that takes 3 arguments: 

`boot(data, statistic, R)`

the data set, a function which, when applied to data, returns a vector containing the statistic(s) of interest, and the number of desired bootstrap replicates. The first and the third arguments are straightforward. The second, however, needs more explanation



The `boot` function requires a second argument which is a function. This function needs at least two arguments, a data set and an index, which tells `R` which points it should use for its estimation. The indexing strategy speeds up the computation of the bootstrap samples.

With that, we can tell the function what to return. In this case, we care about the second coefficient of the linear regression:  `consumption~price+income` 

In [None]:
eta_fn<-function(data,index){
  
  coef(lm(consumption~price+income, data = data, subset = index))[2] #returns the second coefficient of the linear regression
}

Let's check that it works. We give the function our `gas` data and tell it to use all the observations from 1 to the last row:


In [None]:
eta_fn(gas,1:nrow(gas))

We get the same coefficient shown in our results table. So we know that the function is working. With that, we can estimate the standard error using the `boot` function:

In [None]:
set.seed(123)
#call the boot function
boot(gas, eta_fn, R = 1000)

We get similar results from our estimates by hand. The small differences are given by the sampling strategy implemented by `boot` which is more efficient than the one implemented by us.

Note also that the estimates obtained by bootstrapping differ from those returned by `lm.` **Why?**

## Case 2


The previous case was not particularly useful since the specification resulted in the elasticity of demand coinciding with a coefficient, and `R` outputs standard errors automatically. However, the power of the bootstrap lies in the fact that it can be easily applied to a wide range of statistical learning methods, including some for which a measure of variability is otherwise challenging to obtain and is not automatically output by statistical software.

For example, let's suppose that the model is now the following

$$\ln{Quantity}_{t}= \beta_0 + \beta_1 \ln{Price}_{t}  + \beta_2 \ln{Price}^2_{t} +\beta_3 \ln{Income}_{t}+ \beta_4 \ln{Price}_{t}\times \ln{Income}_{t} +u_{t}$$

given this specification, the elasticity of demand implied by this model is:


$$
\eta_{qp} =\beta_1 + 2*\beta_2 \ln{Price} + \beta_4  \ln{Income}_{t}
$$
Then the elasticity is the combination of 3 coefficients. Thus the uncertainty associated with this quantity should account for the uncertainty coming from all of them.


Let's then estimate the model and calculate the elasticity of demand. First, we need  to generate the quadratic and interaction terms: 

In [None]:
gas<- gas %>% mutate(price2=price^2, 
                     price_income=price*income )

In [None]:
head(gas)

then regress: 

In [None]:
mod2<-lm(consumption~price+price2+income+price_income,gas)
stargazer(mod1,mod2,type="text")

 To do that you will need to: 
 
i) Obtain the coefficients of regression: 

In [None]:
coefs<-mod2$coef
coefs 

ii) Extract the coefficients to scalars: 

In [None]:
    b0<-coefs[1] 
    b1<-coefs[2]
    b2<-coefs[3] 
    b3<-coefs[4]
    b4<-coefs[5] 

iii) We need a value at which to estimate the elasticity since it depends on price and income ($\eta_{qp} =\beta_1 + 2\beta_2 \ln{Price} + \beta_4  \ln{Income}_{t}$). Here we are going to choose the sample mean, but you can do it at any point:

In [None]:
price_bar<-mean(gas$price)
income_bar<-mean(gas$income)


elastpt<-b1+2*b2*price_bar+b4*income_bar
    
elastpt

Note that the elasticity of demand implied by the second model is smaller than the previous one. Next, we need to calculate the standard errors to characterize this demand's uncertainty. 

Let's turn to the boot package and construct the function that will return the elasticity of interest.


In [None]:

eta_mod2_fn<-function(data,index,
                      price_bar=mean(gas$price),
                      income_bar=mean(gas$income)){
      
      f<-lm(consumption~price+price2+income+price_income,data, subset = index)
      
      coefs<-f$coefficients
        
        b1<-coefs[2]
        b2<-coefs[3] 
        b4<-coefs[5] 
    
      elastpt<-b1+2*b2*price_bar+b4*income_bar
    
    
      return(elastpt)
    }


  

The function is similar to the first case. Still, we have added two arguments; the points where we want the elasticity to be evaluated, in this case, the sample means. Let's check that it works

In [None]:
eta_mod2_fn(gas,1:nrow(gas))

We get the same result as above. Let's evaluate it at different points of the distribution of price and income:


In [None]:
eta_mod2_fn(gas,1:nrow(gas),price_bar=-1,income_bar=2)  

With our function working, we can run `boot` and obtain the standard error for our elasticity:

In [None]:
#función boot requires, data (gas), 
#requires a statistic
# requires number of replications
results <- boot(data=gas, eta_mod2_fn,R=1000)
results