In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import seaborn as sns

# Bootstrap Tutorial

This notebook contains a tutorial on how to use the bootstrap functionality provided by estimagic. In this example, we will work with the "exercise" example dataset taken from the seaborn library.

The working example will be a linear regression to investigate the effects of exercise time on pulse.

In [2]:
df = sns.load_dataset("exercise")

dict = {"1 min": 1, "15 min": 15, "30 min": 30}

df = df.replace({"time": dict})
df["constant"] = 1


print(df.head())

   Unnamed: 0  id     diet  pulse  time  kind  constant
0           0   1  low fat     85     1  rest         1
1           1   1  low fat     85    15  rest         1
2           2   1  low fat     88    30  rest         1
3           3   2  low fat     90     1  rest         1
4           4   2  low fat     92    15  rest         1


In general, there are two ways to use bootstrap functionality in estimagic. We can either use the main bootstrap() function, which will draw bootstrap samples, compute estimates, as well as compute standard errors and confidence intervals of these estimates at the same time. Or we can use the fact that all of the code is separated, and go through all of this step by step. 

In either way, the first thing that is required from the user is to specify a wrapping function that calculates the statistic of interest, and only takes the data on which to compute the statistic as an input. In our case, we want to regress "pulse" on "time" and a constant, and the function looks as follows:

In [3]:
def ols_fit(data):
    
    y = data["pulse"]
    x = data[["constant", "time"]]
    
    params = pd.Series(
        sm.OLS(y, x).fit().params, index=["constant", "time"]
    )
    
    return params

By returning a pd.Series with the coefficient names as index, we make sure we will eventually get a nice and readable output table. This is however not necessary. We could also simply return the parameter vector, in which case we would simply have to keep in mind the meaning of each parameter by ourselves.

## The Easy Way

After defining our function of interest, we can make use of estimagic's bootstrap functionality using one simple function call of the bootstrap() function.

In [4]:
from estimagic.inference.bootstrap import bootstrap

results_without_cluster = bootstrap(data=df, f=ols_fit)

print(results_without_cluster)

               mean       std   lower_ci   upper_ci
constant  93.800212  1.444054  91.092999  96.653470
time       0.384128  0.123372   0.142345   0.644022


The above function call represents the minimum that a user has to specify, making full use of the default options, such as drawing a 1000 bootstrap draws, using the "percentile" bootstrap confidence interval, not making use of parallelization, etc.

If we would for example want to make 10000 draws, while parallelizing on 4 threads and using a "bca" type confidence interval, we would simply call the following.

In [5]:
results_without_cluster2 = bootstrap(data=df, f=ols_fit, ndraws=10000, ci_method="bca", num_threads=4)

print(results_without_cluster2)

               mean       std   lower_ci   upper_ci
constant  93.756518  1.512834  91.850845  97.373046
time       0.387780  0.124949   0.126304   0.556214


We can compare our estimates to simply running a full sample OLS estimation, and see that they are (naturally) quite similar:

In [6]:
y = df["pulse"]
x = df[["constant", "time"]]

regular_ols = sm.OLS(y, x).fit()

regular_ols.summary()

0,1,2,3
Dep. Variable:,pulse,R-squared:,0.096
Model:,OLS,Adj. R-squared:,0.086
Method:,Least Squares,F-statistic:,9.383
Date:,"Fri, 28 Feb 2020",Prob (F-statistic):,0.00291
Time:,17:39:55,Log-Likelihood:,-365.51
No. Observations:,90,AIC:,735.0
Df Residuals:,88,BIC:,740.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
constant,93.7611,2.450,38.275,0.000,88.893,98.629
time,0.3873,0.126,3.063,0.003,0.136,0.639

0,1,2,3
Omnibus:,20.828,Durbin-Watson:,0.827
Prob(Omnibus):,0.0,Jarque-Bera (JB):,26.313
Skew:,1.173,Prob(JB):,1.93e-06
Kurtosis:,4.231,Cond. No.,31.7


Estimagic implements two different types of bootstrap. In the standard bootstrap that we called in the two examples above, observations from the original dataset are drawn uniformly with replacement in order to create the different bootstrap samples.

In the cluster robust variant of the bootstrap, the original dataset is divided into clusters according to the values of some user-specified variable, and then clusters are drawn uniformly with replacement in order to create the different bootstrap samples. 

In order to use the cluster robust boostrap, we simply specify which variable to cluster by. In the example we are working with, it seems sensible to cluster on individuals, i.e. on the column "id" of our dataset.

In [7]:
results_with_cluster = bootstrap(data=df, f=ols_fit, cluster_by="id")

print(results_with_cluster)

               mean       std   lower_ci   upper_ci
constant  93.752154  1.165025  91.546191  96.048796
time       0.387964  0.103310   0.189072   0.598999


We can see that the estimated standard errors are indeed of a smaller magnitude when we use the cluster robust bootstrap, which is in line with the fact that standard errors are usually overestimated if we don't cluster when we should.

Finally, we can compare the results to full sample regressions using the statsmodels regression function and see that the cluster robust bootstrap gives standard error estimates very close to the cluster robust regression.

In [8]:
cluster_robust_ols = sm.OLS(y, x).fit(
        cov_type="cluster", cov_kwds={"groups": df["id"]}
    )

cluster_robust_ols.summary()

0,1,2,3
Dep. Variable:,pulse,R-squared:,0.096
Model:,OLS,Adj. R-squared:,0.086
Method:,Least Squares,F-statistic:,13.75
Date:,"Fri, 28 Feb 2020",Prob (F-statistic):,0.000879
Time:,17:39:58,Log-Likelihood:,-365.51
No. Observations:,90,AIC:,735.0
Df Residuals:,88,BIC:,740.0
Df Model:,1,,
Covariance Type:,cluster,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
constant,93.7611,1.205,77.837,0.000,91.400,96.122
time,0.3873,0.104,3.708,0.000,0.183,0.592

0,1,2,3
Omnibus:,20.828,Durbin-Watson:,0.827
Prob(Omnibus):,0.0,Jarque-Bera (JB):,26.313
Skew:,1.173,Prob(JB):,1.93e-06
Kurtosis:,4.231,Cond. No.,31.7


## The More Sophisticated Way

In some analyses, we might be interested in not simply calculating one statistic of our sample, but indeed calculating several statistics of our sample one after another. Here I want to show how this can be done with bootstrap in estimagic.

In a first step, we should fix seeds in order to make sure that the samples we base our estimates on are the same for each separate statistic. The seeds returned are simply an np.array of ndraws valid random seeds.

In [9]:
from estimagic.inference.bootstrap_samples import get_seeds

my_seeds = get_seeds(2000)

print(my_seeds)

[ 323501399  588426255 1045046778 ... 1356650334 1868719583 1991488640]


Next, we can then go on to call the bootstrap function for different statistics, e.g. the above defined function and also a simple correlation coefficient of "pulse" and "time":

In [10]:
def corr(data):
    
    return pd.Series(data["pulse"].corr(data["time"]), index=["corr(pulse, time)"])

In [11]:
results_ols = bootstrap(data=df, f=ols_fit, seeds=my_seeds)
print(results_ols)

               mean       std   lower_ci   upper_ci
constant  93.818926  1.527447  90.970883  96.963398
time       0.386825  0.122389   0.161589   0.636060


In [12]:
results_corr = bootstrap(data=df, f=corr, seeds=my_seeds)
print(results_corr)

                       mean       std  lower_ci  upper_ci
corr(pulse, time)  0.307829  0.081068  0.148877  0.467313


## A Complete Separation of Code

There might of course also be situations where we simply want to return the collection of bootstrap estimates and not directly get the very streamlined results table. The following example makes use of the complete separation of code.

In [13]:
more_seeds = get_seeds(2000)

Next, we can use get_bootstrap_estimates() to get the pd.DataFrame containing all of the different bootstrap estimates for f.

In [14]:
from estimagic.inference.bootstrap_estimates import get_bootstrap_estimates

my_estimates = get_bootstrap_estimates(data=df, f=ols_fit, seeds=more_seeds)

print(my_estimates.head())

    constant      time
0  93.782043  0.515684
1  93.420674  0.403365
2  93.422063  0.354121
3  97.334865  0.148656
4  91.635526  0.440604


We can also compute confidence intervals using estimates as an input.

In [15]:
from estimagic.inference.bootstrap_ci import compute_ci

confidence_intervals = compute_ci(data=df, f=ols_fit, estimates=my_estimates, ci_method="bc")

print(confidence_intervals)

           lower_ci   upper_ci
constant  91.328959  96.398915
time       0.197747   0.619950


Alternatively, we can compute the whole results table using specific estimates as an input.

In [16]:
from estimagic.inference.bootstrap import get_results_table

my_results = get_results_table(data=df, f=ols_fit, estimates=my_estimates, ci_method="t")

print(my_results)

               mean       std   lower_ci   upper_ci
constant  93.754326  1.552568  90.615657  96.795346
time       0.389216  0.127902   0.116775   0.614301


Finally, note that it is also possible to return the list of bootstrap samples using the get_bootstrap_samples() function. However, this can be very memory inefficient and is not recommended unless definitely needed. This function returns a list of ndraws bootstrap samples obtained from drawing from the original sample with replacement.

In [17]:
from estimagic.inference.bootstrap_samples import get_bootstrap_samples

my_samples = get_bootstrap_samples(data=df, seeds=my_seeds)

print(type(my_samples))

print(my_samples[4].head())

<class 'list'>
    Unnamed: 0  id     diet  pulse  time     kind  constant
57          57  20   no fat    102     1  walking         1
32          32  11  low fat     84    30  walking         1
0            0   1  low fat     85     1     rest         1
66          66  23  low fat     98     1  running         1
18          18   7   no fat     87     1     rest         1
