# Traditional vs. Bootstrapping Confidence Intervals

In [1]:
import pandas as pd
import numpy as np
np.random.seed(42)

import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')

full_data = pd.read_csv('coffee_dataset.csv')
sample_data = full_data.sample(200) #this is the only data you might actually get in the real world.

### Bootstrapping Confidence Intervals


In [2]:
diff = []

for _ in range(10000):
    bootsample = sample_data.sample(200, replace=True)
    mean_coff = bootsample[bootsample.drinks_coffee == True].height.mean()
    mean_non_coff = bootsample[bootsample.drinks_coffee == False].height.mean()
    diff.append(mean_coff - mean_non_coff)

# Build a 95% confidence interval using your sampling distribution.
np.percentile(diff, 2.5), np.percentile(diff, 100-2.5)

(0.3965686790909317, 2.243258868112464)

### Traditional Confidence Intervals


In [3]:
import statsmodels.stats.api as sms

In [4]:
X1 = sample_data[sample_data.drinks_coffee == True].height
X2 = sample_data[sample_data.drinks_coffee == False].height

cm = sms.CompareMeans(sms.DescrStatsW(X1), sms.DescrStatsW(X2))
cm.tconfint_diff(usevar='unequal')

(0.39600106159185644, 2.273413157022891)

In this Notebook you saw a comparison of the traditional method for calculating a difference of means using a python built in to the bootstrapping method you have been using throughout this lesson.

With large sample sizes, these end up looking very similar. With smaller sample sizes, using a traditional methods likely has assumptions that are not true of your interval. Small sample sizes are not ideal for bootstrapping methods though either, as they can lead to misleading results simply due to not accurately representing your entire population well.

