### Confidence Interval - Difference In Means

Here you will look through the example from the last video, but you will also go a couple of steps further into what might actually be going on with this data.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
np.random.seed(42)

full_data = pd.read_csv('coffee_dataset.csv')
sample_data = full_data.sample(200)

In [2]:
sample_data.head()

Unnamed: 0,user_id,age,drinks_coffee,height
2402,2874,<21,True,64.357154
2864,3670,>=21,True,66.859636
2167,7441,<21,False,66.659561
507,2781,>=21,True,70.166241
1817,2875,>=21,True,71.36912


`1.` For 10,000 iterations, bootstrap sample your sample data, compute the difference in the average heights for coffee and non-coffee drinkers.  Build a 99% confidence interval using your sampling distribution.  Use your interval to start answering the first quiz question below.

In [3]:
diffs = []

for i in range(10000):
    boot_sample = sample_data.sample(200, replace=True)
    coff_mean = boot_sample[boot_sample['drinks_coffee']==True].height.mean()
    nocoff_mean = boot_sample[boot_sample['drinks_coffee']==False].height.mean()
    diff = coff_mean - nocoff_mean
    diffs.append(diff)

In [5]:
np.percentile(diffs, 0.5), np.percentile(diffs, 99.5)

(0.10258900080919674, 2.5388333707966284)

`2.` For 10,000 iterations, bootstrap sample your sample data, compute the difference in the average heights for those older than 21 and those younger than 21.  Build a 99% confidence interval using your sampling distribution.  Use your interval to finish answering the first quiz question below.  

In [10]:
diffs_21 = []

for i in range(10000):
    boot_sample = sample_data.sample(200, replace=True)
    under21_mean = boot_sample[boot_sample['age']=='<21'].height.mean()
    over21_mean = boot_sample[boot_sample['age']=='>=21'].height.mean()
    diff = over21_mean - under21_mean
    diffs_21.append(diff)

In [11]:
np.percentile(diffs_21, 0.5), np.percentile(diffs_21, 99.5)

(3.3846249718386421, 5.1051788925372721)

`3.` For 10,000 iterations bootstrap your sample data, compute the **difference** in the average height for coffee drinkers and the average height for non-coffee drinkers for individuals **under** 21 years old.  Using your sampling distribution, build a 95% confidence interval.  Use your interval to start answering question 2 below.

In [20]:
diffs_under21_coffee = []

for i in range(10000):
    boot_sample = sample_data.sample(200, replace=True)
    under21_coff_mean = boot_sample[(boot_sample['age']=='<21') &\
                                    (boot_sample['drinks_coffee']==True)
                                   ].height.mean()
    under21_nocoff_mean = boot_sample[(boot_sample['age']=='<21') &\
                                      (boot_sample['drinks_coffee']==False)
                                     ].height.mean()
    diff = under21_coff_mean - under21_nocoff_mean
    diffs_under21_coffee.append(diff)

In [21]:
np.percentile(diffs_under21_coffee, 0.5), np.percentile(diffs_under21_coffee, 99.5)

(-2.836049427829249, -0.84017752205872265)

`4.` For 10,000 iterations bootstrap your sample data, compute the **difference** in the average height for coffee drinkers and the average height for non-coffee drinkers for individuals **over** 21 years old.  Using your sampling distribution, build a 95% confidence interval. Use your interval to finish answering the second quiz question below. As well as the following questions. 

In [18]:
diffs_over21_coffee = []

for i in range(10000):
    boot_sample = sample_data.sample(200, replace=True)
    over21_coff_mean = boot_sample[(boot_sample['age']=='>=21') &\
                                    (boot_sample['drinks_coffee']==True)
                                   ].height.mean()
    over21_nocoff_mean = boot_sample[(boot_sample['age']=='>=21') &\
                                      (boot_sample['drinks_coffee']==False)
                                     ].height.mean()
    diff = over21_coff_mean - over21_nocoff_mean
    diffs_over21_coffee.append(diff)

In [19]:
np.percentile(diffs_over21_coffee, 0.5), np.percentile(diffs_over21_coffee, 99.5)

(-4.7694094420344788, -1.2720102460389362)

Within the under 21 and over 21 groups, we saw that on average non-coffee drinkers were taller.  But, when combined, we saw that on average coffee drinkers were on average taller.  This is again **Simpson's paradox**, and essentially there are more adults in the dataset who were coffee drinkers.  So these individuals made it seem like coffee drinkers were on average taller - which is a misleading result.  

A larger idea for this is the idea of confounding variables altogether.  You will learn even more about these in the regression section of the course.