### Confidence Interval - Difference In Means

Here you will look through the example from the last video, but you will also go a couple of steps further into what might actually be going on with this data.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
np.random.seed(42)

full_data = pd.read_csv('data/coffee_dataset.csv')
sample_data = full_data.sample(200)

In [2]:
sample_data.head()

Unnamed: 0,user_id,age,drinks_coffee,height
2402,2874,<21,True,64.357154
2864,3670,>=21,True,66.859636
2167,7441,<21,False,66.659561
507,2781,>=21,True,70.166241
1817,2875,>=21,True,71.36912


`1.` For 10,000 iterations, bootstrap sample your sample data, compute the difference in the average heights for coffee and non-coffee drinkers.  Build a 99% confidence interval using your sampling distribution.  Use your interval to start answering the first quiz question below.

In [3]:
boot_diffs = []
for _ in range(10000):
    bootsample = sample_data.sample(200, replace = True)
    coff_mean_height = bootsample[bootsample.drinks_coffee == True]['height'].mean()
    nocoff_mean_height = bootsample[bootsample.drinks_coffee == False]['height'].mean()
    boot_diffs.append(coff_mean_height - nocoff_mean_height)

np.percentile(boot_diffs, 0.5), np.percentile(boot_diffs, 99.5)

(0.10258900080921117, 2.538833370796657)

This is a statistical evidence coffee drinkers are on average taller

In every bootstrapped instance in the first question, is the difference in our averages suggested that coffee drinkers are on average taller than non-coffee drinkers?

In [4]:
count = 0
for diff in boot_diffs:
    if diff < 0:
        count += 1
print(count)

30


There are 30 bootstrapped instances where difference in average suggest that coffee drinkers are on average shorter than non-coffee drinkers!!

`2.` For 10,000 iterations, bootstrap sample your sample data, compute the difference in the average heights for those older than 21 and those younger than 21.  Build a 99% confidence interval using your sampling distribution.  Use your interval to finish answering the first quiz question below.  

In [5]:
boot_diffs = []
for _ in range(10000):
    bootsample = sample_data.sample(200, replace = True)
    older_mean_height = bootsample[bootsample.age == '>=21']['height'].mean()
    younger_mean_height = bootsample[bootsample.age == '<21']['height'].mean()
    boot_diffs.append(older_mean_height - younger_mean_height)

np.percentile(boot_diffs, 0.5), np.percentile(boot_diffs, 99.5)

(3.3652749452554938, 5.0932450670661495)

In [6]:
count = 0
for diff in boot_diffs:
    if diff < 0:
        count += 1
print(count)

0


In every bootstrapped instance in the second question, the difference in our averages suggested that those older than 21 are on average taller than those younger than 21!

`3.` For 10,000 iterations bootstrap your sample data, compute the **difference** in the average height for coffee drinkers and the average height for non-coffee drinkers for individuals **under** 21 years old.  Using your sampling distribution, build a 95% confidence interval.  Use your interval to start answering question 2 below.

In [7]:
boot_diffs = []
for _ in range(10000):
    bootsample = sample_data.sample(200, replace = True)
    coff_mean_height = bootsample[(bootsample.age == '<21') & (bootsample.drinks_coffee == True)]['height'].mean()
    nocoff_mean_height = bootsample[(bootsample.age == '<21') & (bootsample.drinks_coffee == False)]['height'].mean()
    boot_diffs.append(nocoff_mean_height - coff_mean_height)

np.percentile(boot_diffs, 2.5), np.percentile(boot_diffs, 97.5)

(1.0593651244624271, 2.5931557940679184)

`4.` For 10,000 iterations bootstrap your sample data, compute the **difference** in the average height for coffee drinkers and the average height for non-coffee drinkers for individuals **over** 21 years old.  Using your sampling distribution, build a 95% confidence interval. Use your interval to finish answering the second quiz question below. As well as the following questions. 

In [8]:
boot_diffs = []
for _ in range(10000):
    bootsample = sample_data.sample(200, replace = True)
    coff_mean_height = bootsample[(bootsample.age != '<21') & (bootsample.drinks_coffee == True)]['height'].mean()
    nocoff_mean_height = bootsample[(bootsample.age != '<21') & (bootsample.drinks_coffee == False)]['height'].mean()
    boot_diffs.append(nocoff_mean_height - coff_mean_height)

np.percentile(boot_diffs, 02.5), np.percentile(boot_diffs, 97.5)

(1.8278953970883662, 4.40263296547742)

For each group, we have evidence that the coffee drinkers were shorter

The intervals in the last two questions provide statistical evidence that on average coffee drinkers are shorter than non-coffee drinkers for both age ranges.

The intervals in the last two parts are narrower than the intervals from the first parts.

in the first intervals, we had evidence that the average height of coffee drinkers was taller, but in the final intervals, we had evidence that coffee drinkers in each group were actually shorter. This is an example of impson's Paradox!

We always need to be careful of these confounding variables leading to misleading conclusions.