### Confidence Interval - Difference In Means

Here you will look through the example from the last video, but you will also go a couple of steps further into what might actually be going on with this data.

In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
np.random.seed(42)

full_data = pd.read_csv('coffee_dataset.csv')
sample_data = full_data.sample(200)

sample_data.head()

Unnamed: 0,user_id,age,drinks_coffee,height
2402,2874,<21,True,64.357154
2864,3670,>=21,True,66.859636
2167,7441,<21,False,66.659561
507,2781,>=21,True,70.166241
1817,2875,>=21,True,71.36912


`1.` For 10,000 iterations, bootstrap sample your sample data, compute the difference in the average heights for coffee and non-coffee drinkers.  Build a 99% confidence interval using your sampling distribution.  Use your interval to start answering the first quiz question below.

In [4]:
diff =[]
for i in range(10000):
    c = sample_data.sample(200, replace = True)
    mean1 = c[c.drinks_coffee == True]['height'].mean()
    mean2 = c[c.drinks_coffee == False]['height'].mean()
    diff.append(mean1 - mean2)
    

In [5]:
np.percentile(diff, 1), np.percentile(diff, 99)

(0.23034449860735282, 2.4056774798496261)

In [None]:
diff =[]
for i in range(10000):
    c = sample_data.sample(200, replace = True)
    mean1 = c[c.age = '>=21' ]['height'].mean()
    mean2 = c[c.drinks_coffee == False]['height'].mean()
    diff.append(mean1 - mean2)

`2.` For 10,000 iterations, bootstrap sample your sample data, compute the difference in the average heights for those older than 21 and those younger than 21.  Build a 99% confidence interval using your sampling distribution.  Use your interval to finish answering the first quiz question below.  

In [10]:
diff =[]
for i in range(10000):
    c = sample_data.sample(200, replace = True)
    mean1 = c[c.age.str.contains('>=21') ]['height'].mean()
    mean2 = c[c.age.str.contains('<21')]['height'].mean()
    diff.append(mean1 - mean2)

In [11]:
np.percentile(diff, 1), np.percentile(diff, 99)

(3.4474176175626794, 5.0334842841937624)

`3.` For 10,000 iterations bootstrap your sample data, compute the **difference** in the average height for coffee drinkers and the average height for non-coffee drinkers for individuals **under** 21 years old.  Using your sampling distribution, build a 95% confidence interval.  Use your interval to start answering question 2 below.

In [14]:
diff =[]
for i in range(10000):
    c = sample_data.sample(200, replace = True)
    mean1 = c[c.age.str.contains('<21') & c.drinks_coffee == True ]['height'].mean()
    mean2 = c[c.age.str.contains('<21')& c.drinks_coffee == False]['height'].mean()
    diff.append(mean2 - mean1)
    
np.percentile(diff, 1), np.percentile(diff, 99)

(3.4459551396569745, 4.8051405400504157)

In [16]:
np.percentile(diff, 2.5), np.percentile(diff, 97.5)

(-3.8150919329591799, -2.3385501253452854)

`4.` For 10,000 iterations bootstrap your sample data, compute the **difference** in the average height for coffee drinkers and the average height for non-coffee drinkers for individuals **over** 21 years old.  Using your sampling distribution, build a 95% confidence interval. Use your interval to finish answering the second quiz question below. As well as the following questions. 

In [15]:
diff =[]
for i in range(10000):
    c = sample_data.sample(200, replace = True)
    mean1 = c[c.age.str.contains('>=21') & c.drinks_coffee == True ]['height'].mean()
    mean2 = c[c.age.str.contains('>=21')& c.drinks_coffee == False]['height'].mean()
    diff.append(mean2 - mean1)
    
np.percentile(diff, 1), np.percentile(diff, 99)

(-3.944117113699475, -2.1797059017439482)

In [17]:
np.percentile(diff, 2.5), np.percentile(diff, 97.5)

(-3.8150919329591799, -2.3385501253452854)

In [18]:
diffs_coff_over21 = []
for _ in range(10000):
    bootsamp = sample_data.sample(200, replace = True)
    over21_coff_mean = bootsamp.query("age != '<21' and drinks_coffee == True")['height'].mean()
    over21_nocoff_mean = bootsamp.query("age != '<21' and drinks_coffee == False")['height'].mean()
    diffs_coff_over21.append(over21_nocoff_mean - over21_coff_mean)
    
np.percentile(diffs_coff_over21, 2.5), np.percentile(diffs_coff_over21, 97.5)

(1.8482176570248883, 4.4235646586174928)

In [20]:
diffs_coff_under21 = []
for _ in range(10000):
    bootsamp = sample_data.sample(200, replace = True)
    over21_coff_mean = bootsamp.query("age == '<21' and drinks_coffee == True")['height'].mean()
    over21_nocoff_mean = bootsamp.query("age == '<21' and drinks_coffee == False")['height'].mean()
    diffs_coff_over21.append(ovTrueer21_nocoff_mean - over21_coff_mean)
    
np.percentile(diffs_coff_over21, 2.5), np.percentile(diffs_coff_over21, 97.5)

(1.0825564833083945, 2.6173678470272996)

In [56]:
c = sample_data.sample(200, replace = True)
c[(c.age.str.contains('<21')) & (c.drinks_coffee == True)]

Unnamed: 0,user_id,age,drinks_coffee,height
2402,2874,<21,True,64.357154
557,6101,<21,True,64.054247
557,6101,<21,True,64.054247
2333,7277,<21,True,64.004553
368,5182,<21,True,63.973306
1593,4434,<21,True,63.938056
2545,5094,<21,True,63.89838
594,6143,<21,True,62.782455
1093,3752,<21,True,63.459866
1739,5282,<21,True,62.864208


In [29]:
bootsamp = sample_data.sample(200, replace = True)
bootsamp.query("age == '>=21' and drinks_coffee == True")['height'].mean()


69.627423923066928

In [50]:
bootsamp = sample_data.sample(200, replace = True)
bootsamp.query("age == '>=21' and drinks_coffee == False")

Unnamed: 0,user_id,age,drinks_coffee,height
2622,2958,>=21,False,72.245746
2837,4327,>=21,False,72.593112
2682,5483,>=21,False,71.145025
1334,7348,>=21,False,71.289814
1334,7348,>=21,False,71.289814
2622,2958,>=21,False,72.245746
2837,4327,>=21,False,72.593112
1253,8059,>=21,False,71.010834
2682,5483,>=21,False,71.145025
2682,5483,>=21,False,71.145025
