# Practice notebook for confidence intervals using NHANES data

This notebook will give you the opportunity to practice working with confidence intervals using the NHANES data.

You can enter your code into the cells that say "enter your code here", and you can type responses to the questions into the cells that say "Type Markdown and Latex".

Note that most of the code that you will need to write below is very similar to code that appears in the case study notebook.  You will need to edit code from that notebook in small ways to adapt it to the prompts below.

To get started, we will use the same module imports and read the data in the same way as we did in the case study:

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
import statsmodels.api as sm

da = pd.read_csv("nhanes_2015_2016.csv")

## Question 1

Restrict the sample to women between 35 and 50, then use the marital status variable [DMDMARTL](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#DMDMARTL) to partition this sample into two groups - women who are currently married, and women who are not currently married.  Within each of these groups, calculate the proportion of women who have completed college.  Calculate 95% confidence intervals for each of these proportions.

In [16]:

da2 = da[['RIDAGEYR','DMDMARTL','DMDEDUC2','RIAGENDR']]
da2 = da2[da2['RIAGENDR']==2]
da2 = da2[da2['RIDAGEYR']>=35]
da2 = da2[da2['RIDAGEYR']<=50]
total = da2.size

da2['DMDMARTL'] = da2.DMDMARTL.replace({1:"Married", 2:"Not Married",3:"Not Married",4:"Not Married", 5:"Not Married",6:"Not Married",77:"Not Married",99:"Not Married"})
da2.rename(columns = {'RIDAGEYR':'Age','DMDMARTL':'Married?','DMDEDUC2':'Collage?'}, inplace = True)

total_married = da2[da2['Married?']=="Married"]
total_married_collage_count = total_married[total_married['Collage?']==5].size
# 95% CI for the proportion of married women who completed collage
ci_collage = sm.stats.proportion_confint(total_married_collage_count, total_married.size)
# 95% CI for the proportion of married women who did not complete collage
ci_no_collage = sm.stats.proportion_confint(total_married.size - total_married_collage_count, total_married.size)

total_not_married = da2[da2['Married?']!= "Married"]
total_not_married_collage_count = total_not_married[total_not_married['Collage?']==5].size
# 95% CI for the proportion of married women who completed collage
ci_not_married_collage = sm.stats.proportion_confint(total_not_married_collage_count, total_not_married.size)
# 95% CI for the proportion of married women who did not complete collage
ci_not_married_no_collage = sm.stats.proportion_confint(total_not_married.size - total_not_married_collage_count, total_not_married.size)


print("CI for married collage", ci_collage, "total married is", total_married.size)
print("CI for married not going to collage", ci_no_collage)
print("CI for unmarried collage", ci_not_married_collage, "total unmarried is", total_not_married.size)
print("CI for unmarried not going to collage", ci_not_married_no_collage)

CI for married collage (0.338591859422365, 0.38301170405202245) total married is 1796
CI for married not going to collage (0.6169882959479774, 0.661408140577635)
CI for unmarried collage (0.1911929595750938, 0.2348425433834861) total unmarried is 1352
CI for unmarried not going to collage (0.7651574566165139, 0.8088070404249063)


__Q1a.__ Identify which of the two confidence intervals is wider, and explain why this is the case. 

The condidence interval for Married women who atttend collage is wider. This is likely due to the sample sizes used

__Q1b.__ Write 1-2 sentences summarizing these findings for an audience that does not know what a confidence interval is (the goal here is to report the substance of what you learned about how marital status and educational attainment are related, not to teach a person what a confidence interval is).

From the data above we have a 95% confidence that between 34% and 38% of married women complted collage, while between 19% and 23% of currenlty unmarried women did not complete collage. 
We don't have enough dat to draw any conculsions regarding the underlying reasons for this disparrity. 

## Question 2

Construct 95% confidence intervals for the proportion of smokers who are female, and for the proportion of smokers who are male.  Then construct a 95% confidence interval for the difference between these proportions.

In [20]:
da["SMQ020x"] = da.SMQ020.replace({1: "Yes", 2: "No", 7: np.nan, 9: np.nan})  # np.nan represents a missing value
da["RIAGENDRx"] = da.RIAGENDR.replace({1: "Male", 2: "Female"})
dx = da[["SMQ020x", "RIAGENDRx"]].dropna()  # dropna drops cases where either variable is missing
pd.crosstab(dx.SMQ020x, dx.RIAGENDRx)
male_smoke = sm.stats.proportion_confint(1413, 1340+1413)
female_smoke = sm.stats.proportion_confint(906, 906+2066)  
print("Confidence Interval Females ", female_smoke)
print("Confidence Interval Males", male_smoke)
#work out confidence interval of 2 proportions 
dz = dx.groupby(dx.RIAGENDRx).agg({"SMQ020x": [lambda x: np.mean(x=="Yes"), np.size]})
dz.columns = ["Proportion", "Total_n"] # The default column names are unclear, so we replace them here
p = dz.Proportion.Female # Female proportion
n = dz.Total_n.Female # Total number of females
se_female = np.sqrt(p * (1 - p) / n)
print(se_female)
p = dz.Proportion.Male # Male proportion
n = dz["Total_n"].Male # Total number of males
se_male = np.sqrt(p * (1 - p) / n)
print(se_male)
se_diff = np.sqrt(se_female**2 + se_male**2)
d = dz.Proportion.Female - dz.Proportion.Male
lcb = d - 2*se_diff
ucb = d + 2*se_diff
print(lcb, ucb)
print(d)

Confidence Interval Females  (0.2882949879861214, 0.32139545615923526)
Confidence Interval Males (0.49458749263718593, 0.5319290347874418)
0.008444152146214435
0.009526078653689868
-0.2338728044024504 -0.18295327887682067
-0.20841304163963553


__Q2a.__ Discuss why it may be relevant to report the proportions of smokers who are female and male, and contrast this to reporting the proportions of males and females who smoke.

It seems there is a clear differance betweent he 2 genders it therefore makes more sense to seprate them.  

__Q2b.__ How does the width of the confidence interval for the difference of the two proportions compare to the widths of the confidence intervals for each proportion separately?

The width is similar which make sense and it is representing the same orginal data in a new form. 

## Question 3

Construct a 95% interval for height ([BMXHT](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/BMX_I.htm#BMXHT)) in centimeters.  Then convert height from centimeters to inches by dividing by 2.54, and construct a 95% confidence interval for height in inches.  Finally, convert the endpoints (the lower and upper confidence limits) of the confidence interval from inches to back to centimeters   

In [37]:
height = da['BMXHT']
height_sample = height.loc[:49]
se = np.std(height_sample)/np.sqrt(50)
mean = np.mean(height_sample)
lcb_height = mean - 1.96 * se
ucb_height = mean + 1.96 * se
print(lcb_height, ucb_height, " cm")
height_sample_inches = height_sample / 2.54

se_inch = np.std(height_sample_inches)/np.sqrt(50)
mean_inch = np.mean(height_sample_inches)
lcb_height_inch = mean_inch - 1.96 * se
ucb_height_inch = mean_inch + 1.96 * se
print(lcb_height_inch, ucb_height_inch, " Inch")

back_to_cm_lcb = lcb_height_inch * 2.54
back_to_cm_ucb = ucb_height_inch * 2.54
print(back_to_cm_lcb, back_to_cm_ucb, " back to cm")

163.44722726353805 169.540772736462  cm
62.50204616117581 68.59559163409976  Inch
158.75519724938655 174.2328027506134  back to cm


__Q3a.__ Describe how the confidence interval constructed in centimeters relates to the confidence interval constructed in inches.

The data above indicates that the accuarcy is less when using inchs for the calculation. When orginally calculated using cms the width of the confidece interval was approximatly 6 cm while it was approximately 16 cm when the clauations where done in inches and converted back to cm's   

## Question 4

Partition the sample based on 10-year age bands, i.e. the resulting groups will consist of people with ages from 18-28, 29-38, etc. Construct 95% confidence intervals for the difference between the mean BMI for females and for males within each age band.

In [46]:
# enter your code here
# Calculate the mean, SD, and sample size for BMI within age/gender groups
da["agegrp"] = pd.cut(da.RIDAGEYR, [18, 28, 38, 48, 58, 68, 78])
pr = da.groupby(["agegrp", "RIAGENDRx"]).agg({"BMXBMI": [np.mean, np.std, np.size]}).unstack()
# Calculate the SEM for females and for males within each age band
pr["BMXBMI", "sem", "Female"] = pr["BMXBMI", "std", "Female"] / np.sqrt(pr["BMXBMI", "size", "Female"]) 
pr["BMXBMI", "sem", "Male"] = pr["BMXBMI", "std", "Male"] / np.sqrt(pr["BMXBMI", "size", "Male"]) 

# Calculate the mean difference of BMI between females and males within each age band, also  calculate
# its SE and the lower and upper limits of its 95% CI.
pr["BMXBMI", "mean_diff", ""] = pr["BMXBMI", "mean", "Female"] - pr["BMXBMI", "mean", "Male"]
pr["BMXBMI", "sem_diff", ""] = np.sqrt(pr["BMXBMI", "sem", "Female"]**2 + pr["BMXBMI", "sem", "Male"]**2) 
pr["BMXBMI", "lcb_diff", ""] = pr["BMXBMI", "mean_diff", ""] - 1.96 * pr["BMXBMI", "sem_diff", ""] 
pr["BMXBMI", "ucb_diff", ""] = pr["BMXBMI", "mean_diff", ""] + 1.96 * pr["BMXBMI", "sem_diff", ""]

pr["BMXBMI", "width", ""] = pr["BMXBMI", "lcb_diff", ""] -  pr["BMXBMI", "ucb_diff", ""]

pr

Unnamed: 0_level_0,BMXBMI,BMXBMI,BMXBMI,BMXBMI,BMXBMI,BMXBMI,BMXBMI,BMXBMI,BMXBMI,BMXBMI,BMXBMI,BMXBMI,BMXBMI
Unnamed: 0_level_1,mean,mean,std,std,size,size,sem,sem,mean_diff,sem_diff,lcb_diff,ucb_diff,width
RIAGENDRx,Female,Male,Female,Male,Female,Male,Female,Male,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2
agegrp,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3
"(18, 28]",28.019433,27.058186,8.048854,6.679515,498.0,458.0,0.360678,0.312113,0.961247,0.476973,0.026381,1.896114,-1.869734
"(28, 38]",29.943443,29.69718,7.959097,6.72669,494.0,467.0,0.358097,0.311274,0.246263,0.474473,-0.683705,1.17623,-1.859935
"(38, 48]",31.003733,29.514646,8.044642,6.10495,514.0,398.0,0.354834,0.306014,1.489086,0.468563,0.570703,2.40747,-1.836767
"(48, 58]",30.787361,29.385132,7.64759,6.151534,454.0,419.0,0.358919,0.300522,1.40223,0.46812,0.484714,2.319745,-1.835031
"(58, 68]",31.054664,29.232462,7.779502,5.959024,466.0,470.0,0.360378,0.274869,1.822202,0.453239,0.933853,2.710551,-1.776698
"(68, 78]",30.537818,28.72027,6.780588,5.336652,279.0,307.0,0.405943,0.304579,1.817548,0.507502,0.822844,2.812252,-1.989407


__Q4a.__ How do the widths of these confidence intervals differ?  Provide an explanation for any substantial diferences in the confidence interval widths that you see.

Between the ages 18 to 58 the widths remain fairly similar but once you past 58 the ranges are much bigger suggesting BMI differances between male and female are much greater in older people. 

## Question 5

Construct a 95% confidence interval for the first and second systolic blood pressure measures, and for the difference between the first and second systolic blood pressure measurements within a subject.

In [54]:
# enter code here 'BPXSY1' 'BPXSY2'
sys_1 = da['BPXSY1']
sys_2 = da['BPXSY2']
#grab a sample of 50 
sys_1 = sys_1.loc[:49]
sys_2 = sys_2.loc[:49]
se_sys_1 = np.std(sys_1)/np.sqrt(50)
se_sys_2 = np.std(sys_2)/np.sqrt(50)
mean_sys_1 = np.mean(sys_1)
mean_sys_2 = np.mean(sys_2)

lcb_sys_1 = mean_sys_1 - 1.96 * se_sys_1
ucb_sys_1 = mean_sys_1 + 1.96 * se_sys_1
lcb_sys_2 = mean_sys_2 - 1.96 * se_sys_2
ucb_sys_2 = mean_sys_2 + 1.96 * se_sys_2

print("Confidence interval 1st Systolic preasure", " (", lcb_sys_1, ucb_sys_1,")")
print("Confidence interval 2nd Systolic preasure", " (", lcb_sys_2, ucb_sys_2,")")

se_diff = se_sys_1 - se_sys_2
mean_diff = mean_sys_1 - mean_sys_2
lcb_sys_diff = mean_diff - 1.96 * se_diff
ucb_sys_diff = mean_diff + 1.96 * se_diff

print("Confidence interval between the 2 Systolic preasures"," (", lcb_sys_diff, ucb_sys_diff,")")


Confidence interval 1st Systolic preasure  ( 122.94855793766261 132.7249114500925 )
Confidence interval 2nd Systolic preasure  ( 122.51324015181243 132.2622700522692 )
Confidence interval between the 2 Systolic preasures  ( 0.4353177858501752 0.46264139782328895 )


__Q5a.__ Based on these confidence intervals, would you say that a difference of zero between the population mean values of the first and second systolic blood pressure measures is consistent with the data?

no, there is a samll differance which may be accounatble in testing erorrs but is still there. 


__Q5b.__ Discuss how the width of the confidence interval for the within-subject difference compares to the widths of the confidence intervals for the first and second measures.

The width between the first and second systolic preasures is very similar and the width of the interveral between the 2 data sets in much smaller which confirms that the orginal 2 are very close in the informaiton they show. 

## Question 6

Construct a 95% confidence interval for the mean difference between the average age of a smoker, and the average age of a non-smoker.

In [64]:
# insert your code here SMQ020x
da["SMQ020x"] = da.SMQ020.replace({1: "Yes", 2: "No", 7: np.nan, 9: np.nan})
dx = da[["SMQ020x", "RIDAGEYR"]].dropna()
#get a sample 
dx = dx.loc[:49]
total_count = dx['SMQ020x'].size
smokers = dx[dx['SMQ020x']=="Yes"]
non_smokers = dx[dx['SMQ020x']=="No"]
ave_age_smokers = np.mean(smokers['RIDAGEYR'])
se_smokers = np.std(smokers['RIDAGEYR'])/np.sqrt(50)
ave_age_non_smokers = np.mean(non_smokers['RIDAGEYR'])
se_non_smokers = np.std(non_smokers["RIDAGEYR"])/np.sqrt(50)

lcb_ci_mean_diff = (ave_age_non_smokers - ave_age_smokers) - 1.96 * (se_smokers - se_non_smokers)
ucb_ci_mean_diff = (ave_age_non_smokers - ave_age_smokers) + 1.96 * (se_smokers - se_non_smokers)
print("Confidence interval for mean differance", lcb_ci_mean_diff, ucb_ci_mean_diff)

Confidence interval for mean differance -6.064396378778839 -6.123103621221161


__Q6a.__ Use graphical and numerical techniques to compare the variation in the ages of smokers to the variation in the ages of non-smokers.  

In [1]:
# insert your code here


__Q6b.__ Does it appear that uncertainty about the mean age of smokers, or uncertainty about the mean age of non-smokers contributed more to the uncertainty for the mean difference that we are focusing on here?

It seems the variation in non smokers contributed more 