# Practice notebook for confidence intervals using NHANES data

[github link](https://github.com/karimkmafifi/Inferential-Statistical-Analysis-with-Python---Coursera/blob/master/nhanes_confidence_intervals_practice.ipynb)

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import statsmodels.api as sm
import matplotlib.pyplot as plt
%matplotlib inline

da = pd.read_csv("nhanes_2015_2016.csv")

## Question 1

Restrict the sample to women between 35 and 50, then use the marital status variable [DMDMARTL](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#DMDMARTL) to partition this sample into two groups - women who are currently married, and women who are not currently married.  Within each of these groups, calculate the proportion of women who have completed college.  Calculate 95% confidence intervals for each of these proportions.

In [2]:
da['DMDMARTL'] = da['DMDMARTL'].replace({1:'married', 2:'widowed', 
                                         3:'divorced', 4:'separated',
                                        5:'never married', 6:'living with partner',
                                        77:'refused', 99:"don't know", '.':'missing'})

da['DMDMARTL'].unique()

array(['married', 'divorced', 'living with partner', 'separated',
       'never married', nan, 'widowed', 'refused'], dtype=object)

In [3]:
da['RIAGENDR'] = da['RIAGENDR'].replace({1:'male', 2:'female', '.':'missing'})
da['RIAGENDR'].unique()

array(['male', 'female'], dtype=object)

In [4]:
da.columns

Index(['SEQN', 'ALQ101', 'ALQ110', 'ALQ130', 'SMQ020', 'RIAGENDR', 'RIDAGEYR',
       'RIDRETH1', 'DMDCITZN', 'DMDEDUC2', 'DMDMARTL', 'DMDHHSIZ', 'WTINT2YR',
       'SDMVPSU', 'SDMVSTRA', 'INDFMPIR', 'BPXSY1', 'BPXDI1', 'BPXSY2',
       'BPXDI2', 'BMXWT', 'BMXHT', 'BMXBMI', 'BMXLEG', 'BMXARML', 'BMXARMC',
       'BMXWAIST', 'HIQ210'],
      dtype='object')

In [5]:
da_female_married = da[(da['RIDAGEYR'].between(35, 50)) & (da['RIAGENDR']=='female') & (da['DMDMARTL']=='married')]
da_female_married.head()

Unnamed: 0,SEQN,ALQ101,ALQ110,ALQ130,SMQ020,RIAGENDR,RIDAGEYR,RIDRETH1,DMDCITZN,DMDEDUC2,...,BPXSY2,BPXDI2,BMXWT,BMXHT,BMXBMI,BMXLEG,BMXARML,BMXARMC,BMXWAIST,HIQ210
34,83799,,,,2,female,37,2,1.0,4.0,...,110.0,72.0,66.6,161.6,25.5,,,,,2.0
50,83828,1.0,,2.0,2,female,39,1,2.0,3.0,...,100.0,62.0,71.3,162.0,27.2,36.8,34.6,29.1,94.6,
55,83837,2.0,2.0,,2,female,45,1,1.0,2.0,...,114.0,68.0,77.5,148.3,35.2,30.5,34.0,34.4,107.6,2.0
61,83851,2.0,1.0,1.0,1,female,37,3,1.0,3.0,...,122.0,74.0,85.1,155.3,35.3,32.5,33.6,36.1,106.5,1.0
62,83853,,,,2,female,49,3,1.0,3.0,...,116.0,84.0,76.1,166.7,27.4,39.2,38.6,32.6,88.7,2.0


In [6]:
# Age between 35 and 50 femmale not married
da_female_unmarried = da[(da['RIDAGEYR'].between(35, 50)) & (da['RIAGENDR']=='female') & (da['DMDMARTL']!='married')]
da_female_unmarried.head()

Unnamed: 0,SEQN,ALQ101,ALQ110,ALQ130,SMQ020,RIAGENDR,RIDAGEYR,RIDRETH1,DMDCITZN,DMDEDUC2,...,BPXSY2,BPXDI2,BMXWT,BMXHT,BMXBMI,BMXLEG,BMXARML,BMXARMC,BMXWAIST,HIQ210
4,83736,2.0,1.0,1.0,2,female,42,4,1.0,4.0,...,114.0,54.0,55.2,164.9,20.3,37.4,36.0,27.2,80.4,2.0
52,83832,2.0,1.0,4.0,2,female,50,1,2.0,1.0,...,,,105.9,157.7,42.6,29.2,35.0,40.7,129.1,
58,83845,1.0,,,1,female,44,4,1.0,1.0,...,116.0,78.0,133.3,171.5,45.3,37.3,35.7,48.7,,2.0
100,83911,1.0,,1.0,2,female,43,4,1.0,4.0,...,126.0,76.0,91.1,172.3,30.7,40.4,38.2,34.6,101.6,
127,83958,2.0,1.0,2.0,1,female,47,4,1.0,3.0,...,148.0,76.0,58.6,160.6,22.7,36.8,35.2,28.0,77.9,


In [7]:
# completed college proportion for married & unmarried female
n_married = da_female_married.shape[0]
n_unmarried = da_female_unmarried.shape[0]

married_after_college = len(da_female_married[da_female_married['DMDEDUC2']==5]) 
unmarried_after_college = len(da_female_unmarried[da_female_unmarried['DMDEDUC2']==5])

married_proportion = married_after_college / n_married
unmarried_proportion = unmarried_after_college / n_unmarried

married_after_college, married_proportion, unmarried_after_college, unmarried_proportion

(162, 0.36080178173719374, 72, 0.21301775147928995)

In [8]:
se_married = married_proportion * (1 - married_proportion)
se_unmarried = unmarried_proportion * (1 - unmarried_proportion)

In [9]:
# married
print("LCB (married)", married_proportion - 1.96 * np.sqrt(se_married/n_married))
print("UCB (married)", married_proportion + 1.96 * np.sqrt(se_married/n_married))

LCB (married) 0.3163811208674688
UCB (married) 0.4052224426069187


In [10]:
# unmarried
print("LCB (unmarried)", unmarried_proportion - 1.96 * np.sqrt(se_unmarried/n_unmarried))
print("UCB (unmarried)", unmarried_proportion + 1.96 * np.sqrt(se_unmarried/n_unmarried))

LCB (unmarried) 0.1693673655848136
UCB (unmarried) 0.25666813737376626


In [11]:
# for married female
lcb, ucb = sm.stats.proportion_confint(count=married_after_college, nobs=n_married)
print(f"LCB (married) = {lcb} and UCB (married) = {ucb}")
ucb - lcb

LCB (married) = 0.31638193710753626 and UCB (married) = 0.4052216263668512


0.08883968925931496

In [12]:
# for unmarried female
lcb, ucb = sm.stats.proportion_confint(count=unmarried_after_college, nobs=n_unmarried)
print(f"LCB (unmarried) = {lcb} and UCB (unmarried) = {ucb}")
ucb - lcb

LCB (unmarried) = 0.16936816767089768 and UCB (unmarried) = 0.2566673352876822


0.08729916761678452

__Q1a.__ Identify which of the two confidence intervals is wider, and explain why this is the case. 

The first confidence inervals is wider.

__Q1b.__ Write 1-2 sentences summarizing these findings for an audience that does not know what a confidence interval is (the goal here is to report the substance of what you learned about how marital status and educational attainment are related, not to teach a person what a confidence interval is).

## Question 2

Construct a 95% confidence interval for the proportion of smokers who are female. Construct a 95% confidence interval for the proportion of smokers who are male. Construct a 95% confidence interval for the **difference** between those two gender proportions.

In [13]:
da["SMQ020x"]= da.SMQ020.replace({1: "Yes", 2: "No", 7: np.nan, 9: np.nan})
da["RIAGENDRx"] = da.RIAGENDR.replace({1: "Male", 2: "Female"})

In [14]:
dx = da[["SMQ020x", "RIAGENDRx"]].dropna()
dx

Unnamed: 0,SMQ020x,RIAGENDRx
0,Yes,male
1,Yes,male
2,Yes,male
3,No,female
4,No,female
...,...,...
5730,Yes,female
5731,No,male
5732,Yes,female
5733,Yes,male


In [15]:
dy = pd.crosstab(columns=dx.RIAGENDRx, index=dx.SMQ020x)
dy

RIAGENDRx,female,male
SMQ020x,Unnamed: 1_level_1,Unnamed: 2_level_1
No,2066,1340
Yes,906,1413


In [16]:
dz = dx.groupby(dx.RIAGENDRx).agg({"SMQ020x": [lambda x: np.mean(x=="Yes"), np.size]})
dz.columns = ["Proportion", "Count"]
dz

Unnamed: 0_level_0,Proportion,Count
RIAGENDRx,Unnamed: 1_level_1,Unnamed: 2_level_1
female,0.304845,2972
male,0.513258,2753


In [45]:
female_p = dz.Proportion.female
male_p = dz.Proportion.male
female_n = dz.Count.female
male_n = dz.Count.male
print(female_n, male_n)

2972 2753


In [46]:
# standard error (female)
female_se = np.sqrt(female_p * (1 - female_p) / female_n)
print("Standard error (female) =", female_se)

# standard error (male)
male_se = np.sqrt(male_p * (1 - male_p) / male_n)
print("Standard error (female) =", male_se)

Standard error (female) = 0.008444152146214435
Standard error (female) = 0.009526078653689868


In [52]:
# for female
female_lcb = female_p - 1.96 * female_se
female_ucb = female_p + 1.96 * female_se
print(f"LCB (female) = {female_lcb} and UCB (female) = {female_ucb}")

# for male
male_lcb = male_p - 1.96 * male_se
male_ucb = male_p + 1.96 * male_se
print(f"LCB (male) = {male_lcb} and UCB (male) = {male_ucb}")

LCB (female) = 0.288294683866098 and UCB (female) = 0.32139576027925865
LCB (male) = 0.49458714955108174 and UCB (male) = 0.531929377873546


**Using Statsmodels**

In [48]:
# for female smoker
lcb, ucb = sm.stats.proportion_confint(count=dy.female.Yes , nobs=dy.female.Yes + dy.female.No)
lcb, ucb

(0.2882949879861214, 0.32139545615923526)

In [49]:
# for male smoker
lcb, ucb = sm.stats.proportion_confint(count=dy.male.Yes , nobs=dy.male.Yes + dy.male.No)
lcb, ucb

(0.49458749263718593, 0.5319290347874418)

In [53]:
diff_se = np.sqrt(female_se**2 + male_se**2)
print(diff_se)

0.012729881381407434


In [55]:
# male proportion - female proportion
diff_p = dz.Proportion.male - dz.Proportion.female
lcb = diff_p - 2  * diff_se
ucb = diff_p + 2 * diff_se
print(f"LCB = {lcb} and UCB = {ucb}")

LCB = 0.18295327887682067 and UCB = 0.2338728044024504


In [56]:
# female proportion - male proportion
diff_p = dz.Proportion.female - dz.Proportion.male
lcb = diff_p - 2  * diff_se
ucb = diff_p + 2 * diff_se
print(f"LCB = {lcb} and UCB = {ucb}")

LCB = -0.2338728044024504 and UCB = -0.18295327887682067


__Q2a.__ Why might it be relevant to report the separate gender proportions **and** the difference between the gender proportions?

__Q2b.__ How does the **width** of the confidence interval for the difference between the gender proportions compare to the widths of the confidence intervals for the separate gender proportions?

## Question 3

Construct a 95% interval for height ([BMXHT](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/BMX_I.htm#BMXHT)) in centimeters.  Then convert height from centimeters to inches by dividing by 2.54, and construct a 95% confidence interval for height in inches.  Finally, convert the endpoints (the lower and upper confidence limits) of the confidence interval from inches to back to centimeters.

In [58]:
da["BMXHTi"] = da["BMXHT"] / 2.54
da["BMXHTi"].dropna(inplace=True)

In [60]:
mean_h = np.mean(da["BMXHTi"])
std_h = np.std(da["BMXHTi"])
count_h = len(da["BMXHTi"])

lcb = mean_h - 1.96 * (std_h / np.sqrt(count_h))
ucb = mean_h + 1.96 * (std_h / np.sqrt(count_h))
print(f"LCB (inches) = {lcb} and UCB (inches) = {ucb}")

LCB = 65.30787049524722 and UCB = 65.51325901586426


In [61]:
print(f"LCB (cm) = {lcb * 2.54} and UCB (cm) = {ucb * 2.54}")

LCB (cm) = 165.88199105792793 and UCB (cm) = 166.40367790029524


__Q3a.__ Describe how the confidence interval constructed in centimeters relates to the confidence interval constructed in inches.

## Question 4

Partition the sample based on 10-year age bands, i.e. the resulting groups will consist of people with ages from 18-28, 29-38, etc. Construct 95% confidence intervals for the difference between the mean BMI for females and for males within each age band.