# Practice notebook for confidence intervals using NHANES data

[github link](https://github.com/karimkmafifi/Inferential-Statistical-Analysis-with-Python---Coursera/blob/master/nhanes_confidence_intervals_practice.ipynb)

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import statsmodels.api as sm
import matplotlib.pyplot as plt
%matplotlib inline

da = pd.read_csv("nhanes_2015_2016.csv")

## Question 1

Restrict the sample to women between 35 and 50, then use the marital status variable [DMDMARTL](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#DMDMARTL) to partition this sample into two groups - women who are currently married, and women who are not currently married.  Within each of these groups, calculate the proportion of women who have completed college.  Calculate 95% confidence intervals for each of these proportions.

In [2]:
da['DMDMARTL'] = da['DMDMARTL'].replace({1:'married', 2:'widowed', 
                                         3:'divorced', 4:'separated',
                                        5:'never married', 6:'living with partner',
                                        77:'refused', 99:"don't know", '.':'missing'})

da['DMDMARTL'].unique()

array(['married', 'divorced', 'living with partner', 'separated',
       'never married', nan, 'widowed', 'refused'], dtype=object)

In [3]:
da['RIAGENDR'] = da['RIAGENDR'].replace({1:'male', 2:'female', '.':'missing'})
da['RIAGENDR'].unique()

array(['male', 'female'], dtype=object)

In [4]:
da.columns

Index(['SEQN', 'ALQ101', 'ALQ110', 'ALQ130', 'SMQ020', 'RIAGENDR', 'RIDAGEYR',
       'RIDRETH1', 'DMDCITZN', 'DMDEDUC2', 'DMDMARTL', 'DMDHHSIZ', 'WTINT2YR',
       'SDMVPSU', 'SDMVSTRA', 'INDFMPIR', 'BPXSY1', 'BPXDI1', 'BPXSY2',
       'BPXDI2', 'BMXWT', 'BMXHT', 'BMXBMI', 'BMXLEG', 'BMXARML', 'BMXARMC',
       'BMXWAIST', 'HIQ210'],
      dtype='object')

In [19]:
da_female_married = da[(da['RIDAGEYR'].between(35, 50)) & (da['RIAGENDR']=='female') & (da['DMDMARTL']=='married')]
da_female_married.head()

Unnamed: 0,SEQN,ALQ101,ALQ110,ALQ130,SMQ020,RIAGENDR,RIDAGEYR,RIDRETH1,DMDCITZN,DMDEDUC2,...,BPXDI2,BMXWT,BMXHT,BMXBMI,BMXLEG,BMXARML,BMXARMC,BMXWAIST,HIQ210,female_gp
34,83799,,,,2,female,37,2,1.0,4.0,...,72.0,66.6,161.6,25.5,,,,,2.0,f_married
50,83828,1.0,,2.0,2,female,39,1,2.0,3.0,...,62.0,71.3,162.0,27.2,36.8,34.6,29.1,94.6,,f_married
55,83837,2.0,2.0,,2,female,45,1,1.0,2.0,...,68.0,77.5,148.3,35.2,30.5,34.0,34.4,107.6,2.0,f_married
61,83851,2.0,1.0,1.0,1,female,37,3,1.0,3.0,...,74.0,85.1,155.3,35.3,32.5,33.6,36.1,106.5,1.0,f_married
62,83853,,,,2,female,49,3,1.0,3.0,...,84.0,76.1,166.7,27.4,39.2,38.6,32.6,88.7,2.0,f_married


In [20]:
# Age between 35 and 50 femmale not married
da_female_unmarried = da[(da['RIDAGEYR'].between(35, 50)) & (da['RIAGENDR']=='female') & (da['DMDMARTL']!='married')]
da_female_unmarried.head()

Unnamed: 0,SEQN,ALQ101,ALQ110,ALQ130,SMQ020,RIAGENDR,RIDAGEYR,RIDRETH1,DMDCITZN,DMDEDUC2,...,BPXDI2,BMXWT,BMXHT,BMXBMI,BMXLEG,BMXARML,BMXARMC,BMXWAIST,HIQ210,female_gp
4,83736,2.0,1.0,1.0,2,female,42,4,1.0,4.0,...,54.0,55.2,164.9,20.3,37.4,36.0,27.2,80.4,2.0,f_married
52,83832,2.0,1.0,4.0,2,female,50,1,2.0,1.0,...,,105.9,157.7,42.6,29.2,35.0,40.7,129.1,,f_married
58,83845,1.0,,,1,female,44,4,1.0,1.0,...,78.0,133.3,171.5,45.3,37.3,35.7,48.7,,2.0,f_married
100,83911,1.0,,1.0,2,female,43,4,1.0,4.0,...,76.0,91.1,172.3,30.7,40.4,38.2,34.6,101.6,,f_married
127,83958,2.0,1.0,2.0,1,female,47,4,1.0,3.0,...,76.0,58.6,160.6,22.7,36.8,35.2,28.0,77.9,,f_married


In [31]:
# completed college proportion for married & unmarried female
n_married = da_female_married.shape[0]
n_unmarried = da_female_unmarried.shape[0]

married_after_college = len(da_female_married[da_female_married['DMDEDUC2']==5]) 
unmarried_after_college = len(da_female_unmarried[da_female_unmarried['DMDEDUC2']==5])

married_proportion = married_after_college / n_married
unmarried_proportion = unmarried_after_college / n_unmarried

married_after_college, married_proportion, unmarried_after_college, unmarried_proportion

(162, 0.36080178173719374, 72, 0.21301775147928995)

In [26]:
se_married = married_proportion * (1 - married_proportion)
se_unmarried = unmarried_proportion * (1 - unmarried_proportion)

In [30]:
# married
print("LCB (married)", married_proportion - 1.96 * np.sqrt(se_married/n_married))
print("UCB (married)", married_proportion + 1.96 * np.sqrt(se_married/n_married))

LCB (married) 0.3163811208674688
UCB (married) 0.4052224426069187


In [37]:
# unmarried
print("LCB (unmarried)", unmarried_proportion - 1.96 * np.sqrt(se_unmarried/n_unmarried))
print("UCB (unmarried)", unmarried_proportion + 1.96 * np.sqrt(se_unmarried/n_unmarried))

LCB (unmarried) 0.1693673655848136
UCB (unmarried) 0.25666813737376626


In [38]:
# for married female
lcb, ucb = sm.stats.proportion_confint(count=married_after_college, nobs=n_married)
print(f"LCB (married) = {lcb} and UCB (married) = {ucb}")
ucb - lcb

LCB (married) = 0.31638193710753626 and UCB (married) = 0.4052216263668512


0.08883968925931496

In [39]:
# for unmarried female
lcb, ucb = sm.stats.proportion_confint(count=unmarried_after_college, nobs=n_unmarried)
print(f"LCB (unmarried) = {lcb} and UCB (unmarried) = {ucb}")
ucb - lcb

LCB (unmarried) = 0.16936816767089768 and UCB (unmarried) = 0.2566673352876822


0.08729916761678452

__Q1a.__ Identify which of the two confidence intervals is wider, and explain why this is the case. 

The first confidence inervals is wider.

__Q1b.__ Write 1-2 sentences summarizing these findings for an audience that does not know what a confidence interval is (the goal here is to report the substance of what you learned about how marital status and educational attainment are related, not to teach a person what a confidence interval is).

## Question 2

Construct a 95% confidence interval for the proportion of smokers who are female. Construct a 95% confidence interval for the proportion of smokers who are male. Construct a 95% confidence interval for the **difference** between those two gender proportions.