# Confidence Intervals

## Confidence Intervals Using NHANES Data
--- 
#### Confidence Intervals: $Best\ Estimate\ \pm Margin\ of\ Error$
---
##### Formulas:
**Population Proportions**

$\hat{p} \pm z^**\sqrt{\frac{\hat{p}(1-\hat{p})}{n}} $

**Population Mean**

$\bar{x} \pm t^* * (\frac{S}{\sqrt{n}}) $

**Standard Error For Difference of Two Proportions or Means** 

$\sqrt{(SE_{1})^2+(SE_{2})^2}$

---

$\hat{p}$ sample proportion

$\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$ estimated standard error for sample proportion

$z^*$ Z-score

$\bar{x}$ sample mean

$(\frac{S}{\sqrt{n}})$ estimated standard error for the sample mean

$n$ sample size

$t^*$ comes from *t-distribution* table with *n-1* degrees of freedom

In [1]:
# Initial Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 
import statsmodels.api as sm
from pathlib import Path
%matplotlib inline
pd.options.display.max_columns=50

In [2]:
df = pd.read_csv(Path('data/nhanes_2015_2016.csv'))
display(df.shape)
df.head()

(5735, 28)

Unnamed: 0,SEQN,ALQ101,ALQ110,ALQ130,SMQ020,RIAGENDR,RIDAGEYR,RIDRETH1,DMDCITZN,DMDEDUC2,DMDMARTL,DMDHHSIZ,WTINT2YR,SDMVPSU,SDMVSTRA,INDFMPIR,BPXSY1,BPXDI1,BPXSY2,BPXDI2,BMXWT,BMXHT,BMXBMI,BMXLEG,BMXARML,BMXARMC,BMXWAIST,HIQ210
0,83732,1.0,,1.0,1,1,62,3,1.0,5.0,1.0,2,134671.37,1,125,4.39,128.0,70.0,124.0,64.0,94.8,184.5,27.8,43.3,43.6,35.9,101.1,2.0
1,83733,1.0,,6.0,1,1,53,3,2.0,3.0,3.0,1,24328.56,1,125,1.32,146.0,88.0,140.0,88.0,90.4,171.4,30.8,38.0,40.0,33.2,107.9,
2,83734,1.0,,,1,1,78,3,1.0,3.0,1.0,2,12400.01,1,131,1.51,138.0,46.0,132.0,44.0,83.4,170.1,28.8,35.6,37.0,31.0,116.5,2.0
3,83735,2.0,1.0,1.0,2,2,56,3,1.0,5.0,6.0,1,102718.0,1,131,5.0,132.0,72.0,134.0,68.0,109.8,160.9,42.4,38.5,37.7,38.3,110.1,2.0
4,83736,2.0,1.0,1.0,2,2,42,4,1.0,4.0,3.0,5,17627.67,2,126,1.23,100.0,70.0,114.0,54.0,55.2,164.9,20.3,37.4,36.0,27.2,80.4,2.0


In [3]:
df.columns

Index(['SEQN', 'ALQ101', 'ALQ110', 'ALQ130', 'SMQ020', 'RIAGENDR', 'RIDAGEYR',
       'RIDRETH1', 'DMDCITZN', 'DMDEDUC2', 'DMDMARTL', 'DMDHHSIZ', 'WTINT2YR',
       'SDMVPSU', 'SDMVSTRA', 'INDFMPIR', 'BPXSY1', 'BPXDI1', 'BPXSY2',
       'BPXDI2', 'BMXWT', 'BMXHT', 'BMXBMI', 'BMXLEG', 'BMXARML', 'BMXARMC',
       'BMXWAIST', 'HIQ210'],
      dtype='object')

### Question 1
Restrict the sample to women between 35 and 50, then use the marital status variable [DMDMARTL](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#DMDMARTL) to partition this sample into two groups - women who are currently married, and women who are not currently married.  Within each of these groups, calculate the proportion of women who have completed college.  Calculate 95% confidence intervals for each of these proportions.

In [4]:
# check for missing values in our columns of interest
df[['RIAGENDR','RIDAGEYR','DMDMARTL','DMDEDUC2']].isna().sum()

RIAGENDR      0
RIDAGEYR      0
DMDMARTL    261
DMDEDUC2    261
dtype: int64

In [5]:
# drop missing values 
df.dropna(subset=['DMDMARTL','DMDEDUC2'], inplace=True)
df[['RIAGENDR','RIDAGEYR','DMDMARTL','DMDEDUC2']].isna().sum()

RIAGENDR    0
RIDAGEYR    0
DMDMARTL    0
DMDEDUC2    0
dtype: int64

In [6]:
def myfunct(x):
    """ This function will be used to generalize marital status as either married or unmarried """
    if x == 1:
        return 'Married'
    else:
        return 'Not Married'    

In [7]:
# Replace gender code for male and female
df['RIAGENDR'] = df['RIAGENDR'].replace({1:'Male',2:'Female'})

# Replace code for Adults 20+, DMDEDUC2 column
df['DMDEDUC2'] = df['DMDEDUC2'].replace(
    {
        1:'Less than 9th grade',
        2:'9-11th grade',
        3:'Highschool graduate',
        4:'Some college',
        5:'College graduate',
        7:'Refused',
        9:"Don't Know"        
    })

# Classify DMDMARTL column as either married or unmarried.
df['DMDMARTL'] = df['DMDMARTL'].apply(myfunct)

In [8]:
# Restric population sample to women between 35 and 50 years old 
sample_df = df[(df['RIDAGEYR']>=35)&(df['RIDAGEYR']<=50)&(df['RIAGENDR']=='Female')]
print("Shape of Sample Dataframe")
display(sample_df.shape)
print("Unique Values  Count for Education Level")
display(sample_df['DMDEDUC2'].value_counts())
print("Unique Value Count for Marital Status")
display(sample_df['DMDMARTL'].value_counts())
sample_df.sample(5)

Shape of Sample Dataframe


(787, 28)

Unique Values  Count for Education Level


Some college           257
College graduate       234
Highschool graduate    133
Less than 9th grade     83
9-11th grade            80
Name: DMDEDUC2, dtype: int64

Unique Value Count for Marital Status


Married        449
Not Married    338
Name: DMDMARTL, dtype: int64

Unnamed: 0,SEQN,ALQ101,ALQ110,ALQ130,SMQ020,RIAGENDR,RIDAGEYR,RIDRETH1,DMDCITZN,DMDEDUC2,DMDMARTL,DMDHHSIZ,WTINT2YR,SDMVPSU,SDMVSTRA,INDFMPIR,BPXSY1,BPXDI1,BPXSY2,BPXDI2,BMXWT,BMXHT,BMXBMI,BMXLEG,BMXARML,BMXARMC,BMXWAIST,HIQ210
4432,91417,2.0,2.0,,2,Female,39,4,1.0,College graduate,Not Married,1,32503.24,1,132,5.0,120.0,78.0,120.0,76.0,79.6,157.9,31.9,37.4,38.0,34.0,108.5,2.0
1559,86364,1.0,,2.0,1,Female,42,3,1.0,Highschool graduate,Not Married,4,114675.72,1,122,4.95,118.0,78.0,116.0,78.0,74.5,158.8,29.5,36.8,37.2,32.8,89.7,2.0
2129,87330,2.0,2.0,,2,Female,36,3,1.0,Highschool graduate,Married,6,45235.25,1,126,0.74,104.0,66.0,96.0,64.0,66.6,158.2,26.6,37.5,37.2,31.5,86.6,2.0
2759,88496,2.0,2.0,,2,Female,37,3,1.0,Some college,Married,5,73245.77,1,131,1.83,108.0,70.0,112.0,62.0,109.8,164.8,40.4,40.5,38.5,37.6,118.0,2.0
3500,89743,2.0,1.0,1.0,2,Female,41,3,1.0,College graduate,Married,3,146360.49,1,127,5.0,112.0,74.0,108.0,78.0,56.1,159.5,22.1,37.0,33.0,26.8,78.4,2.0


In [9]:
pd.crosstab(sample_df['DMDMARTL'],sample_df['DMDEDUC2'])

DMDEDUC2,9-11th grade,College graduate,Highschool graduate,Less than 9th grade,Some college
DMDMARTL,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Married,43,162,64,41,139
Not Married,37,72,69,42,118


In [10]:
print('Sample proportion for married females that have finished college\n',pd.crosstab(sample_df['DMDMARTL'],sample_df['DMDEDUC2']).iloc[0,1]/449)

Sample proportion for married females that have finished college
 0.36080178173719374


In [11]:
# lambda function will count the number of college graduates in each group and divides by the size of the population in each group
sample_df = sample_df.groupby('DMDMARTL').agg({'DMDEDUC2': [lambda x: np.mean(x=='College graduate'),np.size]})
sample_df.columns = ['proportion','n']
sample_df

Unnamed: 0_level_0,proportion,n
DMDMARTL,Unnamed: 1_level_1,Unnamed: 2_level_1
Married,0.360802,449
Not Married,0.213018,338


In [19]:
# Calculate 95% confidence interval for married female population
# z-score at 95% = 1.96
p = sample_df.loc['Married','proportion']
n = sample_df.loc['Married','n']

# calculate standard error
se_married = np.sqrt((p*(1-p))/n)

# caluculate confidence interval
lower_bound, upper_bound =  round(p - 1.96 * se_married,4), round(p + 1.96 * se_married,4)

print(f"Standard error for population proportion for married females who have completed college: {round(se_married,4)}")
print(f"Estimated population proportion for married females who have completed college: {round(p,4)}")
print(f"95% confidence interval for married female population: {lower_bound, upper_bound}")
print(f"Range of interval: {round(upper_bound-lower_bound,4)}")
#sm.stats.proportion_confint(162, 449)  

Standard error for population proportion for married females who have completed college: 0.0227
Estimated population proportion for married females who have completed college: 0.3608
95% confidence interval for married female population: (0.3164, 0.4052)
Range of interval: 0.0888


In [22]:
# Calculate 95% confidence interval for unmarried female population
p = sample_df.loc['Not Married','proportion']
n = sample_df.loc['Not Married','n']

# calculate standard error
se_unmarried = np.sqrt((p*(1-p))/n)

# caluculate confidence interval
lower_bound, upper_bound =  round(p - 1.96 * se_unmarried,4), round(p + 1.96 * se_unmarried,4)

print(f"Standard error for population proportion for unmarried females who have completed college: {round(se_unmarried,4)}")
print(f"Estimated population proportion for unmarried females who have completed college: {round(p,4)}")
print(f"95% confidence interval for unmarried female population: {lower_bound, upper_bound}")
print(f"Range of interval: {round(upper_bound-lower_bound,4)}")
#sm.stats.proportion_confint(162, 449)  

Standard error for population proportion for unmarried females who have completed college: 0.0223
Estimated population proportion for unmarried females who have completed college: 0.213
95% confidence interval for unmarried female population: (0.1694, 0.2567)
Range of interval: 0.0873


The confidence interval for the currently married females who have graduated college is wider than the confidence interval for the currently unmarried females who have graduated college. This is because confidence intervals are connected to standard errors. In our parameters of interest, the margin of error for females that are currently married and have completed college is greater than the other group. 

Based on our findings, we are 95% confident that the population proportion for currently married females between the ages of 35 and 50 that have completed college, is estimated to be between (0.3164, 0.4052). 

With a 95% confidence , the population proportion for currently unmarried females between the ages of 35 and 50 that have completed college, is estimated to be between (0.1694, 0.2567).