## Confidence Interval for Difference in Population Parameter

### Overview

We will use the 2015-2016 wave of the NHANES data for our analysis.

For our population proportions, we will analyze the difference of proportion between female and male smokers.  The column that specifies smoker and non-smoker is "SMQ020" in our dataset.

For our population means, we will analyze the difference of mean of body mass index within our female and male populations.  The column that includes the body mass index value is "BMXBMI".

Additionally, the gender is specified in the column "RIAGENDR".

In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib as plt
import seaborn as sns
%matplotlib inline

### Cleaning Dataset First

In [2]:
cols = ['SMQ020', 'BMXBMI', 'RIAGENDR']
df = pd.read_csv('../data/nhanes_2015_2016.csv', usecols=cols)
df.head()

Unnamed: 0,SMQ020,RIAGENDR,BMXBMI
0,1,1,27.8
1,1,1,30.8
2,1,1,28.8
3,2,2,42.4
4,2,2,20.3


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5735 entries, 0 to 5734
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   SMQ020    5735 non-null   int64  
 1   RIAGENDR  5735 non-null   int64  
 2   BMXBMI    5662 non-null   float64
dtypes: float64(1), int64(2)
memory usage: 134.5 KB


In [None]:
# df[df['SMQ020'].isnull()] #no null values

In [15]:
# mapping 'SMQ020' and 'RIAGENDR' to be more representable
smq_category = {1:'yes', 2:'no', 7:np.nan, 8:np.nan}
df.SMQ020 = df.SMQ020.map(smq_category)

In [16]:
gendr_category = {1:'male', 2:'female'}
df.RIAGENDR = df.RIAGENDR.map(gendr_category) 

In [17]:
# confirm changes
df.head()

Unnamed: 0,SMQ020,RIAGENDR,BMXBMI
0,yes,male,27.8
1,yes,male,30.8
2,yes,male,28.8
3,no,female,42.4
4,no,female,20.3


In [22]:
# group and drop null
gendr_smq = pd.crosstab(index=df.RIAGENDR, columns=df.SMQ020, dropna=True)
print(gendr_smq)

SMQ020      no   yes
RIAGENDR            
female    2066   906
male      1340  1413


`BMXBMI` and `RIAGENDR`

In [None]:
# df[df['BMXBMI'].isnull()]

In [44]:
df_bmi = df[df['BMXBMI'].notnull()] # 5662 rows - correct
bmi_gendr = df_bmi.groupby('RIAGENDR').agg(['count', 'mean', 'std'])
print(bmi_gendr)

         BMXBMI                     
          count       mean       std
RIAGENDR                            
female     2944  29.939946  7.753319
male       2718  28.778072  6.252568


---

### Constructing Confidence Intervals

Now that we have the population proportions of male and female smokers, we can begin to calculate confidence intervals.  The equation is as follows:

$$Best\ Estimate \pm Margin\ of\ Error$$

Where the *Best Estimate* is the **observed population proportion or mean** from the sample and the *Margin of Error* is the **t-multiplier**.

The equation to create a 95% confidence interval can also be shown as:

$$Population\ Proportion\ or\ Mean\ \pm (t-multiplier *\ Standard\ Error)$$

The Standard Error (SE) is calculated differenly for population proportion and mean:

$$Standard\ Error \ for\ Population\ Proportion = \sqrt{\frac{Population\ Proportion * (1 - Population\ Proportion)}{Number\ Of\ Observations}}$$

$$Standard\ Error \ for\ Mean = \frac{Standard\ Deviation}{\sqrt{Number\ Of\ Observations}}$$

Lastly, the standard error for difference of population proportions and means is:

$$Standard\ Error\ for\ Difference\ of\ Two\ Population\ Proportions\ Or\ Means = \sqrt{(SE_{\ 1})^2 + (SE_{\ 2})^2}$$

### Confidence interval for difference of population **proportion** - `SMQ020` and `RIAGENDR`

In [33]:
# number of observations : female and male
num_f = gendr_smq.loc['female'].sum()
num_m = gendr_smq.loc['male'].sum()
print(num_f, num_m)

2972 2753


In [48]:
# propoartions : female and male
prop_f = gendr_smq.loc['female','yes'] / num_f
prop_m = gendr_smq.loc['male', 'yes'] / num_m
print(prop_f, prop_m)

# proportion for difference 
prop_diff = prop_f - prop_m
print(prop_diff)

0.30484522207267833 0.5132582637123139
-0.20841304163963553


In [47]:
# estimated standard error
stdErr_f = np.sqrt( prop_f*(1-prop_f) / num_f )
stdErr_m = np.sqrt( prop_m*(1-prop_m) / num_m )
print(stdErr_f, stdErr_m)

# standar error for difference 
stdErr_diff = np.sqrt( stdErr_f**2 + stdErr_m**2 )
print(stdErr_diff)

0.008444152146214435 0.009526078653689868
0.012729881381407434


In [49]:
# with 95% confidence
lcb = prop_diff - 1.96 * stdErr_diff
ucb = prop_diff + 1.96 * stdErr_diff
conf_diff = (lcb, ucb)
print(conf_diff)

(-0.2333636091471941, -0.18346247413207697)


### Confidence interval for difference of population **mean** - `BMXBMI` and `RIAGENDR`

In [50]:
bmi_gendr

Unnamed: 0_level_0,BMXBMI,BMXBMI,BMXBMI
Unnamed: 0_level_1,count,mean,std
RIAGENDR,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
female,2944,29.939946,7.753319
male,2718,28.778072,6.252568


In [76]:
bmi_f = bmi_gendr.loc['female']
#print(bmi_f)
num_f, mean_f, std_f = [bmi_f[i] for i in range(len(bmi_f))]
print(num_f, mean_f, std_f)

2944.0 29.939945652173996 7.753318809545676


In [77]:
bmi_m = bmi_gendr.loc['male']
#print(bmi_m)
num_m, mean_m, std_m = [bmi_m[i] for i in range(len(bmi_m))]
print(num_m, mean_m, std_m)

2718.0 28.778072111846985 6.252567616801466


In [79]:
# mean difference
mean_diff = mean_f - mean_m
print(mean_diff)

1.1618735403270115


In [85]:
# standard error for difference in mean
stdErr_f = std_f / np.sqrt(num_f)
stdErr_m = std_m / np.sqrt(num_m)
print(stdErr_f, stdErr_m)

stdErr_diff = np.sqrt( stdErr_f**2 + stdErr_m**2 )
print(stdErr_diff)

0.1428955614964966 0.11993161192479428
0.18655490621872828


In [86]:
# confidence interval for difference in mean
lcb = mean_diff - 1.96 * stdErr_diff
ucb = mean_diff + 1.96 * stdErr_diff
conf_diff = (lcb, ucb)
print(conf_diff)

(0.7962259241383041, 1.527521156515719)
