# Confidence Intervals


This tutorial is going to demonstrate how to load data, clean/manipulate a dataset, and construct a confidence interval for the difference between two population proportions and means.

We will use the 2015-2016 wave of the NHANES data for our analysis.

*Note: We have provided a notebook that includes more analysis, with examples of confidence intervals for one population proportions and means, in addition to the analysis I will show you in this tutorial.  I highly recommend checking it out!

For our population proportions, we will analyze the difference of proportion between female and male smokers.  The column that specifies smoker and non-smoker is "SMQ020" in our dataset.

For our population means, we will analyze the difference of mean of body mass index within our female and male populations.  The column that includes the body mass index value is "BMXBMI".

Additionally, the gender is specified in the column "RIAGENDR".m

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt
import statsmodels.api as sm
import matplotlib
matplotlib.use('Agg')

In [3]:
da = pd.read_csv("nhanes_2015_2016.csv")

## Investigating and Cleaning Data

In [4]:
# Recode SMQ020 from 1/2 to Yes/No into new variable SMQ020x
da["SMQ020x"] = da.SMQ020.replace({1: "Yes", 2: "No", 7: np.nan, 9: np.nan})
da["SMQ020x"]

0       Yes
1       Yes
2       Yes
3        No
4        No
       ... 
5730    Yes
5731     No
5732    Yes
5733    Yes
5734     No
Name: SMQ020x, Length: 5735, dtype: object

In [5]:
# Recode RIAGENDR from 1/2 to Male/Female into new variable RIAGENDRx
da["RIAGENDRx"] = da.RIAGENDR.replace({1: "Male", 2: "Female"})
da["RIAGENDRx"]

0         Male
1         Male
2         Male
3       Female
4       Female
         ...  
5730    Female
5731      Male
5732    Female
5733      Male
5734    Female
Name: RIAGENDRx, Length: 5735, dtype: object

In [10]:
dx = da[["SMQ020x", "RIAGENDRx"]].dropna()
dx

Unnamed: 0,SMQ020x,RIAGENDRx
0,Yes,Male
1,Yes,Male
2,Yes,Male
3,No,Female
4,No,Female
...,...,...
5730,Yes,Female
5731,No,Male
5732,Yes,Female
5733,Yes,Male


In [9]:
# Cross tabulation
pd.crosstab(index=dx.SMQ020x, columns=dx.RIAGENDRx)

RIAGENDRx,Female,Male
SMQ020x,Unnamed: 1_level_1,Unnamed: 2_level_1
No,2066,1340
Yes,906,1413


In [7]:
# Custom test
pd.crosstab(index=dx.RIAGENDRx, columns=dx.SMQ020x)

SMQ020x,No,Yes
RIAGENDRx,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,2066,906
Male,1340,1413


In [11]:
# Recode SMQ020x from Yes/No to 1/0 into existing variable SMQ020x
dx["SMQ020x"] = dx.SMQ020x.replace({"Yes": 1, "No": 0})
dx["SMQ020x"][0:5]

0    1
1    1
2    1
3    0
4    0
Name: SMQ020x, dtype: int64

In [14]:
dz = dx.groupby("RIAGENDRx").agg({"SMQ020x": [np.mean, np.size]})
dz

Unnamed: 0_level_0,SMQ020x,SMQ020x
Unnamed: 0_level_1,mean,size
RIAGENDRx,Unnamed: 1_level_2,Unnamed: 2_level_2
Female,0.304845,2972
Male,0.513258,2753


In [19]:
# Test with Female to know the fraction
dx[dx["RIAGENDRx"] == 'Female']["SMQ020x"].value_counts()

SMQ020x
0    2066
1     906
Name: count, dtype: int64

In [21]:
906/(2066+906)

0.30484522207267833

In [22]:
dz.columns = ["Proportion", "Total n"]
dz

Unnamed: 0_level_0,Proportion,Total n
RIAGENDRx,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,0.304845,2972
Male,0.513258,2753


### Constructing Confidence Intervals

Now that we have the population proportions of male and female smokers, we can begin to calculate confidence intervals.  From lecture, we know that the equation is as follows:

$$Best\ Estimate \pm Margin\ of\ Error$$

Where the *Best Estimate* is the **observed population proportion or mean** from the sample and the *Margin of Error* is the **t-multiplier**.

The equation to create a 95% confidence interval can also be shown as:

$$Population\ Proportion\ or\ Mean\ \pm (t-multiplier *\ Standard\ Error)$$

The Standard Error (SE) is calculated differenly for population proportion and mean:

$$Standard\ Error \ for\ Population\ Proportion = \sqrt{\frac{Population\ Proportion * (1 - Population\ Proportion)}{Number\ Of\ Observations}}$$

$$Standard\ Error \ for\ Mean = \frac{Standard\ Deviation}{\sqrt{Number\ Of\ Observations}}$$

Lastly, the standard error for difference of population proportions and means is:

$$Standard\ Error\ for\ Difference\ of\ Two\ Population\ Proportions\ Or\ Means = \sqrt{(SE_{\ 1})^2 + (SE_{\ 2})^2}$$



### Difference of Two Population Proportions

In [34]:
p = .304845                                     # Population proportion for female
n = 2972                                        # Number of observations for female
se_female = np.sqrt(p * (1 - p)/n)              # Standard Error for female
se_female

0.00844415041930423

In [35]:
p = .513258                                    # Population proportion for male
n = 2753                                       # Number of observations for male
se_male = np.sqrt(p * (1 - p)/ n)              # Standard Error for male
se_male

0.009526078787008965

In [36]:
se_diff = np.sqrt(se_female**2 + se_male**2)  # Standard Error for Difference of Two Population Proportions
se_diff

0.012729880335656654

In [37]:
d = .304845 - .513258                        # Difference between 2 proportions
lcb = d - 1.96 * se_diff                     # Lower Confidence Bound
ucb = d + 1.96 * se_diff                     # Upper Confidence Bound
(lcb, ucb)

(-0.23336356545788706, -0.18346243454211297)

In [38]:
# Custom test male proportions - female proportion
d = .513258  - .304845
lcb = d - 1.96 * se_diff
ucb = d + 1.96 * se_diff
(lcb, ucb)

(0.18346243454211297, 0.23336356545788706)

### Difference of Two Population Means

In [27]:
da["BMXBMI"].head()

0    27.8
1    30.8
2    28.8
3    42.4
4    20.3
Name: BMXBMI, dtype: float64

In [28]:
da.groupby("RIAGENDRx").agg({"BMXBMI": [np.mean, np.std, np.size]})

Unnamed: 0_level_0,BMXBMI,BMXBMI,BMXBMI
Unnamed: 0_level_1,mean,std,size
RIAGENDRx,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Female,29.939946,7.753319,2976
Male,28.778072,6.252568,2759


In [29]:
sem_female = 7.753319 / np.sqrt(2976)           # Standard Error for mean (female)
sem_male = 6.252568 / np.sqrt(2759)             # Standard Error for mean (male)
(sem_female, sem_male)

(0.14212523289878048, 0.11903716451870151)

In [30]:
sem_diff = np.sqrt(sem_female**2 + sem_male**2) # Standard Error for Difference of Two Population Mean
sem_diff

0.18538993598139303

In [32]:
d = 29.939946 - 28.778072                      # Difference between 2 mean values for male and female

In [33]:
lcb = d - 1.96 * sem_diff                      # Lower Confidence Bound
ucb = d + 1.96 * sem_diff                      # Upper Confidence Bound
(lcb, ucb)

(0.798509725476467, 1.5252382745235278)