# Confidence intervals case study using NHANES data


This notebook demonstrates how to use Python and its statistical libraries to construct confidence intervals for proportions and means.  We will also cover some important points relating to the properties of confidence intervals, and discuss how to use and interpret confidence intervals in practice.  We will use the 2015-2016 wave of the [NHANES](https://www.cdc.gov/nchs/nhanes/index.htm) data for all the analyses below.

It is important to note that the NHANES data are a "complex survey".  The data are not an independent and representative sample from the target population.  Proper analysis of complex survey data should make use of additional information about the manner in which the data were collected.  Since complex survey analysis is a somewhat specialized topic, we ignore this aspect of the data here, and analyze the NHANES data as if it were an independent and identically distributed sample from a population.

The following module import statements and data reading statement are identical to what we have used previously in course 1.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
%matplitlib inline

UsageError: Line magic function `%matplitlib` not found.


In [2]:
da = pd.read_csv("nhanes_2015_2016.csv")

## Confidence intervals for one proportion

In this section, we demonstrate the construction of confidence intervals for the proportion of people who smoke.  The specific definition of "smoker" used here ([SMQ020](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/SMQ_I.htm#SMQ020)) identifies a person as being a smoker if they self-report as having smoked 100 or more cigarettes in their lifetime.  It is more accurate to refer to this as a measure of "lifetime smoking" rather than "current smoking".  Recall that the definitions of these and other NHANES variables can be found using the NHANES code books, or by searching using the link below.

https://wwwn.cdc.gov/nchs/nhanes/search/default.aspx

We will calculate the proportions of smokers separately for females and for males.  Initially we can compare these two proportions and their corresponding confidence intervals informally, but later we will discuss methods to compare two proportions formally using confidence intervals.

First we replace the numeric codes in the variables of interest with text labels, and set the rare answers other than "yes" and "no" to be missing (so they will automatically be omitted from all the analyses below).

In [3]:
da["SMQ020x"]= da.SMQ020.replace({1: "Yes", 2: "No", 7: np.nan, 9: np.nan})  # np.nan represents a missing value
da["RIAGENDRx"] = da.RIAGENDR.replace({1: "Male", 2: "Female"})

In [9]:
dx = da[["SMQ020x", "RIAGENDRx"]].dropna()
pd.crosstab(index=dx.SMQ020x, columns=dx.RIAGENDRx)

RIAGENDRx,Female,Male
SMQ020x,Unnamed: 1_level_1,Unnamed: 2_level_1
No,2066,1340
Yes,906,1413


The confidence interval (CI) is constructed using two inputs: the sample proportion of smokers, and the total sample size for smokers and non-smokers combined.  We calculate these values next.

In [12]:
dz = dx.groupby(dx.RIAGENDRx).agg({"SMQ020x": [lambda x: np.mean(x=="Yes"), np.size]})
dz

Unnamed: 0_level_0,SMQ020x,SMQ020x
Unnamed: 0_level_1,<lambda_0>,size
RIAGENDRx,Unnamed: 1_level_2,Unnamed: 2_level_2
Female,0.304845,2972
Male,0.513258,2753


In [13]:
dz.columns = ["Proportion", "Total_n"]
dz

Unnamed: 0_level_0,Proportion,Total_n
RIAGENDRx,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,0.304845,2972
Male,0.513258,2753


Confidence intervals are closely connected to standard errors.  Recall that the standard error essentially tells you how
far you should expect an estimate to fall from the truth.  A confidence interval is an interval that under repeated sampling covers the truth a defined proportion of the time.  In most settings, this "coverage probability" is set to 95%.  

It turns out that in many settings, a 95% confidence interval can be constructed as the interval consisting of all points that are within two (or 1.96) standard errors of the point estimate.  More concisely, the confidence interval approximately spans from e - 2•SE to e + 2•SE, where e is the point estimate and SE is the standard error.

Since the standard error plays such an important role here, we calculate it separately first.

In [16]:
p = dz.Proportion.Female                  # Value in junction of index "Female" and column "Proportion"
n = dz.Total_n.Female                     # Value in junction of index "Female" and column "Total_n"
se_female = np.sqrt(p * (1 - p) / n)      # Standard Error for Female
print("Standard Error (Female):", se_female)

p = dz.Proportion.Male                    # Value in junction of index "Male" and column "Total_n"
n = dz["Total_n"].Male                    # Value in junction of index "Male" and column "Total_n"
se_male = np.sqrt(p * (1 - p) / n)        # Standard Error for Male
print("Standard Error (Male):", se_male)

Standard Error (Female): 0.008444152146214435
Standard Error (Male): 0.009526078653689868


We can see that the standard errors for the estimated proportions of females and males who smoke are similar, and are each around 1% (since we are studying a proportion here, 0.01 corresponds to a 1 percentage point change in the smoking rate).  

The standard error for a proportion is maximized when the true proportion is around 1/2, and gets smaller as the true proportion approaches either 0 or 1.  The estimated male smoking proportion is closer to 1/2 than the estimated female smoking proportion, and the male sample size is smaller than the female sample size.  Both of these factors lead to the male standard error being larger than the female standard error, although the difference is very small in this case.

Next we calculate the 95% confidence intervals for the proportions of female and male smokers using the formula for the one-sample confidence interval for a proportion: