# Hypothesis Testing Using [NHANES](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#DMDEDUC2) Data

[Hypothesis Testing:](https://www.investopedia.com/terms/h/hypothesistesting.asp) "is an act in statistics whereby an analyst tests an assumption regarding a population parameter. The methodology employed by the analyst depends on the nature of the data used and the reason for the analysis." The analyst stablishes a hypothesis(Null), and uses statistical metrics to either accept or reject such claim. If the Null hypothesis cannot be supported, then the alternative is said to be true

**Null Hypothesis: $H_0$**

**Alternative Hypothesis: $H_a$**

$test\ statistc = \frac{Best\ Estimate - Hypothesized\ Estimate}{Standard\ Error\ of\ Estimate}$
The statsmodel library will be used throughout this notebook

[Quick Reference](https://online.stat.psu.edu/stat200/book/export/html/196)

In [1]:
# Initial Imports
import pandas as pd 
import numpy as np 
import statsmodels.api as sm
import scipy.stats
import matplotlib.pyplot as plt
from pathlib import Path

%matplotlib inline

In [2]:
# import csv dataframe 
df = pd.read_csv(Path("data/nhanes_2015_2016.csv"))
df.head()

Unnamed: 0,SEQN,ALQ101,ALQ110,ALQ130,SMQ020,RIAGENDR,RIDAGEYR,RIDRETH1,DMDCITZN,DMDEDUC2,...,BPXSY2,BPXDI2,BMXWT,BMXHT,BMXBMI,BMXLEG,BMXARML,BMXARMC,BMXWAIST,HIQ210
0,83732,1.0,,1.0,1,1,62,3,1.0,5.0,...,124.0,64.0,94.8,184.5,27.8,43.3,43.6,35.9,101.1,2.0
1,83733,1.0,,6.0,1,1,53,3,2.0,3.0,...,140.0,88.0,90.4,171.4,30.8,38.0,40.0,33.2,107.9,
2,83734,1.0,,,1,1,78,3,1.0,3.0,...,132.0,44.0,83.4,170.1,28.8,35.6,37.0,31.0,116.5,2.0
3,83735,2.0,1.0,1.0,2,2,56,3,1.0,5.0,...,134.0,68.0,109.8,160.9,42.4,38.5,37.7,38.3,110.1,2.0
4,83736,2.0,1.0,1.0,2,2,42,4,1.0,4.0,...,114.0,54.0,55.2,164.9,20.3,37.4,36.0,27.2,80.4,2.0


### Question 1
Conduct a hypothesis test (at the 0.05 level) for the null hypothesis that the proportion of women who smoke is equal to the proportion of men who smoke.

This is a *difference in population proportion test*. Our **population** is *people that smoke*, and our **parameter of interest** is the *difference in proportion between female and male population*

$H_0: p_1-p_2 = 0$

$H_a: p_1-P_2 \neq0$

In [3]:
df.columns

Index(['SEQN', 'ALQ101', 'ALQ110', 'ALQ130', 'SMQ020', 'RIAGENDR', 'RIDAGEYR',
       'RIDRETH1', 'DMDCITZN', 'DMDEDUC2', 'DMDMARTL', 'DMDHHSIZ', 'WTINT2YR',
       'SDMVPSU', 'SDMVSTRA', 'INDFMPIR', 'BPXSY1', 'BPXDI1', 'BPXSY2',
       'BPXDI2', 'BMXWT', 'BMXHT', 'BMXBMI', 'BMXLEG', 'BMXARML', 'BMXARMC',
       'BMXWAIST', 'HIQ210'],
      dtype='object')

In [4]:
# create a df with the required columns for analysis
# SMQ020: 1=yes, 2=no
# RIAGENDR: 1=Male, 2 = Female
population_df = df[['RIAGENDR','SMQ020']]

# check for possible missing data or integer codes that may influence analysis
display(population_df.RIAGENDR.value_counts().to_frame(),population_df.SMQ020.value_counts().to_frame())

Unnamed: 0,RIAGENDR
2,2976
1,2759


Unnamed: 0,SMQ020
2,3406
1,2319
9,8
7,2


In [5]:
# SMQ020 has multiple integer codes, but we are only interested in 1 or 2
population_df = population_df[(population_df['SMQ020']==1) | (population_df['SMQ020']==2)]
# Rename columns, replace interger codes
population_df.rename(columns={'RIAGENDR':'GENDER', 'SMQ020':'SMOKE'}, inplace = True)
population_df['GENDER'] = population_df['GENDER'].replace({1:'Male', 2:'Female'})
population_df['SMOKE'] = population_df['SMOKE'].replace({1:'Yes', 2:'No'})

population_df.value_counts().unstack()

SMOKE,No,Yes
GENDER,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,2066,906
Male,1340,1413


In [64]:
# compute proportion of smokers for male and female
table = pd.crosstab(population_df.GENDER, population_df.SMOKE)
display(table)

proportions_df = population_df.groupby('GENDER').agg({'SMOKE': [lambda x: np.mean(x=='Yes'),'count']})
proportions_df.columns = ['Proportion','n']
display(proportions_df)
# calculate common proportion of yes response for bothe males and females
p_combined = (population_df.SMOKE=='Yes').mean()

# Null hypothesis states that there is no difference in proportions
# So We assume that the variance is the same
# Calculate variance 
var = (population_df.SMOKE=='Yes').var()
# calculate standard error for estimated difference in proportions
se = np.sqrt(var * (1/proportions_df.loc['Female','n'] + 1/proportions_df.loc['Male','n']))

# calculate test statistic and p-value
t_stat = (proportions_df.loc['Female','Proportion']-proportions_df.loc['Male','Proportion'])/se

# degrees of freedom for two tailed test: n-2
df = proportions_df.n.sum()-2
pvalue = scipy.stats.t.sf(abs(t_stat), df=df)*2 # https://www.statology.org/p-value-from-t-score-python/
print(f'combined proportion: {p_combined}')
print(f'population variance: {var}')
print(f'standard error: {se}')
print(f'degrees of freedom: {df}')
print(f'test statistic: {t_stat}')
print(f'p-value: {pvalue}')

# run hypothesis test using different(faster approach)
print()
print('*'*50)
print('HYPOTHESIS TESTING USING FASTER APPROACH')
females = population_df.loc[population_df.GENDER=='Female','SMOKE'].replace({'Yes':1, 'No':0})
males =  population_df.loc[population_df.GENDER=='Male','SMOKE'].replace({'Yes':1, 'No':0})
t_stat, pvalue, df = sm.stats.ttest_ind(females, males)
print(f'degrees of freedom: {df}')
print(f'test statistic: {t_stat}')
print(f'p_value: {pvalue}')
print(f'p_value using spicy library: {scipy.stats.t.sf(abs(t_stat), df=df)*2}')
print('*'*50)
print()

# calculate 95% confidence interval using statsmodels library for female and male smoker populations
# create frequency table for the proportion of smokers in the female and male population
table = pd.crosstab(population_df.GENDER, population_df.SMOKE)
fem_lower_bound, fem_upper_bound = sm.stats.proportion_confint(table.loc['Female','Yes'], proportions_df.loc['Female','n'])
male_lower_bound, male_upper_bound =  sm.stats.proportion_confint(table.loc['Male','Yes'], proportions_df.loc['Male','n'])

# calculate standard error for population proportion standard error
# female population
p = proportions_df.loc['Female','Proportion']
n = proportions_df.loc['Female','n']
se_female = np.sqrt(p*(1-p)/n)

# male  population
p = proportions_df.loc['Male','Proportion']
n = proportions_df.loc['Male','n']
se_male = np.sqrt(p*(1-p)/n)

# calculate confidence interval for the difference in proportion between male and female smokers
# CI = (p1-p2)+-se_diff
# calculate difference standard error for difference of two proportions
se_diff= np.sqrt(se_female**2 + se_male**2)
p_female = proportions_df.loc['Female','Proportion']
p_male = proportions_df.loc['Male','Proportion']
lower_bound, upper_bound = (p_female - p_male) - (1.96*se_diff), (p_female - p_male) + (1.96*se_diff)
print(f"95% Confidence Interval for Female Smoker Population: ({fem_lower_bound:.2%},{fem_upper_bound:.2%})")
print(f"95% Confidence Interval for Male Smoker Population: ({male_lower_bound:.2%},{male_upper_bound:.2%})")
print(f"Difference in population proportions for the two populations: {p_female-p_male:.2%}")
print(f"Confidence interval for difference in population proportion: ({lower_bound:.2%},{upper_bound:.2%})")

SMOKE,No,Yes
GENDER,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,2066,906
Male,1340,1413


Unnamed: 0_level_0,Proportion,n
GENDER,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,0.304845,2972
Male,0.513258,2753


combined proportion: 0.4050655021834061
population variance: 0.2410295423544167
standard error: 0.012986597347748223
degrees of freedom: 5723
test statistic: -16.048317820200435
p-value: 1.0007122332903874e-56

**************************************************
HYPOTHESIS TESTING USING FASTER APPROACH
degrees of freedom: 5723.0
test statistic: -16.420585558984445
p_value: 3.0320887866906843e-59
p_value using spicy library: 3.0320887866906843e-59
**************************************************

95% Confidence Interval for Female Smoker Population: (28.83%,32.14%)
95% Confidence Interval for Male Smoker Population: (49.46%,53.19%)
Difference in population proportions for the two populations: -20.84%
Confidence interval for difference in population proportion: (-23.34%,-18.35%)


#### Analysis
Based on our estimated differences in sample proportions, we have enough evidence to reject our `null hypothesis`. Our `p-value` is closer to zero, therefor we can infer that there is significant difference between the proportion of women who smoke and the proportion of men who smoke. At a 5% confidence level, this difference is significantly different from zero. This claim is further supported by the confidence interval for the difference in population proportion. We found that the reasonable range of values for fall below zero. Based on this study, we can also infer that the male population smokes more than the female population. 

https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/SMQ_I.htm#SMQ020
https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#RIAGENDR