# Hypothesis Testing Using [NHANES](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#DMDEDUC2) Data

[Hypothesis Testing:](https://www.investopedia.com/terms/h/hypothesistesting.asp) "is an act in statistics whereby an analyst tests an assumption regarding a population parameter. The methodology employed by the analyst depends on the nature of the data used and the reason for the analysis." The analyst stablishes a hypothesis(Null), and uses statistical metrics to either accept or reject such claim. If the Null hypothesis cannot be supported, then the alternative is said to be true

**Null Hypothesis: $H_0$**

**Alternative Hypothesis: $H_a$**

$test\ statistc = \frac{Best\ Estimate - Hypothesized\ Estimate}{Standard\ Error\ of\ Estimate}$
The statsmodel library will be used throughout this notebook

[Quick Reference](https://online.stat.psu.edu/stat200/book/export/html/196)

In [1]:
# Initial Imports
import pandas as pd 
import numpy as np 
import statsmodels.api as sm
import scipy.stats
import matplotlib.pyplot as plt
from pathlib import Path

%matplotlib inline

In [2]:
# import csv dataframe 
df = pd.read_csv(Path("data/nhanes_2015_2016.csv"))
df.head()

Unnamed: 0,SEQN,ALQ101,ALQ110,ALQ130,SMQ020,RIAGENDR,RIDAGEYR,RIDRETH1,DMDCITZN,DMDEDUC2,...,BPXSY2,BPXDI2,BMXWT,BMXHT,BMXBMI,BMXLEG,BMXARML,BMXARMC,BMXWAIST,HIQ210
0,83732,1.0,,1.0,1,1,62,3,1.0,5.0,...,124.0,64.0,94.8,184.5,27.8,43.3,43.6,35.9,101.1,2.0
1,83733,1.0,,6.0,1,1,53,3,2.0,3.0,...,140.0,88.0,90.4,171.4,30.8,38.0,40.0,33.2,107.9,
2,83734,1.0,,,1,1,78,3,1.0,3.0,...,132.0,44.0,83.4,170.1,28.8,35.6,37.0,31.0,116.5,2.0
3,83735,2.0,1.0,1.0,2,2,56,3,1.0,5.0,...,134.0,68.0,109.8,160.9,42.4,38.5,37.7,38.3,110.1,2.0
4,83736,2.0,1.0,1.0,2,2,42,4,1.0,4.0,...,114.0,54.0,55.2,164.9,20.3,37.4,36.0,27.2,80.4,2.0


### Question 1
Conduct a hypothesis test (at the 0.05 level) for the null hypothesis that the proportion of women who smoke is equal to the proportion of men who smoke.

This is a *difference in population proportion test*. Our **population** is *people that smoke*, and our **parameter of interest** is the *difference in proportion between female and male population*

$H_0: p_1-p_2 = 0$

$H_a: p_1-P_2 \neq0$

In [3]:
df.columns

Index(['SEQN', 'ALQ101', 'ALQ110', 'ALQ130', 'SMQ020', 'RIAGENDR', 'RIDAGEYR',
       'RIDRETH1', 'DMDCITZN', 'DMDEDUC2', 'DMDMARTL', 'DMDHHSIZ', 'WTINT2YR',
       'SDMVPSU', 'SDMVSTRA', 'INDFMPIR', 'BPXSY1', 'BPXDI1', 'BPXSY2',
       'BPXDI2', 'BMXWT', 'BMXHT', 'BMXBMI', 'BMXLEG', 'BMXARML', 'BMXARMC',
       'BMXWAIST', 'HIQ210'],
      dtype='object')

In [4]:
# create a df with the required columns for analysis
# SMQ020: 1=yes, 2=no
# RIAGENDR: 1=Male, 2 = Female
population_df = df[['RIAGENDR','SMQ020']]

# check for possible missing data or integer codes that may influence analysis
display(population_df.RIAGENDR.value_counts().to_frame(),population_df.SMQ020.value_counts().to_frame())

Unnamed: 0,RIAGENDR
2,2976
1,2759


Unnamed: 0,SMQ020
2,3406
1,2319
9,8
7,2


In [5]:
# SMQ020 has multiple integer codes, but we are only interested in 1 or 2
population_df = population_df[(population_df['SMQ020']==1) | (population_df['SMQ020']==2)]
# Rename columns, replace interger codes
population_df.rename(columns={'RIAGENDR':'GENDER', 'SMQ020':'SMOKE'}, inplace = True)
population_df['GENDER'] = population_df['GENDER'].replace({1:'Male', 2:'Female'})
population_df['SMOKE'] = population_df['SMOKE'].replace({1:'Yes', 2:'No'})

population_df.value_counts().unstack()

SMOKE,No,Yes
GENDER,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,2066,906
Male,1340,1413


In [6]:
# compute proportion of smokers for male and female
table = pd.crosstab(population_df.GENDER, population_df.SMOKE)
display(table)

proportions_df = population_df.groupby('GENDER').agg({'SMOKE': [lambda x: np.mean(x=='Yes'),'count']})
proportions_df.columns = ['Proportion','n']
display(proportions_df)
# calculate common proportion of yes response for bothe males and females
p_combined = (population_df.SMOKE=='Yes').mean()

# Null hypothesis states that there is no difference in proportions
# So We assume that the variance is the same
# Calculate variance 
var = (population_df.SMOKE=='Yes').var()
# calculate standard error for estimated difference in proportions
se = np.sqrt(var * (1/proportions_df.loc['Female','n'] + 1/proportions_df.loc['Male','n']))

# calculate test statistic and p-value
t_stat = (proportions_df.loc['Female','Proportion']-proportions_df.loc['Male','Proportion'])/se

# degrees of freedom for two tailed test: n-2
d_freedom = proportions_df.n.sum()-2
pvalue = scipy.stats.t.sf(abs(t_stat), df=d_freedom)*2 # https://www.statology.org/p-value-from-t-score-python/
print(f'combined proportion: {p_combined}')
print(f'population variance: {var}')
print(f'standard error: {se}')
print(f'degrees of freedom: {d_freedom}')
print(f'test statistic: {t_stat}')
print(f'p-value: {pvalue}')

# run hypothesis test using different(faster approach)
print()
print('*'*50)
print('HYPOTHESIS TESTING USING FASTER APPROACH')
females = population_df.loc[population_df.GENDER=='Female','SMOKE'].replace({'Yes':1, 'No':0})
males =  population_df.loc[population_df.GENDER=='Male','SMOKE'].replace({'Yes':1, 'No':0})
t_stat, pvalue, d_freedom = sm.stats.ttest_ind(females, males)
print(f'degrees of freedom: {d_freedom}')
print(f'test statistic: {t_stat}')
print(f'p_value: {pvalue}')
print(f'p_value using spicy library: {scipy.stats.t.sf(abs(t_stat), df=d_freedom)*2}')
print('*'*50)
print()

# calculate 95% confidence interval using statsmodels library for female and male smoker populations
# create frequency table for the proportion of smokers in the female and male population
table = pd.crosstab(population_df.GENDER, population_df.SMOKE)
fem_lower_bound, fem_upper_bound = sm.stats.proportion_confint(table.loc['Female','Yes'], proportions_df.loc['Female','n'])
male_lower_bound, male_upper_bound =  sm.stats.proportion_confint(table.loc['Male','Yes'], proportions_df.loc['Male','n'])

# calculate standard error for population proportion standard error
# female population
p = proportions_df.loc['Female','Proportion']
n = proportions_df.loc['Female','n']
se_female = np.sqrt(p*(1-p)/n)

# male  population
p = proportions_df.loc['Male','Proportion']
n = proportions_df.loc['Male','n']
se_male = np.sqrt(p*(1-p)/n)

# calculate confidence interval for the difference in proportion between male and female smokers
# CI = (p1-p2)+-se_diff
# calculate difference standard error for difference of two proportions
se_diff= np.sqrt(se_female**2 + se_male**2)
p_female = proportions_df.loc['Female','Proportion']
p_male = proportions_df.loc['Male','Proportion']
lower_bound, upper_bound = (p_female - p_male) - (1.96*se_diff), (p_female - p_male) + (1.96*se_diff)
print(f"95% Confidence Interval for Female Smoker Population: ({fem_lower_bound:.2%},{fem_upper_bound:.2%})")
print(f"95% Confidence Interval for Male Smoker Population: ({male_lower_bound:.2%},{male_upper_bound:.2%})")
print(f"Difference in population proportions for the two populations: {p_female-p_male:.2%}")
print(f"Confidence interval for difference in population proportion: ({lower_bound:.2%},{upper_bound:.2%})")

SMOKE,No,Yes
GENDER,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,2066,906
Male,1340,1413


Unnamed: 0_level_0,Proportion,n
GENDER,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,0.304845,2972
Male,0.513258,2753


combined proportion: 0.4050655021834061
population variance: 0.2410295423544167
standard error: 0.012986597347748223
degrees of freedom: 5723
test statistic: -16.048317820200435
p-value: 1.0007122332903874e-56

**************************************************
HYPOTHESIS TESTING USING FASTER APPROACH
degrees of freedom: 5723.0
test statistic: -16.420585558984445
p_value: 3.0320887866906843e-59
p_value using spicy library: 3.0320887866906843e-59
**************************************************

95% Confidence Interval for Female Smoker Population: (28.83%,32.14%)
95% Confidence Interval for Male Smoker Population: (49.46%,53.19%)
Difference in population proportions for the two populations: -20.84%
Confidence interval for difference in population proportion: (-23.34%,-18.35%)


#### Analysis

Based on our estimated differences in sample proportions, we have enough evidence to reject our `null hypothesis`. Our `p-value` is closer to zero, therefor we can infer that there is significant difference between the proportion of women who smoke and the proportion of men who smoke. At a 5% confidence level, this difference is significantly different from zero. This claim is further supported by the confidence interval for the difference in population proportion. We found that the reasonable range of values for fall below zero. Based on this study, we can also infer that the male population smokes more than the female population. 

---
---

## Question 2

Partition the population into two groups based on whether a person has graduated college or not, using the educational attainment variable [DMDEDUC2](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#DMDEDUC2).  Then conduct a test of the null hypothesis that the average heights (in centimeters) of the two groups are equal.  Next, convert the heights from centimeters to inches, and conduct a test of the null hypothesis that the average heights (in inches) of the two groups are equal.

This is a *difference in population means test*. Our **population** is *people that graduated college or not*, and our **parameter of interest** is the *difference in populations means between the height of people that have graduated college and the height of people that have not graduated college*

$H_0: u_1-u_2 = 0$

$H_a: u_1-u_2 \neq0$

In [7]:
# prepare population dataframe for people that have graduated from college and those who have not
population_df = df[['DMDEDUC2',"BMXHT"]].copy()
population_df.columns = ['graduated', 'standing_height']

# change interger code 1:graduated, everything else classified as not graduated from college
population_df.graduated = population_df.graduated.apply(lambda x: 'yes' if x == 5 else 'no')

# separate population into two groups
# It is important to drop anu Null Values. Missing data results in NaN results for calculations
graduated = population_df[population_df.graduated =='yes']['standing_height'].dropna()
not_graduated = population_df[population_df.graduated =='no']['standing_height'].dropna()

# calculate test statistic and p-value using z-test
# https://www.statsmodels.org/dev/generated/statsmodels.stats.weightstats.ztest.html
print(f'Graduated person average height: {graduated.mean()}')
print(f'Not graduated person average height: {not_graduated.mean()}')
print(f'sample mean difference: {graduated.mean()-not_graduated.mean()}')
print('(test statistic, p-value)')
print(sm.stats.ztest(graduated, not_graduated))

# calculate test statistic and p-value using t-distribution
print('(test statistic, p-value, degrees of freedom)')
print(sm.stats.ttest_ind(graduated, not_graduated))

# covert height to inches and run hypothesis testing
# 1 cm = 0.393701 inches 
graduated = graduated * 0.393701
not_graduated = not_graduated * 0.393701

print('*'*50)
print('RESULTS AFTER CONVERTING TO INCHES')
print(f'Graduated person average height: {graduated.mean()}')
print(f'Not graduated person average height: {not_graduated.mean()}')
print(f'sample mean difference: {graduated.mean()-not_graduated.mean()}')
print('(test statistic, p-value)')
print(sm.stats.ztest(graduated, not_graduated))

# calculate test statistic and p-value using t-distribution
print('(test statistic, p-value, degrees of freedom)')
print(sm.stats.ttest_ind(graduated, not_graduated))



Graduated person average height: 167.85535055350562
Not graduated person average height: 165.60544233441394
sample mean difference: 2.2499082190916795
(test statistic, p-value)
(7.2007656698012505, 5.987537169067422e-13)
(test statistic, p-value, degrees of freedom)
(7.200765669801341, 6.766583030851776e-13, 5671.0)
**************************************************
RESULTS AFTER CONVERTING TO INCHES
Graduated person average height: 66.08481936826571
Not graduated person average height: 65.19902825250139
sample mean difference: 0.8857911157643201
(test statistic, p-value)
(7.200765669801326, 5.987537169064107e-13)
(test statistic, p-value, degrees of freedom)
(7.200765669801211, 6.766583030858192e-13, 5671.0)


#### Analysis
Based on the estimated difference in sample means and a `p-value < 0.05`, we have enough evidence to reject the null hypothesis. The t-statistic is a measurement of how far we are from our hypothesized population parameter in terms of standard errors. The further away, the less confident we are in holding the null hypothesis true. In this study the estimated mean difference is 2.24cm, and the test statistic is 7.2. From the p-value we can infer that these are significantly different than zero. 

We can also confirm that changing the units of measurement from centimiters to inches has no effect on the calculations for test statistic and/or p-value.

---
---
### Question 3

Conduct a hypothesis test of the null hypothesis that the average BMI for men between 30 and 40 is equal to the average BMI for men between 50 and 60.  Then carry out this test again after log transforming the BMI values.

This is a *difference in population means test*. Our **population** is *men between 30 and 60*, and our **parameter of interest** is the *difference in populations means between the body mass index of men between 30 and 40 and the body mass index of men between 50 and 60*

We will use a 5% significant level, meaning if the p-value > 0.05, we will fail to reject the null hypothesis


$H_0: u_1-u_2 = 0$

$H_a: u_1-u_2 \neq0$

In [32]:
# prepare population sample dataset
# separate into two groups: male age 30-40, and male age 50-60
population_df = df[['RIAGENDR','RIDAGEYR','BMXBMI']].copy()
population_df.columns = ['gender','age','bmi']
population_df.gender = population_df.gender.replace({1:'male',2:'female'})
population_df = population_df[population_df.gender == 'male']

# grab bmi column of the populations
male_30_40 = population_df[(population_df.age >= 30) & (population_df.age <= 40)]['bmi'].dropna()
male_50_60 = population_df[(population_df.age >= 50) & (population_df.age <= 60)]['bmi'].dropna()

# calculate avg bmi for each group and difference in means
print(f'average bmi of men between 30 and 40: {male_30_40.mean()}')
print(f'average bmi of men between 50 and 60: {male_50_60.mean()}')
print(f'sample mean difference: {male_30_40.mean()-male_50_60.mean()}')

# calculate test statistic and p-value using z-test
# https://www.statsmodels.org/dev/generated/statsmodels.stats.weightstats.ztest.html
print('(test statistic, p-value)')
print(sm.stats.ztest(male_30_40, male_50_60))

# calculate test statistic and p-value using t-distribution
print('(test statistic, p-value, degrees of freedom)')
print(sm.stats.ttest_ind(male_30_40, male_50_60))

# Log transform BMI values and run hypothesis test again 
male_30_40 = male_30_40.apply(lambda x: np.log(x))
male_50_60 = male_50_60.apply(lambda x: np.log(x))

# calculate avg bmi for each group and difference in means
print('*'*50)
print('RESULTS AFTER LOG TRANSFORMING BMI VALUES')
print(f'average bmi of men between 30 and 40: {male_30_40.mean()}')
print(f'average bmi of men between 50 and 60: {male_50_60.mean()}')
print(f'sample mean difference: {male_30_40.mean()-male_50_60.mean()}')

# calculate test statistic and p-value using z-test
# https://www.statsmodels.org/dev/generated/statsmodels.stats.weightstats.ztest.html
print('(test statistic, p-value)')
print(sm.stats.ztest(male_30_40, male_50_60))

# calculate test statistic and p-value using t-distribution
print('(test statistic, p-value, degrees of freedom)')
print(sm.stats.ttest_ind(male_30_40, male_50_60))

average bmi of men between 30 and 40: 29.613729508196734
average bmi of men between 50 and 60: 29.249999999999996
sample mean difference: 0.36372950819673733
(test statistic, p-value)
(0.8984008016755222, 0.36897190924214873)
(test statistic, p-value, degrees of freedom)
(0.898400801675522, 0.36919303123271296, 978.0)
**************************************************
RESULTS AFTER LOG TRANSFORMING BMI VALUES
average bmi of men between 30 and 40: 3.3653907870271893
average bmi of men between 50 and 60: 3.35623020612263
sample mean difference: 0.009160580904559179
(test statistic, p-value)
(0.7057844184100666, 0.4803222133688403)
(test statistic, p-value, degrees of freedom)
(0.7057844184100324, 0.4804902912676029, 978.0)


#### Analysis
Based in this study, we do not enough evidence to support the null hypothesis. The `p-value` is greater than 0.05, so we fail to reject the original hypothesis that the average bmi of men between 30 and 40 and the average bmi of men between 50 and 60 are equal. At a 5% significance level, the difference in means is significant. 

Log transforming BMI did not impact our conclusion, it resulted in a larger p-value

---
---

## Question 4

Suppose we wish to compare the mean BMI between college graduates and people who have not graduated from college, focusing on women between the ages of 30 and 40.  First, consider the variance of BMI within each of these subpopulations using graphical techniques, and through the estimated subpopulation variances.  Then, calculate pooled and unpooled estimates of the standard error for the difference between the mean BMI in the two populations being compared.  Finally, test the null hypothesis that the two population means are equal, using each of the two different standard errors.



https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/SMQ_I.htm#SMQ020
https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#RIAGENDR