# Practice notebook for hypothesis tests using NHANES data

This notebook will give you the opportunity to perform some hypothesis tests with the NHANES data that are similar to
what was done in the week 3 case study notebook.

You can enter your code into the cells that say "enter your code here", and you can type responses to the questions into the cells that say "Type Markdown and Latex".

Note that most of the code that you will need to write below is very similar to code that appears in the case study notebook.  You will need to edit code from that notebook in small ways to adapt it to the prompts below.

To get started, we will use the same module imports and read the data in the same way as we did in the case study:

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import statsmodels.api as sm
import scipy.stats
import numpy as np

In [2]:
da = pd.read_csv("nhanes_2015_2016.csv")

In [3]:
da.head()

Unnamed: 0,SEQN,ALQ101,ALQ110,ALQ130,SMQ020,RIAGENDR,RIDAGEYR,RIDRETH1,DMDCITZN,DMDEDUC2,...,BPXSY2,BPXDI2,BMXWT,BMXHT,BMXBMI,BMXLEG,BMXARML,BMXARMC,BMXWAIST,HIQ210
0,83732,1.0,,1.0,1,1,62,3,1.0,5.0,...,124.0,64.0,94.8,184.5,27.8,43.3,43.6,35.9,101.1,2.0
1,83733,1.0,,6.0,1,1,53,3,2.0,3.0,...,140.0,88.0,90.4,171.4,30.8,38.0,40.0,33.2,107.9,
2,83734,1.0,,,1,1,78,3,1.0,3.0,...,132.0,44.0,83.4,170.1,28.8,35.6,37.0,31.0,116.5,2.0
3,83735,2.0,1.0,1.0,2,2,56,3,1.0,5.0,...,134.0,68.0,109.8,160.9,42.4,38.5,37.7,38.3,110.1,2.0
4,83736,2.0,1.0,1.0,2,2,42,4,1.0,4.0,...,114.0,54.0,55.2,164.9,20.3,37.4,36.0,27.2,80.4,2.0


## Question 1

Conduct a hypothesis test (at the 0.05 level) for the null hypothesis that the proportion of women who smoke is equal to the proportion of men who smoke.

In [4]:
da.RIAGENDR.unique()

array([1, 2])

In [5]:
# 1 = men, 2 = women
gender = da.RIAGENDR
men = da[gender == 1]
women = da[gender == 2]

In [6]:
# men smoker information
men_smoker_count = np.sum(men.SMQ020 == 1)
men_count = men.size
"There are {} men, of whom {} of them are smokers".format(men_count, men_smoker_count)

'There are 77252 men, of whom 1413 of them are smokers'

In [7]:
# women smoker information
women_smoker_count = np.sum(women.SMQ020 == 1)
women_count = women.size
"There are {} women, of whom {} of them are smokers".format(women_count, women_smoker_count)

'There are 83328 women, of whom 906 of them are smokers'

$\displaystyle \frac{\hat p_1 - \hat p_2}{\text{standard error}(\hat p)}$

$\displaystyle \text{standard error} = \sqrt{\hat p (1-\hat p) (\frac{1}{n_1} + \frac{1}{n_2})}$

$\displaystyle \hat p = \frac{(n_1 * p_1 + n_2 * p_2)}{(n_1+n_2)}$

$p_1$ and $p_2$ are probabilities.

$n_1$ and $n_2$ are number of population.

In [8]:
p1 = men_smoker_count / men_count
p2 = women_smoker_count / women_count
p1 - p2

0.00741809273546316

In [9]:
phat = (men_count * p1 + women_count * p2) / (men_count + women_count)
print("p-hat is {:.5f}".format(phat))

se = np.sqrt(phat * (1-phat) * (1/men_count + 1/women_count))
print("The standard error of our measurement is {:.5f}".format(se))

z = (p1 - p2) / se
print("The z-value is {:.2f}".format(z))

p-hat is 0.01444
The standard error of our measurement is 0.00060
The z-value is 12.45


In [10]:
p_one_tail = 1 - scipy.stats.norm.cdf(z)
p_two_tail = 2 * p_one_tail
print("p-value is {:.30f}".format(p_two_tail))

p-value is 0.000000000000000000000000000000


__Q1a.__ Write 1-2 sentences explaining the substance of your findings to someone who does not know anything about statistical hypothesis tests.

__Ans:__ We have sufficient evidence to support the claim that the population proportion of men who smoke is different from the population proportion of females who smoke.

__Q1b.__ Construct three 95% confidence intervals: one for the proportion of women who smoke, one for the proportion of men who smoke, and one for the difference in the rates of smoking between women and men.

In [11]:
# It will be useful to have our z-multiplier handy beforehand for a 95% two-sided interval 
z_multiplier = scipy.stats.norm.ppf(0.975)

**Formula**

$\displaystyle CI = p1 \pm z \sqrt{p * \frac{1-p}{n}}$

In [12]:
# CI for men
lower_bound_males = p1 - z_multiplier * np.sqrt(p1 * (1 - p1) / men_count)
upper_bound_males = p1 + z_multiplier * np.sqrt(p1 * (1 - p1) / men_count)
"A confidence interval for the population proportion of males who smoke is from {:.6f} to {:.6f}".format(
    lower_bound_males, 
    upper_bound_males
)

'A confidence interval for the population proportion of males who smoke is from 0.017346 to 0.019236'

In [13]:
# CI for women
lower_bound_females = p2 - z_multiplier * np.sqrt(p2 * (1 - p2) / women_count)
upper_bound_females = p2 + z_multiplier * np.sqrt(p2 * (1 - p2) / women_count)
"A confidence interval for the population proportion of females who smoke is from {:.6f} to {:.6f}".format(
    lower_bound_females, 
    upper_bound_females
)

'A confidence interval for the population proportion of females who smoke is from 0.010169 to 0.011577'

**Formula**

$\displaystyle CI = (p_1 - p_2) \pm z * \text{standard error}$

$\displaystyle \text{standard error} = \sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}}$

In [14]:
se_diff = np.sqrt(p1 * (1-p1)/men_count + p2 * (1-p2)/women_count)
lower_bound_difference = (p1 - p2) - z_multiplier * se_diff
upper_bound_difference = (p1 - p2) + z_multiplier * se_diff
"A confidence interval for the difference in the population proportion of males and females who smoke is from {:.6f} to {:.6f}".format(
    lower_bound_difference, 
    upper_bound_difference
)

'A confidence interval for the difference in the population proportion of males and females who smoke is from 0.006240 to 0.008597'

__Q1c.__ Comment on any ways in which the confidence intervals that you found in part b reinforce, contradict, or add support to the hypothesis test conducted in part a.

We can see the confidence interval for the difference between $p_1$ and $p_2$ which was from $(0.006240, 0.008597)$ does not contain zero which supports our original hypothesis that the difference is significantly different from zero. This is the same as saying $p_1$ and $p_2$ are different in a magnitude that is statistically significant. 

It would be incorrect to use the confidence intervals for just $p_1$ or just $p_2$ to access the significant of their difference. While it is reassuring they do not overlap, the only confidence interval that can provide evidence that $p_1$ and $p_2$ are statistically different is the confidence interval for their differences. 

## Question 2

Partition the population into two groups based on whether a person has graduated college or not, using the educational attainment variable [DMDEDUC2](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#DMDEDUC2).  Then conduct a test of the null hypothesis that the average heights (in centimeters) of the two groups are equal.  Next, convert the heights from centimeters to inches, and conduct a test of the null hypothesis that the average heights (in inches) of the two groups are equal.

In [15]:
graduated = da[da['DMDEDUC2'] == 5]
not_graduated = da[[x in [1,2,3,4] for x in da['DMDEDUC2']]]

In [16]:
"We have {} people who graduated college and {} who did not graduate college in our sample".format(
    graduated.shape[0], 
    not_graduated.shape[0])

'We have 1366 people who graduated college and 4105 who did not graduate college in our sample'

In [17]:
height_graduated = graduated['BMXHT'].dropna()
height_not_graduated = not_graduated['BMXHT'].dropna()

In [18]:
# Make sure that we cleaned the dataset properly
assert pd.notnull(height_graduated).all()
assert pd.notnull(height_not_graduated).all()

If the standard deviations are at least within 2X of each other, we can pool. Else we cannot conduct a pooled test

In [19]:
sd_grad_heights = np.std(height_graduated, ddof=1)       # N - ddof = degree of freedom
sd_ngrad_heights = np.std(height_not_graduated, ddof=1)  # N - ddof = degree of freedom

display("The standard deviation of college grad heights is: {:.3f}".format(sd_grad_heights))
display("The standard deviation of non college grad heights is: {:.3f}".format(sd_ngrad_heights))

'The standard deviation of college grad heights is: 9.705'

'The standard deviation of non college grad heights is: 10.174'

In [20]:
# Since the standard deviations are so close (10.174 / 9.705 < 2), we can conduct a pooled 
# t-test for the differences in the means

# It is handy to have some general things computed about the two groups before we begin
n_college = graduated.shape[0]
n_no_college = not_graduated.shape[0]