# Hypothesis Testing (one sample)

In [18]:
import numpy as np
import pandas as pd
from scipy import stats

## One-sided hypothesis testing

#### Example: Pharmaceutical Company

A pharmaceutical company is trying out a medication for lowering blood sugar and managing diabetes. It is known that any level of Hemoglobin A1c below 5.7% is considered normal. The drug company has treated 100 study volunteers with this medication and would like to prove that after treatment their mean A1c is below 5.7%.

In [3]:
# null hyp: hemoglobin levels >= 5.7 (so if you REJECT this, then the average would be below 5.7)

In [4]:
pop_mean = 5.7
sample_mean = 5.1
sample_std = 1.6
n = 100

statistic = (sample_mean - pop_mean)/(sample_std/np.sqrt(n)) # this formula is the same as what we learned in class

pval = stats.t.sf(np.abs(statistic), n-1) # note: sf stands for survival function!
print(statistic)

print(pval) # use the stats package to automatically calculate the p-value!
 
# p-val = probability to observe something as extreme as this by pure chance, given that our null hypothesis is true!

# given that our null hyp is true:
# this pval is smaller than 5%, THEREFORE we REJECT H_0.

-3.750000000000003
0.0001489332089038242


In [5]:
# Confidence Interval
stats.t.interval(0.95, df=n-1, loc=sample_mean, scale=(sample_std/np.sqrt(n)))

# we are 95% sure that the population mean lies within this interval!
# this does NOT include 5.7. This also means we are 95% sure that 5.7 is NOT the population mean

(4.78252528775861, 5.417474712241389)

#### Example: Municipal Children's Home

Boys of a certain age are known to have a mean weight of μ = 85 pounds. A complaint is made that the boys living in a municipal children's home are underfed and thus underweight (one-sided test!!). As one bit of evidence, n = 25 boys(of the same age) are weighed and found to have a mean weight of 80.94 pounds. It is known that the population standard deviation σ is 11.6 pounds (the unrealistic part of this example!).  
Based on the available data, what should be concluded concerning the complaint?

In [7]:
pop_m = 85
sample_m = 80.94
pop_std = 11.6
n = 25

statistic = (sample_m - pop_m)/(pop_std/np.sqrt(n))

pval = stats.t.sf(np.abs(statistic), n-1)
print(statistic)
print(pval)

# REJECT the H_0 (pval < 0.05)
# we are 95% sure that the population mean 

-1.750000000000001
0.046447544473094286


In [None]:
# Confidence Interval


## Two-sided Hypothesis Tests

#### Example: Honolulu Heart Study

It is assumed that the mean systolic blood pressure is μ = 120 mm Hg. In the Honolulu Heart Study, a sample of n = 100 people had an average systolic blood pressure of 130.1 mm Hg with a standard deviation of 21.21 mm Hg. Is the group significantly different (with respect to systolic blood pressure!) from the regular population?

## Using data arrays

#### Generating 1000 draws from a standard normal random variable

In [12]:
X = stats.norm(0, 1).rvs(size = 10)
print(X)

[-0.80972493  1.25784511  0.79757186  0.23381243 -1.492823   -0.36542652
  0.12052578 -0.36799808  1.05340191  0.29123896]


#### Test if the sample average of X is equal to 0

In [13]:
stats.ttest_1samp(X, 0) # test: is the mean of this actually 0? (or the value that you discuss in the H_0)

#pval = 0.796, which is ... go over this interpretation. Here you are still comparing it to the value

Ttest_1sampResult(statistic=0.26562797843337727, pvalue=0.7965101944968229)

In [16]:
data = pd.read_csv('Fitbit2.csv') 
data.head()

Unnamed: 0,Date,Calorie burned,Steps,Distance,Floors,Minutes Sedentary,Minutes Lightly Active,Minutes Fairly Active,Minutes Very Active,Activity Calories,...,Distance_miles,Days,Days_encoded,Work_or_Weekend,Hours Sleep,Sleep efficiency,Yesterday_sleep,Yesterday_sleep_efficiency,Months,Months_encoded
0,2015-05-08,1934,905,0.65,0,1.355,46,0,0,1680,...,0.403891,Friday,4.0,1,6.4,92.086331,0.0,0.0,May,5
1,2015-05-09,3631,18925,14.11,4,611.0,316,61,60,2248,...,8.767545,Saturday,5.0,0,7.566667,92.464358,6.4,92.086331,May,5
2,2015-05-10,3204,14228,10.57,1,602.0,226,14,77,1719,...,6.567891,Sunday,6.0,0,6.45,88.761468,7.566667,92.464358,May,5
3,2015-05-11,2673,6756,5.02,8,749.0,190,23,4,9620,...,3.119282,Monday,0.0,1,5.183333,88.857143,6.45,88.761468,May,5
4,2015-05-12,2495,502,3.73,1,876.0,171,0,0,7360,...,2.317714,Tuesday,1.0,1,6.783333,82.892057,5.183333,88.857143,May,5


In [20]:
stats.ttest_1samp(data['Distance'], 8.1)

# note: ttest_1samp is a two-sided test (which you can read in the help function! )

Ttest_1sampResult(statistic=2.5232718732480763, pvalue=0.012049635797895152)