# Null Hypothesis for Data Analysis

## What is null hypothesis and why it is important?

#### It is formal methods of reaching conclusions or making decisions on the basis of data

## Learning null hypotheis objective by example

#### Example: A neurologist is testing the effect of a drug on response time by injecting 100 rats with a unit dose of a drug, subjecting each to neurologist stimulus and recording its response time. The neurologist knows that the mean response time for rats not injected with the drug is 1.2 seconds. The mean of the 100 injected rats's response time is 1.05 seconds with sample standard deviation of 0.5 seconds. Do you think the drug has effect on response time?

$H_o$ : Drug has no effect => $\mu$ = 1.2 even with drug


$H_a$ : Drug has an effect => $\mu$ $\neq$ 1.2 when the drug is given

Assumption: The sample distribution is Normal. Also define, $\mu_x$ is sample mean and $\sigma_x$ is sample standard deviation. The $\mu$ and $\sigma$ are the population mean and standard deviation

## Precodures to calculate z-score in order to accept or reject null hypotheis

If $H_o$ is true, then $\mu_x$ is equal to $\mu$ ($\mu_x$ = $\mu$) and $\sigma_x$ = $\frac{\sigma}{\sqrt{N}}$ $\approx$ $\frac{S}{\sqrt{N}}$ 

For this example if $H_o$ is true, $\mu_x$ = 1.2 and $\sigma_x \approx \frac{0.5}{10}$, where $N$ = 100

Under these assumption, the z-score is $z_{score} = \frac{\mu-\mu_x}{\sigma_x} = \frac{1.2 - 1.05}{0.05} = 3$

#### Z has Normal distribution with zero mean and unit variance, p-value is the area under curve below or above z-score


p-value for above example is (from table or statistical module) 0.0026

#### When the p-value is very small number compared to significant level ($\alpha$) (it is usually 0.05) then we would reject the null hypthesis with 1-0.05 = 0.95 confidence

#### If p-value < $\alpha$, reject the null hypothesis

#### If p-value > $\alpha$, accept the null hypothesis

## How do calculate p-value from z-score?

In [2]:
# Task: write a code to calculate p-value from z score
import numpy as np
import scipy
from scipy.stats.distributions import norm

z_value = 3
p_value = 2*norm.cdf(-np.abs(z_value))

print(p_value)
print(scipy.stats.norm.sf(abs(3))*2)


0.0026997960632601866
0.0026997960632601866


So, the drug has definitly an effect

## Possible errors that can happen when accept or reject null hypothesis

#### Type I error : We reject the null hypothesis when the null is true

$\alpha$ = P(rejecting $H_o$  $|$  $H_o$ is true)

#### Type II error : We accept the null hypothesis when it is not true

$\beta$ = P(accepting $H_o$ $|$ $H_o$ is false)

## In class activity I

In [26]:
# Task: write a function the takes the mean of population and the samples as the input argument 
# then decide to reject ot accept the null hypothjessis 

def accept_or_reject_null_hypothesis(data_sample, mu, significant_level):
    z = (np.mean(data_sample)-mu)/(np.std(data_sample)/np.sqrt(len(data_sample)))
    p_value = 2*norm.cdf(-np.abs(z))
    if p_value < significant_level:
        print('reject null hypothesis')
    else:
        print('accept null hypothesis')
            

## In class activity II

Example: The average British man is 175.3 cm tall. A survey recorded the heights of 10 UK men and we want to know whether the mean of the sample is different from the population mean.

In [10]:
# Task: should we accept or reject the above example
from scipy import stats

x = [177.3, 182.7, 169.6, 176.3, 180.3, 179.4, 178.5, 177.2, 181.8, 176.5]
mu = 175.3
x_bar = np.array(x).mean()
s = np.array(x).std(ddof=1) # subtract 1 from N to get unbiased estimate of sample standard deviation
N = len(x)
SE = s/np.sqrt(N)
t = (x_bar - mu)/SE
print("t-statistic: ",t)

# a one sample t-test that gives you the p-value too can be done with scipy as follows:
t, p = stats.ttest_1samp(x, mu)
print("t = ", t, ", p = ", p)

t-statistic:  2.295568968083183
t =  2.295568968083183 , p =  0.04734137339747034


#### More precisely, when N is less than 30 or when the standard deviation of population is not known, instead of Z-statistics, we use T -statistics to test the null hypothesis

## In class activity III

In [25]:
# Task write a function that determine whether use z-score or t-test in order to accept or reject null Hypothesis

def z_t_null_hypothesis(data_sample, mu, sigma, significant_level):
    if sigma:
        z_score = (np.mean(data_sample)-mu)/(sigma/np.sqrt(len(data_sample)))
        p = scipy.stats.norm.sf(abs(z_score))*2
    elif len(data_sample) > 30:
        z_score = (np.mean(data_sample)-mu)/(np.std(data_sample)/np.sqrt(len(data_sample)))
        p = scipy.stats.norm.sf(abs(z_score))*2
    else:
        t, p = stats.ttest_1samp(data_sample, mu)
    
    if p < significant_level:
        print('reject null hypothesis')
        
    else:
        print('accept null hypothesis')      

## What is one-tail or two-tail calculation for p-value?

If the alternative hypothesis says the mean of sample is different from mean of population, we should compute p-value from two-tail. If it says the mean of sample is greater or lower than the mean of population we should compute one-tail

## Please mention other examples for hypothesis test application

#### Other statistical tests we can do

http://iaingallagher.tumblr.com/post/50980987285/t-tests-in-python


#### Other Resources 

https://www.kaggle.com/jgroff/unit-3-hypothesis-testing

http://jukebox.esc13.net/untdeveloper/RM/Stats_Module_4/mobile_pages/Stats_Module_48.html

## Homework

https://docs.google.com/document/d/1ITryiXU_VoyBvtZY4deehk4PmlieSlF7rSNc8sBU3Sw/edit