# Introduction to Statistical Hypothesis Testing
 
### Examples: 

### 1. If an engineer has to decide on the basis of sample data whether the true average lifetime of a certain kind of tire is 22,000 miles;  
 
### 2. if an agronomist has to decide on the basis of experiments whether one kind of fertilizer produces a higher yield of soybeans than another; or 

 
## These problems can all be translated into the language of **statistical tests of hypotheses**

## Example 1. 
### Null Hypothesis: $$ \mu = 22000$$ vs Alternative Hypothesis: $$ \mu \neq 22000 $$




## Example 2. 
### Null Hypothesis: $$ \mu1 = \mu2$$ vs Alternative Hypothesis: $$ \mu1 > \mu2 $$




Sample product-moment correlation coefficient

Reference:
https://en.wikipedia.org/wiki/Pearson_correlation_coefficient


### Suppose you are working as a chocolate manufacturer and is interested in the effect of advertising on sales. 
### You select a random sample of eight of the products. The value Y of sales, in tens of thousands of pounds, of each product in a certain period and the amount of money X, in thousands of pounds,spent on advertising each product were recorded. 

### Using the following information, calculate the sample product-moment correlation coefficient between the variables 'sales' and 'advertising costs'. 


#### Reference: http://www.hkss.org.hk/images/exam/papers/Past/2016/HC4%202016%20-%20HKSS.PDF



In [4]:
sumofx = 386
sumofy = 460
sumofx2 = 25426
sumofy2 = 28867
sumofxy = 26161
n = 8


Sxx = sumofx2 - (sumofx)**2/n
Syy = sumofy2 - (sumofy)**2/n
Sxy = sumofxy - (sumofx*sumofy)/n

import math

r = Sxy/math.sqrt(Sxx*Syy)
r 



0.9781648070201413

In [1]:
x = [2.3,5.4,6,3,10,12]
y = [5,12,14,7,22.4,25.9]



sumofx = sum(x)
sumofy = sum(y)



# reference: 
# https://stackoverflow.com/questions/10271484/how-to-perform-element-wise-multiplication-of-two-lists-in-python
sumofx2 = sum([a*b for a,b in zip(x,x)])
sumofy2 = sum([a*b for a,b in zip(y,y)])
sumofxy = sum([a*b for a,b in zip(x,y)])


Sxx = sumofx2 - (sumofx)**2/len(x)
Syy = sumofy2 - (sumofy)**2/len(x)
Sxy = sumofxy - (sumofx*sumofy)/len(x)

import math

r = Sxy/math.sqrt(Sxx*Syy)
r 


0.998718938316758

In [2]:
from scipy.stats.stats import pearsonr

pearsonr(x,y)

(0.9987189383167582, 2.460627367048062e-06)

In [9]:
import pandas as pd

df = pd.DataFrame(x,y)
df.corr()

df = df.reset_index()
# Reference: https://stackoverflow.com/questions/3949226/calculating-pearson-correlation-and-significance-in-python


df.corr()

Unnamed: 0,index,0
index,1.0,0.998719
0,0.998719,1.0


### Examine, at the 1% significance level, whether there is evidence of positive correlation between 'sales' and 'advertising costs', stating the critical value and your conclusion. 

### Check the table of critical values for Pearson correlation coefficient

#### Reference: http://www.radford.edu/~jaspelme/statsbook/Chapter%20files/Table_of_Critical_Values_for_r.pdf



### The critical value for 1% significant level for degree of freedom = 8-2 = 6, is found as 0.789.
### Our observed valued which is calculated to be 0.978 >0.789.

### Therefore, we will reject the null hypothesis that Pearson correlation coefficent =0 and favor the alternative hypothesis that Pearson correlation coefficient >0

## How to interprete hypothesis testing

1. https://onlinecourses.science.psu.edu/statprogram/node/137
    
2. http://www.statisticshowto.com/probability-and-statistics/find-critical-values/    

### Calculate Spearman's rank correlation coefficient between the sets of ranks.  

### Examine, at the 1% level, whether there is evidence of a positive association between the two rank list

In [10]:
ranky = [4,1,2,5,3,6,7,8]
rankx = [6,2,1,4,3,5,7,8]

n = len(ranky)

# Spearman rank correlation coefficient

d = sum([(a-b)**2 for a,b in zip(rankx,ranky)])

rs = 1- (6*d)/(n*(n**2-1))

rs


0.9047619047619048

#### Reference:

https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient

https://www.york.ac.uk/depts/maths/tables/spearman.pdf
    
### critical value for n = 8 and 1% is 0.833, we see that 0.9048 > 0.833, we reject null hypothesis (i.e. rs = 0), accepting alternative hypothesis rs > 0    

## Advantage: The Spearman’s rank correlation coefficient does not rely on any distributional assumptions.



#### Reference:

http://www.hkss.org.hk/index.php/prof/exam/doc
        

## Exercise 1

In [None]:
sumofx = 386
sumofy = 460
sumofx2 = 25426
sumofy2 = 28867
sumofxy = 26161
n = 8


Sxx = sumofx2 - (sumofx)**2/n
Syy = sumofy2 - (sumofy)**2/n
Sxy = sumofxy - (sumofx*sumofy)/n

import math

r = Sxy/math.sqrt(Sxx*Syy)
r 


In [13]:
y=[14.4,14.7,15.0,15.4,15.5,15.7,16.0,16.1,16.3,17.0,17.1,17.2,18.4,19.8,20.0]
x = [76.3, 69.7, 79.6, 69.4, 75.2, 71.5, 71.6, 80.5, 83.3, 83.5, 80.6, 82.6, 84.3, 93.3, 88.6]

sumofx = sum(x)
sumofy = sum(y)

# reference: 
# https://stackoverflow.com/questions/10271484/how-to-perform-element-wise-multiplication-of-two-lists-in-python
sumofx2 = sum([a*b for a,b in zip(x,x)])
sumofy2 = sum([a*b for a,b in zip(y,y)])
sumofxy = sum([a*b for a,b in zip(x,y)])


Sxx = sumofx2 - (sumofx)**2/len(x)
Syy = sumofy2 - (sumofy)**2/len(x)
Sxy = sumofxy - (sumofx*sumofy)/len(x)

import math

r = Sxy/math.sqrt(Sxx*Syy)
r 


0.8338494448140588

### 1% point is 0.5923 

### At 1% significance level there is strong evidence against null hypothesis, which shows that there is a strong positive correlation between number of chirps and temperature.

In [14]:
import scipy.stats as ss

rankx = list(ss.rankdata(x))
ranky = list(ss.rankdata(y))

In [22]:
print(x)
print(rankx)

[76.3, 69.7, 79.6, 69.4, 75.2, 71.5, 71.6, 80.5, 83.3, 83.5, 80.6, 82.6, 84.3, 93.3, 88.6]
[6.0, 2.0, 7.0, 1.0, 5.0, 3.0, 4.0, 8.0, 11.0, 12.0, 9.0, 10.0, 13.0, 15.0, 14.0]


In [21]:
print(y)
print(ranky)

[14.4, 14.7, 15.0, 15.4, 15.5, 15.7, 16.0, 16.1, 16.3, 17.0, 17.1, 17.2, 18.4, 19.8, 20.0]
[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0]


In [23]:
sumofx

1189.9999999999998

In [24]:
sumofy

248.60000000000002

In [25]:
n = len(ranky)

# Spearman rank correlation coefficient

d = sum([(a-b)**2 for a,b in zip(rankx,ranky)])

rs = 1- (6*d)/(n*(n**2-1))

rs

0.8464285714285714

### 1% point is 0.5923 

### At 1% significance level there is strong evidence against H0, which shows that there is a strong positive correlation between number of chirps and temperature.

### 1% point 0.6036 ⇒ Strong evidence against H0, which shows that there is a strong association between the number of chirps and temperature.

## Exercise 2:

In [29]:
x = [29, 27, 17, 25, 8, 19, 11, 9, 4]
y = [29, 22, 19, 33, 7, 14, 12, 11, 5]

sumofx = sum(x)
sumofy = sum(y)

# reference: 
# https://stackoverflow.com/questions/10271484/how-to-perform-element-wise-multiplication-of-two-lists-in-python
sumofx2 = sum([a*b for a,b in zip(x,x)])
sumofy2 = sum([a*b for a,b in zip(y,y)])
sumofxy = sum([a*b for a,b in zip(x,y)])


Sxx = sumofx2 - (sumofx)**2/len(x)
Syy = sumofy2 - (sumofy)**2/len(x)
Sxy = sumofxy - (sumofx*sumofy)/len(x)

import math

r = Sxy/math.sqrt(Sxx*Syy)
r 


0.9132112971140768

In [32]:
len(x)

9

In [38]:
(0.6664+0.7498)/2

0.7081

### Notice that the approximated critical value for 2% significant is 0.7081, since 0.913 > 0.7081, we reject the null hypothesis in favor of alternative hypothesis

In [30]:
import scipy.stats as ss

rankx = list(ss.rankdata(x))
ranky = list(ss.rankdata(y))

In [31]:
n = len(ranky)

# Spearman rank correlation coefficient

d = sum([(a-b)**2 for a,b in zip(rankx,ranky)])

rs = 1- (6*d)/(n*(n**2-1))

rs

0.9333333333333333

In [39]:
(0.7000+0.783)/2

0.7415

### Notice that the approximated critical value for 2% significant is 0.7415, since 0.933 > 0.7415, we reject the null hypothesis in favor of alternative hypothesis