# Hypothesis test

In this notebbok we are going to review a statistical topic, the hypothesis testing. For this, we are going to use as guide an article from _Towards Data Science_. You can find it in the next link: 
    
    https://towardsdatascience.com/three-common-hypothesis-tests-all-data-scientists-should-know-6204067a9ced
 
 Before any code, we present the process to validate an hypothesis:
 
 1. Assume the inference is not true on the population, this is the null hypothesis.
 2. Calculate the statistic of the inference from the sample. This refers to a mean, a proportion, etc.
 3. Understand the expected distribution of the sampling error around the statistic, this will help to determine which formula has to be used to validate the hypothesis.
 4. Use the distribution to understand  the maximum likelihood fo your sample statistic being consiste with the null hypothesis.
 5. Use _alpha_ to make a binary decision to accept or reject the null hypothesis.
 
 
 

In [1]:
import pandas as pd
from scipy import stats
# get data
url = "http://peopleanalytics-regression-book.org/data/salespeople.csv"
salespeople = pd.read_csv(url)

In [2]:
salespeople.head()

Unnamed: 0,promoted,sales,customer_rate,performance
0,0,594.0,3.94,2.0
1,0,446.0,4.06,3.0
2,1,674.0,3.83,4.0
3,0,525.0,3.62,2.0
4,1,657.0,4.4,3.0


### Welch's test

* Used to determine if two population has different means. In this case, we will look the two sample version
* We expect that the population distribution is normal, so we'll use a t-test which is an approximate of the normal distribution.
* To use the t distribution we need the degrees of freedom wich can be determined with the Welch equation. Also, we need the standar deviation of the  mean difference.

In [3]:
#get sales for top and bottom performers
perf1 = salespeople[salespeople["performance"] == 1].sales
perf4 = salespeople[salespeople["performance"] == 4].sales

#welch's independent t-test with unequal variance
ttest = stats.ttest_ind(perf4,perf1,equal_var=False,alternative="greater")

print(ttest)

Ttest_indResult(statistic=4.629477606844271, pvalue=5.466221730788519e-06)


### Correlation test
* Used to find if there is a correlation between two numeric variables. 
* For example for this dataset we could ask if there is a non-zero correlation between `sales` and `customer_rate` in our `salespeople` dataset. As usual we assume the null hypothesis: that there is a zero correlation between these variables. We then calculate the sample correlation:

In [5]:
import numpy as np

#calculate correlation and p-value
sales = salespeople.sales[~np.isnan(salespeople.sales)]
cust_rate = salespeople.customer_rate[~np.isnan(salespeople.customer_rate)]

cor = stats.pearsonr(sales,cust_rate)
print(cor)


(0.3378050448586781, 8.64795221209082e-11)


### Chi-square test 
* Like the correlation test but for cathegorical variables.
* In this case we can ask if there is a differente between the proportion of salespeople who are promoted between the different performance categories
* The null hypothesis is that the proportion of salespeople who are promoted is the same across all the performance categories.

First, we have to create a contingency table for `performance` and `promotion`.


In [6]:
contingency = pd.crosstab(salespeople.promoted,salespeople.performance)
contingency

performance,1.0,2.0,3.0,4.0
promoted,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,50,85,77,25
1,10,25,48,30


Now we assume that there was perfect equality across all the categories, the expected value. This is calculated by using the overall proportion of promoted salespeople and the applying this proportion to the number of sales people in each category, the same for the not promoted salespeople. 

For python we only use the `stats.chi2_contingency` function

In [7]:
chi2_test = stats.chi2_contingency(contingency)
print(chi2_test)

(25.895405268094862, 1.0030629464566802e-05, 3, array([[40.62857143, 74.48571429, 84.64285714, 37.24285714],
       [19.37142857, 35.51428571, 40.35714286, 17.75714286]]))
