## <font color='darkblue'>Preface</font>
([article source](https://towardsdatascience.com/three-common-hypothesis-tests-all-data-scientists-should-know-6204067a9ced)) <font size='3ptx'>**<font color='darkblue'>Hypothesis testing</font> is one of the most fundamental elements of inferential statistics. In modern languages like Python and R, these tests are easy to conduct — often with a single line of code.**</font>

ut it never fails to puzzle me how few people use them or understand how they work. **In this article I want to use an example to show three common hypothesis tests and how they work under the hood, as well as showing how to run them in R and Python and to understand the results.**
* <font size='3ptx'>[**Example 1 — Welch’s t-test**](#sect1)</font>
* <font size='3ptx'>[**Example 2— Correlation test**](#sect2)</font>
* <font size='3ptx'>[**Example 3— Chi-square test of difference in proportion**](#sect3)</font>

### <font color='darkgreen'>The general principles and process of hypothesis testing</font>
**Hypothesis testing exists because it is almost never the case that we can observe an entire population when trying to make a conclusion or inference about it**. Almost always, we are trying to make that inference on the basis of a sample of data from that population.

Given that we only ever have a sample, **we can never be 100% certain about the inference we want to make. We can be 90%, 95%, 99%, 99.999% certain, but never 100%.**

**Hypothesis testing is essentially about calculating how certain we can be about an inference based on our sample.** The most common process for calculating this has several steps:
1. **Assume the inference is not true on the population** — this is called the <font color='darkblue'>**null hypothesis**</font>
2. Calculate the **statistic of the inference on the sample**
3. **Understand the expected distribution of the sampling error around that statistic**
4. Use that distribution to understand the **maximum likelihood of your sample statistic being consistent with the null hypothesis**
5. **Use a chosen ‘<font color='darkblue'>likelihood cutoff</font>’ — known as alpha — to make a binary decision on whether to accept the null hypothesis or reject it.** The most commonly used value of alpha is 0.05. That is, we usually reject a null hypothesis if it renders the maximum likelihood of our sample statistic to be less than 1 in 20.

### <font color='darkgreen'>The salespeople data set</font>
To illustrate some common hypothesis tests in this article I will use the [salespeople dataset which can be obtained here](http://peopleanalytics-regression-book.org/data/salespeople.csv). Let’s download it

In [19]:
import pandas as pd
import numpy as np
from scipy import stats

In [5]:
salespeople_df = pd.read_csv("../../datas/salespeople.csv")
salespeople_df.head()

Unnamed: 0,promoted,sales,customer_rate,performance
0,0,594.0,3.94,2.0
1,0,446.0,4.06,3.0
2,1,674.0,3.83,4.0
3,0,525.0,3.62,2.0
4,1,657.0,4.4,3.0


We see four columns of data:
* **promoted** — a binary value indicating if the salesperson was promoted or not in the recent promotion round
* **sales** — the recent sales made by the salesperson in thousands of dollars
* **customer_rate** — the recent average rating by customers of the salesperson on a scale of 1 to 5
* **performance** — the most recent performance rating of the salesperson where a rating of 1 is the lowest and 4 is the highest.

<a id='sect1'></a>
## <font color='darkblue'>Example 1 — Welch’s t-test</font>
**[Welch’s t-test](https://en.wikipedia.org/wiki/Welch%27s_t-test) is a hypothesis test for determining if two populations have different means**. There are a number of varieties of this test, but we will look at the two sample version and we will ask:
> if high performing salespeople generate higher sales than low performing salespeople in the population.
<br/>

We start by assuming **our null hypothesis which is that the difference in mean sales between high performers and low performers in the population is zero or less**. Now we calculate our difference in means statistic for our sample.

In [12]:
# get sales for top and bottom performers
perf1_sales_series = salespeople_df[salespeople_df.performance == 1].sales
perf4_sales_series = salespeople_df[salespeople_df.performance == 4].sales

# difference
sales_mean_diff = perf4_sales_series.mean() - perf1_sales_series.mean()
sales_mean_diff

154.9742424242424

So we see that in our sample, high performers generate around `$155k` more in sales than low performers.

**Now, we are assuming that `sales` is a random variable — that is, that the sales of one salesperson is independent of another**. Therefore we expect the difference in mean sales between the two groups to also be a random variable. So we expect the true population difference to be on a t-distribution centered around our sample statistic, which is an estimate of a normal distribution based on our sample. To get the precise t-distribution, we need the [**degrees of freedom**](https://en.wikipedia.org/wiki/Degrees_of_freedom_(statistics)) — which can be determined based on the [**Welch-Satterthwaite equation**](https://en.wikipedia.org/wiki/Welch%E2%80%93Satterthwaite_equation) (<font color='brown'>100.98 in this case</font>). We also need to know the standard deviation of the mean difference, which we call the standard error which we can calculate to be 33.48. See [here](http://peopleanalytics-regression-book.org/found-stats.html) for more details on these calculations.

Knowing these parameters, we can create **a graph of the t-distribution around our sample statistic**:
![1.png](images/1.png)
<br/>

We can now see the expected probability distribution for our true population statistic. We can also mark the maximum position on this distribution that represents a difference of zero or less — which is our null hypothesis statement. By taking the area under this distribution to the left of the red line, **we calculate the maximum probability of this sample statistic occurring if the null hypothesis were true.** Usually this is calculated by working out the number of standard errors that are needed to get to the red line — known as the t-statistic. In this case it would be

In [14]:
se = 33.48
t_statistic = round((0 - sales_mean_diff)/se, 2)
t_statistic

-4.63

So our red line is 4.63 standard errors away from the sample statistic. We can use some built-in functions in R to calculate the **associated area under the curve for this t-statistic on a t-distribution with 100.98 degrees of freedom. This represents the maximum probability of our sample statistic occurring under the null hypothesis, and is known as the <font color='darkblue'>p-value</font> of the hypothesis test.**

In [15]:
ttest = stats.ttest_ind(perf4_sales_series, perf1_sales_series, equal_var=False, alternative = "greater")
print(ttest)

Ttest_indResult(statistic=4.629477606844271, pvalue=5.466221730788519e-06)


**So we determine that the maximum probability of our sample statistic occurring under the null hypothesis is 0.000005 — much less than even a very stringent alpha**. In most cases this would be considered too unlikely to accept the null hypothesis and **we will reject it in favour of the alternative hypothesis — that high performing salespeople generate higher sales than low performing salespeople.**

<a id='sect2'></a>
## <font color='darkblue'>Example 2— Correlation test</font>
**Another common hypothesis test is a test that two numeric variables have a non-zero correlation.**

Let’s ask if there is a non-zero correlation between `sales` and `customer_rate` in our [**salespeople data set**](https://drive.google.com/file/d/1JF_8jFJRULnI_LWkO6g48C2sMeijTOIP/view?usp=sharing). As usual we assume the null hypothesis: 
> that there is a zero correlation between variables `sales` and `customer_rate`. 

We then calculate the sample correlation & p-value by [**scipy.stats.pearsonr**](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html):

In [25]:
# calculate correlation and p-value 
sales_series = salespeople_df.sales[~np.isnan(salespeople_df.sales)]
cust_rate_series = salespeople_df.customer_rate[
  ~np.isnan(salespeople_df.customer_rate)
]
cor, pv = stats.pearsonr(sales_series, cust_rate_series)
print(f"correlation={cor}; p-value={pv:.012f}")

correlation=0.3378050448586781; p-value=0.000000000086


Again, we expect the true population correlation to lie in a distribution around this sample statistic. A simple correlation like this is expected to observe a t-distribution with n-2 degrees of freedom (<font color='brown'>348 in this case</font>) and the standard error is approximately 0.05. As before we can graph this and position our null hypothesis red line:
![2.png](images/2.png)
<br/>

We see that the red line lies more than 6 standard errors away from the observed statistic and we have `p-value = 0.00000000008` which is extremely small. Thus we can again reject the null hypothesis.

<a id='sect3'></a>
## <font color='darkblue'>Example 3— Chi-square test of difference in proportion</font>
Unlike the previous two examples, data scientists often have to deal with categorical variables. **A common question is whether there is a difference in proportion across different categories of a such a variable. A [Chi-square test](https://en.wikipedia.org/wiki/Chi-squared_test) is a hypothesis test designed for this purpose.**

Let’s ask the question: is there a difference in the proportion of salespeople who are promoted between the different performance categories? Again, we assume the null hypothesis:
> the proportion of salespeople who are `promoted` is the same across all the `performance` categories.

Let’s look at the proportion of salespeople who were promoted in each `performance` category by creating a contingency table or cross table for `performance` and `promoted`.

In [27]:
contingency = pd.crosstab(salespeople_df.promoted, salespeople_df.performance)
contingency

performance,1.0,2.0,3.0,4.0
promoted,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,50,85,77,25
1,10,25,48,30


**Now let’s assume that there was perfect equality across the categories**. 

We do this by calculating the overall proportion of promoted salespeople and then applying this proportion to the number of salespeople in each category. This would give us the following expected theoretical contingency table by [**scipy.stats.chi2_contingency**](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html)

In [29]:
# perform chi-square test
chi2, pv, dof, expected_contingency_table = stats.chi2_contingency(contingency)
print(expected_contingency_table)

[[40.62857143 74.48571429 84.64285714 37.24285714]
 [19.37142857 35.51428571 40.35714286 17.75714286]]


We then use this formula on each entry of the observed and expected contingency tables and sum up the results to form a statistic known as the chi-square statistic.
![3.png](images/3.png)
<br/>
In this case the chi-square statistic is calculated to be 25.895.

In [30]:
print(chi2)

25.895405268094862


As with our t-statistic earlier, the chi-square statistic has an expected distribution which is dependent on the degrees of freedom. **The degrees of freedom are calculated by subtracting one from the number of rows and the number of columns of the contingency table and multiplying them together** — in this case the degrees of freedom is 3.

In [31]:
print(dof)

3


So, as before, we can graph our chi-square distribution with 3 degrees of freedom, mark where our chi-square statistic falls in that distribution and calculate the area under the distribution curve to the right of that point to find the associated p-value.
![4.png](images/4.png)
<br/>

In [32]:
print(f"p-value={pv}")

p-value=1.0030629464566802e-05


Again, we can see that this area is extremely small (<font color='brown'>p-value=0.00001</font>) indicating that we are likely to reject the null hypothesis and confirm the alternative hypothesis that **there is a difference in promotion rates between promotion categories.**