# Beginner Python and Math for Data Science
## Lecture 9 - Hypothesis Testing
### Statistics

__Purpose:__ The purpose of this lecture is to offer an overview of hypothesis testing. 

__At the end of this lecture you will be able to:__
> 1. Perform Hypothesis Testing 
> 2. Calculate Confidence Intervals 

In [None]:
import numpy as np
import pandas as pd
from scipy import stats
import seaborn as sns
import math 
import random
import matplotlib.pyplot as plt
%matplotlib inline

## 1.1 Hypothesis Testing 

### 1.1.1 What is Hypothesis Testing? 

__Overview:__
- __[Hypothesis Testing](https://en.wikipedia.org/wiki/Statistical_hypothesis_testing)__ The general idea of Hypothesis Testing is to test if some basic truth holds
> - Some terminology:
>> a.__Null Hypothesis:__ The Null Hypothesis is the basic truth that you would like to check if holds and is usually denoted by $H_0$<br> 
>> b. __Alternative Hypothesis:__ The Alternative Hypothesis is the opposite/converse/contradiction of the basic truth and is usually denoted by $H_a$<br> 
>> c. __One-Sided/One-Tailed Hypothesis Test:__ A one-sided/one-tailed hypothesis test is checking only if a value is less than a number OR a value is greater than a number, but not both. For example:

<center> $H_0$ = the mean height of male men in the United States is equal to 160 cm ($H_0$ = 160 cm) </center>  

<center> $H_a$ = the mean height of male men is greater than 160 cm ($H_a$ > 160 cm) </center> 

>> d. __Two-Sided/Two-Tailed Hypothesis Test:__ A two-sided/two-tailed hypothesis test checks if a value is less than a number AND checks if a value is greater than a number. For example:
<center> $H_0$ = the mean height of male men in the United States is equal to 160 cm ($H_0$ = 160 cm) </center>  
<center> $H_a$ = the mean height of male men is greater than 160 cm or the mean height of male men is less than 160 cm ($H_a$ 160 cm $<$ height of men $<$ 160 cm) </center> 

__Helpful Points:__ 
1. The goal of the Hypothesis Test is to see if we should reject the Null Hypothesis or fail to reject the Null Hypothesis
2. Failing to reject the Null Hypothesis is NOT the same as accepting the Null Hypothesis

### 1.1.2 Outcomes of Hypothesis Testing:

__Overview:__ 
- Now that we have defined the Null and Alternative Hypotheses, we can consider the possible outcomes which are best summarized as 4 possible results in the following table:
<img src="img36.png">

__Helpful Points:__ 
1. The Type-I Error is typically described by $\alpha$ = P(Type-I error) = P(Reject $H_0 \mid H_0$ is true)
2. The Type-II Error is typically described by $\beta$ = P(Type-II error) = P(Reject $H_a \mid H_a$ is true)

### 1.1.3 Performing Hypothesis Tests:

__Overview:__ 
- There exist robust statistical tests to determine if the Null Hypothesis should be rejected or fail to be rejected
- Performing a Hypothests Test involves 3 decisions: 
> 1. What is the Null Hypothesis? 
> 2. Do you want to control the Type-I error rate ($\alpha$) or the Type-II error rate ($\beta$)
> 3. At what level do you wish to set the selected error rate (either $\alpha$ or $\beta$ depending on the choice made in Step 2

__Helpful Points:__
1. Typically we control the Type-I error rate ($\alpha$) and it is common for a problem to state "with 95% confidence" or "at the 5% level" which both refer to the Type-I error rate 

__Practice:__ Examples of Performing Hypothesis Tests in Python 

### Example 1 (One-Sided Hypothesis Test):

Suppose a teacher takes a sample of their class exam marks and wants to determine if the mean of the exam marks is greater than 80%, with $\alpha = 0.05$. Assume that the class's exam marks are normally distributed so we are sampling from an underlying normal distribution. 

In [None]:
exam_marks = [85, 91, 75, 79, 80, 81, 83]
np.mean(exam_marks)

__Step 1:__ The Null Hypothesis should be stated as $H_0 = 80$ and the Alternative Hypothesis is $H_a > 80$ since this is a One-Tailed Test 

__Step 2:__ Control the Type-I error rate $\alpha$ at the 5% level 

__Step 3:__ Perform the Hypothesis Test which essentially involves calculating what the probability is observing a sample with the mean of 82 - did we observe this sample average by chance? 

__Step 3a:__ 
- We have to calculate a __critical region__ and then evaluate if the sample mean falls inside or outside the critical region
- If the observed sample falls into the critical region, we would reject the Null Hypothesis with 1 - $\alpha = 95$ percent confidence 
- If the observed sample falls outside the critical region, we would fail to reject the Null Hypothesis 

__Step 3b:__ 
- To consruct a critical region, we must calculate a __test statistic__ which represents the sample. There are many test statistics, but we will use the following statistic:

<center> $T = \frac{\bar X - 80}{\sqrt{S^2/n}}$ </center> 

- $\bar X$ = Estimates of the mean of X (distribution that the sample has been chosen from) 
- $S^2$ = Estimates of the variance of X (distribution that the sample has been chosen from) 

In [None]:
S_squared = (1/(len(exam_marks)-1))*np.sum((exam_marks - np.mean(exam_marks))**2) # unbiased sample variance 
test_stat = (np.mean(exam_marks) - 80)/(math.sqrt(S_squared / len(exam_marks))) # using formula above 
test_stat

See the figure below to visualize where this test statistic lies

__Step 3c:__ 
- To determine if the test statistic lies inside or outside the critical region, we need to calculuate a __critical value__ corresponding to an alpha level of $\alpha = 0.05$. We will then compare the test statistic with the critical value
- The critical region is defined by the area in which the following holds:
<center> $T \geq t_{n-1}(\alpha)$ </center>
- The following states that the probability of the test statistic being greater than the critical value, given that the Null Hypothesis is true, is equal to alpha 
<center> $P(T \geq t_{n-1}(\alpha) \mid H_0)$ </center>

In [None]:
deg_freedom = len(exam_marks) - 1 # degrees of freedom (parameter of t distribution)
deg_freedom

In [None]:
critical_val = stats.t.ppf(0.95, deg_freedom) # inverse cdf 
critical_val

Since the test statistic lies to the left of the critical value, the observed sample mean falls outside the critical region and we fail to reject the Null Hypothesis at the 5% significance level. 

See the figure below to visualize where this t critical lies

__Step 3d:__
- We can also calculate the probability of observing a sample mean greater than the sample mean observed
- If this probability is less than the specified alpha level, then you must fail to reject the Null Hypothesis 

In [None]:
1 - stats.t.cdf(test_stat, deg_freedom) # probability to the right of the test statistic (must be less than 0.05 to reject H0)

In [None]:
stats.ttest_1samp(exam_marks, 80) # one-sample t test (two-sided)

In [None]:
0.33651242446593654 / 2 # one-tailed probability

Here is a helpful figure to understand the above steps. We computed all the numbers in the figure manually above:
<img src="img_37.png">

Note: The above drawing is a rough sketch and is not drawn to scale.

### Problem 1:

An experiment was run at a university to test the efficacy of a new drug. Participants of the experiment were randmomly assigned into a "test" group and a "control group" where the first group was given the actual drug and the second group was given a placebo drug. The purpose of the drug was to help individuals concentrate when taking standardized tests. The following outlines the results of the standardized tests after taking the respective drugs:

Group 1 (test): `[91, 76, 85, 88, 65, 90]`<br>
Group 2 (control): `[75, 95, 76, 81, 80, 93]`

Determine if the mean of the two groups are different at a confidence level of 95% by performing a Hypothesis Test according to the step-by-step method outlined above. Your answer must contain your Null Hypothesis, Alternative Hypothesis, test statistic, and critical value. You can check your answer by using the SciPy function `scipy.stats.ttest_ind`, but make sure you include a manual solution.

NOTE: Since we have two samples now, we can't use the T Statistic mentioned above. Instead, use the following test statistic:

<center> $T = \frac{(\bar x_{group1} - \bar x_{group2}) - 0}{\sqrt{S_1^2/n_1 + S_2^2/n_2}}$ </center> 

In [None]:
group_1 = [91, 76, 85, 88, 79, 90]
group_2 = [75, 95, 76, 81, 80, 93]

In [None]:
np.mean(group_1)

In [None]:
np.mean(group_2)

In [None]:
# write your code here 





### SOLUTIONS

### Problem 1:

An experiment was run at a university to test the efficacy of a new drug. Participants of the experiment were randmomly assigned into a "test" group and a "control group" where the first group was given the actual drug and the second group was given a placebo drug. The purpose of the drug was to help individuals concentrate when taking standardized tests. The following outlines the results of the standardized tests after taking the respective drugs:

Group 1 (test): `[91, 76, 85, 88, 65, 90]`<br>
Group 2 (control): `[75, 95, 76, 81, 80, 93]`

Determine if the mean of the two groups are different at a confidence level of 95% by performing a Hypothesis Test according to the step-by-step method outlined above. Your answer must contain your Null Hypothesis, Alternative Hypothesis, test statistic, and critical value. You can check your answer by using the SciPy function `scipy.stats.ttest_ind`, but make sure you include a manual solution.

NOTE: Since we have two samples now, we can't use the T Statistic mentioned above. Instead, use the following test statistic:

<center> $T = \frac{(\bar x_{group1} - \bar x_{group2}) - 0}{\sqrt{S_1^2/n_1 + S_2^2/n_2}}$ </center> 

In [None]:
group_1 = [91, 76, 85, 88, 79, 90]
group_2 = [75, 95, 76, 81, 80, 93]

In [None]:
np.mean(group_1)

In [None]:
np.mean(group_2)

### Part 1 - Manually:

__Step 1:__ The Null Hypothesis should be stated as $H_0 : \mu_{group1} = \mu_{group2}$ which is the same as $H_0 : \mu_{group1} - \mu_{group2} = 0$ and the Alternative Hypothesis is $H_a : \mu_{group1} \neq \mu_{group2}$ which is the same as $H_0 : \mu_{group1} - \mu_{group2} \neq 0$

In [None]:
np.mean(group_1) - np.mean(group_2)

We are essentially saying the Null Hypothesis is that the difference between the sample mean of group 1 and the sample mean of group 2 is equal to 0. Notice, we observe this sample difference to be 1.5, but the question is - is this number significantly different than 0?

__Step 2:__ Control the Type-I error rate $\alpha$ at the 5% level 

__Step 3:__ Perform the Hypothesis Test which essentially involves calculating what the probability is observing a difference in sample mean equal to 1.5 - did we observe this difference in sample means by chance? 

__Step 3a:__ 
- We have to calculate a __critical region__ and then evaluate if the difference between the sample means falls inside or outside the critical region
- If the observed difference in sample mean falls into the critical region, we would reject the Null Hypothesis with 1 - $\alpha = 95%$ confidence 
- If the observed difference in sample mean outside the critical region, we would fail to reject the Null Hypothesis 

__Step 3b:__ 
- To consruct a critical region, we must calculate a __test statistic__ which represents the sample. There are many test statistics, but we will use the following statistic:

<center> $T = \frac{(\bar x_{group1} - \bar x_{group2}) - 0}{\sqrt{S_1^2/n_1 + S_2^2/n_2}}$ </center> 

- $\bar x_i$ = Estimates of the mean of group i  
- $S_i^2$ = Estimates of the variance of group i 

In [None]:
S_squared_1 = (1/(len(group_1)-1))*np.sum((group_1 - np.mean(group_1))**2)
S_squared_2 = (1/(len(group_2)-1))*np.sum((group_2 - np.mean(group_2))**2)
test_stat = (np.mean(group_1) - np.mean(group_2))/(math.sqrt((S_squared_1 / len(group_1)) + (S_squared_2 / len(group_2))))  
test_stat

__Step 3c:__ 
- To determine if the test statistic lies inside or outside the critical region, we need to calculuate a __critical value__ corresponding to an alpha level of $\alpha = 0.05$. We will then compare the test statistic with the critical value
- The critical region is defined by the area in which the following holds:
<center> $T \geq t_{n-1}(\alpha)$ </center>
- The following states that the probability of the test statistic being greater than the critical value, given that the Null Hypothesis is true, is equal to alpha 
<center> $P(T \geq t_{n-1}(\alpha) \mid H_0)$ </center>

In [None]:
deg_freedom = len(group_1) + len(group_2) - 2
deg_freedom

In [None]:
critical_val = stats.t.ppf(0.95, deg_freedom) # inverse cdf 
critical_val

Since the test statistic lies to the left of the critical value, the observed sample mean falls outside the critical region and we fail to reject the Null Hypothesis at the 5% significance level. 

__Step 3d:__
- We can also calculate the probability of observing a difference in sample means greater than the difference in sample means observed 
- If this probability is less than the specified alpha level, then you must fail to reject the Null Hypothesis 

In [None]:
1 - stats.t.cdf(test_stat, deg_freedom) # one-tailed probability

In [None]:
(1 - stats.t.cdf(test_stat, deg_freedom)) * 2 # two-tailed probability

We fail to reject the Null Hypothesis since this probability is greater than alpha 

### Part 2 - Programmatically:

In [None]:
# 2 sample t test in Python (assumes equal variance which is ok for now)
stats.ttest_ind(group_1, group_2)

Therefore, we fail to reject the Null Hypothesis that the two group mean values are equal. 