# Hypothesis Test
---

1.   **[Definition](#1.-Definition)**
1.   **[Notebook Preparation](#2.-Notebook-Preparation)**
1.   **[Data Preparation](#3.-Data-Preparation)**
1.   **[Hypothesis Test](#4.-Hypothesis-Test)**
1.   **[Code Cells](#5.-Code-Cells)**

<a name="1. Definition"></a>
### 1. Definition

Hypothesis testing uses sample data to evaluate an assumption about a population parameter. 

Data professionals conduct a hypothesis test to decide whether the evidence from sample data supports either the null hypothesis or the alternative hypothesis. 

**Steps for conducting a hypothesis test:**
1.   [State the null and alternative hypothesis](#4.1:-State-the-hypotheses)
2.   [Choose a significance level](#4.2:-Specify-the-Significance-Level)
3.   [Find the p-value](#4.3:-Find-the-p-value)
4.   [Reject or fail to reject the null hypothesis](#4.4:-Reject-or-fail-to-reject-the-null-hypothesis)

<a name="2.-Notebook-Preparation"></a>
### 2. Notebook Preparation

In [2]:
# Import libraries and packages 
# import pandas as pd
# from scipy import stats

In [7]:
# Load data
# data = pd.read_csv("file_name.csv")
# data = data.dropna()

<a name="3.-Data-Preparation"></a>
### 3. Data Preparation

Instantiate the required data:
- name variables 
- apply any conditional filters

#### 3.1 Simulate random sampling

**Sample 1:** 

After data prep, use the `sample()` function to take a random sample. 
- First, name a new variable: eg. `x1`. 
- Then, enter the arguments of the `sample()` function. 
    *   `n`: Sample size 
    *   `replace`: `True` if sampling with replacement.
    *   `random_state`: specify random seed number

In [None]:
# x1 = x1.sample(n= , replace= True, random_state= 42)

**Sample 2:**
 
For the second sample:
- Follow the same procedure as for sample 1
- choose a different number for the random seed

In [None]:
# x2 = x2.sample(n= , replace= True, random_state= 420)

#### 3.2 Compute sample means

Use `mean()` to compute the mean for both samples.

In [6]:
# x1['column_name'].mean()

In [5]:
# x2['column_name'].mean()

**Note**: At this point, one might be tempted to conclude that `sample 1` has a higher/lower mean than `sample 2`. However, due to sampling variability, this observed difference might simply be due to chance - rather than an actual difference in the corresponding population means. 
- A hypothesis test is used to determine whether or not the results are statistically significant. 

<a name="4.-Hypothesis-Test"></a>
### 4. Hypothesis Test

**Definitions:**
- One-Sample: 
    - Compares a sample mean or proportion to a known population mean or proportion. This type of test is used when we want to know whether a sample is statistically different from a population, or if an intervention or treatment has a significant effect on a population. For example, a one-sample t-test can be used to determine if the average weight of a sample of apples is different from a known population mean weight.
- Two-Sample
    - Compares two sample means or proportions to each other. This type of test is used when we want to compare two groups to see if there is a statistically significant difference between them. For example, a two-sample t-test can be used to determine if there is a statistically significant difference in the average test scores between two different classes of students.
- T-test
    - t-tests are used when the population variance is unknown and the sample size is small (typically, a sample size less than 30). The test statistic for a t-test is calculated using the t-distribution, which takes into account the added uncertainty that comes with estimating the population variance from a small sample. The t-test is more robust than the z-test in situations where the population variance is unknown, and it is more likely to produce reliable results with small sample sizes.
- Z-test
    - Used when the population variance is known, or when the sample size is large (typically, a sample size greater than 30). The test statistic for a z-test is calculated using the standard normal distribution, and the null hypothesis is tested against a standard normal distribution. The z-test is useful when the population variance is known, as it provides a more precise estimate of the true population mean.

#### 4.1: State the hypotheses


- The **null hypothesis** is a statement that is assumed to be true unless there is convincing evidence to the contrary. 
- The **alternative hypothesis** is a statement that contradicts the null hypothesis, and is accepted as true only if there is convincing evidence for it. 

In a two-sample t-test, the null hypothesis states that there is no difference between the means of your two groups. The alternative hypothesis states the contrary claim: there is a difference between the means of your two groups. 

*   $H_0$: There is no difference in the mean `['chosen field']` between `x1` and `x2`
*   $H_A$: There is a difference in the mean `['chosen field']` between `x1` and `x2`

#### 4.2: Specify the Significance Level



The **significance level** is the threshold at which a result is considered statistically significant. This is the probability of rejecting the null hypothesis when it is true. 
- **Standard is 5%, or 0.05.**

#### 4.3: Find the p-value

**P-value:** probability of observing results as or more extreme than those observed when the null hypothesis is true.

- Based on sample data, the difference between the mean of `x1` and `x2` is `n%`. Null hypothesis claims that this difference is due to chance. P-value is the probability of observing an absolute difference in sample means that is `n%` or greater *if* the null hypothesis is true. 
- If the probability of this outcome is very unlikely - in particular, if p-value is *less than* the significance level of 5% – then reject the null hypothesis.

#### 4.4: Reject or fail to reject the null hypothesis


To draw a conclusion, compare p-value with the significance level.

*   If the **p-value < significance level**, conclude that there is a statistically significant difference in the sample means and ***reject the null hypothesis*** $H_0$.
*   If the **p-value > the significance level**, conclude there is *not* a statistically significant difference in the sample means and therefore ***fail to reject the null hypothesis*** $H_0$.


#### 5: Code Cells

##### 5.1: One-Sample $t$-test:

In [None]:
# import numpy as np
# from scipy.stats import ttest_1samp

# sample = np.array([1.2, 1.8, 0.9, 1.3, 1.5, 1.4, 1.6, 1.1, 1.7, 1.2])

# null_hypothesis = 1.5   # The null hypothesis is that the population mean is 1.5 for this example
# alpha =  0.05           # The significance level is 5%

# t_statistic, p_value = ttest_1samp(data, null_hypothesis)

# # Check the p-value against the significance level to determine whether to reject or fail to reject the null hypothesis:
# if p_value < alpha:
#     print("The p-value is", p_value, "which is less than the significance level of", alpha)
#     print("We reject the null hypothesis that the population mean is", null_hypothesis)
# else:
#     print("The p-value is", p_value, "which is greater than or equal to the significance level of", alpha)
#     print("We fail to reject the null hypothesis that the population mean is", null_hypothesis)

##### 5.2: One-Sample $z$-test:

In [None]:
# import numpy as np
# from scipy.stats import norm

# data = sample = np.array([1.2, 1.8, 0.9, 1.3, 1.5, 1.4, 1.6, 1.1, 1.7, 1.2])

# null_hypothesis = 1.5     # The null hypothesis is that the population mean is 1.5 for this example
# sigma = 0.3               # The population standard deviation has to be know for z-tests
# alpha = 0.05            # The significance level is 5%

# # Calculate the z-score
# z_score = (np.mean(data) - null_hypothesis) / (sigma / np.sqrt(len(data)))

# # Calculate the p-value
# p_value = 2 * norm.sf(np.abs(z_score))  # two-sided test

# # Check the p-value against the significance level to determine whether to reject or fail to reject the null hypothesis:
# if p_value < alpha:
#     print("The p-value is", p_value, "which is less than the significance level of", alpha)
#     print("We reject the null hypothesis that the population mean is", null_hypothesis)
# else:
#     print("The p-value is", p_value, "which is greater than or equal to the significance level of", alpha)
#     print("We fail to reject the null hypothesis that the population mean is", null_hypothesis)

##### 5.3: Two-Sample $t$-test:

`scipy.stats.ttest_ind(a, b, equal_var= )` to compute p-value. 

This function includes the arguments:

*   `a`: Observations from the first sample. 
*   `b`: Observations from the second sample.
*   `equal_var`: A boolean statement which indicates whether the population variance of the two samples is assumed to be equal.
    - Without access to the data of the entire population avoid making a wrong assumption and set this argument to `False`. 

Reference: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html.

In [None]:
# import numpy as np
# from scipy.stats import ttest_ind

# sample1 = np.array([1.2, 1.8, 0.9, 1.3, 1.5])
# sample2 = np.array([1.4, 1.6, 1.1, 1.7, 1.2])

# # Specify the null hypothesis and the significance level (alpha)
# null_hypothesis = 0  # The null hypothesis is that the population means are equal
# alpha = 0.05  # The significance level is 5%

# # Use the ttest_ind() function to perform the two-sample t-test
# t_statistic, p_value = ttest_ind(sample1, sample2, equal_var= False) 

# # Check the p-value against the significance level to determine whether to reject or fail to reject the null hypothesis
# if p_value < alpha:
#     print("The p-value is", p_value, "which is less than the significance level of", alpha)
#     print("We reject the null hypothesis that the population means are equal")
# else:
#     print("The p-value is", p_value, "which is greater than or equal to the significance level of", alpha)
#     print("We fail to reject the null hypothesis that the population means are equal")

##### 5.4: Two-Sample $z$-test:

**Notes:** 
- the z-test assumes that the population standard deviations are known. If the population standard deviations are unknown, a two-sample t-test is often more appropriate
- when the `ddof` parameter is set to 1, we are adjusting the calculation of the sample standard deviation to take into account that one degree of freedom is lost due to estimating the sample mean. This is the most commonly used value for ddof in practice, and it provides an unbiased estimator of the population standard deviation.

In [None]:
# import numpy as np
# from scipy.stats import norm

# sample1 = np.array([1.2, 1.8, 0.9, 1.3, 1.5])
# sample2 = np.array([1.4, 1.6, 1.1, 1.7, 1.2])

# # Specify the null hypothesis, the population standard deviation (sigma), and the significance level (alpha)
# null_hypothesis = 0  # The null hypothesis is that the population means are equal
# sigma1 = np.std(sample1, ddof=1)  # The sample standard deviations are used to estimate the population standard deviations
# sigma2 = np.std(sample2, ddof=1)
# n1 = len(sample1)  # The sample sizes
# n2 = len(sample2)
# alpha = 0.05  # The significance level is 5%

# # Calculate the z-score
# z_score = ((np.mean(sample1) - np.mean(sample2)) - null_hypothesis) / np.sqrt((sigma1**2/n1) + (sigma2**2/n2))

# # Calculate the p-value using the standard normal distribution (since the null hypothesis is that the population means are equal, rather than a range of values)
# p_value = 2 * norm.sf(np.abs(z_score))  # two-sided test

# # Check the p-value against the significance level to determine whether to reject or fail to reject the null hypothesis
# if p_value < alpha:
#     print("The p-value is", p_value, "which is less than the significance level of", alpha)
#     print("We reject the null hypothesis that the population means are equal")
# else:
#     print("The p-value is", p_value, "which is greater than or equal to the significance level of", alpha)
#     print("We fail to reject the null hypothesis that the population means are equal")
