# Hypothesis Testing


## Review

* Hypothesis tests are used to compare datasets and determine whether any differences are the result of chance
* Hypothesis tests return a **p-value**
* The "p" stands for "probability"
* The p-value tells us whether a dataset's mean is the result of chance
* To choose the right test from `scipy.stats`, think in terms of your dataframe:
    * To compare the mean of **1 column** to a predetermined mean: `ttest_1samp`
        * *Null hypothesis: The column mean is equal to the predetermined mean*
    * To compare the means of **two related columns**: `ttest_rel`
        * *Null hypothesis: For each row, the values in the columns are equal*
    * To compare the means of **two independent columns**: `ttest_ind`
        * *Null hypothesis: The columns' means are equal*
    * To compare the means of **three or more independent columns**: `f_oneway`
        * *Null hypothesis: The columns' means are equal*
    

## ANOVA

* <strong>AN</strong>alysis <strong>O</strong>f the <strong>VA</strong>riance
* We'll be doing a "1-way ANOVA" test
* Used to compare means of three or more groups
* Groups do not necessarily have to be equal in length
* Null hypothesis: groups' means are equal because they have all been drawn from the same population

In [1]:
import pandas as pd
import numpy as np

np.random.seed(42)

In [2]:
def decimal_str(x: float, decimals: int = 50) -> str:
    return format(x, f".{decimals}f").lstrip().rstrip('0')

def interpret(alpha, p_val):
    print(f"Probability results occurred by chance: {decimal_str(p_val, 30)}.\nIs our p-value less than the alpha? {'Yes. We reject the null hypothesis! Kill it with fire!' if p_val < alpha else 'No. Our null hypothesis is correct.'}")

def roll(x):
    fair = [1,2,3,4,5,6]
    unfair = [6,2,3,4,5,6]
    return np.random.choice(fair if x == 'fair' else unfair)

In [3]:
r = 1000

df = pd.DataFrame({
    'A_1d': [roll('fair') for _ in range(r)],
    'A_2d': [roll('fair') for _ in range(r)],
    'A_3d': [roll('fair') for _ in range(r)],
    
    'B_1d': [roll('fair') for _ in range(r)],
    'B_2d': [roll('fair') for _ in range(r)],
    'B_3d': [roll('unfair') for _ in range(r)]
})

df.head(5)

Unnamed: 0,A_1d,A_2d,A_3d,B_1d,B_2d,B_3d
0,4,6,3,1,5,4
1,5,5,6,5,4,6
2,3,6,4,2,3,4
3,5,6,5,5,3,4
4,5,4,5,4,4,4


In [4]:
df.mean()

A_1d    3.457
A_2d    3.500
A_3d    3.568
B_1d    3.502
B_2d    3.439
B_3d    4.333
dtype: float64

In [7]:
from scipy.stats import f_oneway   #ANOVA

p_val = f_oneway(
    df['A_1d'],
    df['A_2d'],
    df['A_3d'],
    df['B_1d'],
    df['B_2d'],
    df['B_3d']
)[1]

decimal_str(p_val)

'0.00000000000000000000000000000000000000000042671472'

In [8]:
p_val < 0.05

True

In [12]:
f_oneway(*[df[col] for col in df])

F_onewayResult(statistic=42.42720671287208, pvalue=4.267147242924766e-43)

In [10]:
def arg(*args):       # *args -> the * makes it so you can pass various arguments
    for arg in args:
        print(arg)
        
arg(1, 2)

1
2


In [13]:
arg(*[1, 2]) # you can pass it as a list with * in front of your list

1
2
