# Hypothesis Testing
Yang Xi <br>
22 Aug, 2021

In [1]:
import numpy as np
import scipy.stats as ss

# Shapiro-Wilk Test: Exam Normal Distribution
* Null hypothesis: normal

In [2]:
print("Large p-value ==> normally distributed:")
np.random.seed(1)
ss.shapiro(np.random.normal(loc=5, scale=3, size=100))

Large p-value ==> normally distributed:


ShapiroResult(statistic=0.9920048713684082, pvalue=0.8215786218643188)

In [3]:
print("Small p-value ==> significantly NOT normally distributed")
np.random.seed(1)
ss.shapiro(np.random.uniform(low=2, high=4, size=100))

Small p-value ==> significantly NOT normally distributed


ShapiroResult(statistic=0.9471390247344971, pvalue=0.0005401436355896294)

# T-test of Means of Two Samples
Refer to **AB Testing** for
- t-test of two sample means
- z-test of two sample proportions

# Analysis of Variance (ANOVA) of Means of Multiple Samples
- ANOVA generalizes the t-test beyond two means.
- It is based on the **Law of Total Variance**, where the observed variance in a particular variable is partitioned into components attributable to different sources of variation. *Var(Y) = E[Var(Y|X)] + Var(E[Y|X])*
- There are three classes of models used in ANOVA: fixed-effect, random-effects and mixed-effects.
The most common approach of ANOVA is a linear model that relates the response to the treatments and blocks.
    - **Assumptions**:
        - **Independent** samples from **normally distributed** population
        - **homoscedasticity** (equal variance) of samples.
        - Normality and homoscedasticity of residuals.
    - **Notes**:
        - The model is linear in parameters, but may be **nonlinear across factor levels**.
        - Interpretation is easy for **blanaced data**, but complicated for unbalanced data.
        - ANOVA result is **independent of constant bias and scaling errors**: addint a constant to all observations, or multiplying all observations by a constant, does not alter the significance.
    - Number of Factors:
        - **One-Way ANOVA**: Single factor
        - **N-Way ANOVA**: Generalized to multiple factors. **Factorial** experiments include observations at all combinations of levels of each factor, and they are more efficient than a series of single factor experiments. Factorial experiements can also detect interactions, and the interaction terms should be tested first.

Reference:
- *https://en.wikipedia.org/wiki/Analysis_of_variance*

## One-Way ANOVA
- **Null hypothesis**: the means are equal

Benchmark: *https://en.wikipedia.org/wiki/One-way_analysis_of_variance*

In [2]:
ar1 = np.array([6,8,4,5,3,4])
ar2 = np.array([8,12,9,11,6,8])
ar3 = np.array([13,9,11,8,7,12])

ss.f_oneway(ar1, ar2, ar3) # low p-value: significant differences between the means

F_onewayResult(statistic=9.264705882352942, pvalue=0.0023987773293929083)

### Formulation (manual calculation)

In [3]:
n_group = 3 # number of groups
n = 6 # number of observations in each group
m1, m2, m3 = ar1.mean(), ar2.mean(), ar3.mean()
m = (m1 + m2 + m3) / 3 # overall group mean

SB = n*(m1-m)**2 + n*(m2-m)**2 + n*(m3-m)**2 # between-group sum of squared differences
fB = n_group - 1 # between-group degress of freedom
MSB = SB / fB # between-group mean square value

ar1_center, ar2_center, ar3_center = ar1-m1, ar2-m2, ar3-m3
SW = (ar1_center**2).sum() + (ar2_center**2).sum() + (ar3_center**2).sum() # within-group sum of squares
fW = n_group * (n-1) # within-group degrees of freedom
MSW = SW / fW # within-group mean square value

F = MSB / MSW
if F<1:
    p = ss.f.cdf(F, fW, fB)
else:
    p = ss.f.sf(F, fB, fW)

F, p

(9.264705882352942, 0.0023987773293929083)

## Two-Way ANOVA
- **Null hypothesis**: the means are equal

Benchmark: *https://www.statology.org/two-way-anova-python/*

In [4]:
import pandas as pd

# water: watering frequency
# sun: sunlight exposure
# height: height of each plant after two months
df = pd.DataFrame({ \
    'water': np.repeat(['daily', 'weekly'], 15),
    'sun': np.tile(np.repeat(['low', 'med', 'high'], 5), 2),
    'height': [6, 6, 6, 5, 6, 5, 5, 6, 4, 5, \
        6, 6, 7, 8, 7, 3, 4, 4, 4, 5,
        4, 4, 4, 4, 4, 5, 6, 6, 7, 8]
        })

df.shape

(30, 3)

In [5]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

#perform two-way ANOVA
model = ols('height ~ C(water) + C(sun) + C(water):C(sun)', data=df).fit()
sm.stats.anova_lm(model, typ=2)

Unnamed: 0,sum_sq,df,F,PR(>F)
C(water),8.533333,1.0,16.0,0.000527
C(sun),24.866667,2.0,23.3125,2e-06
C(water):C(sun),2.466667,2.0,2.3125,0.120667
Residual,12.8,24.0,,


Interpretation
- water*sun: high p-value indicates no significant interaction between watering frequency and sunlight exposure.
- water: small p-value indicates significant effect on plant height
- sun: small p-value indicates significant effect on plant height

# F-test of Variances of Two Samples
* **Null hypothesis**: the two variances are equal
* **Assumption**: sample from normal distributed population

F-test is extremely sensitive to non-normality samples. **Levene's test** or **Bartlett's test** are more robust alternatives.

In [4]:
# This function is to match the implementation of var.test in R
def f_test(ar1, ar2, alternative="two_sided"):
    df1, df2 = len(ar1) - 1, len(ar2) - 1
    var1, var2 = ar1.var(ddof=1), ar2.var(ddof=1)
    f = var1 / var2
    if alternative == "two_sided":
        if f < 1:
            p = 2*ss.f.cdf(f, df1, df2)
        else:
            p = 2*ss.f.cdf(1/f, df2, df1)
    if alternative == "less": # significant if var1 < var2
        p = ss.f.cdf(f, df1, df2)
    if alternative == "greater": # significant if var1 > var2
        p = ss.f.sf(f, df1, df2)
    return f, p

In [5]:
print("Two-sided: large p-value ==> equal variances")
np.random.seed(1)
n = 10000
ar1 = np.random.uniform(low=0, high=1, size=n)
ar2 = np.random.uniform(low=0, high=1, size=n)
f, p = f_test(ar1, ar2, alternative="two_sided")
print(f"f statistics = {f}, p value = {p}")

Two-sided: large p-value ==> equal variances
f statistics = 0.9894128128015265, p value = 0.5946288001782598


In [6]:
print("Two-sided: small p-value ==> significantly different variances")
np.random.seed(1)
n = 10000
v1 = np.random.uniform(low=0, high=1, size=n)
v2 = np.random.uniform(low=0, high=10, size=n)
f, p = f_test(v1, v2)
print(f"f statistics = {f}, p value = {p}")

Two-sided: small p-value ==> significantly different variances
f statistics = 0.009894128128015265, p value = 0.0


In [7]:
print("One-sided: small p-value ==> var1 significantly less than var2")
f, p = f_test(v1, v2, alternative="less")
print(f"f statistics = {f}, p value = {p}")

One-sided: small p-value ==> var1 significantly less than var2
f statistics = 0.009894128128015265, p value = 0.0


In [8]:
print("One-sided: large p-value ==> var1 larger than var2")
f, p = f_test(v1, v2, alternative="greater")
print(f"f statistics = {f}, p value = {p}")

One-sided: large p-value ==> var1 larger than var2
f statistics = 0.009894128128015265, p value = 0.9999999999999999


# Chi-Square Test of Independence between Two Categorical Variables

* **Null hypothesis**: the two variables are independent

Reference: *https://towardsdatascience.com/chi-square-test-for-independence-in-python-with-examples-from-the-ibm-hr-analytics-dataset-97b9ec9bb80a*

In [9]:
import pandas as pd

np.random.seed(0)
p1 = 0.2
p21, p22 = 0.2, 0.3
q1, q2 = 1-p1, 1-p21-p22
ar1 = np.random.choice([0,1], size=10000, p=[p1,q1])
ar2 = np.random.choice([0,1,2], size=10000, p=[p21,p22,q2])

dfContingency = pd.crosstab(ar1, ar2)
dfContingency

col_0,0,1,2
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,422,627,1011
1,1661,2355,3924


In [10]:
# Large p-value ==> independent
arContingency = np.array(dfContingency)
chi2, p, dof, expected = ss.chi2_contingency(arContingency)
print(f"chi2 = {chi2}, pvalue = {p}, dof = {dof}")

chi2 = 0.517964767425645, pvalue = 0.7718366198194897, dof = 2


In [11]:
# Formulation (manual calculation)
print("manual calculation:")
seRowSum = dfContingency.sum(axis=1)
seColSum = dfContingency.sum(axis=0)
itotal = seRowSum.sum()
arExpected = np.array(seRowSum.to_frame().dot(seColSum.to_frame().T) / itotal)
print(f"Expected: {arExpected}")

chi2_manual = ((dfContingency - arExpected)**2 / arExpected).sum().sum()
dof_manual = (dfContingency.shape[0]-1) * (dfContingency.shape[1]-1)
p_manual = ss.chi2.sf(chi2_manual, dof_manual)
print(f"chi2 = {chi2_manual}, pvalue = {p_manual}, dof = {dof_manual}")

manual calculation:
Expected: [[ 429.098  614.292 1016.61 ]
 [1653.902 2367.708 3918.39 ]]
chi2 = 0.517964767425645, pvalue = 0.7718366198194897, dof = 2
