<a href="https://colab.research.google.com/github/lcnature/PSY291/blob/main/PSY291_Ch12_ANOVA_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ANOVA
## Distribution of F
We have learned how to calculate F-ratio, the key statistic used in ANOVA.

We have discussed that the expectation of F is 1 when the null hypothesis is true. But how does the distribution look like?

Recall that $F = \frac{s^2_{between\ treatments}}{s^2_{within\ treatments}}$

Both $s^2_{between\ treatments}$ and $s^2_{within\ treatments}$ are associated with a degree of freedom. You might have guessed that the distribution of F-ratio depends on the degree of freedom. That's true!

Let's see how the shape change as we change each of the degree of freedom

In [None]:
from scipy.stats import f

import matplotlib.pyplot as plt
import numpy as np



F = np.arange(0, 10, 0.1)
# This would be the range of F-ratio we want to plot


df_within = 20
# Let's first look at the distribtion of F at these degree of freedom

varying_df_between = [1, 2, 5, 10]

fig, ax = plt.subplots(nrows=1, ncols=4, sharey=True, figsize=[12,2])
# fig refers to the whole figure. ax refers to each subplot
# fig.set_figwidth=8
# fig.set_figheight=2
for idx in range(len(varying_df_between)):
  ax[idx].plot(F, f.pdf(F, varying_df_between[idx], df_within))
  ax[idx].set_xlabel('F ratio')
  if idx == 0:
    ax[idx].set_ylabel('pdf')
  ax[idx].set_title('$df_{between}$='+str(varying_df_between[idx]))



plt.show()

In [None]:
F = np.arange(0, 10, 0.1)
# This would be the range of F-ratio we want to plot

df_between = 3
# Let's first look at the distribtion of F at these degree of freedom

varying_df_within = [3, 10, 100, 1000]

fig, ax = plt.subplots(nrows=1, ncols=4, sharey=True, figsize=[12,2])
# fig refers to the whole figure. ax refers to each subplot
# fig.set_figwidth=8
# fig.set_figheight=2
for idx in range(len(varying_df_within)):
  ax[idx].plot(F, f.pdf(F, df_between, varying_df_within[idx]))
  ax[idx].set_xlabel('F ratio')
  if idx == 0:
    ax[idx].set_ylabel('pdf')
  ax[idx].set_title('$df_{within}$='+str(varying_df_within[idx]))



plt.show()

To find the decision boundary, similarly we can use the ppf function.


In [None]:
sigma = 0.01
df_between = 2
df_within = 20
decision_boundary = f.ppf(1 - sigma, df_between, df_within)

print('the decision criterion for F using ANOVA with degrees of freedom (2,20) is ', decision_boundary)

## Running ANOVA in one line of code

Of course, in practice, we can directly feed data to a function in Python: `scipy.stats.f_oneway`

In [None]:
from scipy.stats import f_oneway

# Again, we can use the toy data from the previous lecture
X1 = np.asarray([4, 3, 6, 3, 4])
X2 = np.asarray([0, 1, 3, 1, 0])
X3 = np.asarray([1, 2, 2, 0, 0])

# We simply pass all the arrays X1,X2,X3 to the function
result = f_oneway(X1, X2, X3)
print(result)

# Let's double check whether the p-value is consistent with what cdf function gives us
# We calculated the F-ratio as 11.25 last time
df_between = 3 - 1
df_within = (5 - 1) * 3
p = 1 - f.cdf(11.25, df_between, df_within)
print('p-value', p)

## post-hoc test
Tukey's honestly significant difference (HSD) test

In [None]:
from scipy.stats import tukey_hsd
result = tukey_hsd(X1, X2, X3)
print(result)

Scheffe test

We need to install another package dedicated to post-hoc tests. The function will return the p-values across all pairs using Scheffe test.

In [None]:
!pip install scikit-posthocs

from scikit_posthocs import posthoc_scheffe

result = posthoc_scheffe([X1, X2, X3])
print(result)

## Relationship between ANOVA and t-test
When there are two independent samples, the two tests are equivalent.


In [None]:

result_anova = f_oneway(X1, X2)
print(result_anova)

from scipy.stats import ttest_ind
result_ttest = ttest_ind(X1, X2)
print(result_ttest)


print('F:', result_anova.statistic)
print('t2:', result_ttest.statistic**2)
