<a href="https://colab.research.google.com/github/lcnature/PSY291/blob/main/PSY291_Ch13_repmeas_ANOVA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Repeated-measures ANOVA

## Step 1: removing mean scores of each participant

Because all data coming from the same participant are related, essentially repeated-measures ANOVA first shifts all scores of a participants such that their average is zero, then subject the re-centered data for the standard ANOVA.

The operation of re-centering removes the effect of individual difference in the average scores.

In [None]:
import numpy as np
import pandas as pd



data = np.asarray([[3,5,8,8],
                   [3,3,5,9],
                   [4,5,8,7],
                   [6,7,9,10],
                   [6,8,8,10],
                   [8,8,10,10]])

print('data:')
print(data)
# Here, data is a 2-dimensional matrix. Each column is one treatment.
# Each row is one participant

participant_mean = np.mean(data, axis=1, keepdims=True)
print('mean score of each participant:')
print(participant_mean)
# Here, with an argument of axis=1, np.mean() calculates the mean along the second
# dimension of the matrix (in Python we count from 0, 0 is the first dimension).
# For a matrix, the vertical direction is axis 0, the horizontal direction is axis 1
# keepdims argument tells np.mean() to maintain its output as a 2-dimensional matrix.

recentered_data = data - participant_mean
print('re-centered data')
print(recentered_data)
# Along the axis 1, `data` has 4 numbers while `participant_mean` has only 1 number.
# The subtraction "propogate" `participant_mean` along axis 1 such that all 4
# numbers in a row of `data` are subtracted by the same number in `participant_mean`



## Step 2: subject the demeaned (re-centered) data to standard ANOVA


### calculating SS

In [None]:
SS_total_recentered = np.sum((recentered_data - np.mean(recentered_data)) ** 2)
# here, without any extra argument, np.mean() calculates the mean over the entire matrix

SS_error = np.sum((recentered_data - np.mean(recentered_data, axis=0)) ** 2)
# here, with an argument of axis=0, np.mean() calculates the mean along the vertical direction

SS_between = SS_total_recentered - SS_error
# We can also calculate SS_between with the formula we learnt in Chapter 12
n, k = recentered_data.shape
SS_between_alternative = n * np.sum((np.mean(recentered_data, axis=0) - np.mean(recentered_data)) ** 2)


print('SS_total after re-centering:', SS_total_recentered)
print('SS_between_treatments:', SS_between)
print('SS_between_treatment calculated in a different way:', SS_between_alternative)
print('SS_error:', SS_error)

### calculate degrees of freedom and obtain F-ratio
Next, to calculate F ratio, we need proper degrees of freedom to calculate the between-treatment variance and error variance, for the numerator and denominator, respectively.

The original data has degree of freedom as $df_{original} = n \cdot k$

After removing mean for each participant, the degree of freedom in each row of data is reduced by 1 (becoming $k-1$). So the degree of freedom of all data is reduced by $n$ after we re-centered the entire dataset. $df_{total_{recentered}} = nk - n$.

Calculating the $SS_{error}$ requies calculating one mean for each treatment group. Because there are only $k-1$ degrees of freedom in each row, this only remove one-row equivalent of degree of freedom. Therefore, the degree of freedom for $SS_{error}$ is: $df_{error} = df_{total_{recentered}} - (k-1) = nk-n-k+1 = (n-1)(k-1) $

Finally, the between-treatments SS requires calculating the mean over all means across the $k$ treatments. Therefore, it has a degree of freedom $df_{between} = k-1$.

We can also see that $df_{total_{recentered}} = df_{error} + df_{between}$

In [None]:
df_between = data.shape[1] - 1
df_error = (data.shape[0] - 1) * (data.shape[1] - 1)

print('df_between:', df_between)
print('df_error:', df_error)

s2_between = SS_between / df_between
s2_error = SS_error / df_error

F = s2_between / s2_error

print('between-treatment variance:', s2_between)
print('error variance:', s2_error)
print('F-ratio:', F)


### draw conclusion
We can check the p-value corresponding to this F statistic based on the degrees of freedom

In [None]:
from scipy.stats import f
p_anova = 1 - f.cdf(F, df_between, df_error)
print('p-value', p_anova)

p-value 1.116385231048067e-05


Because it is such a small number below $α=0.05$, we can reject the null hypothesis.

## Doing it with a pre-built function

In [None]:
from statsmodels.stats.anova import AnovaRM
import pandas as pd
print('data:',data)
flattened_data = np.reshape(data, data.size)
# data.size tells us how many elements the matrix `data` contains.
print('flattered data:', flattened_data)

# now we need to create indices indicating which participant each data point comes from
# and which treatment it is
subj_id = np.repeat(np.arange(data.shape[0]), data.shape[1])
print('participant ID:', subj_id)

treatment_id = np.tile(np.arange(data.shape[1]), data.shape[0])
print('treatment ID:', treatment_id)

dataframe = pd.DataFrame({'score':flattened_data, 'subj_id':subj_id, 'treatment':treatment_id})

anova = AnovaRM(dataframe, depvar='score', subject='subj_id', within=['treatment'])

result = anova.fit()
print(result)


## Pracitice:

If we directly send the re-centered data to `f_oneway` of `scipy.stats` package, will the result be correct?

Try it and think of why.