## ANOVA

Notes

The classic ANOVA is very powerful when the groups are normally distributed and have equal variances. However, when the groups have unequal variances, it is best to use the Welch ANOVA (pingouin.welch_anova()) that better controls for type I error (Liu 2015). The homogeneity of variances can be measured with the pingouin.homoscedasticity() function.

The main idea of ANOVA is to partition the variance (sums of squares) into several components. For example, in one-way ANOVA:

𝑆𝑆total=𝑆𝑆effect+𝑆𝑆error

𝑆𝑆total=∑𝑖∑𝑗(𝑌𝑖𝑗−𝑌⎯⎯⎯⎯)^2

𝑆𝑆effect=∑𝑖𝑛𝑖(𝑌𝑖⎯⎯⎯⎯⎯−𝑌⎯⎯⎯⎯)^2

𝑆𝑆error=∑𝑖∑𝑗(𝑌𝑖𝑗−𝑌⎯⎯⎯⎯𝑖)^2

where 𝑖=1,...,𝑟;𝑗=1,...,𝑛𝑖, 𝑟 is the number of groups, and 𝑛𝑖 the number of observations for the 𝑖 th group.

The F-statistics is then defined as:

𝐹∗=𝑀𝑆effect/𝑀𝑆error = 𝑆𝑆effect/(𝑟−1)/ 𝑆𝑆error/(𝑛𝑡−𝑟)


and the p-value can be calculated using a F-distribution with 𝑟−1,𝑛𝑡−1 degrees of freedom.

When the groups are balanced and have equal variances, the optimal post-hoc test is the Tukey-HSD test (pingouin.pairwise_tukey()). If the groups have unequal variances, the Games-Howell test is more adequate (pingouin.pairwise_gameshowell()).

The default effect size reported in Pingouin is the partial eta-square, which, for one-way ANOVA is the same as eta-square and generalized eta-square.

𝜂2𝑝=𝑆𝑆effect𝑆𝑆effect+𝑆𝑆error

Missing values are automatically removed. Results have been tested against R, Matlab and JASP.

## Flowchart
https://pingouin-stats.org/guidelines.html#id5

## Example

In [2]:
!pip install pingouin
import pingouin as pg

# Load an example dataset comparing pain threshold as a function of hair color
df = pg.read_dataset('anova')

# 1. This is a between subject design, so the first step is to test for equality of variances
pg.homoscedasticity(data=df, dv='Pain threshold', group='Hair color')

# 2. If the groups have equal variances, we can use a regular one-way ANOVA
pg.anova(data=df, dv='Pain threshold', between='Hair color')

# 3. If there is a main effect, we can proceed to post-hoc Tukey test
pg.pairwise_tukey(data=df, dv='Pain threshold', between='Hair color')

Collecting pingouin
  Downloading pingouin-0.5.3.tar.gz (187 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m188.0/188.0 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting scikit-learn
  Downloading scikit_learn-1.0.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (24.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.8/24.8 MB[0m [31m27.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting seaborn>=0.11
  Downloading seaborn-0.12.2-py3-none-any.whl (293 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m293.3/293.3 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hCollecting pandas>=1.0
  Downloading pandas-1.3.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.3 MB)
[2K     [90m━━━━━━━━━━

Unnamed: 0,A,B,mean(A),mean(B),diff,se,T,p-tukey,hedges
0,Dark Blond,Dark Brunette,51.2,37.4,13.8,5.168623,2.669957,0.074068,1.413596
1,Dark Blond,Light Blond,51.2,59.2,-8.0,5.168623,-1.547801,0.435577,-0.810661
2,Dark Blond,Light Brunette,51.2,42.5,8.7,5.482153,1.586968,0.414728,0.982361
3,Dark Brunette,Light Blond,37.4,59.2,-21.8,5.168623,-4.217758,0.003708,-2.336811
4,Dark Brunette,Light Brunette,37.4,42.5,-5.1,5.482153,-0.930291,0.789321,-0.626769
5,Light Blond,Light Brunette,59.2,42.5,16.7,5.482153,3.046249,0.036647,2.01528


## Example: One Way

In [1]:
import pingouin as pg
df = pg.read_dataset('anova')
aov = pg.anova(dv='Pain threshold', between='Hair color', data=df,
               detailed=True)
aov.round(3)

Unnamed: 0,Source,SS,DF,MS,F,p-unc,np2
0,Hair color,1360.726,3,453.575,6.791,0.004,0.576
1,Within,1001.8,15,66.787,,,


Same but using a standard eta-squared instead of a partial eta-squared effect size. Also note how here we’re using the anova function directly as a method (= built-in function) of our pandas dataframe. In that case, we don’t have to specify data anymore.



In [2]:
df.anova(dv='Pain threshold', between='Hair color', detailed=False,
         effsize='n2')

Unnamed: 0,Source,ddof1,ddof2,F,p-unc,n2
0,Hair color,3,15,6.791407,0.004114,0.575962


Two-way ANOVA with balanced design



In [3]:
data = pg.read_dataset('anova2')
data.anova(dv="Yield", between=["Blend", "Crop"]).round(3)

Unnamed: 0,Source,SS,DF,MS,F,p-unc,np2
0,Blend,2.042,1,2.042,0.004,0.952,0.0
1,Crop,2736.583,2,1368.292,2.525,0.108,0.219
2,Blend * Crop,2360.083,2,1180.042,2.178,0.142,0.195
3,Residual,9753.25,18,541.847,,,


Two-way ANOVA with unbalanced design (requires statsmodels)



In [4]:
data = pg.read_dataset('anova2_unbalanced')
data.anova(dv="Scores", between=["Diet", "Exercise"],
           effsize="n2").round(3)

Unnamed: 0,Source,SS,DF,MS,F,p-unc,n2
0,Diet,390.625,1.0,390.625,7.423,0.034,0.433
1,Exercise,180.625,1.0,180.625,3.432,0.113,0.2
2,Diet * Exercise,15.625,1.0,15.625,0.297,0.605,0.017
3,Residual,315.75,6.0,52.625,,,


Three-way ANOVA, type 3 sums of squares (requires statsmodels)



In [5]:
data = pg.read_dataset('anova3')
data.anova(dv='Cholesterol', between=['Sex', 'Risk', 'Drug'],
           ss_type=3).round(3)

Unnamed: 0,Source,SS,DF,MS,F,p-unc,np2
0,Sex,2.075,1.0,2.075,2.462,0.123,0.049
1,Risk,11.332,1.0,11.332,13.449,0.001,0.219
2,Drug,0.816,2.0,0.408,0.484,0.619,0.02
3,Sex * Risk,0.117,1.0,0.117,0.139,0.711,0.003
4,Sex * Drug,2.564,2.0,1.282,1.522,0.229,0.06
5,Risk * Drug,2.438,2.0,1.219,1.446,0.245,0.057
6,Sex * Risk * Drug,1.844,2.0,0.922,1.094,0.343,0.044
7,Residual,40.445,48.0,0.843,,,


## Non-Parametric Guideline
https://pingouin-stats.org/guidelines.html#id7

In [7]:
import pingouin as pg

# Load an example dataset comparing pain threshold as a function of hair color
df = pg.read_dataset('anova')

# There are 4 independent groups in our dataset, we'll therefore use the Kruskal-Wallis test:
pg.kruskal(data=df, dv='Pain threshold', between='Hair color')

Unnamed: 0,Source,ddof1,H,p-unc
Kruskal,Hair color,3,10.58863,0.014172
