## ANOVA

Notes

The classic ANOVA is very powerful when the groups are normally distributed and have equal variances. However, when the groups have unequal variances, it is best to use the Welch ANOVA (pingouin.welch_anova()) that better controls for type I error (Liu 2015). The homogeneity of variances can be measured with the pingouin.homoscedasticity() function.

The main idea of ANOVA is to partition the variance (sums of squares) into several components. For example, in one-way ANOVA:

𝑆𝑆total=𝑆𝑆effect+𝑆𝑆error

𝑆𝑆total=∑𝑖∑𝑗(𝑌𝑖𝑗−𝑌⎯⎯⎯⎯)^2

𝑆𝑆effect=∑𝑖𝑛𝑖(𝑌𝑖⎯⎯⎯⎯⎯−𝑌⎯⎯⎯⎯)^2

𝑆𝑆error=∑𝑖∑𝑗(𝑌𝑖𝑗−𝑌⎯⎯⎯⎯𝑖)^2

where 𝑖=1,...,𝑟;𝑗=1,...,𝑛𝑖, 𝑟 is the number of groups, and 𝑛𝑖 the number of observations for the 𝑖 th group.

The F-statistics is then defined as:

𝐹∗=𝑀𝑆effect/𝑀𝑆error = 𝑆𝑆effect/(𝑟−1)/ 𝑆𝑆error/(𝑛𝑡−𝑟)


and the p-value can be calculated using a F-distribution with 𝑟−1,𝑛𝑡−1 degrees of freedom.

When the groups are balanced and have equal variances, the optimal post-hoc test is the Tukey-HSD test (pingouin.pairwise_tukey()). If the groups have unequal variances, the Games-Howell test is more adequate (pingouin.pairwise_gameshowell()).

The default effect size reported in Pingouin is the partial eta-square, which, for one-way ANOVA is the same as eta-square and generalized eta-square.

𝜂2𝑝=𝑆𝑆effect𝑆𝑆effect+𝑆𝑆error

Missing values are automatically removed. Results have been tested against R, Matlab and JASP.

## Flowchart
https://pingouin-stats.org/guidelines.html#id5

## Example

In [None]:
import pingouin as pg

# Load an example dataset comparing pain threshold as a function of hair color
df = pg.read_dataset('anova')

# 1. This is a between subject design, so the first step is to test for equality of variances
pg.homoscedasticity(data=df, dv='Pain threshold', group='Hair color')

# 2. If the groups have equal variances, we can use a regular one-way ANOVA
pg.anova(data=df, dv='Pain threshold', between='Hair color')

# 3. If there is a main effect, we can proceed to post-hoc Tukey test
pg.pairwise_tukey(data=df, dv='Pain threshold', between='Hair color')

## Example: One Way

In [None]:
import pingouin as pg
df = pg.read_dataset('anova')
aov = pg.anova(dv='Pain threshold', between='Hair color', data=df,
               detailed=True)
aov.round(3)

Same but using a standard eta-squared instead of a partial eta-squared effect size. Also note how here we’re using the anova function directly as a method (= built-in function) of our pandas dataframe. In that case, we don’t have to specify data anymore.



In [None]:
df.anova(dv='Pain threshold', between='Hair color', detailed=False,
         effsize='n2')

Two-way ANOVA with balanced design



In [None]:
data = pg.read_dataset('anova2')
data.anova(dv="Yield", between=["Blend", "Crop"]).round(3)

Two-way ANOVA with unbalanced design (requires statsmodels)



In [None]:
data = pg.read_dataset('anova2_unbalanced')
data.anova(dv="Scores", between=["Diet", "Exercise"],
           effsize="n2").round(3)

Three-way ANOVA, type 3 sums of squares (requires statsmodels)



In [None]:
data = pg.read_dataset('anova3')
data.anova(dv='Cholesterol', between=['Sex', 'Risk', 'Drug'],
           ss_type=3).round(3)

## Non-Parametric Guideline
https://pingouin-stats.org/guidelines.html#id7

In [None]:
import pingouin as pg

# Load an example dataset comparing pain threshold as a function of hair color
df = pg.read_dataset('anova')

# There are 4 independent groups in our dataset, we'll therefore use the Kruskal-Wallis test:
pg.kruskal(data=df, dv='Pain threshold', between='Hair color')