# ANOVA
ANOVA is a statistical test that stands for analysis of variance.

ANOVA was developed by statistician and evolutionary biologist Ronald Fisher. The idea behind ANOVA is to compare different groups of samples to determine whether there is a significant difference between the groups.

ANOVA is an extension of the t and the z test and was developed to compare more than two groups.

The null hypothesis of ANOVA is that there is no difference between the groups. The alternative hypothesis is that there is a difference between the groups.

ANOVA is an omnibus test, meaning it tests the data as a whole. In other words, it does not tell you which specific groups were significantly different from each other; it only tells you that at least two groups were different.

## Types of ANOVA

There are three main types of ANOVA:
1. One-way ANOVA
2. Two-way ANOVA
3. N-way ANOVA

### One-way ANOVA

One-way ANOVA is used to compare two or more groups of samples across one continuous independent variable.

For example, you could use a one-way ANOVA to compare the height of people living in different cities.

### Two-way ANOVA

Two-way ANOVA is used to compare two or more groups of samples across two independent variables.

For example, you could use a two-way ANOVA to compare the height of people living in different cities and different countries.

### N-way ANOVA

N-way ANOVA is used to compare two or more groups of samples across N independent variables.


## Assumptions of ANOVA

ANOVA has three main assumptions:

1. The samples are independent.
2. The samples are normally distributed.
3. The variance of each group is equal.

If these assumptions are not met, you may not be able to trust the results of your ANOVA.

## 1. ONE-WAY ANOVA

In [2]:
import scipy.stats as stats

In [7]:

# Sample data: Growth of plants with three types of fertilizers
fertilizer1 = [20, 22, 19, 24, 25]
fertilizer2 = [28, 30, 27, 26, 29]
fertilizer3 = [18, 20, 22, 19, 24]

# Perform the one-way ANOVA
f_stat, p_val = stats.f_oneway(fertilizer1, fertilizer2, fertilizer3)

print("F-statistic:", f_stat)
print("p-value:", p_val)

# print the results based on if the p-value is less than 0.05

if p_val < 0.05:
    print(f"Reject null hypothesis: The means are not equal, as the p-value: {p_val} is less than 0.05")
else:
    print(f"Accept null hypothesis: The means are equal, as the p-value: {p_val} is greater than 0.05")



F-statistic: 15.662162162162158
p-value: 0.0004515404760997282
Reject null hypothesis: The means are not equal, as the p-value: {p_val} is less than 0.05


## Using Statsmodels 

##### performing one way anova

In [8]:
!pip install statsmodels --quiet

In [1]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols 

In [10]:
# create a dataframe
df=pd.DataFrame({"fertilizer":["fertilizer1"]*5 + ["fertilizer2"]*5 + ["fertilizer3"]*5 ,
                 "growth": fertilizer1 + fertilizer2 + fertilizer3})
df.head(15)

Unnamed: 0,fertilizer,growth
0,fertilizer1,20
1,fertilizer1,22
2,fertilizer1,19
3,fertilizer1,24
4,fertilizer1,25
5,fertilizer2,28
6,fertilizer2,30
7,fertilizer2,27
8,fertilizer2,26
9,fertilizer2,29


In [11]:
# fit the model
model=ols("growth ~ fertilizer",data=df).fit()

In [12]:
# perform anova
anova_table=sm.stats.anova_lm(model,typ=2)
print(anova_table)

                sum_sq    df          F    PR(>F)
fertilizer  154.533333   2.0  15.662162  0.000452
Residual     59.200000  12.0        NaN       NaN


In [17]:
# print the results based on if the p-value is less than 0.05
if anova_table["PR(>F)"][0]<0.05:
    print(f"Reject null hypothesis: The means are not equal, as the p-value: {p_val} is less than 0.05")
else:
    print(f"fail to reject Reject null hypothesis: The means are equal, as the p-value: {p_val} is greater than 0.05")

Reject null hypothesis: The means are not equal, as the p-value: 0.0004515404760997282 is less than 0.05


  if anova_table["PR(>F)"][0]<0.05:


## 2.TWO-Way Anova

In [18]:
df=pd.DataFrame({
    "growth":    [20, 22, 19, 24, 25, 28, 30, 27, 26, 29, 18, 20, 22, 19, 24,
                 21, 23, 20, 25, 26, 29, 31, 28, 27, 30, 19, 21, 23, 20, 25],
   "fertilizer": ["F1", "F1", "F1", "F1", "F1", "F2", "F2", "F2", "F2", "F2", 
                   "F3", "F3", "F3", "F3", "F3", "F1", "F1", "F1", "F1", "F1", 
                   "F2", "F2", "F2", "F2", "F2", "F3", "F3", "F3", "F3", "F3"],
    "sunlight":  ["High", "High", "High", "High", "High", "High", "High", "High", "High", "High", 
                 "High", "High", "High", "High", "High", "Low", "Low", "Low", "Low", "Low", 
                 "Low", "Low", "Low", "Low", "Low", "Low", "Low", "Low", "Low", "Low"]
})
df.head()

Unnamed: 0,growth,fertilizer,sunlight
0,20,F1,High
1,22,F1,High
2,19,F1,High
3,24,F1,High
4,25,F1,High


In [23]:
# perform two-way anova
model=ols("growth ~ C(fertilizer) + C(sunlight) + C(fertilizer):C(sunlight)",data=df).fit()
anova_table=sm.stats.anova_lm(model,typ=2)
print(anova_table)
if anova_table["PR(>F)"][0] <0.05:
    print(f"Reject null hypothesis, means are not equal as p value:{p_val} is less then 0.05")
else:
    print(f"fail to null hypothesis, means are equal as p value:{p_val} is greater then 0.05")

                                 sum_sq    df             F        PR(>F)
C(fertilizer)              3.090667e+02   2.0  3.132432e+01  2.038888e-07
C(sunlight)                7.500000e+00   1.0  1.520270e+00  2.295198e-01
C(fertilizer):C(sunlight)  6.441240e-28   2.0  6.528284e-29  1.000000e+00
Residual                   1.184000e+02  24.0           NaN           NaN
Reject null hypothesis, means are not equal as p value:0.0004515404760997282 is less then 0.05


  if anova_table["PR(>F)"][0] <0.05:


## N-Way Anova
`N-way ANOVA`, also known as `factorial ANOVA`, is used when you have more than two independent variables. It allows you to analyze the effects of each factor on the dependent variable and the interaction effects between factors.

## Example: Three-Way ANOVA
Suppose we have an experimental data set with three factors:

1. Fertilizer Type (3 levels: F1, F2, F3)
2. Sunlight Exposure (2 levels: High, Low)
3. Watering Frequency (2 levels: Regular, Sparse)

We want to study the impact of these factors and their interactions on plant growth.

In [27]:
df=pd.DataFrame({
     "Growth": [20, 22, 19, 24, 25, 28, 30, 27, 26, 29, 18, 20, 22, 19, 24,
               21, 23, 20, 25, 26, 29, 31, 28, 27, 30, 19, 21, 23, 20, 25,
               20, 22, 21, 23, 24, 26, 28, 25, 27, 29, 17, 19, 21, 18, 20],
    "Fertilizer": ["F1", "F1", "F1", "F1", "F1", "F2", "F2", "F2", "F2", "F2", 
                   "F3", "F3", "F3", "F3", "F3", "F1", "F1", "F1", "F1", "F1", 
                   "F2", "F2", "F2", "F2", "F2", "F3", "F3", "F3", "F3", "F3",
                   "F1", "F1", "F1", "F1", "F1", "F2", "F2", "F2", "F2", "F2", 
                   "F3", "F3", "F3", "F3", "F3"],
    "Sunlight": ["High", "High", "High", "High", "High", "High", "High", "High", "High", "High", 
                 "High", "High", "High", "High", "High", "Low", "Low", "Low", "Low", "Low", 
                 "Low", "Low", "Low", "Low", "Low", "Low", "Low", "Low", "Low", "Low",
                 "High", "High", "High", "High", "High", "High", "High", "High", "High", "High", 
                 "High", "High", "High", "High", "High"],
    "Watering": ["Regular", "Regular", "Regular", "Regular", "Regular", 
                 "Regular", "Regular", "Regular", "Regular", "Regular",
                 "Regular", "Regular", "Regular", "Regular", "Regular",
                 "Sparse", "Sparse", "Sparse", "Sparse", "Sparse", 
                 "Sparse", "Sparse", "Sparse", "Sparse", "Sparse",
                 "Sparse", "Sparse", "Sparse", "Sparse", "Sparse",
                 "Regular", "Regular", "Regular", "Regular", "Regular", 
                 "Regular", "Regular", "Regular", "Regular", "Regular",
                 "Regular", "Regular", "Regular", "Regular", "Regular"]
})
# fit the model
model=ols("Growth ~ C(Fertilizer) +C(Sunlight) +C(Watering) +C(Fertilizer):C(Sunlight)+C(Fertilizer):C(Watering)+ C(Sunlight):(Watering) +C(Fertilizer):C(Sunlight):C(Watering)",data=df).fit()
# perform three-way Anova
anova_table=sm.stats.anova_lm(model,typ=2)
print(anova_table)

if anova_table["PR(>F)"][0] <0.05:
    print(f"Reject null hypothesis: The means are not equal, as the p-value: {p_val} is less than 0.05")
else:
    print(f"Fail to reject null hypothesis: The means are equal, as the p-value: {p_val} is greater than 0.05")

                                             sum_sq  ...        PR(>F)
C(Fertilizer)                          4.680444e+02  ...  2.050614e-12
C(Sunlight)                            6.024959e-17  ...  1.000000e+00
C(Watering)                            2.864840e-12  ...  9.999993e-01
C(Fertilizer):C(Sunlight)              2.493785e-13  ...  1.000000e+00
C(Fertilizer):C(Watering)              5.461610e-13  ...  1.000000e+00
C(Sunlight):Watering                   2.405455e+01  ...  9.211091e-02
C(Fertilizer):C(Sunlight):C(Watering)  2.163333e+01  ...  1.654275e-01
Residual                               1.573000e+02  ...           NaN

[8 rows x 4 columns]
Reject null hypothesis: The means are not equal, as the p-value: 0.0004515404760997282 is less than 0.05


  if anova_table["PR(>F)"][0] <0.05:


## Post-hoc Tests for One-Way ANOVA

In [3]:

# Sample data: Growth of plants with three types of fertilizers
fertilizer1 = [20, 22, 19, 24, 25]
fertilizer2 = [28, 30, 27, 26, 29]
fertilizer3 = [18, 20, 22, 19, 24]

from statsmodels.stats.multicomp import pairwise_tukeyhsd
import numpy as np
data={
    "Growth" : np.concatenate([fertilizer1,fertilizer2,fertilizer3]),
    "fertilizer":["F1"]*len(fertilizer1)+["F2"]*len(fertilizer2)+["F3"]*len(fertilizer3)
}
df=pd.DataFrame(data)
tukey=pairwise_tukeyhsd(endog=df["Growth"],groups=df["fertilizer"],alpha=0.05)
print(tukey)

 Multiple Comparison of Means - Tukey HSD, FWER=0.05 
group1 group2 meandiff p-adj   lower    upper  reject
-----------------------------------------------------
    F1     F2      6.0 0.0029   2.2523  9.7477   True
    F1     F3     -1.4 0.5928  -5.1477  2.3477  False
    F2     F3     -7.4 0.0005 -11.1477 -3.6523   True
-----------------------------------------------------


## Post-hoc Tests for Two-Way ANOVA

In [5]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Sample data
data = pd.DataFrame({
    "Growth": [20, 22, 19, 24, 25, 28, 30, 27, 26, 29, 18, 20, 22, 19, 24,
               21, 23, 20, 25, 26, 29, 31, 28, 27, 30, 19, 21, 23, 20, 25],
    "Fertilizer": ["F1", "F1", "F1", "F1", "F1", "F2", "F2", "F2", "F2", "F2", 
                   "F3", "F3", "F3", "F3", "F3", "F1", "F1", "F1", "F1", "F1", 
                   "F2", "F2", "F2", "F2", "F2", "F3", "F3", "F3", "F3", "F3"],
    "Sunlight": ["High", "High", "High", "High", "High", "High", "High", "High", "High", "High", 
                 "High", "High", "High", "High", "High", "Low", "Low", "Low", "Low", "Low", 
                 "Low", "Low", "Low", "Low", "Low", "Low", "Low", "Low", "Low", "Low"]
})

tukey=pairwise_tukeyhsd(data["Growth"],data["Fertilizer"]+data["Sunlight"],alpha=0.05)
print(tukey)

 Multiple Comparison of Means - Tukey HSD, FWER=0.05 
group1 group2 meandiff p-adj   lower    upper  reject
-----------------------------------------------------
F1High  F1Low      1.0 0.9786  -3.3434  5.3434  False
F1High F2High      6.0 0.0032   1.6566 10.3434   True
F1High  F2Low      7.0 0.0006   2.6566 11.3434   True
F1High F3High     -1.4 0.9145  -5.7434  2.9434  False
F1High  F3Low     -0.4 0.9997  -4.7434  3.9434  False
 F1Low F2High      5.0 0.0176   0.6566  9.3434   True
 F1Low  F2Low      6.0 0.0032   1.6566 10.3434   True
 F1Low F3High     -2.4 0.5396  -6.7434  1.9434  False
 F1Low  F3Low     -1.4 0.9145  -5.7434  2.9434  False
F2High  F2Low      1.0 0.9786  -3.3434  5.3434  False
F2High F3High     -7.4 0.0003 -11.7434 -3.0566   True
F2High  F3Low     -6.4 0.0016 -10.7434 -2.0566   True
 F2Low F3High     -8.4    0.0 -12.7434 -4.0566   True
 F2Low  F3Low     -7.4 0.0003 -11.7434 -3.0566   True
F3High  F3Low      1.0 0.9786  -3.3434  5.3434  False
----------------------------

## Post-hoc Tests for N-Way ANOVA (Factorial ANOVA)

In [7]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd
# Sample data
data = pd.DataFrame({
    "Growth": [20, 22, 19, 24, 25, 28, 30, 27, 26, 29, 18, 20, 22, 19, 24,
               21, 23, 20, 25, 26, 29, 31, 28, 27, 30, 19, 21, 23, 20, 25,
               20, 22, 21, 23, 24, 26, 28, 25, 27, 29, 17, 19, 21, 18, 20],
    "Fertilizer": ["F1", "F1", "F1", "F1", "F1", "F2", "F2", "F2", "F2", "F2", 
                   "F3", "F3", "F3", "F3", "F3", "F1", "F1", "F1", "F1", "F1", 
                   "F2", "F2", "F2", "F2", "F2", "F3", "F3", "F3", "F3", "F3",
                   "F1", "F1", "F1", "F1", "F1", "F2", "F2", "F2", "F2", "F2", 
                   "F3", "F3", "F3", "F3", "F3"],
    "Sunlight": ["High", "High", "High", "High", "High", "High", "High", "High", "High", "High", 
                 "High", "High", "High", "High", "High", "Low", "Low", "Low", "Low", "Low", 
                 "Low", "Low", "Low", "Low", "Low", "Low", "Low", "Low", "Low", "Low",
                 "High", "High", "High", "High", "High", "High", "High", "High", "High", "High", 
                 "High", "High", "High", "High", "High"],
    "Watering": ["Regular", "Regular", "Regular", "Regular", "Regular", 
                 "Regular", "Regular", "Regular", "Regular", "Regular",
                 "Regular", "Regular", "Regular", "Regular", "Regular",
                 "Sparse", "Sparse", "Sparse", "Sparse", "Sparse", 
                 "Sparse", "Sparse", "Sparse", "Sparse", "Sparse",
                 "Sparse", "Sparse", "Sparse", "Sparse", "Sparse",
                 "Regular", "Regular", "Regular", "Regular", "Regular", 
                 "Regular", "Regular", "Regular", "Regular", "Regular",
                 "Regular", "Regular", "Regular", "Regular", "Regular"]
})
tukey=pairwise_tukeyhsd(data["Growth"],data["Fertilizer"]+data["Sunlight"]+data["Watering"],alpha=0.05)
print(tukey)

        Multiple Comparison of Means - Tukey HSD, FWER=0.05        
    group1        group2    meandiff p-adj   lower    upper  reject
-------------------------------------------------------------------
F1HighRegular   F1LowSparse      1.0 0.9419  -2.2956  4.2956  False
F1HighRegular F2HighRegular      5.5    0.0   2.8092  8.1908   True
F1HighRegular   F2LowSparse      7.0    0.0   3.7044 10.2956   True
F1HighRegular F3HighRegular     -2.2 0.1647  -4.8908  0.4908  False
F1HighRegular   F3LowSparse     -0.4 0.9991  -3.6956  2.8956  False
  F1LowSparse F2HighRegular      4.5 0.0027   1.2044  7.7956   True
  F1LowSparse   F2LowSparse      6.0 0.0004   2.1946  9.8054   True
  F1LowSparse F3HighRegular     -3.2 0.0613  -6.4956  0.0956  False
  F1LowSparse   F3LowSparse     -1.4 0.8775  -5.2054  2.4054  False
F2HighRegular   F2LowSparse      1.5 0.7478  -1.7956  4.7956  False
F2HighRegular F3HighRegular     -7.7    0.0 -10.3908 -5.0092   True
F2HighRegular   F3LowSparse     -5.9 0.0001  -9.