# ANOVA Test
ANOVA (or ANalysis Of VAriance) is a technique meant to compare the means of three or more independent samples. An example of when we might use ANOVA is when conducting a test on an e-commerce website and trying out multiple UI designs at once to see if there is a change in sales.

The ANOVA test has **important assumptions that must be satisfied** in order for the associated p-value to be valid.
   1. The samples are independent.
   2. Each sample is from a normally distributed population.
   3. The population standard deviations of the groups are all equal. This property is known as homoscedasticity.

If these assumptions are not true for a given set of data, it may still be possible to use the **Kruskal-Wallis H-test (scipy.stats.kruskal)** although with some loss of power.


In [4]:
## ANOVA Test example
from scipy.stats import f_oneway
import pandas as pd

rate = pd.read_csv('data/rate_by_city.csv')
rate['city_count'] = rate.groupby('City').cumcount()
rate_pivot = rate.pivot(index='city_count', columns='City', values='Rate')
rate_pivot.columns = ['City_'+str(x) for x in rate_pivot.columns.values]

display(rate_pivot.head())

analyisVars = [
    rate_pivot.City_1,
    rate_pivot.City_2,
    rate_pivot.City_3,
    rate_pivot.City_4,
    rate_pivot.City_5,
    rate_pivot.City_6
]

f_oneway(*analyisVars)

Unnamed: 0_level_0,City_1,City_2,City_3,City_4,City_5,City_6
city_count,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,13.75,14.25,14.0,15.0,14.5,13.5
1,13.75,13.0,14.0,14.0,14.0,12.25
2,13.5,12.75,13.51,13.75,14.0,12.25
3,13.5,12.5,13.5,13.59,13.9,12.0
4,13.0,12.5,13.5,13.25,13.75,12.0


F_onewayResult(statistic=4.8293848737024, pvalue=0.001174551414504048)

In [7]:
## EXAMPLE 2 - Same as previous
# In this case, we prefer not to pivot our data since the library will do it for us.
# The pivoting is performed internally by using the C function

from statsmodels.formula.api import ols
import statsmodels.api as sm

model = ols('Rate ~ C(City)', data=rate).fit()

# Our result is the same p-value and our conclusion to reject remains the same.
sm.stats.anova_lm(model, typ=2)

Unnamed: 0,sum_sq,df,F,PR(>F)
C(City),10.945667,5.0,4.829385,0.001175
Residual,21.758133,48.0,,
