# ANOVA one way 

This notebook is written to see how one way ANOVA can be implemented using python.
> 1 way **ANOVA** can be performed only when the variable has  more than 2 categorical variables.
> Shapiro test is used to check the normality of the dataset.

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as stats
from scipy.stats import levene, shapiro, f_oneway
from statsmodels.stats.multicomp import pairwise_tukeyhsd, MultiComparison
import matplotlib.pyplot as plt
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

In [2]:
market=pd.read_excel('Book1.xlsx')

In [3]:
market.head()

Unnamed: 0,Country,Region,exportcost,importcost
0,Cambodia,East Asia & Pacific,795,930
1,China,East Asia & Pacific,620,615
2,"Hong Kong SAR, China",East Asia & Pacific,590,565
3,Indonesia,East Asia & Pacific,615,660
4,Malaysia,East Asia & Pacific,450,485


Initially we are doing a shapiro test and checking whether this **ANOVA** can be done are not.
    
    Conditions to do 1 way ANOVA is :
1. First we do a shapiro test if this test pass then ANOVA if fails we go to step-ii.
2. We do **levines's test** when pass we go for ANOVA else **"Kruskol_test"**.
3. When ANOVA fails we go for **TUKYHSD** to find which feature has caused the ANOVA to fail.

       Here the TUKYHSD is not necessary it is done to see which feature caused the ANOVA to fail. TUKYHSD can also be 
       performed we ANOVA passes.

In [18]:
shapiro(market[market['Region'] == 'East Asia & Pacific'].importcost)

(0.9657883048057556, 0.8492884635925293)

In [19]:
shapiro(market[market['Region'] == 'Eastern Europe & Central Asia'].importcost)

(0.8764522075653076, 0.1187758594751358)

In [20]:
shapiro(market[market['Region'] == 'Latin American & Carribbean'].importcost)

(0.9156326055526733, 0.32196322083473206)

In [21]:
shapiro(market[market['Region'] == 'Middle East & North Africa'].importcost)

(0.6830025315284729, 0.000548569078091532)

Here the complete dataframe is split into multiple dataframe based on the region (because to check for variance we must do it with apple to apple).

In [34]:
df1=market[market['Region']=='East Asia & Pacific']
df2=market[market['Region'] == 'Eastern Europe & Central Asia']
df3=market[market['Region'] == 'Latin American & Carribbean']
df4=market[market['Region'] == 'Middle East & North Africa']

In [35]:
x1=df1['importcost'].mean()
y1=df2['importcost'].mean()
z1=df3['importcost'].mean()
z2=df4['importcost'].mean()

In [36]:
print(x1,y1,z1,z2)

643.5 1590.0 1916.0 1482.9


In [37]:

#levene's test

levene(df1['importcost'],df2['importcost'],df3['importcost'],df4['importcost'])

LeveneResult(statistic=2.036620755449027, pvalue=0.1260602158520989)

###### 1way ANOVA

In [41]:
stats.f_oneway(df1['importcost'],df2['importcost'],df3['importcost'],df4['importcost'])

F_onewayResult(statistic=6.137096049804877, pvalue=0.0017602994423940278)

###### TUKYHSD

In [51]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd, MultiComparison
mc = MultiComparison(market['importcost'],market['Region'])
result=mc.tukeyhsd()

In [52]:
print(result)

                       Multiple Comparison of Means - Tukey HSD,FWER=0.05                       
            group1                        group2            meandiff   lower      upper   reject
------------------------------------------------------------------------------------------------
     East Asia & Pacific      Eastern Europe & Central Asia  946.5    113.1991  1779.8009  True 
     East Asia & Pacific       Latin American & Carribbean   1272.5   439.1991  2105.8009  True 
     East Asia & Pacific        Middle East & North Africa   839.4     6.0991   1672.7009  True 
Eastern Europe & Central Asia  Latin American & Carribbean   326.0   -507.3009  1159.3009 False 
Eastern Europe & Central Asia   Middle East & North Africa   -107.1  -940.4009   726.2009 False 
 Latin American & Carribbean    Middle East & North Africa   -433.1  -1266.4009  400.2009 False 
------------------------------------------------------------------------------------------------


Here the same 1 way ANOVA is implemented on both the respective datasets "Golfball.xlsx" and "Granule.xlsx"

In [2]:
golf=pd.read_excel('Golfball.xlsx')

In [3]:
golf.head()

Unnamed: 0,Design,Value
0,Design1,206.32
1,Design1,207.94
2,Design1,206.19
3,Design1,204.45
4,Design1,209.65


In [4]:
shapiro(golf[golf['Design']=='Design1'].Value)

(0.9313421845436096, 0.4612235128879547)

In [5]:
shapiro(golf[golf['Design']=='Design2'].Value)

(0.9357394576072693, 0.5066492557525635)

In [6]:
shapiro(golf[golf['Design']=='Design3'].Value)

(0.965823769569397, 0.8496574759483337)

In [7]:
shapiro(golf[golf['Design']=='Design4'].Value)

(0.9821692705154419, 0.9757122993469238)

In [8]:
levene(golf[golf['Design']=='Design1'].Value,golf[golf['Design']=='Design2'].Value,golf[golf['Design']=='Design3'].Value,golf[golf['Design']=='Design4'].Value)

LeveneResult(statistic=2.0932275915882848, pvalue=0.11827579172462642)

In [9]:
stats.f_oneway(golf[golf['Design']=='Design1'].Value,golf[golf['Design']=='Design2'].Value,golf[golf['Design']=='Design3'].Value,golf[golf['Design']=='Design4'].Value)

F_onewayResult(statistic=53.0298222109013, pvalue=2.7319908034125896e-13)

In [10]:
formula = 'Value~Design'
model = ols(formula, golf).fit()
aov_table = anova_lm (model, typ=1)
print(aov_table)

            df     sum_sq     mean_sq          F        PR(>F)
Design     3.0  2990.9898  996.996600  53.029822  2.731991e-13
Residual  36.0   676.8244   18.800678        NaN           NaN


In [70]:
mc = MultiComparison(golf['Value'],golf['Design'])
result = mc.tukeyhsd()
print(result)
print(mc.groupsunique)

Multiple Comparison of Means - Tukey HSD,FWER=0.05
 group1  group2 meandiff  lower   upper  reject
-----------------------------------------------
Design1 Design2  11.902   6.6795 17.1245  True 
Design1 Design3  19.974  14.7515 25.1965  True 
Design1 Design4  22.008  16.7855 27.2305  True 
Design2 Design3  8.072    2.8495 13.2945  True 
Design2 Design4  10.106   4.8835 15.3285  True 
Design3 Design4  2.034   -3.1885  7.2565 False 
-----------------------------------------------
['Design1' 'Design2' 'Design3' 'Design4']


In [71]:
print(mc.groupsunique)

['Design1' 'Design2' 'Design3' 'Design4']


# One Way ANOVA

In [75]:
gran=pd.read_excel('Granule.xlsx')

In [76]:
gran.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 310 entries, 0 to 309
Data columns (total 2 columns):
manu    310 non-null object
wt      310 non-null float64
dtypes: float64(1), object(1)
memory usage: 4.9+ KB


In [82]:
gran.head()

Unnamed: 0,manu,wt
0,Boston,0.14
1,Boston,0.31
2,Boston,0.28
3,Boston,0.14
4,Boston,0.17


In [83]:
shapiro(gran[gran['manu']=='Boston'].wt)

(0.8896019458770752, 6.349862857746302e-10)

In [84]:
shapiro(gran[gran['manu']=='Vermont'].wt)

(0.9112821817398071, 1.376009919340504e-07)

In [85]:
levene(gran[gran['manu']=='Boston'].wt,gran[gran['manu']=='Vermont'].wt)

LeveneResult(statistic=0.8648567105557542, pvalue=0.35311143576235615)

In [86]:
formula='wt~manu'
model=ols(formula,gran).fit()
anova=anova_lm(model,typ=1)
print(anova)

             df    sum_sq   mean_sq         F    PR(>F)
manu        1.0  0.162870  0.162870  9.089747  0.002784
Residual  308.0  5.518739  0.017918       NaN       NaN
