Bivariate analysis - Categorical column vs numerical column

ANOVA - An ANOVA test is a way to find out if an experiment results is significant or not. In other words, it help you to figure out if you need to reject the null hypothesis or accept the alternate hypothesis. 

In [3]:
# We will use the inbuild data set from pydataset

from pydataset import data

df = data('chickwts')


In [6]:
# Lets check the sample size of each feed

df.groupby(['feed']).size().reset_index()

Unnamed: 0,feed,0
0,casein,12
1,horsebean,10
2,linseed,12
3,meatmeal,11
4,soybean,14
5,sunflower,12


In [7]:
# Sample size seems having almost equal Sample Sizes

# lets also check the mean weight of each feed

df.groupby(['feed'])['weight'].mean().reset_index()

Unnamed: 0,feed,weight
0,casein,323.583333
1,horsebean,160.2
2,linseed,218.75
3,meatmeal,276.909091
4,soybean,246.428571
5,sunflower,328.916667


In [34]:
# Lets create a Null and alternate hypothesis

# H0 : feed has not impact on weight
# H1 : feed does impact weight

# Input to the ANOVA function is the pandas series of different sample

# Lets create pandas series for each group by following below steps

# Step 1: Creating pandas series of weight for each group

casein = df['weight'][df['feed'] == 'casein']
horsebean = df['weight'][df['feed'] == 'horsebean']
linseed = df['weight'][df['feed'] == 'linseed']
meatmeal = df['weight'][df['feed'] == 'meatmeal']
soybean = df['weight'][df['feed'] == 'soybean']
sunflower = df['weight'][df['feed'] == 'sunflower']

In [35]:
# Perform the ANOVA

import scipy.stats as stats

stats.f_oneway(casein, horsebean, linseed, meatmeal, soybean, sunflower)

F_onewayResult(statistic=15.364799774712534, pvalue=5.936419853471331e-10)

As the p-value is less than alpha we can reject our null hypothesis - which means that feed have impact on weight

In [37]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd

tukey = pairwise_tukeyhsd(df['weight'], df['feed'], alpha=0.05)

tukey.summary()

group1,group2,meandiff,lower,upper,reject
casein,horsebean,-163.3833,-232.3445,-94.4222,True
casein,linseed,-104.8333,-170.5852,-39.0814,True
casein,meatmeal,-46.6742,-113.9039,20.5554,False
casein,soybean,-77.1548,-140.5149,-13.7947,True
casein,sunflower,5.3333,-60.4186,71.0852,False
horsebean,linseed,58.55,-10.4112,127.5112,False
horsebean,meatmeal,116.7091,46.3375,187.0806,True
horsebean,soybean,86.2286,19.544,152.9132,True
horsebean,sunflower,168.7167,99.7555,237.6778,True
linseed,meatmeal,58.1591,-9.0705,125.3887,False


Two way ANOVA - without replication - We can not use stats.f_oneway as stats do not have function for two way test. Thus we will use library know as statsmodels to implement two way ANOVA

In [12]:
# libraries

import statsmodels
import pandas as pd
from statsmodels.stats.anova import AnovaRM

# Data set

df_marks = pd.read_excel('E:\\Training data\\PGDS- EDA\\Batch 6\\Students Marks.xlsx')

df_marks.head(2)

Unnamed: 0,Students,Exam,Marks
0,student1,IA,96
1,student1,FA,36


In [13]:
# Lets create a Null and alternate hypothesis

# H0 : IA has no impact on FA
# H1 : IA has impact on FA

# Lets create the two way Anova using statsmodels

model = AnovaRM(df_marks, 'Marks', 'Students', within=['Exam'])

# Fitting the model

fit = model.fit()

# Printing the summary output

fit.summary()

0,1,2,3,4
,Num DF,Den DF,F Value,Pr > F
Exam,1.0000,9.0000,3.8012,0.0830


As p value is greater than 0.05 we can accept the null hypothesis