# Two way ANOVA

## Two way ANOVA - Introduction

__Two Way ANOVA__ is also known as __Randomized Block Design__

In statistics, the two-way analysis of variance (ANOVA) is an extension of the one-way ANOVA that examines the influence of two different categorical independent variables on one continuous dependent variable. It compares the mean differences between groups that have been split on two independent variables (called factors).

Assumptions  
- The populations from which the samples were obtained must be normally or approximately normally distributed
- The samples must be independent
- The variances of the populations must be equal
- The groups must have the same sample size

### Two way ANOVA - Scenario

Two way ANOVA is performed to analyze the effect of two categorical independent variables on the dependent variable which is quantitative in nature. 

By performing two way ANOVA we can find out the following:
- Effect of 1st independent variable on the dependent variable. 
- Effect of 2nd independent variable on the dependent variable.
- Combined effect of both the independent variables on the dependent variable.

By performing two way ANOVA the surveying agency finds out:

- How the performance of a student is affected by the teacher under whom he/she has studied.
- Does gender of a student has a significant effect on his performance.
- How the attributes teacher and gender together affect the performance of a student. 

### Two way ANOVA - The data to be analyzed

The mark scored by the 15 students along with the gender and the teacher under which they studied is given below. The agency analyzes this data to find the effect of teacher, gender and the combination of both on the performance of students.



In [1]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()

In [2]:
mark_scored = pd.read_csv('data/Mark_teacher_gender.csv')
display(mark_scored)

Unnamed: 0,Teacher,Gender,Mark
0,teacher1,Female,90
1,teacher1,Male,76
2,teacher1,Male,65
3,teacher1,Male,67
4,teacher1,Female,73
5,teacher1,Female,81
6,teacher2,Male,81
7,teacher2,Female,86
8,teacher2,Male,89
9,teacher2,Male,76


### Steps to perform two way ANOVA

__Step 1:__ Set the null hypothesis and the alternate hypothesis.

__Step 2:__ Calculate various means and create the mean table.

__Step 3:__ Calculate the sum of squares. 

__Step 4:__ Calculate the degrees of freedom for each sum of squares.

__Step 5:__ Calculate the mean square values.

__Step 6:__ Calculate the F statistic.

__Step 7:__ Find out the Critical F value and compare the F statistic calculated. 

__Step 8:__  Draw and state the conclusion. 

***

#### Step 1:Setting hypothesis

The Agency is analyzing the effect of two independent attributes i.e. teacher and gender on the performance of a student.

Here, the agency has to set 3 sets of hypothesis. One to analyze the effect of  1st independent variable i.e. teacher, one to analyze the effect of second independent variable i.e. gender and one more to analyze  the effect of combination of both the independent variables.

The null hypothesis and alternate hypothesis is set as given below. 

Set 1:
- $H_0$: Teacher  has no significant effect on the performance of a student i.e.  $\mu_{teacher1}=\mu_{teacher2}=\mu_{teacher1}$. 
- $H_a$: Teacher has significant effect on the performance of a student i.e.  $\mu_{teacher1} \neq \mu_{teacher2} \neq \mu_{teacher1}$


Set 2:
- $H_0$: Gender has no significant effect on the performance of a student i.e. $\mu_{male}=\mu_{female}$.
- $H_a$: Gender has significant effect on the performance of a students i.e. $\mu_{male}≠\mu_{female}$


Set 3:
- $H_0$: Combination of gender and teacher has no significant effect on student's performance. 
- $H_a$: Combination of gnder and teacher has significant effect on student's performance.

After setting the hypothesis, the agency proceeds to step 2 which involves calculating various means. 

#### Step 2: The means to be calculated

The Agency needs to calculate the following kinds of means to perform two way ANOVA.
- Overall Mean ( mean of mark scored by all the 18 students)
- Mean of mark scored by :
     - all the students who studied under teacher1
     - all the students who studied under teacher2
     - all the students who studied under teacher3
     - all male students
     - all female students
     - male students who studied under teacher1
     - female students who studied under teacher1
     - male students who studied under teacher2
     - female students who studied under teacher2
     - male students who studied under teacher3
     - female students who studied under teacher3
     
#### Step 2: Calculating the means

The overall mean or grand mean. 

The overall mean i.e. mean mark scored by all the 18 students

In [3]:
overall_mean = np.mean(mark_scored.Mark.mean())
print("Overall mean: ",overall_mean)

Overall mean:  76.83333333333333


Means with respect to teacher

In [4]:
mean_teacher1 = np.mean(mark_scored.loc[mark_scored.Teacher=='teacher1', 'Mark'])
mean_teacher2 = np.mean(mark_scored.loc[mark_scored.Teacher=='teacher2', 'Mark'])
mean_teacher3 = np.mean(mark_scored.loc[mark_scored.Teacher=='teacher3', 'Mark'])

In [5]:
print("Mean_teacher_1: ",mean_teacher1)
print("Mean_teacher_2: ",mean_teacher2)
print("Mean_teacher_3: ",mean_teacher3)

Mean_teacher_1:  75.33333333333333
Mean_teacher_2:  85.5
Mean_teacher_3:  69.66666666666667


 Means with respect to gender 

The means with respect to the second independent attribute i.e. gender.

In [6]:
mean_male = np.mean(mark_scored.loc[mark_scored.Gender=='Male', "Mark"])
mean_female = np.mean(mark_scored.loc[mark_scored.Gender=='Female', "Mark"])

In [7]:
print('Mean_male: ',mean_male)
print('Mean_female: ',mean_female)

Mean_male:  72.55555555555556
Mean_female:  81.11111111111111


Means with respect to combination of teacher and gender 

The means with respect to the combination of teacher and gender i.e. mean mark scored by all the males who were taught by teacher 1, mean mark scored by all the females who were taught by teacher 1 and so on

In [9]:
mean_male_teacher1 = np.mean(mark_scored.loc[(mark_scored.Teacher=='teacher1') & \
                                             (mark_scored.Gender == 'Male'), "Mark"])
print("mean_male_teacher1: ",mean_male_teacher1)

mean_male_teacher1:  69.33333333333333


In [10]:
mean_female_teacher1 = np.mean(mark_scored.loc[(mark_scored.Teacher=='teacher1') & \
                                             (mark_scored.Gender == 'Female'), "Mark"])
print("mean_female_teacher1: ",mean_female_teacher1)

mean_female_teacher1:  81.33333333333333


In [11]:
mean_male_teacher2 = np.mean(mark_scored.loc[(mark_scored.Teacher=='teacher2') & \
                                             (mark_scored.Gender == 'Male'), "Mark"])
print("mean_male_teacher2: ",mean_male_teacher2)

mean_male_teacher2:  82.0


In [12]:
mean_female_teacher2 = np.mean(mark_scored.loc[(mark_scored.Teacher=='teacher2') & \
                                             (mark_scored.Gender == 'Female'), "Mark"])
print("mean_female_teacher2: ",mean_female_teacher2)

mean_female_teacher2:  89.0


In [13]:
mean_male_teacher3 = np.mean(mark_scored.loc[(mark_scored.Teacher=='teacher3') & \
                                             (mark_scored.Gender == 'Male'), "Mark"])
print("mean_male_teacher3: ",mean_male_teacher3)

mean_male_teacher3:  66.33333333333333


In [14]:
mean_female_teacher3 = np.mean(mark_scored.loc[(mark_scored.Teacher=='teacher3') & \
                                             (mark_scored.Gender == 'Female'), "Mark"])
print("mean_female_teacher3: ",mean_female_teacher3)

mean_female_teacher3:  73.0


#### Step 2: The mean table

All the means calculated, is listed in a table as given below. 

![](8_11.PNG)

#### Step 3: Sum of squares to be calculated

For performing two way ANOVA the following sum of squares to be calculated
- Sum of Squares Total (SST)
- Sum of Squares between due to Factor 1 ($SSB_{factor1}$)
- Sum of Squares between due to Factor 2 ($SSB_{factor2}$)
- Sum of Squares between due to combination of Factor 1 and Factor 2 ($SSC_{factor1 + factor2}$) 
- Sum of Squares within (SSW)

The relationship between various sum of squares is 

![](8_12.PNG)

In our case study the surveying agency needs to calculate the following sum of squares:
- Sum of Squares Total (SST)
- Sum of Squares between due to gender ($SSB_{gender}$)
- Sum of Squares between due to teacher ($SSB_{teacher}$)
- Sum of Squares between due to combination of gender and teacher ($SSB_{gender+teacher}$)
- Sum of Squares within (SSW)

The equation can be written as:

![](8_13.PNG)

#### Step 3:Calculating sum of squares total (SST)

__Sum of Squares total__ is the squared distance of all the records from the overall mean.


In [15]:
SST = np.sum((mark_scored.Mark - overall_mean)**2)
print("SST: ",SST)

SST:  1504.5


#### Step 3: Calculating sum of squares between, due to teacher

__Sum of Squares between__ due to the factor teacher ($SSB_{teacher}$) can be calculated as shown below:

![](8_14.PNG)

> SSBteacher= SSBteacher1+SSBteacher2+SSBteacher3= 13.5+450.6666+308.1667= 772.3333

In [16]:
SSB_teacher1 = len(mark_scored.loc[mark_scored.Teacher=='teacher1', "Mark"]) * ((mean_teacher1 - overall_mean)**2)
SSB_teacher2 = len(mark_scored.loc[mark_scored.Teacher=='teacher2', "Mark"]) * ((mean_teacher2 - overall_mean)**2)
SSB_teacher3 = len(mark_scored.loc[mark_scored.Teacher=='teacher3', "Mark"]) * ((mean_teacher3 - overall_mean)**2)
SSB_teacher = SSB_teacher1+SSB_teacher2+SSB_teacher3
print('SSB_teacher: ', SSB_teacher)

SSB_teacher:  772.333333333333


### Step 3: Calculating sum of squares between, due to gender

Sum of Squares between due to factor gender (SSBgender) can be calculated as shown below:

![](8_15.PNG)  
  
![](8_16.PNG) 

SSBgender=SSBmale+SSBfemale=164.6944+164.6944=329.38

In [17]:
SSB_male = len(mark_scored.loc[mark_scored.Gender=='Male', "Mark"]) * ((mean_male - overall_mean)**2)
SSB_female = len(mark_scored.loc[mark_scored.Gender=='Female', "Mark"]) * ((mean_female - overall_mean)**2)
SSB_gender = SSB_male + SSB_female
print('SSB_gender: ', SSB_gender)

SSB_gender:  329.388888888889


### Step 3: Calculating sum of squares within (SSW) i.e. error

__Sum of Squares within (SSW)__ is calculated as shown below:

![](8_17.PNG)

![](8_18.PNG)

SSW= SSWmale+SSWfemale=179.3334+196.667=376

In [18]:
SSW1 = np.sum((mark_scored.loc[(mark_scored.Teacher=="teacher1") & (mark_scored.Gender=='Male'), 'Mark'] \
              - mean_male_teacher1)**2)
print('SSW1: ', SSW1)

SSW1:  68.66666666666667


In [21]:
SSW2 = np.sum((mark_scored.loc[(mark_scored.Teacher=="teacher1") & (mark_scored.Gender=='Female'), 'Mark'] \
              - mean_female_teacher1)**2)
print('SSW2: ', SSW2)

SSW2:  144.66666666666669


In [22]:
SSW3 = np.sum((mark_scored.loc[(mark_scored.Teacher=="teacher2") & (mark_scored.Gender=='Male'), 'Mark'] \
              - mean_male_teacher2)**2)
print('SSW3: ', SSW3)

SSW3:  86.0


In [24]:
SSW4 = np.sum((mark_scored.loc[(mark_scored.Teacher=="teacher2") & (mark_scored.Gender=='Female'), 'Mark'] \
              - mean_female_teacher2)**2)
print('SSW4: ', SSW4)

SSW4:  38.0


In [25]:
SSW5 = np.sum((mark_scored.loc[(mark_scored.Teacher=="teacher3") & (mark_scored.Gender=='Male'), 'Mark'] \
              - mean_male_teacher3)**2)
print('SSW5: ', SSW5)

SSW5:  24.666666666666664


In [27]:
SSW6 = np.sum((mark_scored.loc[(mark_scored.Teacher=="teacher3") & (mark_scored.Gender=='Female'), 'Mark'] \
              - mean_female_teacher3)**2)
print('SSW6: ', SSW6)

SSW6:  14.0


In [29]:
SSW = SSW1+SSW2+SSW3+SSW4+SSW5+SSW6
print("SSW: ",SSW)

SSW:  376.00000000000006


#### Step 3: Calculating sum of square combined (SSC)

Sum of Squares combined (SSC) or Sum of Squares due to combination of gender and teacher can be found as given below:
![](8_19.PNG)

In [30]:
SSC = SST - SSB_teacher - SSB_gender - SSW
SSB_combined = SSC
print('SSB_combined: ', SSB_combined)

SSB_combined:  26.777777777777885


#### Step 4: Calculating degrees of freedom of sum of squares
> Degrees of Freedom: Sum of Squares total

  No. of observations (N)=18

  Degrees of freedom=N-1= 18-1=17

> Degrees of Freedom: Sum of Squares between (Gender)

  No. of groups (k1)=2 (male and female)

  Degrees of freedom= k1-1= 2-1=1

> Degrees of Freedom: Sum of Squares between (Teacher)

  No. of groups (k2)=3 (teacher1,teacher2,teacher3)

  Degrees of freedom= k2-1=3-1=2

> Degrees of Freedom: Sum of Squares within

  No. of distinct combinations (x)= k1*k2=6

  No. of element in each combination (n)=3

  Degrees of freedom = x*(n-1)=6*2=12

> Degrees of Freedom: Sum of Squares combined

  Degrees of freedom= degrees of freedom between (gender)*degrees of freedom between (teacher)

  Degrees of freedom = 1*2=2

In [31]:
df_total = len(mark_scored.Mark) - 1
print("df_total: ", df_total)

df_total:  17


In [32]:
cat_teach = pd.Series(pd.Categorical(mark_scored.Teacher))
df_between_teacher = len(cat_teach.cat.categories)-1
print("df_between_teacher: ",df_between_teacher)

df_between_teacher:  2


In [33]:
cat_gender = pd.Series(pd.Categorical(mark_scored.Gender))
df_between_gender = len(cat_gender.cat.categories)-1
print("df_between_gender: ",df_between_gender)

df_between_gender:  1


In [34]:
df_combined = df_between_teacher * df_between_gender
print("df_combined: ",df_combined)

df_combined:  2


In [35]:
df_within = len(cat_teach.cat.categories) * len(cat_gender.cat.categories) * 2
print("df_within: ", df_within)

df_within:  12


#### Step 5: Calculating mean square values
The mean square values can be calculated taking the ratio of sum of squares and degrees of freedom as shown below:

In [36]:
MSB_teacher = SSB_teacher/df_between_teacher
MSB_gender = SSB_gender/df_between_gender
MSB_combined = SSB_combined/df_combined
MSW = SSW/df_within
print("MSB_teacher: ", MSB_teacher)
print("MSB_gender: ", MSB_gender)
print("MSB_combined: ", MSB_combined)
print("MSW: ", MSW)

MSB_teacher:  386.1666666666665
MSB_gender:  329.388888888889
MSB_combined:  13.388888888888943
MSW:  31.33333333333334


#### Step 6: Calculating F statistic

As done in case of on way ANOVA the next step is to Calculate the F statistic which is nothing but the ratio of mean square between (MSB) and mean square within (MSW). 

The agency calculates 3 F statistics i.e. one of each of the independent variable and one for the combined effect of both of the independent variables. The F statistics to be calculated is given below.

- F statistic for the attribute teacher  
- F statistic for the attribute gender
- F statistic for the combination of teacher and gender



In [37]:
F_teacher = MSB_teacher/MSW
F_gender = MSB_gender/MSW
F_combined = MSB_combined/MSW
print("F_teacher: ", F_teacher)
print("F_gender: ", F_gender)
print("F_combined: ", F_combined)

F_teacher:  12.324468085106377
F_gender:  10.512411347517732
F_combined:  0.42730496453900874


#### Step 7: Calculating critical F statistics

The next task is to calculate the critical F statistics. The agency would calculated 3 critical F statistic, taking the level of significance and the degrees of freedom into consideration. The level of significance was set as 0.05. 

Python function [f.ppf()](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f.html) is used to find out the critical F statistics. 

In [38]:
from scipy.stats import f
F_critical_teacher = f.ppf(0.95,2,12)
F_critical_gender = f.ppf(0.95,1,12)
F_critical_combined = f.ppf(0.95,2,12)
print("F_critical_teacher: ",F_critical_teacher)
print("F_critical_gender: ",F_critical_gender)
print("F_critical_combined: ",F_critical_combined)

F_critical_teacher:  3.8852938346523933
F_critical_gender:  4.747225346722511
F_critical_combined:  3.8852938346523933


The sum of squares , degrees of freedom, mean squares , calculated F statistics and the critical F statistics is given in the following table. 

![](8_20.PNG)

#### Step 8: Comparing F values and drawing conclusion

From the F table it is observed that, __for both the independent attributes i.e. teacher and gender__ $F_{calculated} > F_{critical}$

Hence we can reject the null hypothesis in both this cases and conclude:
- Performance of a student is affected by the teacher under which he/she has studied.
- There is a significant difference between the performance of males and females.  

__for the combination of both the factors (gender + teacher) $F_{calculated} < F_{critical}$.__

Hence we fail to reject the null hypothesis and conclude
- Combination of gender and teacher does not have significant effect on the performance of a student.

__Solution to Challenge 6.3:__ The agency has concluded that there is a significant difference between the performance of male and female students. The difference in performance of the students who has studied under different teachers is also significant. Hence both the factors i.e. teacher and gender has significant effect on the performance of students. 

### Performing two way ANOVA in Python

Two way ANOVA can be performed using ols() function in Python. The ols() function takes the data frame containing the observations as input. The data frame must have each attribute as a separate column.

The output of ols() function gives the critical F value and P value for each independent attribute and also for the combination of independent variables. 

> Comparing the level of significance and the P values we can accept/ reject the Null hypothesis. 



In [40]:
import statsmodels.api as sm
from statsmodels.formula.api import ols
result_two_way_anova = ols('Mark ~ Teacher + Gender + Teacher:Gender', data = mark_scored).fit()
aov_table = sm.stats.anova_lm(result_two_way_anova, typ = 2)
aov_table

Unnamed: 0,sum_sq,df,F,PR(>F)
Teacher,772.333333,2.0,12.324468,0.001232
Gender,329.388889,1.0,10.512411,0.007057
Teacher:Gender,26.777778,2.0,0.427305,0.661811
Residual,376.0,12.0,,
