# ANOVA  - Lab

## Introduction

In this lab, you'll get some brief practice generating an ANOVA table (AOV) and interpreting its output. You'll then also perform some investigations to compare the method to the t-tests you previously employed to conduct hypothesis testing.

## Objectives

You will be able to:
* Use ANOVA for testing multiple pairwise comparisons
* Understand and explain the methodology behind ANOVA tests

## Loading the Data

Start by loading in the data stored in the file **ToothGrowth.csv**.

In [8]:
import pandas as pd 
df = pd.read_csv("ToothGrowth.csv")
df.head(20)

Unnamed: 0,len,supp,dose
0,4.2,VC,0.5
1,11.5,VC,0.5
2,7.3,VC,0.5
3,5.8,VC,0.5
4,6.4,VC,0.5
5,10.0,VC,0.5
6,11.2,VC,0.5
7,11.2,VC,0.5
8,5.2,VC,0.5
9,7.0,VC,0.5


## Generating the ANOVA Table

Now generate an ANOVA table in order to analyze the influence of the medication and dosage 

In [7]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

formula = 'len ~ C(supp) + C(dose)'
lm = ols(formula, df).fit()
table = sm.stats.anova_lm(lm, typ=2)
print(table)

               sum_sq    df          F        PR(>F)
C(supp)    205.350000   1.0  14.016638  4.292793e-04
C(dose)   2426.434333   2.0  82.810935  1.871163e-17
Residual   820.425000  56.0        NaN           NaN


## Reading the Table

Make a brief comment regarding the statistics regarding the effect of supplement and dosage on tooth length.

In [None]:
# We see that both supplements and dose appear to be influential, as we reject the null hypothesis, since both 
# probabilities are much smaller than .05, with dose seemingly more influential, since PR(>F) is even smaller
# so it is even more unlikely that the difference we observed happened by chance. 

## Comparing to T-Tests

Now that you've gotten a brief chance to interact with ANOVA, its interesting to compare the results to those from the t-tests you were just working with. With that, start by breaking the data into two samples: those given the OJ supplement, and those given the VC supplement. Afterwards, you'll conduct a t-test to compare the tooth length of these two different samples.

In [10]:
vc = df[(df['supp'] == 'VC')]
oj = df[(df['supp'] == 'OJ')]

Now compare a t-test between these two groups and print the associated two-sided p-value.

In [16]:
#Your code here; calculate the 2-sided p-value for a t-test comparing the two supplement groups.
# H_0: mu_OJ = mu_VC
# H_A: mu_OJ =/= mu_VC

import numpy as np
from scipy import stats 
import math

stats.stats.ttest_ind(vc['len'], oj['len'], equal_var=False)

Ttest_indResult(statistic=-1.91526826869527, pvalue=0.06063450788093387)

## A 2-Category ANOVA F-Test is Equivalent to a 2-Tailed t-Test!

Now, recalculate an ANOVA F-test with only the supplement variable. An ANOVA F-test between two categories is the same as performing a 2-tailed t-Test! So, the p-value in the table should be identical to your calculation above.

> Note: there may be a small fractional difference (>0.001) between the two values due to a rounding error between implementations. 

In [17]:
#Your code here; conduct an ANOVA F-test of the oj and vc supplement groups.
#Compare the p-value to that of the t-test above. 
#They should match (there may be a tiny fractional difference due to rounding errors in varying implementations)

formula = 'len ~ C(supp)'
lm = ols(formula, df).fit()
table = sm.stats.anova_lm(lm, typ=2)
print(table)

               sum_sq    df         F    PR(>F)
C(supp)    205.350000   1.0  3.668253  0.060393
Residual  3246.859333  58.0       NaN       NaN


## Generating Multiple T-Tests

While the 2-category ANOVA test is identical to a 2-tailed t-Test, performing multiple t-tests leads to the multiple comparisons problem. To investigate this, look at the various sample groups you could create from the 2 features: 

In [7]:
for group in df.groupby(['supp', 'dose'])['len']:
    group_name = group[0]
    data = group[1]
    print(group_name)

('OJ', 0.5)
('OJ', 1.0)
('OJ', 2.0)
('VC', 0.5)
('VC', 1.0)
('VC', 2.0)


While bad practice, examine the effects of calculating multiple t-tests with the various combinations of these. To do this, generate all combinations of the above groups. For each pairwise combination, calculate the p-value of a 2 sided t-test. Print the group combinations and their associated p-value for the two-sided t-test.

In [37]:
#Your code here; reuse your t-test code above to calculate the p-value for a 2-sided t-test
#for all combinations of the supplement-dose groups listed above. 
#(Since there isn't a control group, compare each group to every other group.)
vc_5 = (df[(df['supp'] == 'VC') & (df['dose'] == 0.5)], 'vc_5' )
vc_1 = (df[(df['supp'] == 'VC') & (df['dose'] == 1.0)], 'vc_1')
vc_2 = (df[(df['supp'] == 'VC') & (df['dose'] == 2.0)], 'vc_2')
oj_5 = (df[(df['supp'] == 'OJ') & (df['dose'] == 0.5)], 'oj_5' )
oj_1 = (df[(df['supp'] == 'OJ') & (df['dose'] == 1.0)], 'oj_1' )
oj_2 = (df[(df['supp'] == 'OJ') & (df['dose'] == 2.0)], 'oj_2' )

dfs = [vc_5, vc_1, vc_2, oj_5, oj_1, oj_2]

for df1 in dfs:
    for df2 in dfs:
        if not df2[1] == df1[1]:
            print(df1[1], 'compared with', df2[1], ':', 
                  stats.stats.ttest_ind(df1[0]['len'], df2[0]['len'], equal_var=False)[1])


vc_5 compared with vc_1 : 6.811017702865016e-07
vc_5 compared with vc_2 : 4.6815774144921145e-08
vc_5 compared with oj_5 : 0.006358606764096813
vc_5 compared with oj_1 : 3.6552067303259103e-08
vc_5 compared with oj_2 : 1.3621396478988818e-11
vc_1 compared with vc_5 : 6.811017702865016e-07
vc_1 compared with vc_2 : 9.155603056638692e-05
vc_1 compared with oj_5 : 0.04601033257637553
vc_1 compared with oj_1 : 0.001038375872299884
vc_1 compared with oj_2 : 2.3610742020468435e-07
vc_2 compared with vc_5 : 4.6815774144921145e-08
vc_2 compared with vc_1 : 9.155603056638692e-05
vc_2 compared with oj_5 : 7.196253524006043e-06
vc_2 compared with oj_1 : 0.09652612338267014
vc_2 compared with oj_2 : 0.9638515887233756
oj_5 compared with vc_5 : 0.006358606764096813
oj_5 compared with vc_1 : 0.04601033257637553
oj_5 compared with vc_2 : 7.196253524006043e-06
oj_5 compared with oj_1 : 8.784919055161479e-05
oj_5 compared with oj_2 : 1.3237838776972294e-06
oj_1 compared with vc_5 : 3.6552067303259103e-

## Summary

In this lesson, you examined the ANOVA technique to generalize A/B testing methods to multiple groups and factors.