# ANOVA  - Lab

## Introduction

In this lab, you'll get some brief practice generating an ANOVA table (AOV) and interpreting its output. You'll then also perform some investigations to compare the method to the t-tests you previously employed to conduct hypothesis testing.

## Objectives

You will be able to:
* Use ANOVA for testing multiple pairwise comparisons
* Understand and explain the methodology behind ANOVA tests

## Loading the Data

Start by loading in the data stored in the file **ToothGrowth.csv**.

In [9]:
# Your code here
import pandas as pd
data = pd.read_csv('ToothGrowth.csv')

In [2]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

## Generating the ANOVA Table

Now generate an ANOVA table in order to analyze the influence of the medication and dosage 

In [3]:
#Your code here
formula = 'len ~ C(supp) + dose'
lm = ols(formula, data).fit()
table = sm.stats.anova_lm(lm, typ=2)
print(table)

               sum_sq    df           F        PR(>F)
C(supp)    205.350000   1.0   11.446768  1.300662e-03
dose      2224.304298   1.0  123.988774  6.313519e-16
Residual  1022.555036  57.0         NaN           NaN


## Reading the Table

Make a brief comment regarding the statistics regarding the effect of supplement and dosage on tooth length.

#Your comment here

## Comparing to T-Tests

Now that you've gotten a brief chance to interact with ANOVA, its interesting to compare the results to those from the t-tests you were just working with. With that, start by breaking the data into two samples: those given the OJ supplement, and those given the VC supplement. Afterwards, you'll conduct a t-test to compare the tooth length of these two different samples.

In [4]:
data.head()

Unnamed: 0,len,supp,dose
0,4.2,VC,0.5
1,11.5,VC,0.5
2,7.3,VC,0.5
3,5.8,VC,0.5
4,6.4,VC,0.5


In [6]:
#Your code here
OJ = data[data.supp=='OJ']
VC = data[data.supp=='VC']

Now compare a t-test between these two groups and print the associated two-sided p-value.

In [7]:
#Your code here; calculate the 2-sided p-value for a t-test comparing the two supplement groups.
from scipy import stats
stats.ttest_ind(OJ.len, VC.len)

Ttest_indResult(statistic=1.91526826869527, pvalue=0.06039337122412849)

## A 2-Category ANOVA F-Test is Equivalent to a 2-Tailed t-Test!

Now, recalculate an ANOVA F-test with only the supplement variable. An ANOVA F-test between two categories is the same as performing a 2-tailed t-Test! So, the p-value in the table should be identical to your calculation above.

> Note: there may be a small fractional difference (>0.001) between the two values due to a rounding error between implementations. 

In [8]:
#Your code here
formula = 'len ~ C(supp)'
lm = ols(formula, data).fit()
table = sm.stats.anova_lm(lm, typ=2)
print(table)
#Your code here; conduct an ANOVA F-test of the oj and vc supplement groups.
#Compare the p-value to that of the t-test above. 
#They should match (there may be a tiny fractional difference due to rounding errors in varying implementations)

               sum_sq    df         F    PR(>F)
C(supp)    205.350000   1.0  3.668253  0.060393
Residual  3246.859333  58.0       NaN       NaN


## Generating Multiple T-Tests

While the 2-category ANOVA test is identical to a 2-tailed t-Test, performing multiple t-tests leads to the multiple comparisons problem. To investigate this, look at the various sample groups you could create from the 2 features: 

In [6]:
for group in df.groupby(['supp', 'dose'])['len']:
    group_name = group[0]
    data = group[1]
    print(group_name)

('OJ', 0.5)
('OJ', 1.0)
('OJ', 2.0)
('VC', 0.5)
('VC', 1.0)
('VC', 2.0)


While bad practice, examine the effects of calculating multiple t-tests with the various combinations of these. To do this, generate all combinations of the above groups. For each pairwise combination, calculate the p-value of a 2 sided t-test. Print the group combinations and their associated p-value for the two-sided t-test.

In [11]:
groups = []
for group in data.groupby(['supp','dose'])['len']:
    s = group[0]
    d = group[1]
    groups.append((s,d))
groups
# stats.ttest_ind(OJ.len, VC.len)
#Your code here; reuse your $t$-test code above to calculate the p-value for a 2-sided $t$-test
#for all combinations of the supplement-dose groups listed above. 
#(Since there isn't a control group, compare each group to every other group.)

[(('OJ', 0.5), 30    15.2
  31    21.5
  32    17.6
  33     9.7
  34    14.5
  35    10.0
  36     8.2
  37     9.4
  38    16.5
  39     9.7
  Name: len, dtype: float64), (('OJ', 1.0), 40    19.7
  41    23.3
  42    23.6
  43    26.4
  44    20.0
  45    25.2
  46    25.8
  47    21.2
  48    14.5
  49    27.3
  Name: len, dtype: float64), (('OJ', 2.0), 50    25.5
  51    26.4
  52    22.4
  53    24.5
  54    24.8
  55    30.9
  56    26.4
  57    27.3
  58    29.4
  59    23.0
  Name: len, dtype: float64), (('VC', 0.5), 0     4.2
  1    11.5
  2     7.3
  3     5.8
  4     6.4
  5    10.0
  6    11.2
  7    11.2
  8     5.2
  9     7.0
  Name: len, dtype: float64), (('VC', 1.0), 10    16.5
  11    16.5
  12    15.2
  13    17.3
  14    22.5
  15    17.3
  16    13.6
  17    14.5
  18    18.8
  19    15.5
  Name: len, dtype: float64), (('VC', 2.0), 20    23.6
  21    18.5
  22    33.9
  23    25.5
  24    26.4
  25    32.5
  26    26.7
  27    21.5
  28    23.3
  29    29.5
  Name:

## Summary

In this lesson, you examined the ANOVA technique to generalize A/B testing methods to multiple groups and factors.