# ANOVA  - Lab

## Introduction

In this lab, you'll get some brief practice generating an ANOVA table (AOV) and interpreting its output. You'll also perform some investigations to compare the method to the t-tests you previously employed to conduct hypothesis testing.

## Objectives

In this lab you will: 

- Use ANOVA for testing multiple pairwise comparisons 
- Interpret results of an ANOVA and compare them to a t-test

## Load the data

Start by loading in the data stored in the file `'ToothGrowth.csv'`: 

In [1]:
# Your code here
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols

In [6]:
teeth = pd.read_csv('ToothGrowth.csv')
print(teeth.head())
teeth.describe()

    len supp  dose
0   4.2   VC   0.5
1  11.5   VC   0.5
2   7.3   VC   0.5
3   5.8   VC   0.5
4   6.4   VC   0.5


Unnamed: 0,len,dose
count,60.0,60.0
mean,18.813333,1.166667
std,7.649315,0.628872
min,4.2,0.5
25%,13.075,0.5
50%,19.25,1.0
75%,25.275,2.0
max,33.9,2.0


## Generate the ANOVA table

Now generate an ANOVA table in order to analyze the influence of the medication and dosage:  

In order to generate the ANOVA table, you first fit a linear model and then generate the table from this object. Our formula will be written as:

Control_Column ~ C(factor_col1) + factor_col2 + C(factor_col3) + ... + X

We indicate categorical variables by wrapping them with C(). 

In [15]:

# set up pandas to display floats in a more human friendly way
pd.options.display.float_format = '{:,.5f}'.format

# Your code here
formula = 'len ~ C(supp) + dose'
lm = ols(formula, teeth).fit()
table = sm.stats.anova_lm(lm, typ=2)

print(table)

              sum_sq       df         F  PR(>F)
C(supp)    205.35000  1.00000  11.44677 0.00130
dose     2,224.30430  1.00000 123.98877 0.00000
Residual 1,022.55504 57.00000       nan     nan


## Interpret the output

Make a brief comment regarding the statistics and the effect of supplement and dosage on tooth length: 

In [16]:
# Your comment here
'''
Values less than 0.05 (or whatever we set 𝛼 to) indicate REJECTION of the null hypothesis. 
H0/Null Hypothesis is that there is no effect by the treatment

In this case, notice that both factors appear influential,
with dose being the potentially most significant, followed by suppliment.
'''


'\nValues less than 0.05 (or whatever we set 𝛼 to) indicate REJECTION of the null hypothesis. \nH0/Null Hypothesis is that there is no effect by the treatment\n\nIn this case, notice that both factors appear influential,\nwith dose being the potentially most significant, followed by suppliment.\n'

## Compare to t-tests

Now that you've had a chance to generate an ANOVA table, its interesting to compare the results to those from the t-tests you were working with earlier. With that, start by breaking the data into two samples: those given the OJ supplement, and those given the VC supplement. Afterward, you'll conduct a t-test to compare the tooth length of these two different samples: 

In [26]:
# Your code here
oj_len = teeth[teeth['supp'] == 'OJ']['len']
vc_len = teeth[teeth['supp'] == 'VC']['len']
print(vc_len.head())

0    4.20000
1   11.50000
2    7.30000
3    5.80000
4    6.40000
Name: len, dtype: float64


Now run a t-test between these two groups and print the associated two-sided p-value: 

In [28]:
# Calculate the 2-sided p-value for a t-test comparing the two supplement groups

from scipy import stats

t_result = stats.ttest_ind(oj_len, vc_len, equal_var=False)
t_result[1]

0.06063450788093387

## A 2-Category ANOVA F-test is equivalent to a 2-tailed t-test!

Now, recalculate an ANOVA F-test with only the supplement variable. An ANOVA F-test between two categories is the same as performing a 2-tailed t-test! So, the p-value in the table should be identical to your calculation above.

> Note: there may be a small fractional difference (>0.001) between the two values due to a rounding error between implementations. 

In [30]:
# Your code here; conduct an ANOVA F-test of the oj and vc supplement groups.
# Compare the p-value to that of the t-test above. 
# They should match (there may be a tiny fractional difference due to rounding errors in varying implementations)

# Your code here
formula_supp = 'len ~ C(supp)'
lm = ols(formula_supp, teeth).fit()
table = sm.stats.anova_lm(lm, typ=2)

print(table)

# COOL....they match

              sum_sq       df       F  PR(>F)
C(supp)    205.35000  1.00000 3.66825 0.06039
Residual 3,246.85933 58.00000     nan     nan


## Run multiple t-tests

While the 2-category ANOVA test is identical to a 2-tailed t-test, performing multiple t-tests leads to the multiple comparisons problem. To investigate this, look at the various sample groups you could create from the 2 features: 

In [32]:
for group in teeth.groupby(['supp', 'dose'])['len']:
    group_name = group[0]
    data = group[1]
    print(group_name)

('OJ', 0.5)
('OJ', 1.0)
('OJ', 2.0)
('VC', 0.5)
('VC', 1.0)
('VC', 2.0)


While bad practice, examine the effects of calculating multiple t-tests with the various combinations of these. To do this, generate all combinations of the above groups. For each pairwise combination, calculate the p-value of a 2-sided t-test. Print the group combinations and their associated p-value for the two-sided t-test.

In [None]:
# Your code here; reuse your t-test code above to calculate the p-value for a 2-sided t-test
# for all combinations of the supplement-dose groups listed above. 
# (Since there isn't a control group, compare each group to every other group.)

## Summary

In this lesson, you implemented the ANOVA technique to generalize testing methods to multiple groups and factors.