# ANOVA  - Lab

## Introduction

In this lab, you'll get some brief practice generating an ANOVA table (AOV) and interpreting its output. You'll also perform some investigations to compare the method to the t-tests you previously employed to conduct hypothesis testing.

## Objectives

In this lab you will: 

- Use ANOVA for testing multiple pairwise comparisons 
- Interpret results of an ANOVA and compare them to a t-test

## Load the data

Start by loading in the data stored in the file `'ToothGrowth.csv'`: 

In [3]:
# Your code here
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

data = pd.read_csv('ToothGrowth.csv')
data

Unnamed: 0,len,supp,dose
0,4.2,VC,0.5
1,11.5,VC,0.5
2,7.3,VC,0.5
3,5.8,VC,0.5
4,6.4,VC,0.5
5,10.0,VC,0.5
6,11.2,VC,0.5
7,11.2,VC,0.5
8,5.2,VC,0.5
9,7.0,VC,0.5


## Generate the ANOVA table

Now generate an ANOVA table in order to analyze the influence of the medication and dosage:  

In [5]:
# Your code here
formula = 'len ~ C(supp) + dose'
lm = ols(formula, data).fit()
table = sm.stats.anova_lm(lm, typ=2)
print(table)

               sum_sq    df           F        PR(>F)
C(supp)    205.350000   1.0   11.446768  1.300662e-03
dose      2224.304298   1.0  123.988774  6.313519e-16
Residual  1022.555036  57.0         NaN           NaN


## Interpret the output

Make a brief comment regarding the statistics and the effect of supplement and dosage on tooth length: 

In [2]:
# Your comment here
print("""
The p-value of supplements is about 0.0013 while that of dosage is about 6.313519e-16. 
Working with an alpha value of 0.05, the two factors, supplement and dosage, are statistically significant 
because their p-values are much less than alpha.

Dosage is more statistically significant than supplements.""")



The p-value of supplements is about 0.0013 while that of dosage is about 6.313519e-16. 
Working with an alpha value of 0.05, the two factors, supplement and dosage, are statistically significant 
because their p-values are much less than alpha.

Dosage is more statistically significant than supplements.


## Compare to t-tests

Now that you've had a chance to generate an ANOVA table, its interesting to compare the results to those from the t-tests you were working with earlier. With that, start by breaking the data into two samples: those given the OJ supplement, and those given the VC supplement. Afterward, you'll conduct a t-test to compare the tooth length of these two different samples: 

In [5]:
# Your code here
VC_df = data[data['supp'] == 'VC']
OJ_df = data[data['supp'] == 'OJ']

Now run a t-test between these two groups and print the associated two-sided p-value: 

In [10]:
# Calculate the 2-sided p-value for a t-test comparing the two supplement groups
import scipy.stats as stats
stats.ttest_ind(VC_df['len'], OJ_df['len'])

Ttest_indResult(statistic=-1.91526826869527, pvalue=0.06039337122412849)

## A 2-Category ANOVA F-test is equivalent to a 2-tailed t-test!

Now, recalculate an ANOVA F-test with only the supplement variable. An ANOVA F-test between two categories is the same as performing a 2-tailed t-test! So, the p-value in the table should be identical to your calculation above.

> Note: there may be a small fractional difference (>0.001) between the two values due to a rounding error between implementations. 

In [22]:
# Your code here; conduct an ANOVA F-test of the oj and vc supplement groups.
# Performing two-way ANOVA

# Concatenate dataframes for two-way ANOVA
combined_df = pd.concat([VC_df, OJ_df])

model = ols('len ~ C(VC) + C(OJ) +\
C(VC):C(OJ)',
            data=combined_df).fit()
result = sm.stats.anova_lm(model, type=2)

# model = ols('height ~ C(Fertilizer) + C(Watering) +\
# C(Fertilizer):C(Watering)',
#             data=dataframe).fit()
# result = sm.stats.anova_lm(model, type=2)
# Compare the p-value to that of the t-test above. 
# They should match (there may be a tiny fractional difference due to rounding errors in varying implementations)

PatsyError: Error evaluating factor: NameError: name 'VC' is not defined
    len ~ C(VC) + C(OJ) +C(VC):C(OJ)
          ^^^^^

In [24]:
combined_df = pd.concat([VC_df, OJ_df])

formula = 'supp ~ len'
model = ols(formula, data=combined_df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

print(anova_table)

ValueError: endog has evaluated to an array with multiple columns that has shape (60, 2). This occurs when the variable converted to endog is non-numeric (e.g., bool or str).

In [15]:
VC_list = VC_df['len']
VC_list

0      4.2
1     11.5
2      7.3
3      5.8
4      6.4
5     10.0
6     11.2
7     11.2
8      5.2
9      7.0
10    16.5
11    16.5
12    15.2
13    17.3
14    22.5
15    17.3
16    13.6
17    14.5
18    18.8
19    15.5
20    23.6
21    18.5
22    33.9
23    25.5
24    26.4
25    32.5
26    26.7
27    21.5
28    23.3
29    29.5
Name: len, dtype: float64

In [23]:
combined_df = pd.concat([VC_df, OJ_df])
combined_df

Unnamed: 0,len,supp,dose
0,4.2,VC,0.5
1,11.5,VC,0.5
2,7.3,VC,0.5
3,5.8,VC,0.5
4,6.4,VC,0.5
5,10.0,VC,0.5
6,11.2,VC,0.5
7,11.2,VC,0.5
8,5.2,VC,0.5
9,7.0,VC,0.5


## Run multiple t-tests

While the 2-category ANOVA test is identical to a 2-tailed t-test, performing multiple t-tests leads to the multiple comparisons problem. To investigate this, look at the various sample groups you could create from the 2 features: 

In [7]:
for group in df.groupby(['supp', 'dose'])['len']:
    group_name = group[0]
    data = group[1]
    print(group_name)

('OJ', 0.5)
('OJ', 1.0)
('OJ', 2.0)
('VC', 0.5)
('VC', 1.0)
('VC', 2.0)


While bad practice, examine the effects of calculating multiple t-tests with the various combinations of these. To do this, generate all combinations of the above groups. For each pairwise combination, calculate the p-value of a 2-sided t-test. Print the group combinations and their associated p-value for the two-sided t-test.

In [None]:
# Your code here; reuse your t-test code above to calculate the p-value for a 2-sided t-test
# for all combinations of the supplement-dose groups listed above. 
# (Since there isn't a control group, compare each group to every other group.)

## Summary

In this lesson, you implemented the ANOVA technique to generalize testing methods to multiple groups and factors.