---
author: Elizabeth Czarniak (CZARNIA_ELIZ@bentley.edu)
---

We're going to use R's `esoph` dataset, about esophageal cancer cases.
We will focus on the impact of age group (`agegp`) and alcohol consumption (`alcgp`)
on the number of cases of the cancer (`ncases`).  We ask, does the interaction
between these two factors affect the number of cases?

First, we load in the dataset.  (See how to quickly load some sample data.)

In [4]:
from rdatasets import data
data = data('esoph') 
data.head()

Unnamed: 0,agegp,alcgp,tobgp,ncases,ncontrols
0,25-34,0-39g/day,0-9g/day,0,40
1,25-34,0-39g/day,10-19,0,10
2,25-34,0-39g/day,20-29,0,6
3,25-34,0-39g/day,30+,0,5
4,25-34,40-79,0-9g/day,0,27


Next, we create a model that includes the response variable we care about,
plus the two categorical variables we will be testing, as well as their
interaction term.

In [5]:
import statsmodels.api as sm
from statsmodels.formula.api import ols
# C(...) means the variable is categorical, and : means multiplication.
model = ols('ncases ~ C(alcgp) + C(agegp) + C(alcgp):C(agegp)', data = data).fit()

A two-way ANOVA with interaction tests the following three null hypotheses.

 1. There is no interaction between the two categorical variables.
    (If we reject this we do not test the other two hypotheses.)
 2. The mean response is the same across all groups of the first factor.
    (In our example, that says the mean `ncases` is the same for all age groups.)
 3. The mean response is the same across all groups of the second factor.
    (In our example, that says the mean `ncases` is the same for all alcohol consumption groups.)

We choose a value, $0 \le \alpha \le 1$, as the Type I Error Rate. Let's let $\alpha=0.05$ here.

In [6]:
sm.stats.anova_lm(model, typ=2)

Unnamed: 0,sum_sq,df,F,PR(>F)
C(alcgp),52.695287,3.0,4.723387,0.004862447
C(agegp),267.026108,5.0,14.361068,2.021935e-09
C(alcgp):C(agegp),107.557743,15.0,1.928206,0.0363271
Residual,238.0,64.0,,


The $p$-value for the interaction of age group and alcohol consumption is in
the third row, final column, $3.63271\times10^{-2}$.  It is less than $\alpha$,
so we can reject the null hypothesis that age group and alcohol consumption
do not interact with regards to the number of esophageal cancer cases.  That is,
we have reason to believe that their interaction does effect the outcome.

As we stated when we listed the hypotheses to test, since we rejected the first
null hypothesis, we will not test the other two.  In the case where you failed
to reject the first null hypothesis, you could consider each $p$-value in the
first two rows of the above table, one for each of the two factors.