# Multi-sample Test (ANOVA)

**Goal:** Compare the mean value of more than 2 independent samples ($\mu_1$, $\mu_2$, ..., $\mu_k$). 

We are interested in which of the following hypotheses is supported by the data.

$$H_1: \mu_1 = \mu_2 = ... = \mu_k$$
$$H_2: \mu_1 \ne \mu_2 = ... = \mu_k$$

$$H_3: \mu_1 = \mu_2 \ne ... = \mu_k$$

$$H_4: \mu_1 = \mu_2 = ... \ne \mu_k$$

$$H_5: \mu_1 \ne \mu_2 \ne ... \ne \mu_k$$

Now that we are interested in comparing more than two samples, we will use an ANOVA instead of a t.test. In R, we will use the *aov()* function. 

We will use the *ChickWeight* built-in dataset again that we have been using for the last couple of modules. In the last module, we used a series of t-tests to compare the mean weight between chicks on different diet. With an anova we can run one test to see if the mean weight for any one group (diet) is different than any other group. The anova will not tell us which group is different, just that any one group is different. In the next module, we will learn how to tell which is the group that is different.


In [31]:
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols
from pydataset import data

df = data('chickwts')
df.feed = df.feed.astype('category')
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 71 entries, 1 to 71
Data columns (total 2 columns):
weight    71 non-null int64
feed      71 non-null category
dtypes: category(1), int64(1)
memory usage: 1.4 KB


In [14]:
cw_lm = ols('weight ~ C(feed)', data=df).fit()
sm.stats.anova_lm(cw_lm, typ=2)

Unnamed: 0,sum_sq,df,F,PR(>F)
C(feed),231129.162103,5.0,15.3648,5.93642e-10
Residual,195556.020996,65.0,,


Looking at the output of the anova, we can see that the mean weight for at least one of the groups (diets) is different than at least one of the other groups. We know this because the p-value (6.433e-07) is less than 0.05. This output tables also gives us the F statistic (10.81). 

If we want to see the coefficient values for each of the diets we can use the *coef()* function. This tells us the size of the diet effect. 

In [37]:
# Issues here
df = data('chickwts')
df['weight'].corr(df['feed'])

TypeError: unsupported operand type(s) for /: 'str' and 'int'

For example, in this situation diet 1 has an average weight of 102.65 grams. The effect of diet 2 is to increase weight by 19.97 grams, for diet 3 it increases weight by 40.30, and for diet 4 we see an increase in weight of 32.62 grams. 

And, we can use the *confint()* function to get the confidence intervals. 


In [38]:
# Issues

Recall that these are the confidence intervals for the intercept is around the mean of diet 1, and the confidence intervals for diets 2-4 is the confidence intervals around the difference in means between the particular diet and diet 1. 

We can also change the confidence intervals if we are interested in another level, such as 99%.

In [39]:
# Issues
# confint(df, level = 0.99)

## Problem Set

1. Let's explore the *iris* dataset (R built-in dataset). This dataset gives us the measurements of sepal length and width and petal length and width for 50 flowers of 3 species of iris. First, create a boxplot of petal length for the 3 iris species.
2. Is there a difference in mean petal length between the 3 iris species? Use an anova to answer this question. 
3. What are the coefficient for each species and the 95% confidence intervals? Which species has the largest mean petal length?
