<!--NAVIGATION-->
< [Multiple Explanatory Variables](16-MulExpl.ipynb) | [Main Contents](Index.ipynb) | [Model Simplification](18-ModelSimp.ipynb)>

# Linear Models: Multiple variables and interactions  <span class="tocSkip">

<h1>Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span><ul class="toc-item"><li><span><a href="#Formulae-with-interactions-in-R" data-toc-modified-id="Formulae-with-interactions-in-R-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Formulae with interactions in R</a></span></li><li><span><a href="#Model-1:-Mammalian-genome-size" data-toc-modified-id="Model-1:-Mammalian-genome-size-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Model 1: Mammalian genome size</a></span></li><li><span><a href="#ANCOVA:-Body-Weight-in-Odonata" data-toc-modified-id="ANCOVA:-Body-Weight-in-Odonata-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>ANCOVA: Body Weight in Odonata</a></span></li></ul></li></ul></div>

# Introduction 

Aims of this chapter[$^{[1]}$](#fn1):

Creating more complex models, including ANCOVA

Looking at interactions between variables

Plotting predictions from models

We will look at two models in this chapter:

Model 1: Is mammalian genome size predicted by interactions between
trophic level and whether species are ground dwelling?

ANCOVA: Is body size in Odonata predicted by interactions between genome
size and taxonomic suborder?

So far, we have only looked at the independent effects of variables. For
example, in the trophic level and ground dwelling model from [MulExpl](16-MulExpl.ipynb), we only looked for specific differences for being a
omnivore *or* being ground dwelling, not for being
specifically a *ground dwelling omnivore*. These
independent effects of a variable are known as *main
effects* and the effects of combinations of variables acting
together are known as *interactions* — they describe how
the variables *interact*.

## Formulae with interactions in R

We’ve already seen a number of different model formulae in R. They all
use this syntax:

`response variable ~ explanatory variable(s)`

but we are now going to add two extra pieces of syntax:

The `a:b` means the interaction between `a` and
`b` — do combinations of these variables lead to different
outcomes?

This a shorthand for the model above. The means fit `a` and
`b` as main effects and their interaction `a:b`.

## Model 1: Mammalian genome size

$\star$ Make sure you have changed the working directory to `Code` in
your stats coursework directory.

Create a new blank script called ‘Interactions.R’ and add some
introductory comments.

Use `load(’mammals.Rdata’)` to load the data.

If `mammals.Rdata` is missing, just import the data again
using `read.csv(“../Data/MammalData.csv”)`. You will then
have to add the log C Value column to the imported data frame again.

Let’s refit the model from [MulExpl](16-MulExpl.ipynb), but including the
interaction between trophic level and ground dwelling. We’ll immediately
check the model is appropriate:

In [None]:
> model <- lm(logCvalue ~ TrophicLevel * GroundDwelling, data= mammals)
> par(mfrow=c(2,2), mar=c(3,3,1,1), mgp=c(2, 0.8,0))
> plot(model)   

This gives:

<a id="fig:mamMod"></a>
<figure>
    <img src="./graphics/mamMod.svg" alt="mamMod" style="width:70%">
    <small> 
        <center>
            <figcaption> 
           Figure 1
            </figcaption>
        </center>
    </small>
</figure>

Now, we’ll examine the `anova` and `summary`
outputs for the model:

In [None]:
> anova(model)

 Analysis of Variance Table
 
 Response: logCvalue
                              Df Sum Sq Mean Sq F value  Pr(>F)    
 TrophicLevel                  2   0.81   0.407    8.06  0.0004 ***
 GroundDwelling                1   2.75   2.747   54.40 2.3e-12 ***
 TrophicLevel:GroundDwelling   2   0.43   0.216    4.27  0.0150 *  
 Residuals                   253  12.77   0.050                    
 ---
 Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Compared to the model from [MulExpl](16-MulExpl.ipynb), there is an extra
line at the bottom. The top two are the same and show that trophic level
and ground dwelling both have independent main effects. The extra line
shows that there is also an interaction between the two. It doesn’t
explain a huge amount of variation, about half as much as trophic level,
but it is significant.

Again, we can calculate the $r^2$ for the model: $$\frac{0.81 + 2.75 + 
0.43}{0.81+2.75+0.43+12.77} = 0.238$$ The model from [MulExpl](16-MulExpl.ipynb) without the interaction had an $r^2 = 0.212$ — our new
model explains 2.6% more of the variation in the data.

The summary table is as follows:

In [None]:
>summary(model)
 
 Call:
 lm(formula = logCvalue ~ TrophicLevel * GroundDwelling, data = mammals)
 
 Residuals:
    Min     1Q Median     3Q    Max 
 -0.523 -0.171 -0.010  0.119  0.831 
 
 Coefficients:
                                         Estimate Std. Error t value Pr(>|t|)    
 (Intercept)                               0.9589     0.0441   21.76  < 2e-16 ***
 TrophicLevelHerbivore                     0.0535     0.0554    0.97  0.33460    
 TrophicLevelOmnivore                      0.2328     0.0523    4.45  1.3e-05 ***
 GroundDwellingYes                         0.2549     0.0651    3.92  0.00012 ***
 TrophicLevelHerbivore:GroundDwellingYes   0.0303     0.0786    0.39  0.69979    
 TrophicLevelOmnivore:GroundDwellingYes   -0.1476     0.0793   -1.86  0.06384 .  
 ---
 Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
 
 Residual standard error: 0.225 on 253 degrees of freedom
   (120 observations deleted due to missingness)
 Multiple R-squared: 0.238, Adjusted R-squared: 0.223 
 F-statistic: 15.8 on 5 and 253 DF,  p-value: 1.5e-13 

The lines in this are:

1. The reference level (intercept) for non ground dwelling carnivores. (The reference level is decided just by the alphabetic order of the levels)
2. Two differences for being in different trophic levels.
3. One difference for being ground dwelling
4. Two new differences that give specific differences for ground dwelling herbivores and omnivores.

The first four lines, as in the model from [ANOVA](15-anova.ipynb), which
would allow us to find the predicted values for each group *if the
size of the differences did not vary between levels because of the
interactions*. That is, this part of the model only includes a
single difference ground and non-ground species, which has to be the
same for each trophic group because it ignores interactions between
trophic level and ground / non-ground identity of each species. The last
two lines then give the estimated coefficients associated with the
interaction terms, and allow cause the size of differences to vary
between levels because of the further effects of interactions.

The table below show how these combine to give the predictions for each
group combination, with those two new lines show in red:

$$\begin{array}{|r|r|r|}
\hline
 & \textrm{Not ground} &  \textrm{Ground} \\
\hline
\textrm{Carnivore} & 0.96 = 0.96 &  0.96+0.25=1.21 \\
\textrm{Herbivore} & 0.96 + 0.05 = 1.01 & 0.96+0.05+0.25{\color{red}+0.03}=1.29\\
\textrm{Omnivore} & 0.96 + 0.23 = 1.19 & 0.96+0.23+0.25{\color{red}-0.15}=1.29\\
\hline
\end{array}$$

So why are there two new coefficients? For interactions between two
factors, there are always $(n-1)\times(m-1)$ new coefficients, where $n$
and $m$ are the number of levels in the two factors (Ground dwelling or
not: 2 levels and trophic level: 3 levels, in our current example). So
in this model, $(3-1) \times (2-1) =2$. It is easier to understand why
graphically: the prediction for the white boxes below can be found by
adding the main effects together but for the grey boxes we need to find
specific differences and so there are $(n-1)\times(m-1)$ interaction
coefficients to add.

<a id="fig:interactionsdiag"></a>
<figure>
    <img src="./graphics/interactionsdiag.png" alt="interactionsdiag" style="width:50%">
    <small> 
        <center>
            <figcaption> 
           Figure 2
            </figcaption>
        </center>
    </small>
</figure>
If we put this together, what is the model telling us?

Herbivores have the same genome sizes as carnivores, but omnivores have
larger genomes.

Ground dwelling mammals have larger genomes.

These two findings suggest that ground dwelling omnivores should have
extra big genomes. However, the interaction shows they are smaller than
expected and are, in fact, similar to ground dwelling herbivores.

Note that although the interaction term in the `anova` output
is significant, neither of the two coefficients in the
`summary` has a $p<0.05$. There are two weak differences (one
very weak, one nearly significant) that together explain significant
variance in the data.

$\star$ Copy the code above into your script and run the model.

Make sure you understand the output!

Just to make sure the sums above are correct, we’ll use the same code as
in [MulExpl](16-MulExpl.ipynb) to get R to calculate predictions for us:

In [None]:
# a data frame of combinations of variables
> gd <- rep(levels(mammals$GroundDwelling), times = 3)
> print(gd)
 [1] "No"  "Yes" "No"  "Yes" "No"  "Yes"

> tl <- rep(levels(mammals$TrophicLevel), each = 2)
> print(tl)

 [1] "Carnivore" "Carnivore" "Herbivore" "Herbivore" "Omnivore"  "Omnivore" 

# New data frame
> predVals <- data.frame(GroundDwelling = gd, TrophicLevel = tl)

# predict using the new data frame
> predVals$predict <- predict(model, newdata = predVals)
> print(predVals)
 
    GroundDwelling TrophicLevel predict
 1             No    Carnivore  0.9589
 2            Yes    Carnivore  1.2138
 3             No    Herbivore  1.0125
 4            Yes    Herbivore  1.2977
 5             No     Omnivore  1.1918
 6            Yes     Omnivore  1.2990

$\star$ Run these predictions in your script.

If we plot these data points onto the barplot from [MulExpl](16-MulExpl.ipynb), they now lie exactly on the mean values, because we’ve
allowed for interactions. The triangle on this plot shows the
predictions for ground dwelling omnivores from the main effects
($0.96 + 0.23  + 0.25 = 1.44$), the interaction of $-0.15$ pushes the
prediction back down.

<a id="fig:predPlot"></a>
<figure>
    <img src="./graphics/predPlot.svg" alt="predPlot" style="width:80%">
</figure>


## ANCOVA: Body Weight in Odonata

We’ll go all the way back to the regression analyses from [regress](14-regress.ipynb). Remember that we fitted two separate regression
lines to the data for damselflies and dragonflies. We’ll now use an
interaction to fit these in a single model. This kind of linear model —
with a mixture of continuous variables and factors — is often called an
*analysis of covariance*, or ANCOVA. That is, ANCOVA is a
type of linear model that blends ANOVA and regression. ANCOVA evaluates
whether population means of a dependent variable are equal across levels
of a categorical independent variable, while statistically controlling
for the effects of other continuous variables that are not of primary
interest, known as covariates.

*That is, this is still a linear model, but with one categorical
and one or more continuous predictors*.

$\star$ Load the data:

`odonata <- read.csv(’../Data/GenomeSize.csv’)`

Create two new variables in the `odonata` data set called
`logGS` and `logBW` containing log genome size and
log body weight.

The models we fitted before looked like this:

<a id="fig:dragonData"></a>
<figure>
    <img src="./graphics/dragonData.svg" alt="dragonData" style="width:70%">
    <small> 
        <center>
            <figcaption> 
           Figure 3
            </figcaption>
        </center>
    </small>
</figure>

We can now fit the model of body weight as a function of both genome
size and suborder:

In [None]:
> odonModel <- lm(logBW ~ logGS * Suborder, data = odonata)

Again, we’ll look at the <span>anova</span> table first:

In [None]:
> anova(odonModel)
 
 Analysis of Variance Table
 
 Response: logBW
                Df Sum Sq Mean Sq F value  Pr(>F)    
 logGS           1    1.1     1.1    2.71     0.1    
 Suborder        1  112.0   112.0  265.13 < 2e-16 ***
 logGS:Suborder  1    9.1     9.1   21.65 1.1e-05 ***
 Residuals      94   39.7     0.4                    
 ---
 Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Interpreting this gives the following:

There is no significant main effect of log genome size. The
*main* effect is the important thing here — genome size is
hugely important but does very different things for the two different
suborders. If we ignored `Suborder`, there isn’t an overall
relationship: the average of those two lines is pretty much flat.

There is a very strong main effect of Suborder: the mean body weight in
the two groups are very different.

There is a strong interaction between suborder and genome size. This is
an interaction between a factor and a continuous variable and shows that
the *slopes* are different for the different factor levels.

The summary table looks like this:

In [None]:
> summary(odonModel)
 
 Call:
 lm(formula = logBW ~ logGS * Suborder, data = odonata)
 
 Residuals:
     Min      1Q  Median      3Q     Max 
 -1.3243 -0.3225  0.0073  0.3962  1.4976 
 
 Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
 (Intercept)              -2.3995     0.0848  -28.31  < 2e-16 ***
 logGS                     1.0052     0.2237    4.49  2.0e-05 ***
 SuborderZygoptera        -2.2489     0.1354  -16.61  < 2e-16 ***
 logGS:SuborderZygoptera  -2.1492     0.4619   -4.65  1.1e-05 ***
 ---
 Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
 
 Residual standard error: 0.65 on 94 degrees of freedom
   (2 observations deleted due to missingness)
 Multiple R-squared: 0.755, Adjusted R-squared: 0.747 
 F-statistic: 96.5 on 3 and 94 DF,  p-value: <2e-16 
 

The first thing to note is that the $r^2$ value is really high. The
model explains three quarters (0.752) of the variation in the data.
Next, there are four coefficients:

The intercept is for the first level of `Suborder`, which is
Anisoptera (dragonflies).

The next line, for `log genome size`, is the slope for Anisoptera.

We then have a coefficient for the second level of `
Suborder`, which is Zygoptera (damselflies). As with the first
model, this difference in factor levels is a difference in mean values
and shows the difference in the intercept for Zygoptera.

The last line is the interaction between `Suborder` and
`logGS`. This shows how the slope for Zygoptera differs from
the slope for Anisoptera.

How do these hang together to give the two lines shown in the model? We
can calculate these by hand: $$\begin{aligned}
    \textrm{Body Weight} &= -2.40 + 1.01 \times \textrm{logGS} & \textrm{[Anisoptera]}\\
    \textrm{Body Weight} &= (-2.40 -2.25) + (1.01 - 2.15) \times \textrm{logGS} & \textrm{[Zygoptera]}\\
                         &= -4.65 - 1.14 \times \textrm{logGS} \\\end{aligned}$$

$\star$ Add the code into your script and check that you understand the outputs.

We’ll use the `predict` function to get the predicted values
from the model and add lines to the plot above.

First, we’ll create a set of numbers spanning the range of genome size:

In [None]:
#get the range of the data
> rng <- range(odonata$logGS)
#get a sequence from the min to the max with 100 equally spaced values
> LogGSForFitting <- seq(rng[1], rng[2], length = 100)

Have a look at these numbers:

In [None]:
print(LogGSForFitting)

We can now use the model to predict the values of body weight at each of
those points for each of the two suborders. We’ve added `se.fit=TRUE` to the function to get the standard error around the
regression lines. Note that we are now using

In [None]:
#get a data frame of new data for the order
> ZygoVals <- data.frame(logGS = LogGSForFitting, Suborder = "Zygoptera")

#get the predictions and standard error
> ZygoPred <- predict(odonModel, newdata = ZygoVals, se.fit = TRUE)

#repeat for anisoptera
AnisoVals <- data.frame(logGS = LogGSForFitting, Suborder = "Anisoptera")

AnisoPred <- predict(odonModel, newdata = AnisoVals, se.fit = TRUE)

Both `AnisoPred` and `ZygoPred` contain predicted values (called `fit`) and standard error values (called
`se.fit`) for each of the values in our generated values in
`LogGSForFitting` for each of the two suborders.

We can add the predictions onto a plot like this:

In [None]:
# plot the scatterplot of the data
> plot(logBW ~ logGS, data = odonata, col = Suborder)
# add the predicted lines
> lines(AnisoPred$fit ~ LogGSForFitting, col = "black")
> lines(AnisoPred$fit + AnisoPred$se.fit ~ LogGSForFitting, col = "black", lty = 2)
> lines(AnisoPred$fit - AnisoPred$se.fit ~ LogGSForFitting, col = "black", lty = 2)

$\star$ Copy the prediction code into your script and run the plot above.

Copy and modify the last three lines to add the lines for the Zygoptera. Your final plot should look like this.

<a id="fig:odonPlot"></a>
<figure>
    <img src="./graphics/odonPlot.svg" alt="odonPlot" style="width:70%">
    <small> 
        <center>
            <figcaption> 
           Figure 4
            </figcaption>
        </center>
    </small>
</figure>

---

<a id="fn1"></a>
[1]: Here you work with the script file `MulExplInter.R`