# Practical 8

## Aim

To learn how to examine and interpret interaction between two variables in a logistic model. 

In [None]:
library(tidyverse)

## Reading in the dataset and identifying relevant variables

In this practical session we will use a dataset from the study of helminths in Uganda.

To read in the dataset, type:

In [None]:
library(haven)

In [None]:
helminths_df <- read_dta("Data_files-20211113/helminths.dta")

In this session we will focus again on hookworm infection as the main exposure and severe anaemia as the outcome. If you are starting this practical, you may need to make sure it is called hookworm. If it is currently called hk_bin. type:

In [None]:
helminths_df_2 <- helminths_df %>%
    mutate(hookworm = hk_bin)

This has now renamed the variable describing hookworm infection status to hookworm. We will examine whether the effect of hookworm on severe anaemia is modified by age. The aim of all statistical modelling exercises is to obtain the best measure of effect for the exposure of interest. There is no point in having a complex model which is difficult to interpret if a simpler one is statistically almost as good. So our strategy will be to test a model with interaction between hookworm and agegrp.  If it does not significantly improve the model, tested via the likelihood ratio test (LRT), then a simpler model with no interaction will be chosen.   
 
First a recap of the variables of interest:

**anaemic_sev**  
    coded: 0=no, 1=yes 
 
**hookworm** is the variable name for hookworm infection status  
    coded:  0=uninfected, 1=infected 

**agegrp** is the variable name for age-group 
    coded: 0=<20, 1=20-24, 2=25-29, 3=30+ 

**malaria** 
    coded: 0=uninfected, 1=infected 


To illustrate with a less complex model, use recode with age to create a binary variable. This is only less complex in that there will be fewer parameters in the model. (It is possible to test interactions with more than two categories in a variable, but we are trying here to keep things simpler at first.) To create a new variable where age takes 2  levels, type: 

In [None]:
helminths_df_3 <- helminths_df_2 %>%
    mutate(agebin = as.factor(if_else(agegrp < 2, 0, 1)))

Check the recoding of the new variable agebin by tabulating it against the original 
variable agegrp.  Type:

In [None]:
table(helminths_df_3$agegrp, helminths_df_3$agebin)

## Specifying a logistic model with an interaction term

In Practical 7 we produced a logistic model that assumed proportional odds. This assumed that the effect of hookworm infection on severe anaemia is the same for all levels of age. If the effect of hookworm is modified by age, i.e. there is interaction between the two variables, we should include the interaction in our model.

To produce a model with interaction in `glm` insert `*` between the two variables for which the interaction parameter(s) are required. Type:

In [None]:
anaemia_agebin_interaction_glm <-
glm(anaemic_sev ~ agebin * hookworm,
    data = helminths_df_3,
    family = binomial)

In [None]:
summary(anaemia_agebin_interaction_glm)

In [None]:
exp(cbind(OR = coef(anaemia_agebin_interaction_glm), confint(anaemia_agebin_interaction_glm)))

We will return to the interpretation of this output shortly.

## Likelihood ratio test for interaction  

Before attempting to interpret interaction terms let’s test this interaction formally using a likelihood ratio test.  Type the following commands: 

In [None]:
anaemia_agebin_glm <-
glm(anaemic_sev ~ agebin + hookworm,
    data = helminths_df_3,
    family = binomial)

In [None]:
library(lmtest)

In [None]:
lrtest(anaemia_agebin_interaction_glm, anaemia_agebin_glm)

The LRT tests the null hypothesis that the model with the interaction term does not improve the model fit when compared to the model without the interaction term, or, put another way, that the two models fit equally well. There is no strong evidence (P=0.672) against this null hypothesis, therefore it would be reasonable to assume that there is no strong evidence for interaction and to accept the simpler model.  

In [None]:
summary(anaemia_agebin_glm)

In [None]:
exp(cbind(OR = coef(anaemia_agebin_glm), confint(anaemia_agebin_glm)))

From this model we would conclude that the odds of severe anaemia in hookworm infected women are 2.8 times the odds of severe anaemia in hookworm uninfected women irrespective of age, and that the odds of severe anaemia in those aged 25+ years are 0.7 times the odds of severe anaemia in those aged <25 years regardless of hookworm infection status. 

## Interpreting an interaction model 

There is very little evidence against the null hypothesis of no interaction in this example and we would therefore usually choose to assume that the proportional odds model is reasonable. However, for the purpose of demonstrating how to analyse and interpret interactions we will proceed with the interaction model. 

Let’s now return to the output of the interaction model from page 59.  

In [None]:
summary(anaemia_agebin_interaction_glm)

In [None]:
exp(cbind(OR = coef(anaemia_agebin_interaction_glm), confint(anaemia_agebin_interaction_glm)))

This model contains an interaction between age and hookworm infection status. This means that there is no longer an overall age effect regardless of hookworm infection status because the effect of age differs according to hookworm infection status (and vice versa, the hookworm effect depends on age group). 

The coefficients for the main effects are interpreted as the effect of the exposure when all other exposures involved in the interaction in the model are equal to their baseline categories. The two exposures involved in the interaction in this model are age group and hookworm infection. 

As an aside, note that if we had also included a third variable in this model (say, maternal education), as a main effect only, we would interpret the coefficients for the main effect of agebin as the effect of agebin when hookworm infection was equal to its baseline category, but now adjusted for maternal education. Similarly, the coefficient for the main effect of hookworm infection would be interpreted as the effect of hookworm infection, adjusted for maternal education, and when agebin is equal to its baseline category. 

The coefficient for `agebin` is 0.75 which represents the effect of age group in the baseline category for hookworm. Therefore, the odds of severe anaemia in those aged 25+ are 0.75 times the odds of severe anaemia in those aged <25 years who are uninfected with hookworm. 

Similarly the odds of severe anaemia in hookworm infected women are 2.95 times the odds of severe anaemia in hookworm uninfected women in the <25 year age-group. These are known as stratum-specific effects: they are the effects within specific strata (meaning groups or combinations of groups). Let’s divide the age and hookworm categories into four strata: 

[A] = [age <25, hookworm uninfected] 

[B] = [age 25+, hookworm uninfected] 

[C] = [age <25, hookworm infected] 

[D] = [age 25+, hookworm infected] 

Here [A] corresponds to the strata where both exposures equal their baseline categories, and the coefficient for `agebin` represents the comparison [B vs A] and the coefficient for hookworm represents the comparison [C vs A].

The 3rd row of the output table corresponds to the interaction parameter (which `R` refers to as `agebin1:hookworm`). The interaction coefficient is the additional effect, over and above the main effects, that occurs when both exposures are “present”. 
 
For example, the odds ratio for the comparison [D vs A] is obtained as: 

OR = 0.75 × 2.95 × 0.88 = 1.95.  
 
There are still some stratum-specific estimates that we have not yet evaluated. For example we know the odds ratio in hookworm infected compared to uninfected women is 2.95 in the <25 years age group but what is it in the 25+ age group? To determine this we must multiply the appropriate coefficients from the model. This particular estimate corresponds to the comparison [D vs B]. Age is the same in both of these strata therefore we do not need to include a main effect for age. We are comparing hookworm infection vs “no infection” so we do require the main effect of hookworm. Both exposures are 
“present” in strata [D] so we also need to include the interaction parameter. Therefore the odds ratio is calculated as OR = 2.95 × 0.88 = 2.60, i.e. in the 25+ age group, the odds of severe anaemia in hookworm infected women are 2.60 times the odds of severe anaemia in women without hookworm infection. 
 
Following similar logic, in hookworm infected women the odds of severe anaemia in those aged 25+ are 0.75 × 0.88 = 0.66 times the odds of severe anaemia in those aged <25 years (This corresponds to the comparison [D vs C]. 
 
Let’s just review the results from the models with and without the interaction. The simpler model (the proportional odds model, since it assumes the odds of severe anaemia are the same regardless of the categories of the other exposure) gives the overall age effect as OR = 0.70. On the other hand the interaction model gives the stratum-specific (i.e. hookworm infection-specific) age effects as OR = 0.75 for hookworm uninfected and OR = 0.66 for hookworm infected women. 
 
Notes: 

1. The formal test for interaction suggested very little evidence against the null hypothesis, however the actual stratum-specific estimates are somewhat 
different to each other. It is important to take away from this that the size of the p-value for the test for interaction does not indicate the magnitude of the interaction effect. 
2. The overall effect lies somewhere between the two stratum-specific effects; it is essentially a weighted average of the two. 
3. While the two stratum specific odds ratios appear somewhat different, they both point in the same direction (i.e. are <1).  
4. Also, we have not calculated confidence intervals. If the confidence intervals overlap, or include the other stratum specific value, we might reconsider how different we feel the two stratum specific values are.  
5. Both the model with and without interaction estimates the underlying ‘true’ effects in the population, and there is no strong evidence that one model is better than the other.  
6. Generally speaking, the stronger the evidence of interaction, the more helpful it is to present stratum-specific estimates rather than an overall estimate of effect. However, there is no absolute cut-off, and it is up to us to decide when the interaction is so important to your main question of interest that you cannot present an overall estimate of effect. We cover these choices in more detail in Session SM12 – Strategies of Analysis. 
 

## Obtaining stratum–specific estimates using Stata

We have now seen how to calculate stratum-specific effects using the coefficients from 
the logistic regression model. Stata can produce stratum-specific results in the following 
ways: 
 
1. Stata can obtain stratum-specific estimates by producing a logistic model for each level of the stratifying variable. For example, to produce estimates for each level of agebin, type:

In [None]:
logistic anaemic_sev hookworm if agebin==0 
logistic anaemic_sev hookworm if agebin==1