# Logistic regression 3 

## Aim 

To learn how to include quantitative exposures in a logistic model and decide whether a linear trend is appropriate. 

In [None]:
library(tidyverse)

## Reading in the dataset and identifying relevant variables 
In this practical session we will again use the dataset from the helminths study in Uganda.  
 
To read in the dataset, type:

In [None]:
library(haven)

In [None]:
helminths_df <- read_dta("Data_files-20211113/helminths.dta")

The variables we will be working with are:

**anaemic_sev**  is the variable for severe anaemia, 
    coded: 0=no, 1=yes 
 
**agegrp**     is the variable name for age-group  
    coded: 0=<20, 1=20-24, 2=25-29, 3=30+ 
 
Note that in previous practicals, hookworm was our main exposure of interest, with age considered a potential confounder. For this practical, we shall consider agegrp as our main exposure of interest, in order to illustrate methods for analysing ordered categorical exposure variables.

## Quantitative exposures

In most analyses it is convenient to group continuous variables such as age. We can then obtain a parameter estimate for each level of the variable compared to a baseline level. For example, with agegrp we have previously obtained parameter estimates for the odds ratios relative to the youngest age-group. To review these results use the following command: 

In [None]:
anaemia_agegrp_glm <-
glm(anaemic_sev ~ as.factor(agegrp),
    data = helminths_df,
    family = binomial)

In [None]:
summary(anaemia_agegrp_glm)

In [None]:
exp(cbind(OR = coef(anaemia_agegrp_glm), confint(anaemia_agegrp_glm)))

## Assuming a linear trend

In fact, with quantitative exposures it is possible to model a linear effect (increasing or decreasing). Such a model assumes a common odds ratio, that is the relative increase (or decrease) from one age-group to the next is the same. To model this in `R` we simply do not specify that the variable is a factor. The output then assumes the same increase for each unit increase in the variable.  

To do this for agegrp type: 


In [None]:
anaemia_agegrp_linear_glm <-
glm(anaemic_sev ~ agegrp,
    data = helminths_df,
    family = binomial)

In [None]:
summary(anaemia_agegrp_linear_glm)

In [None]:
exp(cbind(OR = coef(anaemia_agegrp_linear_glm), confint(anaemia_agegrp_linear_glm)))

Note that only one parameter is required for age when a linear effect is assumed, therefore we will have a simpler model that describes the same association. A simpler, or more parsimonious, model can be desirable when for example you have a limited number of observations, or in the case of particular analysis strategies. We will discuss these issues in more detail in SM12 ‘Strategies of Analysis’.  
 
The odds ratio estimate for the linear effect of age-group is 0.74. How do we interpret 
this? 

This is the odds ratio from one level to the next i.e. the common odds ratio for a unit increase in `agegrp`. This depends, of course, on the way in which the age-group categories are defined, and when reporting a linear effect in categories these definitions should be stated clearly. The odds ratios for each age-group compared to the youngest age-group (<20 years) assuming a linear effect are shown below. The estimates for separate effects are also shown.

| Age-group (years) | Linear effect ORs | Separate ORs |
|-------------------|-------------------|--------------|
| 20-24             | 0.74 = 0.74       | 0.59         |
| 25-29             | 0.74 = 0.55       | 0.42         |
| 30+               | 0.74 = 0.41       | 0.48         |

The odds ratio estimates of the linear effect and the separate effects are somewhat different.  If we can assume a linear trend, then the model is simpler.  However, we must first formally assess whether the separate age-group effects provide a better model for the data, i.e. test whether there is departure from a linear trend in the separate effects of `agegrp`.

## Testing the linear assumption  

Formally this is called a test of ‘departure from linear trend’. To test for departure we compare the model assuming a linear trend (OR = 0.74) to the model with separate age-group effects, using a likelihood ratio test.  

We first fit the model with most parameters (i.e. the one which models the separate effects), save the log likelihood of this model, then fit the model that assumes a linear effect, then compare the two log likelihoods.   

To do this, type: 

In [None]:
library(lmtest)
lrtest(anaemia_agegrp_glm, anaemia_agegrp_linear_glm)

The null hypothesis of this test is that the association between age-group and severe anaemia is linear, or, more formally, that there is no difference in the goodness of fit of the two models assuming linear trend and estimating separate effects for each category. The result of this test is P=0.066, which provides some evidence against the null hypothesis. In other words we can say that there is some evidence (albeit not strong) that including a separate effect for each age group improves the fit of the model and that the linear effect may not sufficiently describe the data. Therefore we may want to model separate effects for each age group rather than assuming a linear trend in the categories.  On the other hand, since the evidence against the null hypothesis of a linear trend is fairly weak, we 
could, for simplicity, assume a linear trend and use this more simple modelling approach. 


## Key points  
* For quantitative exposures that have been grouped into categories a linear effect is preferable to separate effects for each category, but only if modelling with separate effects does not improve the fit of the model.   
 
* The estimate for a linear effect can be interpreted as the OR (or increase in log odds) for a unit increase in the variable. A unit increase may be an increase of one category, for a grouped categorical variable, or an increase of one unit if the variable is on its original scale (e.g. age in years). 

## Review exercise  

Now try to carry out the same analysis on your own. For this exercise you should use the mwanza dataset which refers to a case-control study of HIV infection. The solutions are given in Section 4. 

1) Using the mwanza data, produce a table of HIV infection and number of injections in the past year.  

**inj** is the variable name for name for injections in the past year  
   coded: 1=none, 2=1, 3=2-4, 4=5-9, 5=10+, 9=missing 
 


In [None]:
mwanza_df <- read_dta("Data_files-20211113/MWANZA.dta")

In [None]:
table(mwanza_df$case, mwanza_df$inj)

2) Use tabodds to calculate the odds of infection for each level of number of injections.

There is no good replacement for `STATA`'s `tabodds` command, so we'll do it by hand:

In [None]:
(mwanza_injections_table <- 
    table(mwanza_df$inj, mwanza_df$case))

In [None]:
mwanza_injections_odds <- 
    mwanza_injections_table[, 2] / mwanza_injections_table[, 1]

In [None]:
mwanza_injections_se <- sqrt((1 / sum(mwanza_injections_table[, 2])) +
    (1 / sum(mwanza_injections_table[, 1])))
mwanza_injections_ef <- exp(1.96 * mwanza_injections_se)

In [None]:
mwanza_injections_lower <- mwanza_injections_odds / mwanza_injections_ef
mwanza_injections_upper <- mwanza_injections_odds * mwanza_injections_ef

In [None]:
mwanza_injections_df <- data.frame(cbind(mwanza_injections_table,
    mwanza_injections_odds,
    mwanza_injections_lower,
    mwanza_injections_upper,
    stringsAsFactors = FALSE))
names(mwanza_injections_df) <- c("controls", "cases", "odds", "[95% Conf.", "Interval]")

In [None]:
mwanza_injections_df

3) Produce a logistic model that assumes a linear effect of number of injections in the past year (remember to exclude the missing category). 

    What is the common OR ratio from the model?

In [None]:
anaemia_agegrp_linear_glm <-
glm(anaemic_sev ~ agegrp,
    data = helminths_df,
    family = binomial)

4) Use a likelihood ratio test to test for departure from a linear trend.