# Practical 7

## Aim

To learn how to carry out a simple logistic regression analysis

In [None]:
library(tidyverse)

## Reading in the dataset and identifying relevant variables

In this practical session we will use a dataset from the study of helminths in Uganda.

To read in the dataset, type:

In [None]:
library(haven)

In [None]:
helminths_df <- read_dta("Data_files-20211113/helminths.dta")

In [None]:
head(helminths_df)

In  this  analysis  we  will  work  with  the  variable  representing  hookworm  infection.  It  is currently called hk_bin. To make this more clear, we will rename the variable. Type:

In [None]:
helminths_df_2 <- helminths_df %>%
    mutate(hookworm = hk_bin)

This has now renamed the variable to hookworm 
 
In this analysis we will look at the association between severe anaemia and exposure to hookworm infection.  We will also consider how (if at all) the association changes with age and malaria infection status. 
 
**anaemic_sev**  
    coded: 0=no, 1=yes 
 
**hookworm** is the variable name for hookworm infection status  
    coded:  0=uninfected, 1=infected 

**agegrp** is the variable name for age-group 
    coded: 0=<20, 1=20-24, 2=25-29, 3=30+ 

**malaria** 
    coded: 0=uninfected, 1=infected 

To produce frequency distributions for anaemic_sev, hookworm and agegrp use `CrossTable` from the package `gmodels`, type: 

In [None]:
library(gmodels)

In [None]:
CrossTable(helminths_df_2$anaemic_sev)

There were 275 women with severe anaemia

In [None]:
CrossTable(helminths_df_2$hookworm)

1,022 women were hookworm infected and 1,395 were not infected with hookworm. 

In [None]:
CrossTable(helminths_df_2$agegrp)

There were 607 women aged <20 years; 906 women aged 20 to 24 years; 545 women aged 25 to 29 years; and 359 women aged 30+ years.

## Testing for an association

For an initial examination of the association between severe anaemia and hookworm 
infection use the `CrossTable` command.  Type: 

In [None]:
CrossTable(helminths_df_2$anaemic_sev, helminths_df_2$hookworm,
prop.r = FALSE, prop.c = TRUE, chisq = TRUE)

From the table we can see that 17.7% of women infected with hookworm had severe anaemia, compared to 6.7% of women who were uninfected. This is very strong evidence (P<0.001) against the null hypothesis of *no association between severe anaemia and hookworm infection residence.* 

To examine the odds of severe anaemia by hookworm infection status there is no good replacement for `STATA`'s `tabodds` command, so we'll do it by hand:

Create a table

In [None]:
hookworm_anaemia_table <- 
    table(helminths_df_2$hookworm, helminths_df_2$anaemic_sev)

Calculate the odds

In [None]:
hookworm_anaemia_odds <- 
    hookworm_anaemia_table[, 2] / hookworm_anaemia_table[, 1]

Calculate the standard error

In [None]:
hookworm_anaemia_se <- sqrt((1 / sum(hookworm_anaemia_table[, 2])) +
    (1 / sum(hookworm_anaemia_table[, 1])))
hookworm_anaemia_ef <- exp(1.96 * hookworm_anaemia_se)


Calulate upper and lower 95% confidence interval bonds

In [None]:
hookworm_anaemia_lower <- hookworm_anaemia_odds / hookworm_anaemia_ef
hookworm_anaemia_upper <- hookworm_anaemia_odds * hookworm_anaemia_ef

Bind them together into a data frame and give it readable names

In [None]:
tibble(hookworm_anaemia_table,
hookworm_anaemia_odds,
hookworm_anaemia_lower,
hookworm_anaemia_upper)

In [None]:
hookworm_anaemia_df <- data.frame(cbind(hookworm_anaemia_table,
    hookworm_anaemia_odds,
    hookworm_anaemia_lower,
    hookworm_anaemia_upper,
    stringsAsFactors = FALSE))
names(hookworm_anaemia_df) <- c("controls", "cases", "odds", "[95% Conf.", "Interval]")

Now see the output

In [None]:
hookworm_anaemia_df

And test for homogeneity

In [None]:
table(helminths_df_2$hookworm, helminths_df_2$anaemic_sev) %>%
    chisq.test()

We can see that the odds of severe anemia are greater among hookworm infected 
women, and that the P-value (P<0.001) provides very strong evidence against the null 
hypothesis of no difference in odds of severe anaemia by hookworm infection status. Therefore we can conclude that the underlying ‘true’ odds of severe anaemia is greater in hookworm infected women than in uninfected.

Use `epi.2by2` from the package `epiR` to obtain an odds ratio estimate. Type:

In [None]:
library("epiR")

In [None]:
epi.2by2(table(factor(helminths_df_2$hookworm, levels = c(1, 0)),
         factor(helminths_df_2$anaemic_sev, levels = c(1, 0))),
         method = "cross.sectional", digits = 2)

Therefore, the odds of severe anaemia in hookworm infected women are 2.98 times that 
in hookworm uninfected women (95% CI 2.29 to 3.88, P<0.001).

## Logistic regression with one binary exposure 

Now let’s reproduce the result using logistic regression. To obtain a logistic model on a log scale we will use the `glm` command.  The `glm` command gives the parameter estimates for log odds.

The first model we will fit is:
    
    log odds =  constant + hookworm

In [None]:
anaemia_hookworm_glm <- 
    glm(anaemic_sev ~ hookworm, data = helminths_df_2, family = "binomial")

In [None]:

summary(anaemia_hookworm_glm)

We can test for the null hypothesis that *none of the variables in the model are 
associated with the outcome* with `lrtest` from package `lmtest`

In [None]:
library(lmtest)

In [None]:
lrtest(anaemia_hookworm_glm)

In this instance there is only one variable (hookworm) in the model, so this is a test of the null hypothesis that hookworm is not associated with severe anaemia. Back to the model

In [None]:
summary(anaemia_hookworm_glm)

Now consider the values in the table.  The first column gives the parameter names for each row of estimates in the table.  The model corresponds to: 

    log(odds) = constant + hookworm

The values in the Estimate column represent the log(OR) for the effect of hookworm infection (1.0915) and the constant log(odds) in the uninfected group (-2.6276).

    log(odds) = -2.628 + (1.091 × hookworm)

The third column gives the standard error for the model coefficients. These are then used to calculate z, the Wald test statistic, and the corresponding p-value. Consider the estimates for hookworm: what can we conclude from these values about the effect of hookworm infection on odds of severe anaemia?

    Wald test:  z = 8.11,   P<0.001

There is very strong evidence (P<0.001) against the null hypothesis of no association between severe anaemia and hookworm infected. 

To obtain the 95% confidence intervals, use `confint`

In [None]:
confint(anaemia_hookworm_glm)

The 95% CI excludes the value 0 (log(OR)=0 corresponds to OR=1) and we can interpret it as being consistent with the 
‘true’ log(odds) of severe anaemia lying between 0.83 and 1.36 times greater in hookworm infected women compared to uninfected women.

To obtain the OR estimate for the effect of hookworm infection we take the exponential of the coefficient

In [None]:
exp(coef(anaemia_hookworm_glm))

i.e. exp(1.091) = 2.98

The log scale is preferable for explaining the model estimates and how 95% confidence intervals and Wald tests are derived. However, `R` allows us to automatically obtain estimates on the odds ratio scale, which is convenient for reporting results.

In [None]:
exp(cbind(OR = coef(anaemia_hookworm_glm), confint(anaemia_hookworm_glm)))

## Exposures with more than 2 levels

Use the `CrossTable` command to examine the association between age-group and severe 
anaemia.  Type:  

In [None]:
CrossTable(helminths_df_2$anaemic_sev, helminths_df_2$agegrp,
prop.r = FALSE, prop.c = TRUE, chisq = TRUE)

The prevalence of severe anaemia decreases from 17% in the <20 year age-group to 9% in the 30+ years age-group.

In order to inform `R` that a variable is categorical we need to convert it to a factor, using the `as.factor()` function. To produce a model for age-group on the log(odds) scale, type:

In [None]:
anaemia_agegrp_glm <- 
glm(anaemic_sev ~ as.factor(agegrp),
    data = helminths_df_2,
    family = binomial)

In [None]:
summary(anaemia_agegrp_glm)

In [None]:
exp(cbind(OR = coef(anaemia_agegrp_glm), confint(anaemia_agegrp_glm)))

Note that there are three odds ratios each of which refers to the same baseline group 
(those aged <20 years): 

* the odds ratio is 0.59 for those aged 20-24 compared to those aged <20 years 
* the odds ratio is 0.42 for those aged 25-29 compared to those aged <20 years 
* and the odds ratio is 0.48 for those aged >30 compared to those aged <20 years.   
 
There are three Wald test P-values (one for each odds ratio) which test the null hypothesis that the log(odds) of severe anaemia in that age category are the same as the log(odds) of severe anaemia in the <20 years age category (or equivalently that OR = 1). In this example, the tests for all three odds ratios provide very strong evidence against the null hypothesis (P<0.001 or P=0.001) and we conclude that the odds of severe anaemia in each age category differ from the odds of severe anaemia in the <20 age group. 

In [None]:
lrtest(anaemia_agegrp_glm)

The likelihood ratio statistic is 26.65 on 3 degrees of freedom (P<0.001).  Note that 
there is only one likelihood ratio test P-value.  This tests the agegrp variable as a whole, 
by simultaneously testing the null hypotheses for the three parameters in the model i.e.  

H0[1]: log(OR) = 0 for agegrp 1 vs agegrp 0, AND 

H0[2]: log(OR) = 0 for agegrp 2 vs agegrp 0, AND 

H0[3]: log(OR) = 0 for agegrp 3 vs agegrp 0 

Note: Since hookworm is a binary variable coded as 1 and 0, it makes no difference if it 
is used as a factor. Check this by typing the two commands below and 
comparing them:

In [None]:
glm(anaemic_sev ~ hookworm,
    data = helminths_df_2,
    family = binomial)

In [None]:
glm(anaemic_sev ~ as.factor(hookworm),
    data = helminths_df_2,
    family = binomial)

## Likelihood ratio test

In this section we will carry out a likelihood ratio test in `R`. Remember, a likelihood ratio test compares a model with parameter(s) of interest to a model without parameter(s) of interest, to assess the contribution of a variable (or parameter) to the model. 

As noted in CAL session 7, the likelihood ratio statistic (LRS) is calculated by using the difference between L1 (the log likelihood when the exposure variable is included in the model) and L0 (the log likelihood when the variable is excluded from the model): 

LRS=2(L1 - L0) 

We then refer the LRS to the $\chi^2$ distribution on (r - 1)(c - 1) degrees of freedom.  Note: the degrees of freedom is equal to the number of parameters excluded from the model when the exposure variable is excluded. 
 
To compare the log likelihood from a model with agegrp (L1) and the log likelihood from the model without agegrp (L0), there are 3 steps involved: 

1. fit the first model and save L1 
2. fit the second model and save L0  
3. compare L0 to L1 
 
In `R` we type:

In [None]:
A <- glm(anaemic_sev ~ as.factor(agegrp),
    data = helminths_df_2,
    family = binomial)

B <- glm(anaemic_sev ~ 1,
    data = helminths_df_2,
    family = binomial)

lrtest(A, B)

This is the likelihood ratio test to assess the contribution of agegrp to the model, i.e. it is testing the null hypothesis that adding the agegrp variable does not improve the fit of the model to the data. In fact there is very strong evidence against the null hypothesis (P<0.0001) so we conclude that the model with the agegrp variable is a better model (i.e. a closer fit to the data)
 
Note: The two models being compared in the LRT must be fitted on exactly the same data. This may not happen if some observations have missing values for the variable being tested, in which case `R` would issue a warning. This needs to be handled by ensuring that both models are fitted on exactly the same subset of records in the dataset, e.g. in the above example the command in step 2 could be changed to:

In [None]:
glm(anaemic_sev ~ 1,
    data = helminths_df_2 %>%
        filter(!is.na(agegrp)),
    family = binomial)

## Logistic regression with more than one exposure

We can now look at how to fit a model with two exposures.  We will produce a logistic regression to obtain the odds ratios, confidence intervals and likelihood ratio statistics for: 

the effect of hookworm controlling for agegrp, and 

the effect of agegrp controlling for hookworm on the odds of severe anaemia.   
 
To display the odds ratio estimates, type:

In [None]:
anaemia_hookworm_agegrp_glm <-
    glm(anaemic_sev ~ hookworm + as.factor(agegrp),
        data = helminths_df_2,
        family = binomial)

In [None]:
exp(cbind(OR = coef(anaemia_hookworm_agegrp_glm), confint(anaemia_hookworm_agegrp_glm)))

Have the coefficients changed from those in the models with each variable alone? 
 
In the model with only hookworm, the OR estimate for the effect of hookworm infection was 2.98.  After controlling for age the OR decreased to 2.82. Hence, age slightly confounded the effect of hookworm infection.  The OR estimates for age without adjusting for hookworm infection were 0.59, 0.42 and 0.48, so again there is a slight difference in the estimates after controlling for hookworm infection.   In the crude (unadjusted) analysis, hookworm infection is a risk factor for severe anaemia and older age is protective against severe anaemia (or in other words, younger age is a risk factor for severe anaemia). The estimated association between hookworm infection and severe anaemia decreases when controlled for age group.  Similarly, the estimated effects of the two older age group get weaker (closer to 1) when controlled for hookworm infection.

We will now use the likelihood ratio test  

1. To test hookworm adjusted for agegrp 
2. To test agegrp adjusted for hookworm. 
 
For the likelihood ratio test for hookworm, type: 

In [None]:
A <- glm(anaemic_sev ~ hookworm + as.factor(agegrp),
        data = helminths_df_2,
        family = binomial)

B <- glm(anaemic_sev ~ as.factor(agegrp),
        data = helminths_df_2,
        family = binomial)

lrtest(A, B)

Therefore, after adjusting for `agegrp`, `hookworm` still provides an important contribution to the model fit. In other words, after adjusting for age there is still very strong evidence of a difference in odds of severe anaemia between hookworm uninfected and infected women. 

For the likelihood ratio test for agegrp, type:

In [None]:
A <- glm(anaemic_sev ~ hookworm + as.factor(agegrp),
        data = helminths_df_2,
        family = binomial)

B <- glm(anaemic_sev ~ hookworm,
        data = helminths_df_2,
        family = binomial)

lrtest(A, B)

Therefore, after adjusting for `hookworm`, there is still very strong evidence for a difference in odds of severe anaemia between age groups.

## Review exercise 

Now try to carry out the same analyses on your own. For this exercise you should use 
the helminths dataset. The solutions are given in Section 4.

### 1) Examine the association between severe anaemia (anaemic_sev) and malaria using the `CrossTable` and `glm` commands.

In [None]:
CrossTable(helminths_df_2$anaemic_sev, helminths_df_2$malaria)

In [None]:
anaemia_malaria_glm <-
    glm(anaemic_sev ~ malaria,
        data = helminths_df_2,
        family = binomial)

In [None]:
summary(anaemia_malaria_glm)

#### Does it make any difference if you use `as.factor(malaria)` instead of `malaria` in the logistic command? 

In [None]:
anaemia_malaria_glm_2 <-
    glm(anaemic_sev ~ as.factor(malaria),
        data = helminths_df_2,
        family = binomial)

In [None]:
summary(anaemia_malaria_glm_2)

No difference since it is coded as 0 or 1

#### What is the OR estimate for malaria?

In [None]:
exp(cbind(OR = coef(anaemia_malaria_glm), confint(anaemia_malaria_glm)))

Odds ratio of 3.34

#### Is severe anaemia associated with malaria?

There is good evidence against a null hypothesis of no association

### 2) Does the association between malaria and severe anaemia change when you control for the effects of hookworm and agegrp? 

In [None]:
anaemia_malaria_hookworm_agegrp_glm <-
    glm(anaemic_sev ~ malaria + hookworm + as.factor(agegrp),
        data = helminths_df_2,
        family = binomial)

In [None]:
summary(anaemia_malaria_hookworm_agegrp_glm)

In [None]:
exp(cbind(OR = coef(anaemia_malaria_hookworm_agegrp_glm),
    confint(anaemia_malaria_hookworm_agegrp_glm)))

The association remains significant with a slightly smaller odds ratio, so there is some confounding by hookworm and age.

### 3) Carry out a likelihood ratio test to assess whether malaria should be included in the model with hookworm and agegrp.

In [None]:
lrtest(anaemia_hookworm_agegrp_glm, anaemia_malaria_hookworm_agegrp_glm)

#### Malaria significantly improves the model