# Case-control studies

## Aim

To learn how to analyse case-control data, obtain crude and adjusted estimates and test for trend with increasing exposures.

In [None]:
library(tidyverse)

## Reading in the dataset and identifying relevant variables

This practical session uses the dataset from Mwanza, Tanzania on HIV infection among women

To read in the dataset, type:

In [None]:
library(haven)

In [None]:
mwanza_df <- read_dta("Data_files-20211113/MWANZA.dta")

In [None]:
head(mwanza_df)

We will look at the association between HIV infection and exposure to formal education, number of sexual partners and religion.

**case** is the variable name for HIV infection coded: 1=case; 0=control

**age1** is a grouped age variable coded: 1=15-19, 2=20-24, 3=25-29, 4=30-34, 5=35-44, 6=45-54 years

**ed** is the variable name for level of education coded: 1=no formal education (none/adult only), 2=1-3 years, 3=4-6 years, 4=7+ years

**npa** is the variable name for number of sexual partners ever coded: 1=0-1, 2=2-4, 3=5-9, 4=10-19, 5=20-49, 6=50+, 9=missing

**rel** is the variable name for type of religion coded: 1=Moslem, 2=Catholic, 3=Protestant, 4=other, 9=missing

To examine how many cases and controls there are in the dataset, type:

In [None]:
library(gmodels)


In [None]:
CrossTable(mwanza_df$case)

To look at exposure to formal education create a new variable ed2 which takes the value 1 for women with no formal education and value 2 for those with some education. Type:

In [None]:
mwanza_df_2 <- mwanza_df %>%
    mutate(ed2 = case_when(ed == 1 ~ 1,
                           ed > 1 ~ 2))

To check that the new variable has been coded correctly, tabulate it against the original variable. Type:

In [None]:
CrossTable(mwanza_df_2$ed, mwanza_df_2$ed2)

Similarly for age, recode age1 to a new variable age2 with the 4 categories: 1 = 15-19, 2 = 20-29, 3 = 30-44, 4 = 45+ years. Type:

In [None]:
mwanza_df_3 <- mwanza_df_2 %>%
    mutate(age2 = case_when(age1 == 1 ~ 1,
                            age1 < 4 ~ 2,
                            age1 < 6 ~ 3,
                            age1 == 6 ~ 4))

Again we should tabulate the old variable against the new variable to check the coding is correct.

In [None]:
CrossTable(mwanza_df_3$age1, mwanza_df_3$age2)

## Crude odds ratio estimate

To examine the relationship between being a case and formal education, type:

In [None]:
library("epiR")

In [None]:
epi.2by2(table(factor(mwanza_df_3$case, levels = c(0, 1)),
         factor(mwanza_df_3$ed2, levels = c(1, 2))),
         method = "cohort.count", digits = 2)

 The P-value shows very strong evidence against the null hypothesis of no association.

Note: We should examine the row percentages because column percentages are affected by the different probabilities of selection for cases and controls.

We must be clear about which variable we are treating as the exposure and which category is a case in our interpretation of the table. Examine the table above. What is the proportion of cases with some formal education?

There are 140/189 cases with some formal education, i.e. 74.1%.

To produce an odds ratio for exposure to formal education we can use the `epi.2by2` command. Try the following command first:

In [None]:
epi.2by2(table(factor(mwanza_df_3$case, levels = c(0, 1)),
         factor(mwanza_df_3$ed2, levels = c(2, 1))),
         method = "cohort.count", digits = 2)

Now change the baseline for ed2. Type:

In [None]:
epi.2by2(table(factor(mwanza_df_3$case, levels = c(0, 1)),
         factor(mwanza_df_3$ed2, levels = c(1, 2))),
         method = "cohort.count", digits = 2)

The first version of the command takes level 2 (some education) as baseline. Thus, the odds ratio is 1.0 divided by the odds ratio from the second version, which uses level 1 (no formal education) as the baseline. It is important to know which level is the baseline in our interpretation of the odds ratio.

## Adjusted odds ratio estimates

Now let’s examine the effect of education on HIV infection adjusted for age. To produce tables of case status by education stratified by age, we can use the
`xtabs` function. Type:

In [None]:
xtabs(data = mwanza_df_3,
      formula = ~ case + ed2 + age2) %>%
    addmargins

Note: These tables have cases with the exposure in the bottom right corner, not in the top left corner.

To obtain the odds ratio for HIV infection, comparing those with and without education within each stratum, we will use `mhor`. Is it appropriate to produce a summary estimate of the odds ratio adjusted for age?

In [None]:
library(epiDisplay)

In [None]:
mhor(mwanza_df_3$case,
     mwanza_df_3$ed2,
     mwanza_df_3$age2,
     design = "case-control", graph = FALSE)

The $ \chi $<sup>2</sup> value for effect modification suggests that there is a different effect of education on HIV infection depending on age. The confidence intervals of the ORs are wide but there is an indication that education may be protective in the youngest age group, or at least not as "harmful".

Given that there is some evidence of interaction, the combined estimate of 2.29 is less useful and it is preferable to present the stratum-specific estimates. It is plausible that the "effect" of education has changed if there has been awareness of HIV risks and teaching about this in schools in recent years.

## Test for trend

To look for evidence of a dose-response effect of years of schooling on HIV infection, we can use tabodds to perform a test for trend. Type:

In [None]:
table(mwanza_df_3$case, mwanza_df_3$ed) %>%
    chisq.test

The first test (test for homogeneity, P<0.001) provides very strong evidence against the null hypothesis that the odds of HIV infection are the same in each education category.

To test for a trend we need to reshape the data to fit the function `prop.trend.test`

In [None]:
(tab1 <- table(mwanza_df_3$case, mwanza_df_3$ed) %>%
    addmargins)

In [None]:
prop.trend.test(x = tab1[1, 1:4],
                n = tab1[3, 1:4])

The second test (test for trend, P<0.001) provides very strong evidence against the null hypothesis that there is no trend in odds of HIV infection with increasing years of education.

Remember that this is a case-control study, so that D/H or case/control does not give us the exact odds because the probabilities of selection differ between cases and controls. However the “odds” column is a constant multiple of the true odds so can be used to look at trends.

To investigate further whether there really is evidence that risk of HIV infection increases with years of schooling, we will perform a test for trend excluding women who had never been to school. To exclude women with no formal education, type:

In [None]:
prop.trend.test(x = tab1[1, 2:4],
                n = tab1[3, 2:4])

What should we conclude about the association between schooling and HIV infection?

There is no evidence to support a trend in odds of HIV infection among women with some formal education.

# Review exercise

#### Investigate whether religion (`rel`) confounds the association between schooling (`ed2`) and HIV infection (`case`).  Note: rel has a code 9 for missing values, so we suggest you set this to system-missing (`NA`).

In [None]:
mwanza_df_4 <- mwanza_df_3 %>%
    mutate(rel2 = if_else(rel == 9, NA_real_, rel))

In [None]:
mhor(mwanza_df_4$case,
     mwanza_df_4$ed2,
     mwanza_df_4$rel2,
     design = "case-control", graph = FALSE)

The odds ratios for each strata do not look too different with overlapping 95% confidence intervals. The p-value of the homogeneity test of 0.793 does not reject the null hypothesis. 

#### You might expect an increasing risk of HIV infection with number of sexual partners.  Carry out a test for trend using npa and estimate the odds ratio for each increase in category of number of partners.

In [None]:
mwanza_df_5 <- mwanza_df_4 %>%
    mutate(npa2 = if_else(npa == 9, NA_real_, npa))

In [None]:
CrossTable(mwanza_df_5$npa2)

In [None]:
table(mwanza_df_4$case, mwanza_df_4$npa) %>%
    chisq.test

In [None]:
(tab2 <- table(mwanza_df_4$case, mwanza_df_4$npa) %>%
    addmargins)

In [None]:
prop.trend.test(x = tab2[1, 1:4],
                n = tab2[3, 1:4])

Strong evidence against the null hypothesis of no trend with number of sexual partners