# Case-control studies

## Aim

To learn how to analyse case-control data, obtain crude and adjusted estimates and test for trend with increasing exposures.

In [None]:
library(tidyverse)

## Reading in the dataset and identifying relevant variables

This practical session uses the dataset from Mwanza, Tanzania on HIV infection among women

To read in the dataset, type:

In [None]:
library(haven)

In [None]:
mwanza_df <- read_dta("Data_files-20211113/MWANZA.dta")

In [None]:
head(mwanza_df)

We will look at the association between HIV infection and exposure to formal education, number of sexual partners and religion.

**case** is the variable name for HIV infection coded: 1=case; 0=control

**age1** is a grouped age variable coded: 1=15-19, 2=20-24, 3=25-29, 4=30-34, 5=35-44, 6=45-54 years

**ed** is the variable name for level of education coded: 1=no formal education (none/adult only), 2=1-3 years, 3=4-6 years, 4=7+ years

**npa** is the variable name for number of sexual partners ever coded: 1=0-1, 2=2-4, 3=5-9, 4=10-19, 5=20-49, 6=50+, 9=missing

**rel** is the variable name for type of religion coded: 1=Moslem, 2=Catholic, 3=Protestant, 4=other, 9=missing

To examine how many cases and controls there are in the dataset, type:

In [None]:
library(gmodels)
CrossTable(mwanza_df$case)

To look at exposure to formal education create a new variable ed2 which takes the value 1 for women with no formal education and value 2 for those with some education. Type:

In [None]:
mwanza_df_2 <- mwanza_df %>%
    mutate(ed2 = case_when(ed == 1 ~ 1,
                           ed > 1 ~ 2))

To check that the new variable has been coded correctly, tabulate it against the original variable. Type:

In [None]:
CrossTable(mwanza_df_2$ed, mwanza_df_2$ed2)

Similarly for age, recode age1 to a new variable age2 with the 4 categories: 1 = 15-19, 2 = 20-29, 3 = 30-44, 4 = 45+ years. Type:

In [None]:
mwanza_df_3 <- mwanza_df_2 %>%
    mutate(age2 = case_when(age1 < 20 ~ 1,
                            age1 < 30 ~ 2,
                            age1 < 45 ~ 3,
                            age1 >= 45 ~ 4))

Again we should tabulate the old variable against the new variable to check the coding is correct.

In [None]:
CrossTable(mwanza_df_3$age1, mwanza_df_3$age2)

## Crude odds ratio estimate

To examine the relationship between being a case and formal education, type:

In [None]:
library("epiR")
epi.2by2(table(factor(mwanza_df_3$case, levels = c(0,1)), factor(mwanza_df_3$ed2, levels = c(1,2))),
         method = "cohort.count", digits=2)

 The P-value shows very strong evidence against the null hypothesis of no association.

Note: We should examine the row percentages because column percentages are affected by the different probabilities of selection for cases and controls.

We must be clear about which variable we are treating as the exposure and which category is a case in our interpretation of the table. Examine the table above. What is the proportion of cases with some formal education?

There are 140/189 cases with some formal education, i.e. 74.1%.

To produce an odds ratio for exposure to formal education we can use the `epi.2by2` command. Try the following command first:

In [None]:
epi.2by2(table(factor(mwanza_df_3$case, levels = c(0,1)), factor(mwanza_df_3$ed2, levels = c(2,1))),
         method = "cohort.count", digits=2)

Now change the baseline for ed2. Type:

In [None]:
epi.2by2(table(factor(mwanza_df_3$case, levels = c(0,1)), factor(mwanza_df_3$ed2, levels = c(1,2))),
         method = "cohort.count", digits=2)

The first version of the command takes level 2 (some education) as baseline. Thus, the odds ratio is 1.0 divided by the odds ratio from the second version, which uses level 1 (no formal education) as the baseline. It is important to know which level is the baseline in our interpretation of the odds ratio.

## Adjusted odds ratio estimates