# Case-control studies

## Aim

To learn how to analyse case-control data, obtain crude and adjusted estimates and test for trend with increasing exposures.

In [1]:
library(tidyverse)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.5     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.3     [32m✔[39m [34mdplyr  [39m 1.0.7
[32m✔[39m [34mtidyr  [39m 1.1.3     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.0.0     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



## Reading in the dataset and identifying relevant variables

This practical session uses the dataset from Mwanza, Tanzania on HIV infection among women

To read in the dataset, type:

In [2]:
library(haven)

In [3]:
mwanza_df <- read_dta("Data_files-20211113/MWANZA.dta")

In [4]:
head(mwanza_df)

idno,comp,case,age1,ed,eth,rel,msta,bld,inj,skin,fsex,npa,pa1,usedc,ud,ark,srk
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
112041,1,0,2,3,1,2,3,1,1,1,2,1,1,1,1,2,4
114002,1,0,6,1,1,4,1,1,1,1,1,2,2,1,1,4,4
114006,1,1,4,3,3,3,1,1,5,2,2,3,2,2,1,3,3
114020,1,1,3,3,3,3,1,1,1,2,2,3,2,1,2,2,2
114025,1,1,1,3,1,3,1,1,2,2,1,1,2,1,1,2,4
121006,1,0,2,1,1,2,1,1,1,1,2,1,2,1,1,1,4


We will look at the association between HIV infection and exposure to formal education, number of sexual partners and religion.

**case** is the variable name for HIV infection coded: 1=case; 0=control

**age1** is a grouped age variable coded: 1=15-19, 2=20-24, 3=25-29, 4=30-34, 5=35-44, 6=45-54 years

**ed** is the variable name for level of education coded: 1=no formal education (none/adult only), 2=1-3 years, 3=4-6 years, 4=7+ years

**npa** is the variable name for number of sexual partners ever coded: 1=0-1, 2=2-4, 3=5-9, 4=10-19, 5=20-49, 6=50+, 9=missing

**rel** is the variable name for type of religion coded: 1=Moslem, 2=Catholic, 3=Protestant, 4=other, 9=missing

To examine how many cases and controls there are in the dataset, type:

In [5]:
library(gmodels)
CrossTable(mwanza_df$case)


 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  763 

 
          |         0 |         1 | 
          |-----------|-----------|
          |       574 |       189 | 
          |     0.752 |     0.248 | 
          |-----------|-----------|



 


To look at exposure to formal education create a new variable ed2 which takes the value 1 for women with no formal education and value 2 for those with some education. Type:

In [6]:
mwanza_df_2 <- mwanza_df %>%
    mutate(ed2 = case_when(ed == 1 ~ 1,
                           ed > 1 ~ 2))

To check that the new variable has been coded correctly, tabulate it against the original variable. Type:

In [7]:
CrossTable(mwanza_df_2$ed, mwanza_df_2$ed2)


 
   Cell Contents
|-------------------------|
|                       N |
| Chi-square contribution |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  763 

 
               | mwanza_df_2$ed2 
mwanza_df_2$ed |         1 |         2 | Row Total | 
---------------|-----------|-----------|-----------|
             1 |       312 |         0 |       312 | 
               |   266.581 |   184.419 |           | 
               |     1.000 |     0.000 |     0.409 | 
               |     1.000 |     0.000 |           | 
               |     0.409 |     0.000 |           | 
---------------|-----------|-----------|-----------|
             2 |         0 |        75 |        75 | 
               |    30.668 |    21.216 |           | 
               |     0.000 |     1.000 |     0.098 | 
               |     0.000 |     0.166 |           | 
               |     0.000 |     0.098 |           | 
---------

Similarly for age, recode age1 to a new variable age2 with the 4 categories: 1 = 15-19, 2 = 20-29, 3 = 30-44, 4 = 45+ years. Type:

In [8]:
mwanza_df_3 <- mwanza_df_2 %>%
    mutate(age2 = case_when(age1 < 20 ~ 1,
                            age1 < 30 ~ 2,
                            age1 < 45 ~ 3,
                            age1 >= 45 ~ 4))

Again we should tabulate the old variable against the new variable to check the coding is correct.

In [9]:
CrossTable(mwanza_df_3$age1, mwanza_df_3$age2)


 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  763 

 
                 | mwanza_df_3$age2 
mwanza_df_3$age1 |         1 | Row Total | 
-----------------|-----------|-----------|
               1 |       109 |       109 | 
                 |     0.143 |           | 
-----------------|-----------|-----------|
               2 |       165 |       165 | 
                 |     0.216 |           | 
-----------------|-----------|-----------|
               3 |       123 |       123 | 
                 |     0.161 |           | 
-----------------|-----------|-----------|
               4 |       118 |       118 | 
                 |     0.155 |           | 
-----------------|-----------|-----------|
               5 |       137 |       137 | 
                 |     0.180 |           | 
-----------------|-----------|-----------|
               6 |       111 |       111 | 
    

## Crude odds ratio estimate

To examine the relationship between being a case and formal education, type:

In [10]:
library("epiR")
epi.2by2(table(factor(mwanza_df_3$case, levels = c(0, 1)),
         factor(mwanza_df_3$ed2, levels = c(1, 2))),
         method = "cohort.count", digits = 2)

Loading required package: survival

Package epiR 2.0.41 is loaded

Type help(epi.about) for summary information

Type browseVignettes(package = 'epiR') to learn how to use epiR for applied epidemiological analyses






             Outcome +    Outcome -      Total        Inc risk *        Odds
Exposed +          263          311        574              45.8       0.846
Exposed -           49          140        189              25.9       0.350
Total              312          451        763              40.9       0.692

Point estimates and 95% CIs:
-------------------------------------------------------------------
Inc risk ratio                                 1.77 (1.37, 2.28)
Odds ratio                                     2.42 (1.68, 3.48)
Attrib risk in the exposed *                   19.89 (12.43, 27.35)
Attrib fraction in the exposed (%)            43.42 (26.84, 56.23)
Attrib risk in the population *                14.97 (7.81, 22.12)
Attrib fraction in the population (%)         36.60 (21.23, 48.97)
-------------------------------------------------------------------
Uncorrected chi2 test that OR = 1: chi2(1) = 23.279 Pr>chi2 = <0.001
Fisher exact test that OR = 1: Pr>chi2 = <0.001
 Wald conf

 The P-value shows very strong evidence against the null hypothesis of no association.

Note: We should examine the row percentages because column percentages are affected by the different probabilities of selection for cases and controls.

We must be clear about which variable we are treating as the exposure and which category is a case in our interpretation of the table. Examine the table above. What is the proportion of cases with some formal education?

There are 140/189 cases with some formal education, i.e. 74.1%.

To produce an odds ratio for exposure to formal education we can use the `epi.2by2` command. Try the following command first:

In [11]:
epi.2by2(table(factor(mwanza_df_3$case, levels = c(0, 1)),
         factor(mwanza_df_3$ed2, levels = c(2, 1))),
         method = "cohort.count", digits = 2)

             Outcome +    Outcome -      Total        Inc risk *        Odds
Exposed +          311          263        574              54.2        1.18
Exposed -          140           49        189              74.1        2.86
Total              451          312        763              59.1        1.45

Point estimates and 95% CIs:
-------------------------------------------------------------------
Inc risk ratio                                 0.73 (0.65, 0.82)
Odds ratio                                     0.41 (0.29, 0.60)
Attrib risk in the exposed *                   -19.89 (-27.35, -12.43)
Attrib fraction in the exposed (%)            -36.72 (-53.07, -22.11)
Attrib risk in the population *                -14.97 (-22.12, -7.81)
Attrib fraction in the population (%)         -25.32 (-35.57, -15.84)
-------------------------------------------------------------------
Uncorrected chi2 test that OR = 1: chi2(1) = 23.279 Pr>chi2 = <0.001
Fisher exact test that OR = 1: Pr>chi2 = <0.00

Now change the baseline for ed2. Type:

In [12]:
epi.2by2(table(factor(mwanza_df_3$case, levels = c(0, 1)),
         factor(mwanza_df_3$ed2, levels = c(1, 2))),
         method = "cohort.count", digits = 2)

             Outcome +    Outcome -      Total        Inc risk *        Odds
Exposed +          263          311        574              45.8       0.846
Exposed -           49          140        189              25.9       0.350
Total              312          451        763              40.9       0.692

Point estimates and 95% CIs:
-------------------------------------------------------------------
Inc risk ratio                                 1.77 (1.37, 2.28)
Odds ratio                                     2.42 (1.68, 3.48)
Attrib risk in the exposed *                   19.89 (12.43, 27.35)
Attrib fraction in the exposed (%)            43.42 (26.84, 56.23)
Attrib risk in the population *                14.97 (7.81, 22.12)
Attrib fraction in the population (%)         36.60 (21.23, 48.97)
-------------------------------------------------------------------
Uncorrected chi2 test that OR = 1: chi2(1) = 23.279 Pr>chi2 = <0.001
Fisher exact test that OR = 1: Pr>chi2 = <0.001
 Wald conf

The first version of the command takes level 2 (some education) as baseline. Thus, the odds ratio is 1.0 divided by the odds ratio from the second version, which uses level 1 (no formal education) as the baseline. It is important to know which level is the baseline in our interpretation of the odds ratio.

## Adjusted odds ratio estimates