# Cohort studies

**Keywords:** epidemiology, cohort studies, exposed group, control group, odds, odds ratio, relative risk

## What are cohort and case-control studies?

**References:**

***Cohort studies:*** Book 1 pp9; HB pp10.1, pp10.5. ***Case-control studies:*** Book 1 pp23-24; HB pp10.2, pp11.6.

**Cohort studies** and **case-control studies** are studies of the association between an exposure $E$ and a disease $D$.

In a cohort study, $E$ is known and $D$ is unknown.
(This is because $D$ is to happen in the future, so it cannot be identified prior to the experiment.)

Alternatively, in a case-control study, it is $E$ that is unknown, and $D$ is known.
(This is because $E$ happened in the past, prior to the experiment, so it can be determined before the experiment begins.)

They can both be represented by a two-by-two contingency table, where the column headers depend on the type of study, but the row titles generally remain the same.
Cohort studies include the row marginal totals;
Case-control studies the column marginal totals

An example of a contingency table for a cohort study.

|           | D   |   Not D | Total   |
|-----------|-----|---------|---------|
| **E**     | *a* | *b*     | *n1*    |
| Not **E** | *c* | *d*     | *n2*    |

And for a case-control study.

|           | Cases   | Control   |
|-----------|---------|-----------|
| **E**     | *a*     | *b*       |
| Not **E** | *c*     | *d*       |
| **Total** | *m1*    | *m2*      |

## Measures of association

### Relative risk

**References:** Book 1 pp11, pp26; HB pp10.3

The **relative risk**, $RR$, is the ratio of having the disease given being exposed to having the disease and not being exposed.

In cohort studies, it can be estimated using

$$\widehat{RR} = \frac{\hat P (D|E)}{\hat P (D|\text{not } E)}.$$

It cannot be directly estimated in case-control studies, but it can be indirectly estimated from the the odds ratio *(Book 1, pp15).*

### Odds and the odds ratio

**References:** Book 1 pp13-14, pp27; HB pp10.4

The **odds of an event**, $OD$, is defined as the ratio of the probability of an event occuring to the probability of the event not occuring, so

$$OD(A) = \frac{P(A)}{P(\text{Not } A)} = \frac{P(A)}{1 - P(A)}.$$

In a cohort study it possible to estimate the odds of *disease given exposure,* so $\widehat{OD} (D | E)$ and $\widehat{OD} (D | \text{ Not } E).$
These can be calculated

$$\widehat{OD} (D | E) = \frac{a}{b}, \hspace{3mm} \widehat{OD} (D | \text{ Not } E) = \frac{c}{d}.$$

Alternatively, case-control studies can estimate the odds of *exposure given disease*, so $\widehat{OD} (E | D)$ and $\widehat{OD} (E | \text{ Not } D)$.

$$\widehat{OD} (E | D) = \frac{a}{c}, \hspace{3mm} \widehat{OD} (E | \text{ Not } D) = \frac{b}{d}.$$

The **odds ratio**, $OR$, is defined as the ratio of the odds of a disease happening given exposure to the odds of the disease happening given no exposure.
It represents the **strength of assocation.**
Like $RR,$ a postive value represents a positive association, and a negative value represents a negative association.
It can be shown that

$$\widehat{OR} = \frac{\widehat{OD} (D | E)}{\widehat{OD} (D | \text{ Not } E)} = \frac{\widehat{OD} (E | D)}{\widehat{OD} (E | \text{ Not } D)} = \frac{ad}{bc}.$$

(This means that the calculation of the odds ratio is the same for both cohort and case-control studies.)

### Comparing the relative risk and odds ratio

**References:** Book 1 pp15

### Confidence intervals for measures of association

**References:** Book 1 pp16-17; HB pp10.5 and pp11.6

The point estimates for the relative risk and the odds ratio are subject to sampling variability.
A confidence interval can be used to quantify the uncertainty.
This is done using a **binomial model,** with the assumption that the number of disease outcomes for the exposed group are independent of the number of diseased outcome of the control group.

## Multiple exposure categories

**References:** Book 1 pp29-30; HB pp11.7

Pick an exposure category as an arbitrary reference category.
All ratio are then calculated relative to this reference category.

## Testing for no association

### Chi-squared test for no association

**References:** Book 1 pp31-39, HB pp11.8

A **chi-squared test for no association** tests the null hypothesis that there is no association between the exposure and the disease.

The test requires all expected frequencies to be greater than or equal to **five** for it to be adequate.

The degrees of freedom for the test when using a **2x2** contingency table is **1.**
Otherwise, use $\nu = (r - 1)\hspace{2mm}(c - 1).$

If there is a need for a reference category, then the test is not affected by the choice of reference category, as all values are taken into account.

The $p$-value and $\widehat{OR}$ are often reported together:
The $p$-value quantifies the strength of the *evidence;* $\widehat{OR}$ quantifies the strength of *association.*

### Fisher’s exact test

**References:** Book 1 pp40, HB pp11.8

## Implementation in `opyn`

Cohort studies and case-control studies are implemented in

```{python}
opyn.stats.observationalstudies
```

The example notebooks in this directory make use of the module.

In [None]:
from opyn.stats import observationalstudies as studies  # noqa
studies?