# Cohort studies

**Keywords:** epidemiology, cohort studies, exposed group, control group, odds, odds ratio, relative risk

- What is a cohort study
- Conditional probabilities
- Measures of association
    - Risk and relative risk
    - Odds and the odds ratio
- Confidence intervals
    - Relative risk
    - Odds ratio
- Chi-squared test for no association
- Fishers exact test

## Setup

In [1]:
# import libraries
from opyn.stats import tables

In [2]:
# set precision of notebook
%precision 6

'%.6f'

In [3]:
# initialise example data
cohort: tables.ContingencyTable = (
    tables.ContingencyTable([[14, 19], [13, 39]])
)

## What are cohort and case-control studies?

**References:**

***Cohort studies:*** Book 1 pp9; HB pp10.1, pp10.5. ***Case-control studies:*** Book 1 pp23-24; HB pp10.2, pp11.6.

**Cohort studies** and **case-control studies** are studies of the association between an exposure $E$ and a disease $D$.

In a cohort study, $E$ is known and the $D$ is unknown.
The disease is to happen in the future, so it cannot be identified prior to the experiment. (Book 1 pp9)

Alternatively, in a case-control study, the $E$ is unknown and the $D$ is known.
The exposure happened in the past, so it can be determined prior to the experiment. (Book 1 pp23-24)

They both be represented by a two by two contingency table.
Column headers depend on the type;
Row titles the same.
Cohort studies include the row marginal totals;
Case-control studies the column marginal totals

An example of a contingency table for a cohort study.

In [4]:
cohort.show_table(incl_row_totals=True)

Disease Category,Disease,No Disease,Total
Exposed Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Exposed,14,19,33
Not Exposed,13,39,52


## Measures of association

### Relative risk

**References:** Book 1 pp11, pp26; HB pp10.3

The **relative risk**, $RR$, is the ratio of having the disease given being exposed to having the disease and not being exposed.

In cohort studies, it can be estimated from a sample by

$$\widehat{RR} = \frac{\hat P (D|E)}{\hat P (D|\text{not } E)}.$$

It cannot be directly estimated in case-control studies, but it can be indirectly estimated from the the odds ratio *(Book 1, pp15).*

In [5]:
cohort.relative_risk()

RelativeRisk(point=1.696970, zconfint={lower=0.916424, upper=3.142331})

### Odds and the odds ratio

**References:** Book 1 pp13-14, pp27; HB pp10.4

The **odds of an event**, $OD$, is defined as the ratio of the probability of the event happening to the probability of the event not happening

$$OD(A) = \frac{P(A)}{P(\text{Not } A)} = \frac{P(A)}{1 - P(A)}.$$

In a cohort study it possible to estimate the odds of *disease given exposure,* so $\widehat{OD} (D | E)$ and $\widehat{OD} (D | \text{ Not } E)$.
These can be calculated

$$\widehat{OD} (D | E) = \frac{a}{b}, \hspace{3mm} \widehat{OD} (D | \text{ Not } E) = \frac{c}{d}.$$

Alternatively, case-control studies can estimate the odds of *exposure given disease*, so $\widehat{OD} (E | D)$ and $\widehat{OD} (E | \text{ Not } D)$.

$$\widehat{OD} (E | D) = \frac{a}{c}, \hspace{3mm} \widehat{OD} (E | \text{ Not } D) = \frac{b}{d}.$$

In [6]:
cohort.conditional_odds()

OddsDisease(given_e=0.736842, given_not_e=0.333333)

The **odds ratio**, $OR$, is defined as the ratio of the odds of a disease happening given exposure to the odds of the disease happening given no exposure.

It can be shown that

$$\widehat{OR} = \frac{\widehat{OD} (D | E)}{\widehat{OD} (D | \text{ Not } E)} = \frac{\widehat{OD} (E | D)}{\widehat{OD} (E | \text{ Not } D)} = \frac{ad}{bc}.$$

(This means that the calculation of the odds ratio is the same for both cohort and case-control studies.)

In [7]:
cohort.odds_ratio()

OddsRatio(point=2.210526, zconfint={lower=0.869522, upper=5.619669})

### Comparing the relative risk and odds ratio

**References:** Book 1 pp15

## Confidence intervals for measures of association

**References:** Book 1 pp16-17; HB pp10.5 and pp11.6

The point estimates for the relative risk and the odds ratio are subject to sampling variability.
A confidence interval can be used to quantify the uncertainty.
This is done using a **binomial model,** with the assumption that the number of disease outcomes for the exposed group are independent of the number of diseased outcome of the control group.

## Chi-squared test for no association

**References:** Book 1 pp31-39, HB pp11.8

A **chi-squared test for no association** tests the null hypothesis that there is no association between the exposure and the disease.

The test requires all expected frequencies to be greater than or equal to **five** for it to be adequate. 

The degrees of freedom for the test when using a **2x2** contingency table is **1.**
Otherwise, use $\nu = (r - 1)\hspace{2mm}(c - 1).$

The test is not affected by the choice of reference category in a study with multiple exposure groups, as all values are taken into account.

The $p$-value and $\widehat{OR}$ are often reported together:
The $p$-value quantifies the strength of the *evidence;* $\widehat{OR}$ quantifies the strength of *association.*

In [10]:
cohort.expected_freq(incl_row_totals=True)

Disease Category,Disease,No Disease,Total
Exposed Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Exposed,10.482353,22.517647,33.0
Not Exposed,16.517647,35.482353,52.0


In [11]:
cohort.chi2_contribs()

Disease Category,Disease,No Disease
Exposed Category,Unnamed: 1_level_1,Unnamed: 2_level_1
Exposed,1.180445,0.549517
Not Exposed,0.749129,0.348732


In [12]:
cohort.chi2_test()

ChiSqTest(chisq=2.827823, pval=0.092644, df=1)

## Fisher’s exact test

**References:** Book 1 pp40, HB pp11.8

## Multiple exposure categories

**References:** Book 1 pp29-30; HB pp11.7

Pick an exposure category as an arbitrary reference category.
All ratio are then calculated relative to this reference category.

In [8]:
# use this as an example
pre_term = [18, 266]
term = [402, 8565]
post_term = [45, 1100]