## Summary

Perform an epidemiological analysis on the results of a cohort or case-control study with two or more categories of exposure using *statsmodels* and *scipy*.

There are two main questions addressed in this notebook.

1. How do we ensure the instance of `Table2x2` from *statsmodels* initialises correctly?
1. How do we handle cases where there are multiple exposures?

Both of these questions are solved by defining three functions that will handle the transfromation and initialisation of the needed data structures.
They aren't particularly interesting functions, so we don't discuss them beyond noting what the arguments expect.

We then show two worked examples, the first involving a cohort study with two exposures, and the second a case-control study with three exposures.
This is a recipe book rather than a theoretical overview, so we don't discuss the theory or interpretation of the results.

## Dependencies

In [1]:
from collections import defaultdict
import pandas as pd
from scipy import stats as st
from statsmodels import api as sm
import lrdataio

In [2]:
%load_ext watermark
%watermark --iv

scipy      : 1.9.3
statsmodels: 0.13.5
lrdataio   : 0.3.0
pandas     : 1.5.1



In [3]:
%precision 6

'%.6f'

## Functions

`mask(df: pd.DataFrame, exposure: list, outcome: str) -> pd.DataFrame`

In [4]:
#| code-fold: true
def mask(df, exposures, outcome) -> pd.DataFrame:
    e_dict = defaultdict(lambda: 'not exposed')
    for n, e in enumerate(exposures):
        e_dict[e] = f'exposure{n+1}'
    o_dict = defaultdict(lambda: 'no disease')
    o_dict[outcome] = 'disease'
    return (
        pd.DataFrame()
        .assign(n=df['n'],
                exposure=df['exposure'].map(lambda s: e_dict[s]),
                outcome=df['outcome'].map(lambda s: o_dict[s]))
        .pivot_table('n', 'exposure', 'outcome')
    )

Return a masked `DataFrame`, where the exposures and outcomes columns are relabelled with generic labels.

Pre-conditions:

- *df* has three columns
   - 'n', `int`, the number of observations
   - 'exposures', `str`, the exposure category
   - 'outcomes', `str`, the outcome
- *exposure* is a list of the exposures, sans the reference exposure
- *outcome* is the label representing a positive outcome, either the disease or the case

`collect_table2x2s(df: pd.DataFrame, exposure: list, outcome: str) -> dict[str, sm.Table2x2]`

In [5]:
#| code-fold: true
def collect_tables(df, exposures, outcome) -> dict[str, pd.DataFrame]:
    masked_df = df.pipe(mask, exposures, outcome)
    es = [f'exposure{n+1}' for n, e in enumerate(exposures)]

    def filter_exposure(e): return (
        masked_df
        .query('exposure in [@e, "not exposed"]')
        .to_numpy()
    )

    return {e: sm.stats.Table2x2(filter_exposure(e)) for e in es}

Return a dictionary that maps each exposure to an instance of `Table2x2`.
The exposure labels are replaced by the generic labels in the mapping.

Pre-conditions:

- *df* has three columns
   - 'n', `int`, the number of observations
   - 'exposures', `str`, the exposure category
   - 'outcomes', `str`, the outcome
- *exposure* is a list of the exposures, sans the reference exposure
- *outcome* is the label representing a positive outcome, either the disease or the case

`make_table(df: pd.DataFrame, exposure: list, outcome: str) -> sm.stats.Table`

In [6]:
#| code-fold: true
def make_table(df, exposures, outcome) -> sm.stats.Table:
    arr = (
        df
        .pipe(mask, exposures, outcome)
        .to_numpy()
    )

    return sm.stats.Table(arr)

Return an instance of *statsmodels* table.

The returned `Table` is ordered exposure1, exposure2, ..., no exposure, where expsoure1, exposure2, ..., are the items of *exposure*.

- *df* has three columns
   - 'n', `int`, the number of observations
   - 'exposures', `str`, the exposure category
   - 'outcomes', `str`, the outcome
- *exposure* is a list of the exposures, sans the reference exposure
- *outcome* is the label representing a positive outcome, either the disease or the case

## Two exposures

The data for this worked example is taken from a cohort study that investigated the association compulsory redundancy (the exposure), and incidences of serious self-harm.
(Van Keefe *et al*, 2002.)

The background of the study is...

> *The association between unemployment and poor health outcomes is well documented.*
*Significant debate exists as to whether unemployment causes ill health or whether those with poor health find it harder to obtain and maintain employment.*
*Factory closure studies are well placed to comment on causation.*
*The objective of this study was to investigate associations between involuntary job loss, mortality and serious illness.*

### Setup

Cache and load the results into a `DataFrame`.

In [7]:
_temp_path = lrdataio.cache_url('https://raw.githubusercontent.com/ljk233/laughingrook-datasets/main/m249/medical/redundancy_ssi.csv')  # noqa
redundancy_df = pd.read_csv(_temp_path)
redundancy_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   n         4 non-null      int64 
 1   exposure  4 non-null      object
 2   outcome   4 non-null      object
dtypes: int64(1), object(2)
memory usage: 224.0+ bytes


In [8]:
redundancy_df

Unnamed: 0,n,exposure,outcome
0,14,redundant,ssi
1,1931,redundant,no ssi
2,4,not redundant,ssi
3,1763,not redundant,no ssi


Use the `collect_table2x2s` function to collect the instances of `Table2x2` as a `dict`, where each item in the `dict` maps an exposure to an instance of `Table2x2`.

Given there are only two exposures here, there will be only be a single item in the returned `dict`.
We therefore bypass the `dict` and directly assign the item to a variable.

In [9]:
_exposures = ['redundant']  # exposure list
_outcome = 'ssi'
redundancy_table = (
    redundancy_df
    .pipe(collect_tables, _exposures, _outcome)
    ['exposure1']
)

Output the table to ensure they have initialised successfully.

In [10]:
print(redundancy_table)

A 2x2 contingency table with counts:
[[  14. 1931.]
 [   4. 1763.]]


### Measures of association

Output a summary table of the measures of association.

:::{.callout-note}
If you coming here from **M249**, then we are only interested in the Odds ratio and Risk ratio rows.
Also, the risk ratio is the *relative risk*.
:::

In [11]:
redundancy_table.summary()

0,1,2,3,4,5
,Estimate,SE,LCB,UCB,p-value
Odds ratio,3.195,,1.050,9.726,0.041
Log odds ratio,1.162,0.568,0.049,2.275,0.041
Risk ratio,3.180,,1.049,9.642,0.041
Log risk ratio,1.157,0.566,0.047,2.266,0.041


### Tests for no association

#### Chi-squared test

Expected frequencies under the null hypothesis of no association.

In [12]:
redundancy_table.fittedvalues

array([[   9.431573, 1935.568427],
       [   8.568427, 1758.431573]])

Differences between the observed and expected frequencies.

In [13]:
redundancy_table.table - redundancy_table.fittedvalues

array([[ 4.568427, -4.568427],
       [-4.568427,  4.568427]])

Contributions to the chi-squared test statistic.

In [14]:
redundancy_table.chi2_contribs

array([[2.212836, 0.010783],
       [2.435747, 0.011869]])

Chi-squared test result.

In [15]:
print(redundancy_table.test_nominal_association())

df          1
pvalue      0.030671871146104812
statistic   4.671234588585282


#### Fisher's exact test

:::{.callout-note}
There is no implementation of Fisher's exact test is *statsmodels*, so we use *scipy* implementation.
:::

In [16]:
pd.Series(
    data=st.fisher_exact(redundancy_table.table)[1],
    index=['pvalue'],
    name='fisher''s exact'
)

pvalue    0.033877
Name: fishers exact, dtype: float64

## Three or more exposures

Data are taken from a cohort study that investigated the association the duration of the pregnancy (the exposure), and incidences of early childhood asthma.
(Yuan *et al*, 2002.)

The background of the study is...

> *Childhood asthma may have a fetal origin.*
*In order to examine this hypothesis we examined the association between fetal growth indicators and hospitalization with asthma during early childhood.*

### Setup

Cache and load the results into a `DataFrame`.

In [17]:
_temp_path = lrdataio.cache_url('https://raw.githubusercontent.com/ljk233/laughingrook-datasets/main/m249/medical/asthmagest.csv')  # noqa
asthma_df = pd.read_csv(_temp_path)
asthma_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   n         6 non-null      int64 
 1   exposure  6 non-null      object
 2   outcome   6 non-null      object
dtypes: int64(1), object(2)
memory usage: 272.0+ bytes


In [18]:
asthma_df

Unnamed: 0,n,exposure,outcome
0,18,pre-term,hospitalised
1,266,pre-term,not hospitalised
2,402,term,hospitalised
3,8565,term,not hospitalised
4,45,post-term,hospitalised
5,1100,post-term,not hospitalised


Use the `collect_tables2x2s` function to collect the instances of `Table2x2` as a `dict`.

We take the 'term' exposure to be the reference exposure, so we do not add it to the exposures list.

In [19]:
_exposures = ['pre-term', 'post-term']  # exposure list
_outcome = 'hospitalised'
asthma_tables = asthma_df.pipe(collect_tables, _exposures, _outcome)

Output the tables to ensure they have initialised successfully.

In [20]:
for e, t in asthma_tables.items():
    print(f'exposure={e}')
    print(t, '\n')

exposure=exposure1
A 2x2 contingency table with counts:
[[  18.  266.]
 [ 402. 8565.]] 

exposure=exposure2
A 2x2 contingency table with counts:
[[  45. 1100.]
 [ 402. 8565.]] 



### Measures of association

Output two summary tables, one for each exposure.

:::{.callout-note}
This is a case-control study, so only the Odds ratio is relevant.
:::

In [21]:
for e, t in asthma_tables.items():
    print(f'exposure={e}')
    print(t.summary(), '\n')

exposure=exposure1
               Estimate   SE   LCB    UCB  p-value
--------------------------------------------------
Odds ratio        1.442        0.885 2.348   0.141
Log odds ratio    0.366 0.249 -0.122 0.854   0.141
Risk ratio        1.414        0.895 2.233   0.138
Log risk ratio    0.346 0.233 -0.111 0.803   0.138
-------------------------------------------------- 

exposure=exposure2
               Estimate   SE   LCB    UCB  p-value
--------------------------------------------------
Odds ratio        0.872        0.636 1.194   0.392
Log odds ratio   -0.137 0.160 -0.452 0.177   0.392
Risk ratio        0.877        0.648 1.186   0.393
Log risk ratio   -0.132 0.154 -0.434 0.170   0.393
-------------------------------------------------- 



### Tests for no association

#### Chi-squared test

Initialise the instance of `Table`.

In [22]:
_exposures = ['pre-term', 'post-term']  # exposure list
_outcome = 'hospitalised'
asthma_table = asthma_df.pipe(make_table, _exposures, _outcome)

Output the table to ensure they have initialised successfully.

:::{.callout-note}
The rows are ordered: pre-term, post-term, term, and columns are ordered hospitalised, not hospitalised.
:::

In [23]:
print(asthma_table)

A 3x2 contingency table with counts:
[[  18.  266.]
 [  45. 1100.]
 [ 402. 8565.]]


Expected frequencies under the null hypothesis of no association.

In [24]:
asthma_table.fittedvalues

array([[  12.702963,  271.297037],
       [  51.214409, 1093.785591],
       [ 401.082628, 8565.917372]])

Differences between the observed and expected frequencies.

In [25]:
asthma_table.table - asthma_table.fittedvalues

array([[ 5.297037, -5.297037],
       [-6.214409,  6.214409],
       [ 0.917372, -0.917372]])

Contributions to the chi-squared test statistic.

In [26]:
asthma_table.chi2_contribs

array([[2.208824e+00, 1.034239e-01],
       [7.540629e-01, 3.530755e-02],
       [2.098250e-03, 9.824651e-05]])

Chi-squared test result.

In [27]:
print(asthma_table.test_nominal_association())

df          2
pvalue      0.21184355211692685
statistic   3.1038144769010287


## References

Keefe V, Reid P, Ormsby C, et al. Serious health events following involuntary job loss in New Zealand meat processing workers. Int J Epidemiol. 2002;31(6):1155-1161
doi:10.1093/ije/31.6.1155

Yuan W, Basso O, Sorensen HT, Olsen J. Fetal growth and hospitalization with asthma during early childhood: a follow-up study in Denmark. Int J Epidemiol. 2002;31(6):1240-1245.
doi:10.1093/ije/31.6.1240