## Summary notes

Perform an epidemiological study on the results of a cohort study.

This topics is covered by M249 Book 1, Part 1.

### About the study

Data on a cohort study analysing the possible association between compulsory redundancies and incidents of serious self-inflicted injury (SSI) (Keefe, V., et al (2002)) was sourced.
The exposure is being made compulsorily redundant, and the disease is incidents of serious self-inflicted injury.

The study results were as follows.

|                            | SSI (+) | no SSI (-) |
| -------------------------- | ------: | ---------: |
| **made redundant (+)**     | 14      | 1931       |
| **not made redundant (-)** | 4       | 1763       |

### About the data

The data were stored remotely as a CSV file in a tidy manner with the schema:

| column     | dtype | description                                       |
| ---------- | ----- | ------------------------------------------------- |
| *n*        | `int` | Number of observations                            |
| *exposure* | `str` | Descriptive label indicating category of exposure |
| *disease*  | `str` | Descriptive label indicating category of disease  |

### Method

The analysis was undertaken using StatsModels and SciPy.

We defined a function `cache_file` to handle the retrieval of the data.

The exposure and disease labels were stored as two `list`s with variables named *exposures*, *diseases*.

:::{.callout-note}
Note that the orders of the two `list`s are important, and should correspond with:

- *exposures* = (*exposed*, *not exposed*)
- *diseases* = (*disease*, *no disease*)
:::

The data were cached and used to initialise *data*, a Pandas `DataFrame`.

A new `DataFrame` *cat_data* was initialised from *data*, with the *exposure*, *disease* columns as ordered `Categorical`[^6] data types, and was sorted by (*exposure*, *disease*).

We took column *cat_data*[*n*] as *data_arr*, a Numpy `NDArray` with shape `(2, 2)`, and used it to initialise *ctable*, an instance of `Table2x2`.[^1].

Measures of association[^2] were calculated, including confidence interval estimates.
A chi-squared test of no association was used to test the strength of evidence of an association.
We rounded-off the analysis by performing Fisher's exact test.[^3] [^4]

## Dependencies

In [1]:
import os
import requests
import numpy as np
import pandas as pd
from scipy import stats as st
from statsmodels import api as sm

## Functions

In [2]:
def cache_file(url: str, fname: str, dir_: str = './__cache') -> str:
    """Cache the file at given url in the given dir_ with the given
    fname, and return the local path.

    Preconditions:
    - dir_ exists
    """
    local_path = f'{dir_}/{fname}'
    if fname not in os.listdir(dir_):
        r = requests.get(url, allow_redirects=True)
        open(local_path, 'wb').write(r.content)
    return local_path

## Main

### Initialise the labels

In [3]:
exposures = ['redundant', 'not redundant']
diseases = ['ssi', 'no ssi']

### Cache the data

In [4]:
local_path = cache_file(
    url=('https://raw.githubusercontent.com/ljk233/laughingrook-datasets'
         + '/main/m249/medical/redundancy_ssi.csv'),
    fname='redundancy_ssi.csv'
)

### Load the data

Use the cached file to initialise a `DataFrame`.

In [5]:
data = pd.read_csv(local_path)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   n         4 non-null      int64 
 1   exposure  4 non-null      object
 2   disease   4 non-null      object
dtypes: int64(1), object(2)
memory usage: 224.0+ bytes


Output a view of *data*.

In [6]:
data

Unnamed: 0,n,exposure,disease
0,14,redundant,ssi
1,1931,redundant,no ssi
2,4,not redundant,ssi
3,1763,not redundant,no ssi


### Prepare the data

Initialise a new `DataFrame` using *data*, with the columns *exposure*, *disease* as ordered `Categorical` variables.

In [7]:
cat_data = pd.DataFrame().assign(
    n=data['n'].to_numpy(),
    exposure=pd.Categorical(data['exposure'], exposures, ordered=True),
    disease=pd.Categorical(data['disease'], diseases, ordered=True)
).sort_values(
    by=['exposure',
        'disease']
)
cat_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   n         4 non-null      int64   
 1   exposure  4 non-null      category
 2   disease   4 non-null      category
dtypes: category(2), int64(1)
memory usage: 320.0 bytes


Output a pivot table with the marginal totals.

In [8]:
cat_data.pivot_table(
    values='n',
    index='exposure',
    columns='disease',
    aggfunc='sum',
    margins=True,
    margins_name='total'
).query(
    "exposure != 'total'"
)

disease,ssi,no ssi,total
exposure,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
redundant,14,1931,1945
not redundant,4,1763,1767


### Initialise the contingency table

Get *cat_data*[*n*] as a NumPy `NDArray` with shape `(2, 2)`.

In [9]:
data_arr = cat_data['n'].to_numpy().reshape(2, 2)

Initialise an instance of `Table2x2` using *data_arr*.

In [10]:
ctable = sm.stats.Table2x2(data_arr)
print(ctable)

A 2x2 contingency table with counts:
[[  14. 1931.]
 [   4. 1763.]]


### Measures of association

Return point and interval estimates of the odds ratio.

In [11]:
pd.Series(
    data=[ctable.oddsratio,
          ctable.oddsratio_confint()[0],
          ctable.oddsratio_confint()[1]],
    index=['point', 'lcb', 'ucb'],
    name='odds ratio'
)

point    3.195495
lcb      1.049877
ucb      9.726081
Name: odds ratio, dtype: float64

Return point and interval estimates of the relative risk.

In [12]:
pd.Series(
    data=[ctable.riskratio,
          ctable.riskratio_confint()[0],
          ctable.riskratio_confint()[1]],
    index=['point', 'lcb', 'ucb'],
    name='relative risk'
)

point    3.179692
lcb      1.048602
ucb      9.641829
Name: relative risk, dtype: float64

### Chi-squared test for no association

Return the expected frequencies under the null hypothesis of no association.

In [13]:
ctable.fittedvalues

array([[   9.43157328, 1935.56842672],
       [   8.56842672, 1758.43157328]])

Return the differences between the observed and expected frequencies.

In [14]:
data_arr - ctable.fittedvalues

array([[ 4.56842672, -4.56842672],
       [-4.56842672,  4.56842672]])

Return the contributions to the chi-squared test statistic.

In [15]:
ctable.chi2_contribs

array([[2.21283577, 0.01078263],
       [2.43574736, 0.01186883]])

Return the results of the chi-squared test.[^5]

In [16]:
_res = ctable.test_nominal_association()
pd.Series(
    data=[_res.statistic, _res.pvalue, int(_res.df)],
    index=['statistc', 'pvalue', 'df'],
    name='chi-squared test',
    dtype=object
)

statistc    4.671235
pvalue      0.030672
df                 1
Name: chi-squared test, dtype: object

### Fisher's exact test

Return the results of Fisher's exact test.

In [17]:
_, _pvalue = st.fisher_exact(ctable.table)
pd.Series(
    data=_pvalue,
    index=['pvalue'],
    name='fisher''s exact'
)

pvalue    0.033877
Name: fishers exact, dtype: float64

## References

Vera Keefe, Papaarangi Reid, Clint Ormsby, Bridget Robson, Gordon Purdie, Joanne Baxter, Ngäti Kahungunu Iwi Incorporated, Serious health events following involuntary job loss in New Zealand meat processing workers, *International Journal of Epidemiology*, Volume 31, Issue 6, December 2002, Pages 1155–1161, https://doi.org/10.1093/ije/31.6.1155

In [18]:
%load_ext watermark
%watermark --iv

statsmodels: 0.13.2
numpy      : 1.23.2
requests   : 2.28.1
scipy      : 1.9.0
pandas     : 1.4.3
sys        : 3.10.6 (tags/v3.10.6:9c7b4bd, Aug  1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)]



[^1]: See [statsmodels.stats.contingency_tables.Table2x2](https://www.statsmodels.org/stable/generated/statsmodels.stats.contingency_tables.Table2x2.html)
[^2]: Odds ratio and relative risk.
[^3]: Technically this is not needed, given all expected values are greater than five, but we include it for completeness.
[^4]: There is no version of Fisher's exact test in StatsModels, so we use SciPy instead. See [scipy.stats.fisher_exact](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.fisher_exact.html)
[^5]: We pass the argument `dtype=object`, so the `Series` can handle both `float` and `int` data types.
[^6]: See [pandas.Categorical](https://pandas.pydata.org/docs/reference/api/pandas.Categorical.html)