## Summary notes

Perform an epidemiological study on the results of a 1-1 matched case-control study.

Data was taken from a case-control study undertaken to identify some of the riskfactors associated with death during heatwave in Chicago that occured from 12 July 1995 to 16 July 1995.
Cases were persons aged 24+ years who died between 14-17 July 1995, with a cause mentioned on the death certificate that was possibly heat related. For each case, a matched control was selected of the same age and living in the same neighbourhood.
The risk factor of interest was participation in group activities involving social interactions.
(Semenza, Rubin, Falter, *et al* (1996))

The results were as follows.

| Cases / Controls            | Participated (+) | Did not participate (-) |
| --------------------------- | ---------------: | ----------------------: |
| **Participated (+)**        | 77               | 63                      |
| **Did not participate (-)** | 90               | 74                      |

The analysis was undertaken using StatsModels[^3] and a user-defined function.[^4]
The results were initialised as a NumPy `array`.[^2]

The data was stored on remotely, so we defined a function `cache_file` to handle the retrieval of the file.

We initialised a single `list`, *casecons*, which held the labels for the specific cases and controls.

The data were cached and used to a initialise a Pandas `DataFrame`.

We casted the *cases* and *controls* columns to ordered `Categorical` data, to ensure the data would appear as expected.[^5]

A Mantel-Haenszel odds ratio for the association between participation in group activities and dying of heat-related disease was calculated, including a 95% confidence interval estimate.
Finally, McNemar's test[^1] was performed to test the null hypothesis of no association between participation in group activities and dying of heat-related disease.

These topics are covered by M249 Book 1, Part 2.

## Dependencies

In [1]:
import os
import requests
import numpy as np
import pandas as pd
from scipy import stats as st
from statsmodels import api as sm
from numpy.typing import ArrayLike

## Functions

In [2]:
def cache_file(url: str, fname: str, dir_: str = './__cache') -> str:
    """Cache the file at given url in the given dir_ with the given
    fname and return the local path.

    Preconditions:
    - dir_ exists
    """
    local_path = f'{dir_}/{fname}'
    if fname not in os.listdir(dir_):
        r = requests.get(url, allow_redirects=True)
        open(local_path, 'wb').write(r.content)
    return local_path

In [3]:
def mh_odds_ratio(ctable: ArrayLike, alpha: float = 0.05) -> tuple:
    """Return point and 100(1-alpha)% intervel estimates of the
    Mantel-Haenszel odds ratio.

    Pre-conditions:
    - arr represents the results of a 1-1 matched case-control study
        - shape(arr) = 2, 2
        - rows represent cases, columns represent controls
        - row 0, col 0 represent (+)
        - row 1, col 1 represent (-)
    - 0 < alpha < 1

    Post-conditions:
    - tuple of float estimates, (point, lcb, ucb)
    """

    f, g = ctable[0, 1], ctable[1, 0]
    ste = (st.norm.ppf(1 - (alpha/2)) * np.sqrt(1/f + 1/g))
    return f / g, f / g * np.exp(-ste), f / g * np.exp(ste)

## Main

### Initialise the labels

In [4]:
casecons = ['participated', 'not participated']

### Cache the data

In [5]:
local_path = cache_file(
    url=('https://raw.githubusercontent.com/ljk233/laughingrook-datasets'
         + '/main/m249/medical/heat_deaths.csv'),
    fname='heat_deaths.csv'
)

### Load the data

Use the cached file to initialise a `DataFrame`.

In [6]:
data = pd.read_csv(local_path)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   n         4 non-null      int64 
 1   cases     4 non-null      object
 2   controls  4 non-null      object
dtypes: int64(1), object(2)
memory usage: 224.0+ bytes


In [7]:
data

Unnamed: 0,n,cases,controls
0,77,participated,participated
1,63,participated,not participated
2,90,not participated,participated
3,74,not participated,not participated


### Prepare the data

Initialise a new `DataFrame` using *data*, with the columns *cases*, *controls* as ordered `Categorical` variables.

In [8]:
cat_data = pd.DataFrame().assign(
    n=data['n'].to_numpy(),
    cases=pd.Categorical(data['cases'], casecons, ordered=True),
    controls=pd.Categorical(data['controls'], casecons, ordered=True)
).sort_values(
    by=['cases', 'controls']
)
cat_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   n         4 non-null      int64   
 1   cases     4 non-null      category
 2   controls  4 non-null      category
dtypes: category(2), int64(1)
memory usage: 320.0 bytes


Output pivot tables with marginal totals.

In [9]:
cat_data.pivot_table(
    values='n',
    index='cases',
    columns='controls',
    aggfunc='sum',
    margins=True,
    margins_name='subtotal'
)

controls,participated,not participated,subtotal
cases,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
participated,77,63,140
not participated,90,74,164
subtotal,167,137,304


In [10]:
data_arr = cat_data['n'].to_numpy().reshape(2, 2)
data_arr

array([[77, 63],
       [90, 74]], dtype=int64)

### Matel-Haenszel odds ratio

In [11]:
_point, _lcb, _ucb = mh_odds_ratio(data_arr)
pd.Series(
    data=[_point, _lcb, _ucb],
    index=['point', 'lcb', 'ucb'],
    name='mantel-haenszel odds ratio'
)

point    0.700000
lcb      0.507309
ucb      0.965881
Name: mantel-haenszel odds ratio, dtype: float64

### McNemar's test

In [12]:
_r = sm.stats.mcnemar(data_arr, exact=False)
pd.Series(
    data=[_r.statistic, _r.pvalue],
    index=['statistc', 'pvalue'],
    name='mcnemar''s test'
)

statistc    4.418301
pvalue      0.035555
Name: mcnemars test, dtype: float64

## References

Semenza, J. C., Rubin, C. H., Falter, K. H., Selanikio, J. D., Flanders, W. D., Howe, H. L., & Wilhelm, J. L. (1996). Heat-related deaths during the July 1995 heat wave in Chicago. New England journal of medicine, **335**, 84-90.
[https://doi.org/10.1056/NEJM199607113350203](https://doi.org/10.1056/NEJM199607113350203).

In [13]:
%load_ext watermark
%watermark --iv

requests   : 2.28.1
scipy      : 1.9.0
statsmodels: 0.13.2
numpy      : 1.23.2
pandas     : 1.4.3
sys        : 3.10.6 (tags/v3.10.6:9c7b4bd, Aug  1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)]



[^1]: See [statsmodels.stats.contingency_tables.mcnemar](https://www.statsmodels.org/dev/generated/statsmodels.stats.contingency_tables.mcnemar.html)
[^2]: We did try to represent the data using StatsModels's `Table2x2` class, but unfortunately the `mcnemar` function does not support an instance of `Table2x2`
[^3]: [statsmodels.stats.contingency_tables.mcnemar](https://www.statsmodels.org/dev/generated/statsmodels.stats.contingency_tables.mcnemar.html)
[^4]: Calculate the Mantel-Haenszel odds ratio
[^5]: [pandas.Categorical](https://pandas.pydata.org/docs/reference/api/pandas.Categorical.html)