## Summary notes

Perform a stratified analyses on the results of a stratified case-control study.

This topic is covered in M249, Book 1, Part 2.

### About the study

Data were taken from investigating the possible association between alcohol consumption and fatal car accidents in New York (J.R. McCarroll and W. Haddon Jr, 1962).
The data were stratified by marital status, which was believed to be a possible confounder.
The *exposure* was blood alcohol level of 100mg% or greater;
*Cases* were drivers who were killed in car accidents for which they were considered to be responsible;
and *controls* were selected drivers passing the locations where the accidents of the cases occurred, at the same time of day and on the same day of the week.

The results were as follows.

Level: *Married*

|                      | fatality (+)   | no fatality (-) |
| -------------------- | -------------: | --------------: |
| **over 100mg% (+)**  | 4              | 5               |
| **under 100mg% (-)** | 5              | 103             |

Level: *Not married*

|                      | fatality (+)    | no fatality (-) |
| -------------------- | --------------: | --------------: |
| **over 100mg% (+)**  | 10              | 3               |
| **under 100mg% (-)** | 5               | 43              |

### Method

The analysis was undertaken using StatsModels.

The data was stored on remotely, so we defined the function `cache_file` to handle the retrieval of the file.
The datay were stored in a CSV file with schema:

| column     | dtype | description                                              |
| ---------- | ----- | -------------------------------------------------------- |
| *n*        | `int` | Number of observations                                   |
| *level*    | `str` | Descriptive label indicating category of level           |
| *exposure* | `str` | Descriptive label indicating category of exposure        |
| *casecon*  | `str` | Descriptive label indicating category of case or control |

The function `get_odds_ratio_arr` was defined to return a point and interval estimates of the odds ratio as a `list`.

The exposure, and case-control labels were stored as two `list`s, with variables named *exposures*, and *casecons*.

Note, the orders of *exposures, casesons* are important:
`Table2x2`[^2] requires the data to be some sequence with shape `(2, 2)`:

| exposure            | disease (+) | no diease (-) |
| --------------------| ----------: | ------------: |
| **exposed (+)**     | *a*         | *b*           |
| **not exposed (-)** | *c*         | *d*           |

And `StratifiedTable`[^3] requires the data to be some sequence with shape `(2, 2, n)`, where `n` is the number of levels in the data:

| level   | exposure           | disease (+) | no diease (-) |
| ------- |--------------------| ----------: | ------------: |
| **one** | **exposed (+)**    | *a*         | *b*           |
|         |**not exposed (-)** | *c*         | *d*           |
| **two** | **exposed (+)**    | *e*         | *f*           |
|         |**not exposed (-)** | *g*         | *h*           |

As such, the labels should be ordered:

- *exposures* = (*exposed*, *not exposed*)
- *casecons* = (*case*, *control*)

The data were cached and used to a initialise a Pandas `DataFrame`.

A new `DataFrame` *cat_data* was initialised from *data*, with the *level* *exposure*, *casecon* columns as ordered `Categorical` data.
We casted the *exposure* and *casecons* columns to ordered `Categorical` data, to ensure the data would appear as expected.[^7]
The *level* columns was taken as *levels*, a NumPy `NDArray`.

The main difficulty with a stratified analysis is that we need to deal with the data in many different shapes.
Three variables were initialised to handle the different views of the data.
These were:

- *level_tables*, a collection of `Table2x2` instances, where each instance represents a contingency table for a specific level
- *aggregated_table*, an instance of `Table2x2` using the aggregated data[^8]
- *strat_table*, an instance of `StratifiedTable`

The level-specific, unadjusted[^4] and adjusted[^5] odds ratios were calculated, including their confidence interval estimates.
Two hypothesis tests were performed: a test of no association and a test of homogeneity.

## Dependencies

In [None]:
import os
import requests
from dataclasses import dataclass
import numpy as np
import pandas as pd
from statsmodels import api as sm

## Functions

In [None]:
def cache_file(url: str, fname: str, dir_: str = './__cache') -> str:
    """Cache the file at given url in the given dir_ with the given
    fname and return the local path.

    Preconditions:
    - dir_ exists
    """
    local_path = f'{dir_}/{fname}'
    if fname not in os.listdir(dir_):
        r = requests.get(url, allow_redirects=True)
        open(local_path, 'wb').write(r.content)
    return local_path

In [None]:
def get_odds_ratio_arr(table: sm.stats.Table2x2, alpha: float = 0.05) -> list:
    """Return the point and (100-alpha)% lower and upper confidence
    boundaries.

    Preconditions:
    - 0 < alpha < 1
    """
    return [table.oddsratio,
            table.oddsratio_confint(alpha=alpha)[0],
            table.oddsratio_confint(alpha=alpha)[1]]

## Main

### Initialise the labels

In [None]:
exposures = ['over 100mg', 'under 100mg']
casecons = ['fatality', 'no fatality']

### Cache the data

In [None]:
local_path = cache_file(
        url=('https://raw.githubusercontent.com/ljk233/laughingrook-datasets'
             + '/main/m249/medical/drinkdriving.csv'),
        fname='drinkdriving.csv'
)

### Load the data

Use the cached file to initialise a `DataFrame`.

In [None]:
data = pd.read_csv(local_path)
data.info()

Output a view of *data*.

In [None]:
data

### Prepare the data

Initialise a new `DataFrame` using *data*, with the columns *level* as a `Categorical` variable, and *exposure*, *casecon* as ordered `Categorical` variables.

In [None]:
cat_data = pd.DataFrame().assign(
    n=data['n'].to_numpy(),
    level=pd.Categorical(data['level']),
    exposure=pd.Categorical(data['exposure'], exposures, ordered=True),
    casecon=pd.Categorical(data['casecon'], casecons, ordered=True)
).sort_values(
    by=['level', 'exposure', 'casecon']
)
cat_data.info()

Get *level* as a NumPy `NDArray`.

In [None]:
levels = np.unique(cat_data['level'])
levels

Output pivot tables with marginal totals.

In [None]:
for level in levels:
    print(
        cat_data.query(
            'level == @level'
        ).pivot_table(
            values='n',
            index=['level', 'exposure'],
            columns='casecon',
            aggfunc='sum',
            margins=True,
            margins_name='subtotal'
        ).query(
            "level in [@level, 'subtotal']"
        ),
        '\n'
    )

### Initialise the contingency tables

Initialise contingency tables for different views of the data.

In [None]:
_make_table2x2 = lambda df, level: (
    sm.stats.Table2x2(
        df.query('level == @level')['n']
          .to_numpy()
          .reshape(2, 2)
    )
)
level_tables = [_make_table2x2(cat_data, level) for level in levels]
for table, level in zip(level_tables, levels):
    print(f'level={level}\n{table}\n')

In [None]:
aggregated_table = sm.stats.Table2x2(
    cat_data.groupby(['exposure', 'casecon'])
            .sum()
            .to_numpy()
            .reshape((2, 2))
)
print(aggregated_table)

In [None]:
strat_table = (
    sm.stats.StratifiedTable(
                data.sort_values(by=['exposure', 'casecon', 'level'])['n']
                    .to_numpy()
                    .reshape((2, 2, 2))
    )
)
print(strat_table.table)

### Odds ratios

Return point and interval estimates of the level-specific, unadjusted, and adjusted odds ratios.

Return the level-specific odds ratios.

In [None]:
pd.DataFrame(
    data=[get_odds_ratio_arr(table) for table in level_tables],
    columns=['point', 'lcb', 'ucb'],
    index=pd.Index(levels, name='level')
).round(
    5
)

Return the unadjusted odds ratio.

In [None]:
pd.Series(
    data=get_odds_ratio_arr(aggregated_table),
    index=['point', 'lcb', 'ucb'],
    name='unadjusted odds ratio'
).round(
    5
)

Return the adjusted odds ratio.

In [None]:
pd.Series(
    data=[strat_table.oddsratio_pooled,
           strat_table.oddsratio_pooled_confint()[0],
           strat_table.oddsratio_pooled_confint()[1]],
    index=['point', 'lcb', 'ucb'],
    name='adjusted odds ratio'
).round(
    3
)

### Hypothesis testing

Return the result of a test of no association.

In [None]:
_r = strat_table.test_null_odds(correction=True)
pd.Series(
    data=[_r.statistic, _r.pvalue],
    index=['statistic', 'pvalue'],
    name='test of no association'
).round(
    3
)

Return the result of a test of homogeneity.

In [None]:
_r = strat_table.test_equal_odds(adjust=True)
pd.Series(
    data=[_r.statistic, _r.pvalue],
    index=['statistic', 'pvalue'],
    name='test of homogeneity'
).round(
    3
)

## References

McCarroll, J.R. and Haddon Jr, W., 1962. A controlled study of fatal automobile accidents in New York City. Journal of chronic diseases, 15(8), pp.811-826.

In [None]:
%load_ext watermark
%watermark --iv

[^2]: See [statsmodels.stats.contingency_tables.Table2x2](https://www.statsmodels.org/v0.13.0/generated/statsmodels.stats.contingency_tables.Table2x2.html)
[^3]: See [statsmodels.stats.contingency_tables.StratifiedTable](https://www.statsmodels.org/v0.13.0/generated/statsmodels.stats.contingency_tables.StratifiedTable.html)
[^4]: This is the crude odds ratio
[^5]: This is the Mantel-Haenszel odds ratio
[^7]: See [pandas.Categorical](https://pandas.pydata.org/docs/reference/api/pandas.Categorical.html)
[^8]: Aggregated by *exposure*, *casecon*