## Summary notes

Perform a stratified analyses on the results of a stratified case-control study.

This topic is covered in M249, Book 1, Part 2.

### About the study

Data were taken from investigating the possible association between alcohol consumption and fatal car accidents in New York (J.R. McCarroll and W. Haddon Jr, 1962).
The data were stratified by marital status, which was believed to be a possible confounder.
The *exposure* was blood alcohol level of 100mg% or greater;
*Cases* were drivers who were killed in car accidents for which they were considered to be responsible;
and *controls* were selected drivers passing the locations where the accidents of the cases occurred, at the same time of day and on the same day of the week.

The results were as follows.

Level: *Married*

|                      | fatality (+)   | no fatality (-) |
| -------------------- | -------------: | --------------: |
| **over 100mg% (+)**  | 4              | 5               |
| **under 100mg% (-)** | 5              | 103             |

Level: *Not married*

|                      | fatality (+)    | no fatality (-) |
| -------------------- | --------------: | --------------: |
| **over 100mg% (+)**  | 10              | 3               |
| **under 100mg% (-)** | 5               | 43              |

### About the data

The data were stored remotely as a CSV file in a tidy manner with the schema:

| column     | dtype | description                                              |
| ---------- | ----- | -------------------------------------------------------- |
| *n*        | `int` | Number of observations                                   |
| *level*    | `str` | Descriptive label indicating category of level           |
| *exposure* | `str` | Descriptive label indicating category of exposure        |
| *casecon*  | `str` | Descriptive label indicating category of case or control |

### Method

:::{.callout-warning}
The main difficulty with a stratified analysis is that we need to deal with the data in many different shapes.
Be careful when handling and reshaping the data!
:::

The analysis was undertaken using StatsModels.

We defined a function `cache_file` to handle the retrieval of the data.

The function `get_odds_ratio_arr` was defined to return a point and interval estimate of the odds ratio as a `list`.

The exposure and case-controls labels were stored as two `list`s with variables named *exposures*, *casecons*.

:::{.callout-note}
Note that the orders of the two `list`s are important, and should correspond with:

- *exposures* = (*exposed*, *not exposed*)
- *casecons* = (*case*, *control*)
:::

The data were cached and used to initialise *data*, a Pandas `DataFrame`.

A new `DataFrame` *cat_data* was initialised from *data*, with the *level*, *exposure*, *casecon* columns as ordered `Categorical`[^7] data types, and was sorted by (*exposure*, *disease*, *level*).
Column *cat_data*[*level*] columns was taken as *levels*, a NumPy `NDArray`.

Three variables were initialised to handle the different views of the data.
These were:

- *level_tables*, a collection of `Table2x2`[^2] instances, where each instance represents a contingency table for a specific level
- *aggregated_table*, an instance of `Table2x2` using the aggregated data[^8]
- *strat_table*, an instance of `StratifiedTable`[^3]

The level-specific, unadjusted[^4] and adjusted[^5] odds ratios were calculated, including their confidence interval estimates.
Two hypothesis tests were performed: a test of no association and a test of homogeneity.

## Dependencies

In [1]:
import os
import requests
from dataclasses import dataclass
import numpy as np
import pandas as pd
from statsmodels import api as sm

## Functions

In [2]:
def cache_file(url: str, fname: str, dir_: str = './__cache') -> str:
    """Cache the file at given url in the given dir_ with the given
    fname and return the local path.

    Preconditions:
    - dir_ exists
    """
    local_path = f'{dir_}/{fname}'
    if fname not in os.listdir(dir_):
        r = requests.get(url, allow_redirects=True)
        open(local_path, 'wb').write(r.content)
    return local_path

In [3]:
def get_odds_ratio_arr(table: sm.stats.Table2x2, alpha: float = 0.05) -> list:
    """Return the point and (100-alpha)% lower and upper confidence
    boundaries.

    Preconditions:
    - 0 < alpha < 1
    """
    return [table.oddsratio,
            table.oddsratio_confint(alpha=alpha)[0],
            table.oddsratio_confint(alpha=alpha)[1]]

## Main

### Initialise the labels

In [4]:
exposures = ['over 100mg', 'under 100mg']
casecons = ['fatality', 'no fatality']

### Cache the data

In [5]:
local_path = cache_file(
        url=('https://raw.githubusercontent.com/ljk233/laughingrook-datasets'
             + '/main/m249/medical/drinkdriving.csv'),
        fname='drinkdriving.csv'
)

### Load the data

Use the cached file to initialise a `DataFrame`.

In [6]:
data = pd.read_csv(local_path)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   n         8 non-null      int64 
 1   level     8 non-null      object
 2   exposure  8 non-null      object
 3   casecon   8 non-null      object
dtypes: int64(1), object(3)
memory usage: 384.0+ bytes


Output a view of *data*.

In [7]:
data

Unnamed: 0,n,level,exposure,casecon
0,4,married,over 100mg,fatality
1,5,married,over 100mg,no fatality
2,5,married,under 100mg,fatality
3,103,married,under 100mg,no fatality
4,10,not married,over 100mg,fatality
5,3,not married,over 100mg,no fatality
6,5,not married,under 100mg,fatality
7,43,not married,under 100mg,no fatality


### Prepare the data

Initialise a new `DataFrame` using *data*, with the columns *level* as a `Categorical` variable, and *exposure*, *casecon* as ordered `Categorical` variables.

In [8]:
cat_data = pd.DataFrame().assign(
    n=data['n'].to_numpy(),
    level=pd.Categorical(data['level']),
    exposure=pd.Categorical(data['exposure'], exposures, ordered=True),
    casecon=pd.Categorical(data['casecon'], casecons, ordered=True)
).sort_values(
    by=['exposure',
        'casecon',
        'level']
)
cat_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8 entries, 0 to 7
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   n         8 non-null      int64   
 1   level     8 non-null      category
 2   exposure  8 non-null      category
 3   casecon   8 non-null      category
dtypes: category(3), int64(1)
memory usage: 524.0 bytes


Get *level* as a NumPy `NDArray`.

In [9]:
levels = np.unique(cat_data['level'])
levels

array(['married', 'not married'], dtype=object)

Output pivot tables with marginal totals.

In [10]:
for level in levels:
    print(
        cat_data.query(
            'level == @level'
        ).pivot_table(
            values='n',
            index=['level', 'exposure'],
            columns='casecon',
            aggfunc='sum',
            margins=True,
            margins_name='total'
        ).query(
            "level in [@level, 'total']"
        ),
        '\n'
    )

casecon              fatality  no fatality  total
level   exposure                                 
married over 100mg          4            5    9.0
        under 100mg         5          103  108.0
total                       9          108  117.0 

casecon                  fatality  no fatality  total
level       exposure                                 
not married over 100mg         10            3   13.0
            under 100mg         5           43   48.0
total                          15           46   61.0 



### Initialise the contingency tables

Initialise contingency tables for different views of the data.

In [11]:
_make_table2x2 = lambda df, level: (
    sm.stats.Table2x2(
        df.query('level == @level')['n']
          .to_numpy()
          .reshape(2, 2)
    )
)
level_tables = [_make_table2x2(cat_data, level) for level in levels]
for table, level in zip(level_tables, levels):
    print(f'level={level}\n{table}\n')

level=married
A 2x2 contingency table with counts:
[[  4.   5.]
 [  5. 103.]]

level=not married
A 2x2 contingency table with counts:
[[10.  3.]
 [ 5. 43.]]



In [12]:
aggregated_table = sm.stats.Table2x2(
    cat_data.groupby(['exposure', 'casecon'])
            .sum()
            .to_numpy()
            .reshape((2, 2))
)
print(aggregated_table)

A 2x2 contingency table with counts:
[[ 14.   8.]
 [ 10. 146.]]


In [13]:
strat_table = (
    sm.stats.StratifiedTable(
        data['n'].to_numpy().reshape((2, 2, 2))
    )
)
print(strat_table.table)

[[[  4.   5.]
  [  5. 103.]]

 [[ 10.   3.]
  [  5.  43.]]]


### Odds ratios

Return point and interval estimates of the level-specific, unadjusted, and adjusted odds ratios.

Return the level-specific odds ratios.

In [14]:
pd.DataFrame(
    data=[get_odds_ratio_arr(table) for table in level_tables],
    columns=['point', 'lcb', 'ucb'],
    index=pd.Index(levels, name='level')
).round(
    5
)

Unnamed: 0_level_0,point,lcb,ucb
level,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
married,16.48,3.35421,80.96998
not married,28.66667,5.85662,140.31607


Return the unadjusted odds ratio.

In [15]:
pd.Series(
    data=get_odds_ratio_arr(aggregated_table),
    index=['point', 'lcb', 'ucb'],
    name='unadjusted odds ratio'
).round(
    5
)

point    25.55000
lcb       8.68217
ucb      75.18883
Name: unadjusted odds ratio, dtype: float64

Return the adjusted odds ratio.

In [16]:
pd.Series(
    data=[strat_table.oddsratio_pooled,
           strat_table.oddsratio_pooled_confint()[0],
           strat_table.oddsratio_pooled_confint()[1]],
    index=['point', 'lcb', 'ucb'],
    name='adjusted odds ratio'
).round(
    3
)

point    0.545
lcb      0.182
ucb      1.634
Name: adjusted odds ratio, dtype: float64

### Hypothesis testing

Return the result of a test of no association.

In [17]:
_r = strat_table.test_null_odds(correction=True)
pd.Series(
    data=[_r.statistic, _r.pvalue],
    index=['statistic', 'pvalue'],
    name='test of no association'
).round(
    3
)

statistic    0.612
pvalue       0.434
Name: test of no association, dtype: float64

Return the result of a test of homogeneity.

In [18]:
_r = strat_table.test_equal_odds(adjust=True)
pd.Series(
    data=[_r.statistic, _r.pvalue],
    index=['statistic', 'pvalue'],
    name='test of homogeneity'
).round(
    3
)

statistic    0.233
pvalue       0.629
Name: test of homogeneity, dtype: float64

## References

McCarroll, J.R. and Haddon Jr, W., 1962. A controlled study of fatal automobile accidents in New York City. Journal of chronic diseases, 15(8), pp.811-826.

In [19]:
%load_ext watermark
%watermark --iv

statsmodels: 0.13.2
sys        : 3.10.6 (tags/v3.10.6:9c7b4bd, Aug  1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)]
requests   : 2.28.1
numpy      : 1.23.2
pandas     : 1.4.3



[^2]: See [statsmodels.stats.contingency_tables.Table2x2](https://www.statsmodels.org/v0.13.0/generated/statsmodels.stats.contingency_tables.Table2x2.html)
[^3]: See [statsmodels.stats.contingency_tables.StratifiedTable](https://www.statsmodels.org/v0.13.0/generated/statsmodels.stats.contingency_tables.StratifiedTable.html)
[^4]: This is the crude odds ratio
[^5]: This is the Mantel-Haenszel odds ratio
[^7]: See [pandas.Categorical](https://pandas.pydata.org/docs/reference/api/pandas.Categorical.html)
[^8]: Aggregated by *exposure*, *casecon*