## Summary notes

Perform an epidemiological study on the results of a case-control study.

This topics is covered by M249 Book 1, Part 1.

### About the study

Data on a case-control analysing the possible association between political activity and death by homicide was sourced (Mian, A., Mahmood, S.F., *et al* (2022)).
In total, 35 vicitims of homicide were included in the study, and 85 controls with similar age and sex distributions as the victims.
Household members were questioned on the policial activities of the study subjects.
Of the 35 victims, 11 had attended political meeting, compared to two of the controls.

The study results were as follows.

|                        | homicide (+) | not homicide (-) |
| ---------------------- | -----------: | ---------------: |
| **attended (+)**       |           11 |                2 |
| **did not attend (-)** |           24 |               83 |

### About the data

The data were stored remotely as a CSV file in a tidy manner with the schema:

| column     | dtype | description                                               |
| ---------- | ----- | --------------------------------------------------------- |
| *n*        | `int` | Number of observations                                    |
| *exposure* | `str` | Descriptive label indicating category of exposure         |
| *casecon*  | `str` | Descriptive label indicating category of case or control  |

### Method

The analysis was undertaken using StatsModels and SciPy.

We defined a function `cache_file` to handle the retrieval of the data.

The exposure and case-controls labels were stored as two `list`s with variables named *exposures*, *casecons*.

:::{.callout-note}
Note that the orders of the two `list`s are important, and should correspond with:

- *exposures* = (*exposed*, *not exposed*)
- *casecons* = (*case*, *control*)
:::

The data were cached and used to initialise *data*, a Pandas `DataFrame`.

A new `DataFrame` *cat_data* was initialised from *data*, with the *exposure*, *casecon* columns as ordered `Categorical`[^6] data types, and was sorted by (*exposure*, *disease*).

We took column *cat_data*[*n*] as *data_arr*, a Numpy `NDArray` with shape `(2, 2)`, and used it to initialise *ctable*, an instance of `Table2x2`.[^1]

The odds ratio was calculated, including confidence interval estimates.
A chi-squared test of no association was used to test the strength of evidence of an association.
We rounded-off the analysis by performing Fisher's exact test.[^3] [^4]

## Dependencies

In [1]:
import os
import requests
import numpy as np
import pandas as pd
from scipy import stats as st
from statsmodels import api as sm

## Functions

In [2]:
def cache_file(url: str, fname: str, dir_: str = './__cache') -> str:
    """Cache the file at given url in the given dir_ with the given
    fname, and return the local path.

    Preconditions:
    - dir_ exists
    """
    local_path = f'{dir_}/{fname}'
    if fname not in os.listdir(dir_):
        r = requests.get(url, allow_redirects=True)
        open(local_path, 'wb').write(r.content)
    return local_path

## Main

### Initialise the labels

In [3]:
exposures = np.array(['attended', 'did not attend'])
casecons = np.array(['homicide', 'not homicide'])

### Cache the data

In [4]:
local_path = cache_file(
    url=('https://raw.githubusercontent.com/ljk233/laughingrook-datasets'
         + '/main/m249/medical/karachi.csv'),
    fname='karachi.csv'
)

### Load the data

Use the cached file to initialise a `DataFrame`.

In [5]:
data = pd.read_csv(local_path)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   n         4 non-null      int64 
 1   exposure  4 non-null      object
 2   casecon   4 non-null      object
dtypes: int64(1), object(2)
memory usage: 224.0+ bytes


Output a view of *data*.

In [6]:
data

Unnamed: 0,n,exposure,casecon
0,11,attended,homicide
1,2,attended,not homicide
2,24,did not attend,homicide
3,83,did not attend,not homicide


### Prepare the data

Initialise a new `DataFrame` using *data*, with the columns *exposure*, *disease* as ordered `Categorical` variables.

In [7]:
cat_data = pd.DataFrame().assign(
    n=data['n'].to_numpy(),
    exposure=pd.Categorical(data['exposure'], exposures, ordered=True),
    casecon=pd.Categorical(data['casecon'], casecons, ordered=True)
).sort_values(
    by=['exposure', 'casecon']
)
cat_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   n         4 non-null      int64   
 1   exposure  4 non-null      category
 2   casecon   4 non-null      category
dtypes: category(2), int64(1)
memory usage: 320.0 bytes


Output a pivot table with the marginal totals.

In [8]:
cat_data.pivot_table(
    values='n',
    index='exposure',
    columns='casecon',
    aggfunc='sum',
    margins=True,
    margins_name='total'
).drop(
    columns='total'
)

casecon,homicide,not homicide
exposure,Unnamed: 1_level_1,Unnamed: 2_level_1
attended,11,2
did not attend,24,83
total,35,85


### Initialise the contingency table

Get *cat_data*[*n*] as a NumPy `NDArray` with shape `(2, 2)`.

In [9]:
data_arr = cat_data['n'].to_numpy().reshape(2, 2)

Initialise an instance of `Table2x2` using *data_arr*.

In [10]:
ctable = sm.stats.Table2x2(data_arr)
print(ctable)

A 2x2 contingency table with counts:
[[11.  2.]
 [24. 83.]]


### Measures of association

Return point and interval estimates of the odds ratio.

In [11]:
pd.Series(
    data=[ctable.oddsratio,
          ctable.oddsratio_confint()[0],
          ctable.oddsratio_confint()[1]],
    index=['point', 'lcb', 'ucb'],
    name='odds ratio'
)

point    19.020833
lcb       3.942873
ucb      91.758497
Name: odds ratio, dtype: float64

### Chi-squared test for no association

Return the expected frequencies under the null hypothesis of no association.

In [12]:
ctable.fittedvalues

array([[ 3.79166667,  9.20833333],
       [31.20833333, 75.79166667]])

Return the differences between the observed and expected frequencies.

In [13]:
data_arr - ctable.fittedvalues

array([[ 7.20833333, -7.20833333],
       [-7.20833333,  7.20833333]])

Return the contributions to the chi-squared test statistic.

In [14]:
ctable.chi2_contribs

array([[13.70375458,  5.64272247],
       [ 1.66494215,  0.68556441]])

Return the result of the chi-squared test.[^5]

In [15]:
_res = ctable.test_nominal_association()
pd.Series(
    data=[_res.statistic, _res.pvalue, int(_res.df)],
    index=['statistc', 'pvalue', 'df'],
    name='chi-squared test',
    dtype=object
)

statistc    21.696984
pvalue       0.000003
df                  1
Name: chi-squared test, dtype: object

### Fisher's exact test

Return the results of Fisher's exact test.

In [16]:
_, _pvalue = st.fisher_exact(ctable.table)
pd.Series(
    data=_pvalue,
    index=['pvalue'],
    name='fisher''s exact'
)

pvalue    0.000018
Name: fishers exact, dtype: float64

## References

Mian, A., Mahmood, S.F., Chotani, H. and Luby, S., 2002. Vulnerability to homicide in Karachi: political activity as a risk factor. *International journal of epidemiology,* **31**(3), 581-585.

In [17]:
%load_ext watermark
%watermark --iv

scipy      : 1.9.0
sys        : 3.10.6 (tags/v3.10.6:9c7b4bd, Aug  1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)]
statsmodels: 0.13.2
pandas     : 1.4.3
numpy      : 1.23.2
requests   : 2.28.1



[^1]: See [statsmodels.stats.contingency_tables.Table2x2](https://www.statsmodels.org/stable/generated/statsmodels.stats.contingency_tables.Table2x2.html)
[^3]: Technically this is not needed, given all expected values are greater than five, but we include it for completeness
[^4]: There is no version of Fisher's exact test in StatsModels, so we use SciPy instead. See [scipy.stats.fisher_exact](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.fisher_exact.html)
[^5]: We pass the argument `dtype=object`, so the `Series` can handle both `float` and `int` data types
[^6]: See [pandas.Categorical](https://pandas.pydata.org/docs/reference/api/pandas.Categorical.html)