### Example: Cholera Pandemic in London
---

#### Source

https://vincentarelbundock.github.io/Rdatasets/doc/HistData/Cholera.html

#### Background

1846-1860 Cholera Pandemic (Wikipedia https://en.wikipedia.org/wiki/1846%E2%80%931860_cholera_pandemic)

In 1855, John Snow discovered that cholera was mainly transmitted through contaminated water supply by examining the cases of cholera in London. Brigham et al. (2004) re-examined Snow's study by using a more advanced statistical method, logistic regression. 

#### List of variables

+ `district` - name of the district in London, a character vector
+ `cholera_drate` - deaths from cholera in 1849 per 10,000 inhabitants, a numeric vector
+ `cholera_deaths` - number of deaths registered from cohlera in 1849, a numeric vector
+ `popn` - population, in the middle of 1849, a numeric vector
+ `elevation` - elevation, in feet above the high water mark, a numeric vector
+ `region` - a grouping of the London districts, a factor with levels West North Central South Kent
+ `water` - water supply region, a factor with levels Battersea New River Kew; see Details
+ `annual_deaths` - annual deaths from all causes, 1838-1844, a numeric vector
+ `pop_dens` - population density (persons per acre), a numeric vector
+ `persons_house` - persons per inhabited house, a numeric vector
+ `house_valpp` - average annual value of house, per person (pounds), a numeric vector
+ `poor_rate` - poor rate precept per pound of howse value, a numeric vector
+ `area` - district area, a numeric vector
+ `houses` - number of houses, a numeric vector
+ `house_val` - total house values, a numeric vector

*Details:*

The supply of water was classified as “Thames, between Battersea and Waterloo Bridges” (central London), “New River, Rivers Lea and Ravensbourne”, and “Thames, at Kew and Hammersmith” (western London). The factor levels use abbreviations for these. The data frame is sorted by increasing elevation above the high water mark.


#### Reference

Bingham P., Verlander, N. Q., and Cheal, M. J. (2004). "John Snow, William Farr and the 1849 outbreak of cholera that affected London: a reworking of the data highlights the importance of the water supply." *Public Health*, 118(6), 387-394, https://doi.org/10.1016/j.puhe.2004.05.007

The following cell loads the cholera data from a CSV file 'Cholera.csv'. Download this file from the keio.jp website and put it and this Jupyter notebook in the same folder.

In [1]:
import numpy as np
import scipy.stats as st
import scipy.optimize as opt
import pandas as pd
from IPython.display import display
from bokeh.io import show, output_notebook
from bokeh.layouts import column, row
from bokeh.models import ColumnDataSource, HoverTool, Slider, Span
from bokeh.plotting import figure
output_notebook()

In [2]:
def beta_hpdi(ci0, alpha, beta, prob):
    def hpdi_conditions(v, a, b, p):
        eq1 = st.beta.cdf(v[1], a, b) - st.beta.cdf(v[0], a, b) - p
        eq2 = st.beta.pdf(v[1], a, b) - st.beta.pdf(v[0], a, b)
        return np.hstack((eq1, eq2))
    return opt.root(hpdi_conditions, ci0, args=(alpha, beta, prob)).x

In [3]:
def bernoulli_stats(y, n, a0, b0, prob):
    """
        y: the sum of all observations
        n: the number of observations
        a_0, b_0: the hyperparameters in the beta prior, Beta(a_0, b_0)
        prob: posterior probability for CI and HPDI
    """
    a = y + a0
    b = n - y + b0
    mean_pi = st.beta.mean(a, b)
    median_pi = st.beta.median(a, b)
    mode_pi = (a - 1.0) / (a + b - 2.0)
    sd_pi = st.beta.std(a, b)
    ci_pi = st.beta.interval(prob, a, b)
    hpdi_pi = beta_hpdi(ci_pi, a, b, prob)
    stats = np.hstack((mean_pi, median_pi, mode_pi, sd_pi, ci_pi, hpdi_pi))
    stats = stats.reshape((1, 8))
    stats_string = ['mean', 'median', 'mode', 'sd', 'ci (lower)', 'ci (upper)', 'hpdi (lower)', 'hpdi (upper)']
    param_string = ['$\\theta$']
    results = pd.DataFrame(stats, index=param_string, columns=stats_string)
    return results, a, b

In [4]:
cholera = pd.read_csv('Cholera.csv', index_col=0)
display(cholera)

Unnamed: 0,district,cholera_drate,cholera_deaths,popn,elevation,region,water,annual_deaths,pop_dens,persons_house,house_valpp,poor_rate,area,houses,house_val
1,Newington,144,907,63074,-2,Kent,Battersea,232,101,5.8,3.788,0.075,624,9370,207460
2,Rotherhithe,205,352,17208,0,Kent,Battersea,277,19,5.8,4.238,0.143,886,2420,59072
3,Bermondsey,164,836,50900,0,Kent,Battersea,267,180,7.0,3.318,0.089,282,6663,155175
4,St George Southwark,161,734,45500,0,Kent,Battersea,264,66,6.2,3.077,0.134,688,5674,107821
5,St Olave,181,349,19278,2,Kent,Battersea,281,114,7.9,4.559,0.079,169,2523,90583
6,St Saviour,153,539,35227,2,Kent,Battersea,292,141,7.1,5.291,0.076,250,4659,174732
7,Westminster,68,437,64109,2,West,Battersea,260,70,8.8,4.189,0.039,917,6439,238164
8,Lambeth,120,1618,134768,3,Kent,Battersea,233,34,6.5,4.389,0.072,4015,17791,510341
9,Camberwell,97,504,51704,4,Kent,Battersea,197,12,5.8,4.508,0.038,4342,6843,180418
10,Greenwich,75,718,95954,8,Kent,New River,238,18,6.8,3.379,0.081,5367,11995,274478


Suppose each death from cholera independently occurs with probability $\theta$. In this context, $\theta$ is interpreted as the true death rate from cholera. We use the uniform prior for $\theta$.

In [5]:
a0 = 1.0
b0 = 1.0

The parameters in the posterior distribution $\alpha_\star$ and $\beta_\star$ are 

In [6]:
y = cholera['cholera_deaths'].sum()
n = cholera['popn'].sum()
a_star = a0 + y
b_star = n - y + b0

Then we draw the posterior distribution of the death rate.

In [7]:
q = np.linspace(0.004, 0.008, 1001)
source = ColumnDataSource(
    data=dict(
        q = q,
        prior_pdf = st.beta.pdf(q, a0, b0),
        posterior_pdf = st.beta.pdf(q, a_star, b_star)
    )
)
hover = HoverTool(
    tooltips=[
        ('\u03B8', '@q{0.0000}'), 
        ('prior', '@prior_pdf{0.0000}'),
        ('posterior', '@posterior_pdf{0.0000}')
    ]
)
p = figure(plot_width=400, plot_height=300, 
           tools=[hover], toolbar_location=None, title='Posterior Distribution')
p.line('q', 'posterior_pdf', source=source, line_color='navy', line_width=2,
       legend_label='Posterior distribution')
p.line('q', 'prior_pdf', source=source, line_color='firebrick', line_width=2, line_dash='dashed',
       legend_label='Prior distribution')
p.xaxis.axis_label = '\u03B8'
p.yaxis.axis_label = 'Probability density'
p.legend.location = 'top_left'
p.legend.click_policy = 'hide'
p.legend.border_line_color = p.xgrid.grid_line_color = p.ygrid.grid_line_color = p.outline_line_color = None
show(p)

Using the function `bernoulli_stats()` we alread defined, we compute the posterior statistics on $\theta$.

In [8]:
prob = 0.95
results = bernoulli_stats(y, n, a0, b0, prob)[0]
display(results)

Unnamed: 0,mean,median,mode,sd,ci (lower),ci (upper),hpdi (lower),hpdi (upper)
$\theta$,0.006147,0.006147,0.006146,5.2e-05,0.006046,0.006249,0.006045,0.006248


Next, we compute the posterior statistics of $\theta$ in each district.

In [9]:
cholera_results = pd.DataFrame( \
    np.vstack([bernoulli_stats(x.cholera_deaths, x.popn, a0, b0, prob)[0].values for x in cholera.itertuples()]),
    columns = ['mean', 'median', 'mode', 'sd', 'ci (lower)', 'ci (upper)', 'hpdi (lower)', 'hpdi (upper)'],
    index = cholera['district'].values)
display(cholera_results)

Unnamed: 0,mean,median,mode,sd,ci (lower),ci (upper),hpdi (lower),hpdi (upper)
Newington,0.014395,0.01439,0.01438,0.000474,0.01348,0.015339,0.01347,0.015329
Rotherhithe,0.020511,0.020493,0.020456,0.00108,0.018447,0.022681,0.018411,0.022643
Bermondsey,0.016443,0.016437,0.016424,0.000564,0.015357,0.017566,0.015344,0.017553
St George Southwark,0.016153,0.016146,0.016132,0.000591,0.015015,0.017331,0.015001,0.017317
St Olave,0.018154,0.018137,0.018104,0.000961,0.016317,0.020085,0.016285,0.02005
St Saviour,0.015328,0.015319,0.015301,0.000655,0.014072,0.016637,0.014054,0.016618
Westminster,0.006832,0.006827,0.006817,0.000325,0.006209,0.007484,0.006199,0.007473
Lambeth,0.012013,0.012011,0.012006,0.000297,0.011438,0.012602,0.011434,0.012597
Camberwell,0.009767,0.00976,0.009748,0.000432,0.008937,0.010632,0.008925,0.010619
Greenwich,0.007493,0.00749,0.007483,0.000278,0.006957,0.008048,0.00695,0.008041
