## 1. Generating SIR Training Data: Modeling Mean Counts $S(t)$ and $I(t)$
Created: Mar 18, 2024 HBP and Hassan Mahamat Nil, Johannesburg, South Africa<br>

### Introduction

The susceptible (S), infected (I), removed (R) model is a classic model of an epidemic in which individuals can be in one of three classes (or compartments), denoted by S, I, or R. Individuals can transition from S to I with a probability proportional to $\beta \, s \, i$, where $s$ and $i$ are the number of susceptible and infected persons at time $t$, and from I to R with a probability proportional to $\alpha \, i$. The SIR model assumes that the population of individuals is thoroughly mixed; that is, that every individual has equal probability to be in contact with every other individual. For a highly localized epidemic, such as an outbreak of flu in a school, this may be a reasonable assumption, but it is unlikely to be reasonable for an epidemic spread over a large area. Nevertheless, the SIR model, the prototype of more sophisticated models of epidemics, is still widely studied. 

There are two broad approaches to studying models of epidemics:
  * The use of **deterministic models** based on ordinary differential equations.
  * The use of **stochastic models** based on Markov chains.

A deterministic model seeks to approximate the **mean counts** in each compartment as a function of time, $t$. For the SIR model, this would be $S(t) \equiv \langle s \rangle(t)$, $I(t) \equiv \langle i \rangle(t)$, and $R(t) \equiv \langle r \rangle(t)$. 

### The SIR Deterministic Model

As noted above, this is the prototypical model of an epidemic with 3 compartments: susceptible (S), infected (I), and removed (R). The deterministic model is described using the ordinary differential equations,
\begin{align}
    \frac{dS}{dt} & = - \beta \, \langle s \, i \rangle,\\
    \frac{dI}{dt} & = - \alpha \, I + \beta \langle s \, i \rangle ,\\
    \frac{dR}{dt} & = \alpha \, I.
\end{align}
If the correlation between the susceptible and infected counts is neglected, then we can approximate
$\langle s \, i \rangle \approx S I$, in which case we arrive at the standard form of the SIR deterministic model,
\begin{align}
    \frac{dS}{dt} & = - \beta \, S I,\\
    \frac{dI}{dt} & = - \alpha I + \beta \, S I ,\\
    \frac{dR}{dt} & = \alpha \, I.
\end{align}
The parameters of the model are:
\begin{align*}
    \alpha &= \mbox{removal rate (due to recovery or mortality); so $1/\alpha$ is the mean infectious period, and}\\
    \beta &= \mbox{transmission rate per infected person.}\\
\end{align*}

### The SIR Stochastic Model
For each SIR parameter point $\theta = \alpha, \beta$, this notebook simulates a single SIR epidemic, represented as a sequence of quadruplets, $(t, s, i, r)$, where $s$, $i$, and $r$ are the counts of individuals in the compartments S, I, and R at time $t$, respectively. The quadruplets are randomly shuffled and constitute the training data to be used later in an attempt to model $S(t)$ and $I(t)$ from the simulations.

In [2]:
! ls ../../01_sbi_tutorial/src

1_SIR_generate_data.ipynb      SIR_dnnutil.py
2_SIR_train_cdf.ipynb          SIR_genutil.py
3_SIR_coverage_check_cdf.ipynb [34m__pycache__[m[m


In [4]:
BASE = '../../01_sbi_tutorial/'

import os, sys

sys.path.append(f'{BASE:s}/src')

import numpy as np
import pandas as pd

# the standard modules for high-quality plots
import matplotlib as mp
import matplotlib.pyplot as plt
%matplotlib inline

# to reload modules
import importlib

from tqdm import tqdm

# update fonts
FONTSIZE = 18
font = {'family' : 'serif',
        'weight' : 'normal',
        'size'   : FONTSIZE}
mp.rc('font', **font)

# set usetex = False if LaTex is not 
# available on your system or if the 
# rendering is too slow
mp.rc('text', usetex=True)

# set a seed to ensure reproducibility
seed = 128
rnd  = np.random.RandomState(seed)


### Load SIR data and generate function

In [5]:
from SIR_genutil import generate, observe, Fsolve, SIRdata
print(SIRdata)

 D           : [  3  25  75 227 296 258 236 192 126  71  28  11   7]
 I0          : 3
 O           : [  3  25  75 227 296 258 236 192 126  71  28  11   7]
 R0          : 0
 S0          : 763
 T           : [ 0  2  3  4  5  6  7  8  9 10 11 12 13]
 alpha0      : 0.465
 alpha_bins  : 16
 alpha_max   : 1.0
 alpha_min   : 0.0
 alpha_scale : 1.0
 beta0       : 0.00237
 beta_bins   : 16
 beta_max    : 0.7
 beta_min    : 0.2
 beta_scale  : 0.005
 model       : SIR
 scale       : 50
 tmax        : 14.0
 tmin        : 0.0



### Load $(\alpha, \beta)$ data

In [6]:
N  = 25000
df = pd.read_csv(f'{BASE:s}/data/SIR_alpha_beta_110k.csv.gz', nrows=N)
# N: number of epidemics to generate
N  = len(df)
print('number of entries: %d' % N)
df[:5]

number of entries: 25000


Unnamed: 0,alpha,beta
0,0.556824,0.432547
1,0.917183,0.617733
2,0.222595,0.684092
3,0.513685,0.2314
4,0.533168,0.343659


### Generate synthetic epidemics
We'll choose a uniform prior $\pi_\theta$ as our __proposal distribution__ over the parameter space.

In [15]:
print(f'generate {N:d} epidemics')

# get randomly sampled parameters
alpha = df.alpha.to_numpy()
beta  = df.beta.to_numpy()

epidemics = []
for j in tqdm(range(N)):
    params = (alpha[j], beta[j])
    states = generate(params, SIRdata)
    states.insert(0, beta[j])
    states.insert(0, alpha[j])
    epidemics.append( states )    

generate 25000 epidemics


100%|████████████████████████████████████| 25000/25000 [03:18<00:00, 126.10it/s]


### Observe epidemics are equally spaced observation times

In [40]:
tmin, tmax = SIRdata.tmin, SIRdata.tmax
T = np.linspace(tmin, tmax, 2*int(tmax)+1)

data = []
for j in tqdm(range(N)):
    a, b = epidemics[j][:2]
    states = epidemics[j][2:]
    
    obs  = observe(T, states) 
    
    for t, (s, i, r) in zip(T, obs):
        data.append([a, b, t, s, i])

100%|███████████████████████████████████| 25000/25000 [00:06<00:00, 3602.94it/s]


### Randomly shuffle `data`

In [41]:
d = np.array(data)
np.random.shuffle(d)
K = 110000
d = d[:K].T
d.shape

(5, 110000)

### Write to CSV file

In [42]:
df = pd.DataFrame({'alpha': d[0], 
                   'beta': d[1],
                   't': d[2],
                   's': d[3], 
                   'i': d[4]})

df.to_csv('../data/traindata_110k.csv.gz', index=False, compression='gzip')

df[:5]

Unnamed: 0,alpha,beta,t,s,i
0,0.213605,0.555996,1.0,752.0,13.0
1,0.035917,0.609252,6.5,0.0,687.0
2,0.90136,0.650791,2.0,571.0,141.0
3,0.40607,0.219739,4.5,666.0,50.0
4,0.039018,0.56604,9.5,0.0,597.0


In [13]:
d = data.mean(axis=0)
d.shape

(57, 3)