# Introduction to calibration and uncertainty propagation

## Problem to solve

Explain the problem here...

In [None]:
# pip install the required packages if running in Colab
try:
  import google.colab
  IN_COLAB = True
  %pip install summerepi
except:
  IN_COLAB = False

In [None]:
# Standard imports, plotting option and constant definition
import pandas as pd
from scipy import stats

from summer import CompartmentalModel

pd.options.plotting.backend = "plotly"

## Create some dummy data we want our model to fit to

In [None]:
data = pd.DataFrame({"active_cases":
{
    60.: 3000.,
    80.: 8500.,
    100.: 21000.,
    120.: 40000.,
    140.: 44000.,
    160.: 30000.,
    180.: 16000.,
    200.: 7000.,
}}
)
data['active_cases'].plot(kind="scatter")

# Model

## Define a simple SIR model

In [None]:
def build_sir_model(parameters: dict) -> CompartmentalModel:
    """
    Create a compartmental model, with the minimal compartmental structure needed to run and produce some sort of 
    meaningful outputs.
    
    Args:
        parameters: Flow parameters
    Returns:
        A compartmental model currently without stratification applied
    """

    model = CompartmentalModel(
        times=(parameters["start_time"], parameters["end_time"]),
        compartments=["S", "I", "R"],
        infectious_compartments=["I"],
    )

    infectious_seed = parameters["infectious_seed"]
    initial_population = parameters["initial_population"]
    assert initial_population >= infectious_seed, "Initial population size must be greater than infectious seed"

    model.set_initial_population(
        distribution=
        {
            "S": initial_population - infectious_seed, 
            "I": infectious_seed
        }
    )
    
    # Susceptible people can get infected
    model.add_infection_frequency_flow(
        name="infection", 
        contact_rate=parameters["contact_rate"], 
        source="S", 
        dest="I",
    )
    
    # Infectious people recover
    model.add_transition_flow(
        name="recovery",
        fractional_rate= 1. / parameters["infection_duration"],
        source="I",
        dest="R",
    )

    return model

## Run the model with some example parameters

In [None]:
base_parameters = {
    "initial_population": 1.e6,
    "infectious_seed": 100.,
    "start_time": 0,
    "end_time": 365,
    
    "contact_rate": 0.3,
    "infection_duration": 7.,
}

# Get an SIR model object
model = build_sir_model(base_parameters)

# Run the model
model.run()

# Plot the model outputs against the data
output_df = pd.DataFrame({
    "modelled": model.get_outputs_df()["I"],
    "observed": data.active_cases
})
output_df.plot(kind='scatter')


## Calibration specifications
The main objective of our calibration is to estimate the **posterior distribution** of the calibrated parameters. This is the probability distribution of the parameters that are able to describe our observations, given some prior knowledge about these parameters and a mathematical model. In other words, this tells us "What values the parameters should take such that our model is able to capture the data, and given any prior information we had about the parameters before even running the model".

The posterior probability of a parameter set $\theta$ associated with the data $y$ is denoted $P(\theta | y)$.

Let's write the Bayes Therorem for reference:
$$P(\theta | y) = \frac{P(y | \theta) \times P(\theta)}{P(y)} \quad.$$

Within the MCMC loop, we are only interested in the acceptance ratio that defines the probability of acceptance of a newly proposed parameter set $\theta '$, when the last accepted parameter set was $\theta$. This is the ratio of the posterior probabilities between $\theta '$ and $\theta$:
$$H := \frac{P(\theta ' | y)}{P(\theta | y)} = \frac{\frac{P(y | \theta ') \times P(\theta ')}{P(y)}}{\frac{P(y | \theta) \times P(\theta)}{P(y)}} = \frac{P(y | \theta ') \times P(\theta ')}{P(y | \theta) \times P(\theta)} \quad .$$

Here we will define the fundamental aspects of our MCMC calibration:
- Our prior knowledge about the calibrated parameters: $P(\theta)$
- The likelihood associated with our model. This is the probability of observing the data under a given model parameterisation: $P(y|\theta)$

... and some other technical aspects:
- Intitial point from which the MCMC algorithm starts
- The proposal function (or jumping process), defining how we move around in our parameter space. This is defined by $\pi(\theta' | \theta)$ which is the probability of reaching the parameter set $\theta'$, when starting from the parameter set $\theta$.

### Using log-transformed quantities

The prior and likelihood quantities are often extremely small numbers in practice, which may make computation difficult due to computer precision limits. To avoid issues related to rounding, the probabilities are usually transformed using the logarithm function before calculation of the acceptance ratio.

### Prior distributions

In [None]:
def evaluate_log_priors(proposed_parameters: dict) -> float:
    # Initialise the prior likelihood to 1
    prior_log_proba = 0.

    # Use a uniform prior on [0., 0.5] for the contact_rate 
    prior_log_proba += stats.uniform.logpdf(x=proposed_parameters['contact_rate'], loc=0, scale=0.5)

    # Use a normal prior for the infection duration, with mean=7 days and sd=.5
    prior_log_proba += stats.norm.logpdf(x=proposed_parameters['infection_duration'], loc=7, scale=.5)

    return prior_log_proba

### Likelihood function

In [None]:
def evaluate_log_likelihood(proposed_parameters: dict) -> float:

    # build and run the model with the selected parameters
    parameter_set = dict(base_parameters, **proposed_parameters)
    model = build_sir_model(parameter_set)
    model.run()
    modelled_active = model.get_outputs_df()['I']

    # calculate the log-likelihood associated with the model run
    log_likelihood = 0.
    for data_time, data_value in data['active_cases'].iteritems():
        modelled_value = modelled_active.loc[data_time]
        # use a normal likelihood with sd=100, centered on the model estimate
        log_likelihood += stats.norm.logpdf(x=data_value, loc=modelled_value, scale=.1)

    return log_likelihood


### Proposal (jumping) function

In [None]:
jumping_sds = {
    "contact_rate": .1,
    "infection_duration": .5
}

def propose_parameter_set(previous_parameters: dict, jumping_sds:dict) -> dict:
    
    proposed_parameters = {}
    for param_name in ["contact_rate", "infection_duration"]:
        proposed_parameters[param_name] = stats.norm.rvs(loc=previous_parameters[param_name], scale=jumping_sds[param_name])

    return proposed_parameters


In [None]:
param_set_1 = {
    "contact_rate": .4,
    "infection_duration": 6.
}

param_set_2 = {
    "contact_rate": .4,
    "infection_duration": 12.
}

# print(propose_parameter_set(param_set_2, jumping_sds))

#print(evaluate_log_likelihood(param_set_2))


#print(evaluate_priors(param_set_1))
#print(evaluate_priors(param_set_2))

## Some further considerations
We have presented a very simple implementation of an MCMC-based calibration. In practice, there may be other aspects to consider, including:

- Use of other MCMC algorithms

Here we have implemented a "simple" Metropolis-Hastings algorithm, which is the simplest version of MCMCs. However, there exist other types of MCMCs including Gibbs sampling and the Hamiltonian Monte-Carlo. There are also other Bayesian sampling methods that don't verify the Markov property (i.e. not MCMC) but can be used for the same purpose. This includes adaptive Metropolis samplers such as the Haario algorithm.

- Use of multiple MCMC chains

We have only implemented a single MCMC chain that explores the parameter set and samples posterior estimates. In practice, it is common to use multiple chains that can be run in parallel to generate more samples in the same period of time. Samples from the different chains are then combined and we can perform statistical tests to check for convergence and consistency between the chains (e.g. R-hat statistic).

- Non-symmetric proposal function

We have used a symmetric proposal (jumping) function in this example. This means that $\pi(\theta'|\theta) = \pi(\theta|\theta')$. If the proposal function is not symetric, we should adjust the acceptance ratio as follows:
$$ H= \frac{P(y | \theta ') \times P(\theta ') \times \pi(\theta|\theta') }{P(y | \theta) \times P(\theta) \times \pi(\theta' |\theta)} \quad .$$

- Parameter transformation

When parameter supports are bounded (e.g. finite interval), we often transform the parameters into quantities that are unbounded to make sampling easier. For example, with the transformed parameter space, we don't have to worry about having a proposal function defined on a bounded support. These transformations imply some more adjustments to the acceptance ratio that are not discussed here.

- Thinning

The samples generated by some MCMC algorithms (e.g. Metropolis-Hastings) are often highly auto-correlated. This is due to the iterative way in which the samples are generated. To address this issue we often apply thinning after generating the samples. That is, we only retain every n-th sampled paramerer sets.

- Algorithm tuning...

This is probably the most challenging aspect of the Metropolis-Hastings sampler. This is about finding an adequate proposal (jumping) function that will ensure an exhaustive and efficient exploration of the parameter space. If transitions (or jumps) are too big, we will rarely accept the proposed parameters because they would be outside the high-density regions. If transitions are too small, we may not explore the parameter space comprehensively because we may always stay in the same regions. There is no pre-defined rule about how to define a "good" proposal function, but we often want to achieve an acceptance rare of about 10-40%.

## Now the good news...
There are multiple libraries that handle Bayesian sampling with MCMC algorithms already implemented. They also have self-tuning functionalities and other features (e.g. automatic parameter transformation) that address the issues listed above. Our next session will introduce one of these libraries: numpyro.