Quickstart
======

Here is a small example on how to do Bayesian model selection.

Step by step explanation
----------------------------

### Defining a model


To do model selection, we first need some models. A model, in the simplest case,
is just a callable which takes a single `dict` as input and returns a single `dict` as output. The keys of the input dictionary are the parameters of the model, the output
keys denote the summary statistics.
Here, the `dict` is passed as `parameters` and has the entry `x`, which denotes the mean of a Gaussian.
It returns the observed summary statistics `y`, which is just the sampled value.

In [1]:
%matplotlib inline
import os
import tempfile
import numpy as np

import scipy.stats as st

from pyabc import (ABCSMC, RV, Distribution,
                   PercentileDistanceFunction)

# Define a gaussian model
sigma = 1.


def model(parameters):
    # sample from a gaussian
    y = st.norm(parameters.x, sigma).rvs()
    # return the sample as dictionary
    return {"y": y}

For model selection we usually have more than one model.
These are assembled in a list. We
require a Bayesian prior over the models.
The default is to have a uniform prior over the model classes.
This concludes the model definition.

In [2]:
# We define two models, but they are identical so far
models = [model, model, model]


# However, our models' priors are not the same.
# Their mean differs.
mu_x_1, mu_x_2, mu_x_3 = 0, 1, -1
parameter_priors = [
    Distribution(x=RV("norm", mu_x_1, sigma)),
    Distribution(x=RV("norm", mu_x_2, sigma)), 
    Distribution(x=RV("norm", mu_x_3, sigma))
]

### Configuring the ABCSMC run

Having the models defined, we can plug together the `ABCSMC` class.
We need a distance function,
to measure the distance of obtained samples.

In [3]:
# We plug all the ABC options together
abc = ABCSMC(
    models, parameter_priors,
    PercentileDistanceFunction(measures_to_use=["y"]))

### Setting the observed data

Actually measured data can now be passed to the ABCSMC.
This is set via the `new` method, indicating that we start
a new run as opposed to resuming a stored run (see the "resume stored run" example).
Moreover, we have to set the output database where the ABC-SMC run
is logged.

## Generate test data for comparison to DE

In [5]:
# set same seed as in the NDE notebook
rng = np.random.RandomState(seed=42)
sx_t = rng.normal(loc=0, scale=1, size=1)
sx_t

array([0.49671415])

## Run a loop over all test data points

In [6]:
n_simulations = np.zeros_like(sx_t)
phat = np.zeros((3, sx_t.size))

for idx, y_observed in enumerate(sx_t): 
    # y_observed is the important piece here: our actual observation.
    # and we define where to store the results
    db_path = ("sqlite:///" +
               os.path.join(tempfile.gettempdir(), "test.db"))
    abc_id = abc.new(db_path, {"y": y_observed})

    # We run the ABC until either criterion is met
    history = abc.run(minimum_epsilon=0.0001, max_nr_populations=10)
    
    n_simulations[idx] = history.total_nr_simulations
    
    phat[:, idx] = history.get_model_probabilities().values[-1, :]

INFO:History:Start <ABCSMC(id=108, start_time=2018-12-14 17:23:15.213389, end_time=None)>
INFO:Epsilon:initial epsilon is 0.4373312603003668
INFO:ABC:t:0 eps:0.4373312603003668
INFO:ABC:t:1 eps:0.1956997542977105
INFO:ABC:t:2 eps:0.09489582533615787
INFO:ABC:t:3 eps:0.0541448837359886
INFO:ABC:t:4 eps:0.028632904101118524
INFO:ABC:t:5 eps:0.018295346365303035
INFO:ABC:t:6 eps:0.009101333120114215
INFO:ABC:t:7 eps:0.005109189915900939
INFO:ABC:t:8 eps:0.0030281484175126295
INFO:ABC:t:9 eps:0.001688973306761143
INFO:History:Done <ABCSMC(id=108, start_time=2018-12-14 17:23:15.213389, end_time=2018-12-14 17:23:35.265310)>


## Calculate true posterior 

In [53]:
import scipy
import sys 
sys.path.append('../../')
from model_comparison.models import BaseModel


# background model prior 
prior_m0 = scipy.stats.norm(0, 1)
# signal model prior 
prior_m1 = scipy.stats.norm(1, 1)
# third model 
prior_m2 = scipy.stats.norm(-1, 1)

class GaussianModel(BaseModel):
    def __init__(self, std, dim_param=1, sample_size=10, n_workers=1, seed=None):
        super().__init__(dim_param=dim_param, sample_size=sample_size, n_workers=n_workers, seed=seed)
        self.std = std
        self.posterior = None

    def gen_single(self, params):
        # in multiprocessing the parameter vector additionally contains a seed
        if self.run_parallel:
            mu, seed = params
            self.rng.seed(int(seed))
        else:
            mu = params
        return self.rng.normal(loc=mu, scale=self.std, size=self.sample_size)

# models 
sample_size = 1
m0 = GaussianModel(std=1, sample_size=sample_size)
m1 = GaussianModel(std=1, sample_size=sample_size)
m2 = GaussianModel(std=1, sample_size=sample_size)

marli0 = np.array([scipy.stats.norm.pdf(x=xo, loc=prior_m0.mean(), 
                                        scale=np.sqrt(m0.std**2 + prior_m0.std()**2)) for xo in sx_t])
marli1 = np.array([scipy.stats.norm.pdf(x=xo, loc=prior_m1.mean(), 
                                        scale=np.sqrt(m1.std**2 + prior_m1.std()**2)) for xo in sx_t])
marli2 = np.array([scipy.stats.norm.pdf(x=xo, loc=prior_m2.mean(), 
                                        scale=np.sqrt(m2.std**2 + prior_m2.std()**2)) for xo in sx_t])

p_m0_xtest = marli0 / (marli1 + marli0 + marli2)
p_m1_xtest = marli1 / (marli1 + marli0 + marli2)
p_m2_xtest = marli2 / (marli1 + marli0 + marli2)

In [57]:
ptrue = np.vstack((p_m0_xtest, p_m1_xtest, p_m2_xtest))

In [58]:
np.abs(phat - ptrue).mean()

0.04090708698630825