# Bayesian uncertainty analysis
*R.A. Collenteur, Eawag, June, 2023*

In this notebook it is shown how the MCMC-algorithm can be used to estimate the model parameters and quantify the (parameter) uncertainties for a Pastas model using a Bayesian approach. For this the EmceeSolver is introduced, based on the [emcee](https://emcee.readthedocs.io) Python package. 

Besides Pastas the following Python Packages have to be installed to run this notebook:

- [emcee](https://emcee.readthedocs.io)
- [corner](https://corner.readthedocs.io)

<div class="alert alert-warning">
<b>Note:</b>
The EmceeSolver is still an experimental feature and some of the arguments might be changed in the near future (2023/06/22). We welcome testing and feedback on this new feature!.
</div>

In [None]:
import numpy as np
import pandas as pd

import pastas as ps
import emcee
import corner

import matplotlib.pyplot as plt

ps.set_log_level("ERROR")
ps.show_versions()

## 1. Create a Pastas Model
The first step is to create a Pastas Model, including the RechargeModel to simulate the effect of precipitation and evaporation on the heads. Here, we first estimate the model parameters using the standard least-squares approach.

In [None]:
head = pd.read_csv(
    "data/B32C0639001.csv", parse_dates=["date"], index_col="date"
).squeeze()

evap = pd.read_csv("data/evap_260.csv", index_col=0, parse_dates=[0]).squeeze()
rain = pd.read_csv("data/rain_260.csv", index_col=0, parse_dates=[0]).squeeze()

ml = ps.Model(head)

# Select a recharge model
rch = ps.rch.FlexModel()

rm = ps.RechargeModel(rain, evap, recharge=rch, rfunc=ps.Gamma(), name="rch")
ml.add_stressmodel(rm)

ml.solve(noise=True, tmin="1990")

ax = ml.plot(figsize=(10, 3))

## 2. Use the EmceeSolver

We will now use the EmceeSolve solver to estimate the model parameters and their uncertainties. This solver wraps the [Emcee](https://lmfit.github.io/lmfit-py/fitting.html#lmfit.minimizer.Minimizer.emcee) package, which implements different versions of MCMC. A good understanding of Emcee helps when using this solver, so it comes recommended to check out their documentation as well.

To set up the solver, a number of decisions need to be made:

- Determine the priors of the parameters
- Choose a (log) likelihood function
- Choose the number of steps and thinning

### 2a. Choose and set the priors

The first step is to choose and set the priors of the parameters. This is done by using the `ml.set_parameter` method and the `dist` argument (from distribution). Any distribution from the `scipy.stats` can be chosen (https://docs.scipy.org/doc/scipy/tutorial/stats/continuous.html), for example `uniform`, `norm`, or `lognorm`. Here, for the sake of the example, we set all prior distributions to a normal distribution.

In [None]:
# Set the initial parameters to a normal distribution
for name in ml.parameters.index:
    ml.set_parameter(name, dist="norm")

ml.parameters

Pastas will use the `initial` value of the parameter for the `loc` argument of the distribution (e.g., the mean of a normal distribution), and the `stderr` as the `scale` argument (e.g., the standard deviation of a normal distribution). Only for the parameters with a `uniform` distribution, the `pmin` and `pmax` values are used to determine a uniform prior. By default, all parameters are assigned a `uniform` prior.

<div class="alert alert-warning">
<b>Note:</b>
This means that either the `pmin` and `pmax` should be set for uniform distributions, or the `stderr` for any other distribution. That is why in this example model was first solved using LeastSquares, in order to obtain estimates for the `stderr`. In practice, these could also be set based on expert judgement or information about the parameters.
</div>

### 2b. Create the solver instance

The next step is to create an instance of the `EmceeSolve` solver class. At this stage all the settings need to be provided on how the Ensemble Sampler is created (https://emcee.readthedocs.io/en/stable/user/sampler/). Important settings are the `nwalkers`, the `moves`, the `objective_function`. More advanced options are to parallelize the MCMC algorithm (`parallel=True`), and to set a backend to store the results. Here's an example:

In [None]:
# Choose the objective function
ln_prob = ps.objfunc.GaussianLikelihoodAr1()

# Create the EmceeSolver with some settings
s = ps.EmceeSolve(
    nwalkers=20,
    moves=emcee.moves.DEMove(),
    objective_function=ln_prob,
    progress_bar=True,
    parallel=False,
)

In the above code we created an `EmceeSolve` instance with 20 walkers, which take steps according to the `DEMove` move algorithm (see Emcee docs), and a Gaussian likelihood function that assumes AR1 correlated errors. Different objective functions are available, see the Pastas documentation on the different options. 

Depending on the likelihood function, a number of additional parameters need to be inferred. These parameters are not added to the Pastas Model instance, but are available from the solver object. Using the `set_parameter` method of the solver, these parameters can be changed. In this example where we use the `GaussianLikelihoodAr1` function the sigma and theta are estimated; the unknown standard deviation of the errors and the autoregressive parameter.

In [None]:
s.parameters

In [None]:
s.set_parameter("ln_sigma", initial=0.0028, vary=False, dist="norm")
s.parameters

### 2c. Run the solver and solve the model

After setting the parameters and creating a EmceeSolve solver instance we are now ready to run the MCMC analysis. We can do this by running `ml.solve`. We can pass the same parameters that we normally provide to this method (e.g., `tmin` or `fit_constant`). Here we use the initial parameters from our least-square solve, and do not fit a noise model (`noise=False`), because we take autocorrelated errors into account through the likelihood function. 

All the arguments that are not used by `ml.solve`, for example `steps` and `tune`, are passed on to the `run_mcmc` method from the sampler (see Emcee docs). The most important is the `steps` argument, that determines how many steps each of the walkers takes.

In [None]:
# Use the solver to run MCMC
ml.solve(
    solver=s,
    initial=False,
    fit_constant=False,
    noise=False,  # We have to set noise to False !
    tmin="1990",
    steps=1000,
    tune=True,
)

## 3. Posterior parameter distributions

The results from the MCMC analysis are stored in the `sampler` object, accessible through `ml.fit.sampler` variable. The object `ml.fit.sampler.flatchain` contains a Pandas DataFrame with $n$ the parameter samples, where $n$ is calculated as follows:

$n = \frac{\left(\text{steps}-\text{burn}\right)\cdot\text{nwalkers}}{\text{thin}} $

## Corner.py
Corner is a simple but great python package that makes creating corner graphs easy. A couple of lines of code suffice to create a plot of the parameter distributions and the covariances between the parameters.

In [None]:
# Corner plot of the results
fig = plt.figure(figsize=(8, 8))

labels = list(ml.parameters.index[ml.parameters.vary]) + list(
    ml.fit.parameters.index[ml.fit.parameters.vary]
)
labels = [label.split("_")[1] for label in labels]

best = list(ml.parameters[ml.parameters.vary == True].optimal) + list(
    ml.fit.parameters[ml.fit.parameters.vary == True].optimal
)

axes = corner.corner(
    ml.fit.sampler.get_chain(flat=True, discard=500),
    quantiles=[0.025, 0.5, 0.975],
    labelpad=0.1,
    show_titles=True,
    title_kwargs=dict(fontsize=10),
    label_kwargs=dict(fontsize=10),
    max_n_ticks=3,
    fig=fig,
    labels=labels,
    truths=best,
)

plt.show()

## 4. What happens to the walkers at each step?
The walkers take steps in different directions for each step. It is expected that after a number of steps, the direction of the step becomes random, as a sign that an optimum has been found. This can be checked by looking at the autocorrelation, which should be insignificant after a number of steps. Below we just show how to obtain the different chains, the interpretation of which is outside the scope of this notebook.

In [None]:
fig, axes = plt.subplots(len(labels), figsize=(10, 7), sharex=True)

samples = ml.fit.sampler.get_chain(flat=True)
for i in range(len(labels)):
    ax = axes[i]
    ax.plot(samples[:, i], "k", alpha=0.5)
    ax.set_xlim(0, len(samples))
    ax.set_ylabel(labels[i])
    ax.yaxis.set_label_coords(-0.1, 0.5)

axes[-1].set_xlabel("step number")

## 5. Plot some simulated time series to display uncertainty?

We can now draw parameter sets from the chain and simulate the uncertainty in the head simulation.  

In [None]:
# Plot results and uncertainty
ax = ml.plot(figsize=(10, 3))
plt.title(None)

chain = ml.fit.sampler.get_chain(flat=True, discard=500)
inds = np.random.randint(len(chain), size=100)
for ind in inds:
    params = chain[ind]
    p = ml.parameters.optimal.copy().values
    p[ml.parameters.vary] = params[: ml.parameters.vary.sum()]
    l = ml.simulate(p, tmin="1990").plot(c="gray", alpha=0.1, zorder=-1)
    
plt.legend(["Measurements", "Simulation", "Ensemble members"], numpoints=3)
            