# Histogram fits with `pyhf`

Often we don't have a clear way to parametrize our fit templates, so we need to resort to MC simulations and use histograms as templates that we fit to data in the same bins.

We are going to use the [`pyhf`](https://github.com/scikit-hep/pyhf) package for these fits. The documentation can be found at https://pyhf.readthedocs.io/.

It can be installed with pip, e.g.

`pip install --user pyhf`

In addition we are using the `mplhep` package in this notebook to make histogram plotting more convenient and `iminuit` to extract uncertainties on fit parameters.

`pip install --user mplhep iminuit`

In [None]:
import pyhf

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import mplhep as hep

Let's create 2 artificial histograms with 10 bins (having 11 bin boundaries). You could imagine these as two different background processes for which we have MC simulations on which we ran some event selection and created histograms for. For now, let's assume that the shape of these distributions comes out correctly and we only need to fit the normalization (for both templates independently) to data.

In [None]:
bins = np.arange(11)

In [None]:
hist1 = 3 * np.array([0.5, 1, 2, 2.5, 2.1, 2.2, 2, 1.5, 1, 0.5])
hist2 = 3 * np.array([1, 2, 3, 4, 5, 3, 2, 1, 0.1, 0.05])

In [None]:
hep.histplot([hist1, hist2], bins)

We want to stack them since we think the sum of both should give us the expected data yield

In [None]:
hep.histplot([hist1, hist2], bins, stack=True, histtype="fill")

Now, let's assume we observed the following data counts in each bin:

In [None]:
data = np.array([ 4, 17, 26, 23, 34, 23, 21,  7,  8,  4])

In [None]:
hep.histplot([hist1, hist2], bins, stack=True, histtype="fill")
hep.histplot(data, bins, histtype="errorbar", color="black")

Oftentimes, one plots errorbars, indicating 1 $\sigma$ confidence intervals on poisson distributed event counts to have some visualization on the expected spread.

In [None]:
hep.histplot([hist1, hist2], bins, stack=True, histtype="fill")
hep.histplot(data, bins, histtype="errorbar", color="black", w2=data)

## One template fits it all

`pyhf` does fits using the Maximum-Likelihood method and uses the HistFactory ([CERN-OPEN-2012-016](https://cds.cern.ch/record/1456844)) template. In the simplemost case the pdf (probability density function) is just a product of poisson counts in each bin:

$$p(n|\lambda) = \prod_{\mathrm{bin}\, b} \mathrm{Pois}(n_b | \lambda_b)$$

where $\mathrm{Pois}(n_b | \lambda_b)$ is the Poisson distribution for $\lambda_b$ expected and $n_b$ observed counts. In our case $\lambda_b$ would be given by

$$\lambda_b = \mu_1 b_1 + \mu_2 b_2$$

where $b_1$ and $b_2$ are the expected counts from our 2 histograms and $\mu_1$ and $\mu_2$ are the normalization factors we want to fit. This pdf will define the Likelihood function that is later maximized to give the best fitting parameter values.

The general template is more complicated, allowing for constraint terms and separation into arbitrary channels, but we will come back to that later.

Models in `pyhf` are defined with a json-like specification. In our case we can define the model with the following:

In [None]:
samples = [
    {
        "name": "sample1",
        "data": list(hist1),
        "modifiers": [
            {"name": "mu1", "type": "normfactor", "data" : None}
        ],
    },
    {
        "name": "sample2",
        "data": list(hist2),
        "modifiers": [
            {"name": "mu2", "type": "normfactor", "data" : None}
        ],
    },
]
spec = {"channels" : [{"name" : "singlechannel", "samples" : samples}]}

In [None]:
# info: the `poi_name=None` is nescessary here since we don't want to do a hypothesis test
model = pyhf.Model(spec, poi_name=None)

We will now run a *maximum likelihood fit* that gives us the parameters that maximize the likelihood, the *maximum likelihood estimates* (mle).

In [None]:
mu1, mu2 = pyhf.infer.mle.fit(data, model)

In [None]:
mu1, mu2

We did not have to specify initial parameter values or bounds. For normalization factors the initial parameters are by default `1` and the bounds (fit range) is `[0, 10]`:

In [None]:
model.config.suggested_init()

In [None]:
model.config.suggested_bounds()

Let's look at the fitted templates, together with the data:

In [None]:
hep.histplot([mu1 * hist1, mu2 * hist2], bins, stack=True, histtype="fill")
hep.histplot(data, bins, histtype="errorbar", color="black", w2=data)

Often, we are also interested in the uncertainties and correlations between fit parameters. We can use `iminuit` as a fitting backend for `pyhf` to extract them:

In [None]:
pyhf.set_backend('numpy', 'minuit')

In [None]:
parameters, correlations = pyhf.infer.mle.fit(data, model, return_uncertainties=True, return_correlations=True)

In [None]:
parameters

In [None]:
correlations

We can visualize the impact of these uncertainties on our fit templates using [linear error propagation](https://en.wikipedia.org/wiki/Propagation_of_uncertainty).

In this simple case, let's calculate this manually:

$$\sigma_{\lambda_b}^2 = \left(\frac{\partial \lambda_b}{\partial \mu_1}\sigma_{\mu_1}\right)^2 + \left(\frac{\partial \lambda_b}{\partial \mu_2}\sigma_{\mu_2}\right)^2 + 2 \frac{\partial \lambda_b}{\partial \mu_1}\frac{\partial \lambda_b}{\partial \mu_2}\sigma_{\mu_1}\sigma_{\mu_2}\rho_{12} = \left(b_1\sigma_{\mu_1}\right)^2 + \left(b_2\sigma_{\mu_2}\right)^2 + 2 b_1 b_2 \sigma_{b_1}\sigma_{b_2}\rho_{12}$$

In [None]:
sigma1, sigma2 = parameters[:, 1]
sigma1, sigma2

In [None]:
hist_error = np.sqrt(
    (hist1 * sigma1) ** 2 + (hist2 * sigma2) ** 2 + 2 * hist1 * hist2 * sigma1 * sigma2 * correlations[0][1]
)
hist_error

In [None]:
def errorband(bins, hist, hist_error, ax=None):
    ax = ax or plt.gca()
    n = hist
    s = hist_error

    def extend(x):
        return np.append(x, x[-1])

    ax.fill_between(
        bins,
        extend(n - s),
        extend(n + s),
        step="post",
        color="black",
        alpha=0.3,
        linewidth=0
    )

In [None]:
hep.histplot([mu1 * hist1, mu2 * hist2], bins, stack=True, histtype="fill")
hep.histplot(data, bins, histtype="errorbar", color="black", w2=data)
errorband(bins, mu1 * hist1 + mu2 * hist2, hist_error)

# Advanced:  Uncertainties on the histogram templates