# Statistical Inference

[HistFactory](https://cds.cern.ch/record/1456844) is a tool to construct probabilty distribution functions from template histograms, constructing a likelihood function. In this exercise we will be using HistFactory via [pyhf](https://pyhf.readthedocs.io/), a python implementation of this tool. In addition, we will be using the cabinetry package, which is a python library for constructing and implementing HistFactory models. At the end pyhf turns the statistical model into likelihood function.

In [None]:
import logging
import cabinetry

logging.getLogger("cabinetry").setLevel(logging.INFO)

A statistical model can be define in a declarative way using cabinetry, capturing the $\mathrm{region \otimes sample \otimes systematics}$ structure. General settings `General`, list of phase space regions such as signal and control regions `Regions`, list of samples (MC and data) `Samples`, list of systematic uncertainties `Systematics`, and a list of normalization factors `NormFactors`.

In the `Systematics` section we specify which systematic effects we want to take into account. In addition to the W+jets scale variations, b-tagging variations, and jet energy scale and resolution (shown in the full file) we show here for the ttbar samples `_ME_var` (what does the result look like if we choose another generator?) and `_PS_var` (what does the result look like if we use a different hadronizer?).

In [None]:
config = cabinetry.configuration.load("cabinetry_config.yml")
cabinetry.templates.collect(config)
ws = cabinetry.workspace.build(config)
cabinetry.workspace.save(ws, "workspace.json")

In [None]:
!pyhf inspect workspace.json

Now we perform our maximum likelihood fit:

In [None]:
model, data = cabinetry.model_utils.model_and_data(ws)
fit_results = cabinetry.fit.fit(model, data)

and visualize the pulls of parameters in the fit:

In [None]:
pull_fig = cabinetry.visualize.pulls(
    fit_results, exclude="ttbar_norm", close_figure=False, save_figure=False
)

What are pulls? For our nuisance parameters in the fit the pull is defined as $(\hat{\theta} - \theta_0)/\Delta\theta$, which is the difference between the fitted parameter value and the initial value divided by the width. Looking at the pulls can aid in seeing how well (or how badly) your fit performed. For unbiased estimates and correctly estimated uncertainties, the pull should have a central value of 0 and an uncertainty of 1. If the central value is not 0 then some data feature differs from the expectation which may need investigation if large. If the uncertainty is less than 1 then something is constrained by the data. This needs checking to see if this is legitimate or a modeling issue.

What does the model look like before and after the fit? We can visualize each with the following code:

In [None]:
model_prediction = cabinetry.model_utils.prediction(model)
figs = cabinetry.visualize.data_mc(model_prediction, data, close_figure=False)

model_prediction_postfit = cabinetry.model_utils.prediction(model, fit_results=fit_results)
figs = cabinetry.visualize.data_mc(model_prediction_postfit, data, close_figure=False)

We can see that there is very good post-fit agreement. Finally, what’s the $t\bar{t}$ cross section (for our pseudodata) divided by the Standard Model prediction?

In [None]:
poi_index = model.config.poi_index
print(f"\nfit result for ttbar_norm: {fit_results.bestfit[poi_index]:.3f} +/- {fit_results.uncertainty[poi_index]:.3f}")

We can also visualize the Negative Log-Likelihood (NLL) scan, which shows how the likelihood varies as a function of our parameter of interest (POI), the $t\bar{t}$ signal strength. The minimum of this curve corresponds to the most likely value of the POI.

In [None]:
poi_scan = cabinetry.fit.scan(
    model,
    data,
    "ttbar_norm"
)

In [None]:
x = poi_scan.parameter_values   
y = poi_scan.delta_nlls         

import matplotlib.pyplot as plt
plt.plot(x, y, marker='o')
plt.xlabel("POI")  
plt.ylabel("ΔNLL")
plt.title("NLL Scan")
plt.grid(True)
plt.show()