# Hypothesis tests with pyhf

This notebook will provide you with the tools to do sensitivity estimates which can be used for search region optimization or sensitivity projections.

## p-value for discovery of a new signal

In searches for new physics we want to know how significant a potential deviation from our Standard Model (SM) expectation is. We do this by a hypothesis test where we try to exclude the SM ("background only") hypothesis. We use a so called **p-value** $p_0$ for this, abstractly defined by:

$$p_0 = \int\limits_{t_\mathrm{obs}}^{\infty}p(t|H_0)\mathrm{d}t$$

where $t$ is a test statistic (a number we calculate from our data observations) and $p(t|H_0)$ is the probability distribution for $t$ under the assumption of our **null Hypothesis** $H_0$, in this case the background only hypothesis. This p-value is then typically converted into a number of standard deviations $z$, the **significance** ("number of sigmas") via the inverse of the cumulative standard normal distribution $\Phi$:

$$z = \Phi^{-1}(1 - p)$$

The typical convention for particle physics is to speak of *evidence* when $z>3$ and of an *observation* when $z>5$.

So what do we use for $t$? We want to use something that discriminates well between our null Hypothesis and an **alternative Hypothesis** that we have in mind. When we try to discover new physics, our null Hypothesis is the absence and the alternative Hypothesis the presence of a signal. We can parametrize this by a **signal strength** parameter $\mu$. The test statistics used in almost all LHC searches use the **profile likelihood ratio**

$$\Lambda_\mu = \frac{L(\mu, \hat{\hat{\theta}})}{L(\hat{\mu}, \hat{\theta})}$$

where $\theta$ are the other parameters of our model that are not part of the test, the so called **nuisance parameters**. In contrast, the parameter that we want to test, $\mu$, is called our **parameter of interest** (POI). The nuisance parameters include all fit parameters, like normalization factors and parameters for describing uncertainties. $L(\mu, \hat{\hat{\theta}})$ is the Likelihood function, maximized under the condition that our parameter of interest takes the value $\mu$ and $L(\hat{\mu}, \hat{\theta})$ is the unconditionally maximized Likelihood. So roughly speaking, we are calculating the fraction of the maximum possible likelihood that we can get under our test condition. If it is high, that speaks for our hypothesis, if it is low, against. The test statistic $t_\mu$ is then defined as

$$t_\mu = -2\ln\Lambda_\mu$$

giving us a test statistic where **high values speak against the null hypothesis**.

<div class="alert alert-block alert-success">
    <b>Question 7a:</b> If we want to discover a new signal (using the p-value $p_0$), which value of $\mu$ are we testing against? Or in other words, what is our null Hypothesis?
</div>

All that's left now is to know the distribution of $p(t_\mu|H_0)$. [Wilk's theorem](https://en.wikipedia.org/wiki/Wilks%27_theorem) tells us that the distribution of $t_\mu$ is asymptotically (for large sample sizes) a chi-square distribution. For the discovery p-value we use a slightly modified version of test statistic, called $q_0$ where $\hat{\mu}$ is required to be $>=0$ ($q_0=0$ for $\hat{\mu} < 0$). For $q_0$ the p-value in the asymptotic limit collapses to a very simple formula:

$$p_0 = \sqrt{q_0}$$

The asymptotic limit often matches quite well even for fairly small sample sizes, but it should be kept in mind this is an approximation. Alternatively, one can evaluate $p(t_\mu|H_0)$ by Monte Carlo sampling ("toys").

## CLs for exclusion of an absent signal

Now, sadly, not all searches find evidence for new physics. What we still can do in such a case is to try exclude models by rejecting the hypothesis of a signal being present. That usually means we test against $\mu=1$ or some other value $>0$. The rest of the procedure is very similar with one small detail worth mentioning ... In high energy physics it is very common to use a quantity called $CL_s$ instead of plain p-value. It is defined by

$$CL_s = \frac{CL_{s+b}}{CL_{b}}$$

where $CL_{s+b}$ is the p-value for rejecting the hypothesis of signal + background being present (what would be the "normal" p-value) and $CL_{b}$ is the p-value for rejecting the background only hypothesis, but now using the test statistic for $\mu=1$ (so this is different from $p_0$!). We won't go into further details how to calculate those p-values. `pyhf` has the formulas included and does it automatically for us. The asymptotic distributions for all different variants are described in the paper "Asymptotic formulae for likelihood-based tests of new physics" ([arXiv:1007.1727](https://arxiv.org/abs/1007.1727)).

Just a qualitative explanation of why we use $CL_s$ instead of the p-value: We want to avoid excluding signals in cases where we don't have sensitivity, but observe an *underfluctuation* of the data. In these cases $CL_{s+b}$ and $CL_b$ will be very similar and consequently lead to a large value of $CL_{s}$, telling us the signal is **not** excluded. In case our observations are exactly on spot with the background expectations $CL_b = 0.5$ in the asymptotic limit, so on average we have twice as high "p-values" with $CL_s$.

The typical convention for particle physics is to speak of an **exclusion** of a signal if $CL_s < 0.05$.

## Discovery or exclusion of a signal for a cut & count experiment

Let's start with a simple case where we only want to count the number of events in a certain search region. We assume a certain number of expected background events `b`, expected signal events `s` and a total uncertainty on the expected background `delta_b` ($\sigma_b$).

The likelihood function for this can be formulated as a primary measurement of `n` events and a control ("auxiliary") measurment of `m` events that constrains our background parameter within the uncertainty. So, a product of 2 Poisson distributions:

$$L(s, b) = \mathrm{Pois}(n|s + b)\cdot \mathrm{Pois}(m|\tau b)$$

The parameter $\tau$ can be given in terms of $\sigma_b$ by asking the question "How much more events do i have to measure in the control region to get the relative uncertainty $\sigma_b / b$". That gives

$\tau = \frac{b}{\sigma_b^2}$

Equivalently, we can replace $b$ by $\gamma b$ and $s$ by $\mu s$ to fit normalization factors (initialized to 1) and keep $s$ and $b$ fixed to our expectation.

$$L'(\mu, \gamma) = L(\mu s, \gamma b)$$

`pyhf` has a convenience function to create the specification for such a model: `pyhf.simplemodels.hepdata_like`. It also works for arbitrary many bins, but for now let's go with one bin and 5 expected background events, 7 expected signal events and an uncertainty of 2 on the expected background events:

In [None]:
import pyhf
from scipy import stats

In [None]:
s = 7
b = 5
delta_b = 2

In [None]:
model = pyhf.simplemodels.hepdata_like(
    signal_data=[s], bkg_data=[b], bkg_uncerts=[delta_b]
)

The model comes with a "parameter of interest" (POI) called `mu` that is our signal strength:

In [None]:
model.config.poi_name

In addition, we have one nuisance parameter, the constrained background normalization $\gamma$, called `uncorr_bkguncrt` here:

In [None]:
model.config.par_order

It's initial value should be 1

In [None]:
gamma_initial = 1

So the expected data in our model scales with `mu`. For `mu=1` we get `5 * 1 * 7 = 12`

In [None]:
model.expected_actualdata([1, gamma_initial])

for `mu=2` we get `5 + 2 * 7 = 19`

In [None]:
model.expected_actualdata([2, gamma_initial])

The auxiliary data corresponds to $\tau b$ in the formula above:

In [None]:
model.config.auxdata

It's given by our background uncertainty `delta_b`:

In [None]:
b ** 2 / (delta_b ** 2)

To get the p-value for rejection of the background only hypothesis, we call `pyhf.infer.hypotest` with the test value 0 of our POI $\mu$ using the `q0` test statistic.

We want to know which p-value we would get if we would observe an excess of events of precisely the expected signal, so we plug in `s + b` for the data:

In [None]:
pvalue = pyhf.infer.hypotest(
    poi_test=0,
    data=[s + b] + model.config.auxdata,
    pdf=model,
    test_stat="q0"
)
pvalue

We can convert this into a significance (number of standard deviations) using the inverse of the cumulative standard normal distribution $\Phi$. Note: for most implementations $z = -\Phi^{-1}(p)$ is more numerically stable than $z = \Phi^{-1}(1 - p)$ for small p-values.

In [None]:
def pvalue_to_significance(pvalue):
    return - stats.norm.ppf(pvalue)

In [None]:
pvalue_to_significance(pvalue)

That would not count as "Evidence" yet.

<div class="alert alert-block alert-success">
    <b>Exercise 7b:</b> How much excess events would we need to observe in our search region (assuming unchanged expected background) that we have potential for finding evidence (3 $\sigma$) of a new signal?
</div>

Equivalently we can test for exclusion and calculate $CL_s$. For that we use 1 as the test value for $\mu$ and the `qtilde` test statistic.

We want to know if we could exclude a signal if we would not observe any more data than our background expectation, so we set our data to `b`:

In [None]:
CLs = pyhf.infer.hypotest(
    poi_test=1,
    data=[b] + model.config.auxdata,
    pdf=model,
    test_stat="qtilde"
)
CLs

<div class="alert alert-block alert-success">
    <b>Question 7c:</b> Would that signal count as excluded?
</div>

## Run an upper limit scan

## Run a scan in a signal grid