# Homework 4
## Due by 11:59 pm November 1, 2022

**Submit this notebook as well as the PDF printout to bcourse**

**Please add your SID to the file name**

# Problem  Hypothesis Testing 

In this homework problem, we will use data from the Higgs boson discovery paper to perform a few exercises, which will help you familiarize with concepts and procedures in statistics analysis and also help you to gain experience with numpy as well as other Python tools.

## Introduction
In July 2012, the ATLAS and CMS experiments announced the discovery of a new particle in their searches for the Higgs boson. It was confirmed later that this new particle was indeed a Higgs boson, predicted by the Standard Model of particle physics. One of the two measurements that help to establish the evidence of the Higgs boson was the analysis of collision data that contains four different leptons, often referred to as the Higgs to four-lepton channel or simply $H \rightarrow 4\ell$. In this process, the colliding protons produce a Higgs boson, the Higgs boson decays to four different leptons. We can reconstruct the mass of the Higgs boson from the energies and momenta of the four leptons measured by our detector. This is exactly the same conceptual procedure used in the problem 4 of your last homework where you reconstructed the mass of the Z boson from its decay products (two leptons). In this problem, we do not worry about the reconstruction of the Higgs boson mass. Instead, we directly work on the observed data produced by the ATLAS experiment.

The ATLAS experiment collected three independent samples of collision events that contain four leptons. These three samples differ with each other in the composition of leptons. The ATLAS experiment considered two types of leptons: electrons and muons. The three samples correspond to a sample of events in which there are exactly four electrons, a sample of events in which there are exactly four muons, and a sample of events in which there are exactly two electrons and exactly two muons. These samples are referred to as $4\mu$, $4e$, and $2e2\mu$ samples, respectively. 

#### Observed data
The ATLAS experiment reconstructed the four-lepton invariant mass ($m_{4\ell}$), which is supposed to be the same as the Higgs boson mass with the resolution of the detector. Then, it reported the number of four-lepton events observed in the mass range of 120 GeV < $m_{4\ell}$ < 130 GeV. They are:

$$ 4\mu : 6 $$
$$ 2e2\mu : 5 $$
$$ 4e : 2 $$

#### Background-only hypothesis
The ATLAS experiment also estimated the expected number of events from the `background-only hypothesis` through computer simulation and other data-driven methods. The background-only hypothesis *assumes* that in nature the Higgs boson does not exist and anything collected by the experiment would be from processes that were already known to physicists at the time (July 2012). These known processes are called the background processes. These expected number of events from the `background-only` hypothesis (also referred to as the expected background events/yieds, expected number of background events, etc.) are given as:

$$ 4\mu : 1.25 $$
$$ 2e2\mu : 2.07 $$
$$ 4e : 1.53 $$

In other words, if indeed the Higgs boson doesn't exist, the ATLAS experiment would see 1.25 events in the $4\mu$ sample. Of course, the 1.25 events is the expected number of events. The actual outcomes can be any integer number of events, and these outcomes would follow a Poisson distribution with a mean of 1.25.

#### Signal-plus-background hypothesis

From theory calculation and computer simulation, the ATLAS experiment was able to predict the expected number of Higgs boson signal events, assuming that the Higgs boson exists in nature. These expected signal events are given as:

$$ 4\mu : 2.09 $$
$$ 2e2\mu : 2.29 $$
$$ 4e : 0.9 $$

We must use **both the expected signal events and the expected background events** to construct a `signal-plus-background` hypothesis, which is a possible alternative to the background-only hypothesis. This is because in any experiment one would never totally eliminate the background (or noise in some contexts) in their experiment, and therefore, background must be part of their alternative hypothesis (to the background-only one). 

The signal-plus-background hypothesis would predict the following expected number of events:

$$ 4\mu : 1.25 + 2.09 $$
$$ 2e2\mu : 2.07 + 2.29 $$
$$ 4e : 1.53 + 0.9 $$

What these numbers tell you is the following. If in the nature the Higgs boson exists and its production rate (i.e., how often it would be produced in proton-proton collision) is correctly predicted by theory, then the possible experiment outcomes (observed number of events) in the 4$\mu$ sample would follow a Poisson distribution with a mean of 3.34. As always, the actually experiment outcome could be any integer number of events from 0 to a very large number, but their probability is given by the Poisson function

In [None]:
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# a block of code giving the obs, and expectations. 
# the individual component can be one dimensional array of any size
# the size, i.e., number of entries, are corresponding to number of independent measurements 

obs = np.array([6, 5, 2])
bkg = np.array([1.25,2.07,1.53])
sig = np.array([2.09, 2.29, 0.9])




## Part 0 Visualization of the observed result

- A visualization of the measurement result is created here to help you understand the context.
- The plot shows three distributions:
    - Black dots with vertical error bars are the observed data in the three samples.
    - The orange filled histogram are the expected background events in the three samples, corresponding to the expectation of the background only hypothesis.
    - The histogram shown as the blue dashed line are the expected signal events plus the expected background events, corresponding to the expectation of the signal-plus-background hypothesis.

**You do not need to modify the code cell below**

In [None]:
fig,ax=plt.subplots()
plt.hist([0.5,1.5,2.5], bins=3, range=(0,3),weights=sig+bkg,histtype="step",ls="dashed",label="Expected signal")
plt.hist([0.5,1.5,2.5], bins=3, range=(0,3),weights=bkg,label="Background")
plt.errorbar([0.5,1.5,2.5], obs, yerr=np.sqrt(obs), fmt='o',color='black')

plt.ylabel("Number of events")
plt.xlabel("Region")
plt.xticks([0.5,1.5,2.5,3.5])
plt.xlim(0,3)
plt.ylim(0,10)
plt.legend(frameon=False)


plt.text(0.25,9,"ATLAS Reproduced",fontsize=14, style='italic', weight='bold')
ax.set_xticklabels(['4$\mu$','2e2$\mu$','4e',''])


## Part 1  Log Poisson function
In your code, define a **log Poisson** function, as described below.

The Poisson function is defined as:
$$ Poisson(k, \lambda) = \frac{e^{-\lambda}\lambda^{k}}{k!},$$ where $k$ is the number of observed event and $\lambda$ is the expectation.

We get the **logPoisson** function when we take the logarithm of both sides of the equation
$$ \mathrm{log}Poisson(k,\lambda) = -\lambda + k\mathrm{log}(\lambda) - \mathrm{log}(k!)$$

This can be easily achieved with the logpmf method of scipy.stats.poisson

- **Calculate the log Poisson value for each one of the three samples (4$\mu$, 2e2$\mu$, and 4$e$) under the background-only hypothesis**
    - $\lambda$ should take the value of the expected number of background events 
    - $k$ should take the value of the observed number of events
    - you should get three numbers, one for each sample
    
- **Calculate the log Poisson value for each one of the three samples (4$\mu$, 2e2$\mu$, and 4$e$) under the signal-plus-background hypothesis**
    - $\lambda$ should take the value of the expected number of background events plus the expected number of signal events
    - $k$ should take the value of the observed number of events
    - you should get three numbers, one for each sample
- **Create a markdown cell to state your results in a clear and unambiguous format**

In [None]:
from scipy.stats import poisson
def logPoisson( k, Lambda ):
    # define the log Poisson here 
    # you may either use an existing method of scipy.stats.poisson
    # or write your own code here 
    return 

In [None]:
# Develop your code here

State your results here:



## Part 2 Negative Log Likelihood Ratio
As alluded in the previous part, the logPoisson function has two elements: the expectation and the observation. In this problem, the ATLAS experiment's observation includes the observed number of events in three samples, and their expectation could be either the b-only hypothesis or the signal-plus-background hypothesis, each of which has the expected number of events in the three samples. 

We can construct a likelihood function of the ATLAS experimental results, with their observed number of events, and one of their hypotheses. For example, P(obs|b-only) is the likelihood of yielding the ATLAS observation given the background-only hypothesis. This likehood function is defined as follows:

$$ \mathrm{P(obs|b-only)} = \prod_{i}  Poisson(k_i, \lambda_{i}) $$

where subscript $i$ indicates the sample ($4\mu$, $2e2\mu$, or $4e$). In other words, the likelihood function for multiple measurements is the product of their individual likelihood functions. 

The log likelihood of the measurement is then: 

$$ \mathrm{log(P(obs|b-only))} = \sum_{i} \mathrm{log} Poisson(k_i, \lambda_{i}) $$

Here you can see by taking the log of the likelihood function, we convert the product of multiple Poission functions to a sum of multiple logPoisson functions. This would simplify the computation. 

**In this part, you are required to use the logPoisson function you defined in the previous cell, to calculate the following two log likelihood functions:**

- the value of log likelihood function with the ATLAS observed number of events and **the background-only hypothesis**
    - conceptually, this would be logPoisson(6,1.25) + logPoisson(5,2.07) + logPoisson(2,1.53), but can you make this more concise with numpy arrays? What if we have 1000 terms instead of 3, do you really want to write a long sequence of terms?
- the value of log likelihood function with the ATLAS observed number of events and **the signal-plus-background hypothesis**

hints: if your calculation is correct, you should get -11.0828 and -5.8159 for these two quantities.

- **calculate the `negative log likelihood ratio (NLLR)`, defined as**
$$ - 2\cdot\mathrm{log}\frac{\mathrm{P(obs|sb)}}{\mathrm{P(obs|b-only)}} $$
which is simply
    $$ - 2\cdot( \mathrm{log}\mathrm{P(obs|sb)} - \mathrm{log}\mathrm{P(obs|b-only)} ) \\
    = 2\cdot(\mathrm{log}\mathrm{P(obs|b-only)} - \mathrm{log}\mathrm{P(obs|sb)}) $$


In [None]:
LLB = 
LLSB = 
NLLR= 
print('The log likelihood of the B-only hypothesis is %4.4f' % LLB)
print('The log likelihood of the Signal plus background hypothesis is %4.4f' % LLSB)
print('The NLLR is {:4.2f}'.format(NLLR))

## Part 3 Pseudo experiment

- **Use the background-only hypothesis to generate two million pseudo experiments**
    - If the underlying truth is the background-only hypothesis, then the experimental outcomes would be a set of three Poisson random numbers drawn from the expected background yields in the three samples ($4\mu$, $2e2\mu$ and $4e$). In other words, each pseudo experiment would have three pseudo observed numbers, and they are all Poisson random numbers, generated from mean values of [1.25,2.07,1.53], respectively. Recall these are expected number of events for the background-only hypothesis.
    - Hints: you may consult the Jupyter notebook of Lecture 7, where an example of pseudo experiment generation was shown.
        - you should use a numpy array to store the ensemble of 2,000,000 pseudo experiments. Since each pseudo experiment has 3 observed number of events, your numpy array should have a shape of (2000000,3). Each entry on the axis 0 is a pseudo experiment; for each entry of axis 0, it has three entries on the axis 1, which should be the three observed number of events of the pseudo experiment.
        
- **Use the signal-plus-background hypothesis to generate two million pseudo experiments**
    - repeat the procedure developed for generating the b-only pseudo experiments but change the expectation to signal-plus-background

In [None]:
# pseudo experiments generated under the b-only hypothesis
sampleB = 

# pseudo experiments generated under the S+B hypothesis

sampleSB = 

- **The log likelihood ratio distribution for pseudo experiments generated under the b-only hypothesis**
    - For each pseudo experiment generated under the b-only hypothesis, calculate the log likelihood ratio for it.
    - Since you have 2 million pseudo experiments generated from the b-only hypothesis, your LLR distribution also consists of 2 million entries
    
- **The log likelihood ratio distribution for pseudo experiments generated under the signal-plus-background hypothesis**
    - Repeat the above steps for pseudo experiments generated under the signal-plus-background hypothesis
    
- **Plot the LLR distributions from the two ensembles of pseudo experiments**
    - Draw the LLR distribution of the pseudo experiments generated from the B-only hypothesis and the LLR distribution of pseudo experiments generated from the S+B hypothesis
    - Use a vertical line to indicate the observed LLR value from data
    - Label the plot properly, and your final plot should look like the one shown below
    <img src="https://portal.nersc.gov/project/m3438/physics77/HW4/PE.png" width=500>
    - This plot gives you an intuitive understanding of the "compatibility" between observed data and a hypothesis. 
    


In [None]:
# Develop your code here

## Part 4 p-value and significance of the observation
- Calculate the p-value for the background-only hypothesis
    - First, determine the number of pseudo experiments generated under the background-only hypothesis that have a LLR  value greater than the observed LLR value.
    - Second, the p-value is then the ratio of the above number to the total number of pseudo experiments generated under the b-only hypothesis.
    - Last, convert the observed p-value to the significance using the norm.ppf method of scipy.stats

**In particle physics, if the background-only hypothesis is rejected at three standard deviations or beyond, we claim evidence of a signal is established. Does your result support the claim of an evidence of the Higgs boson signal in the 4 lepton final state?**

In [None]:
from scipy.stats import norm

# develop your code here


pvalue =

Z = 

print("The observed p-value for background-only hypothesis is {:.2e}".format(pvalue))

print("The corresponding significance for rejecting b-only hypothsis is {:2.2f}".format(Z))



## Part 5 Measure the size of the signal

**Signal strength ($\mu$)**

The expected number of signal events given at the introductory part of this problem is based on the theory prediction. The actual size of the signal observed in data may be different from the prediction. We paramaeterize the expectation of a given sample $i$ as $$ \lambda_i = \mu{s_i} + b_i,$$, where $s_i$ and $b_i$ are the expected number of signal events and expected number of background events, respectively, and $\mu$ is a variable that can scale up or down the expected signal. $\mu$ is known as the signal strength. Note that $\mu$ does not have a subscript $i$, meaning that it is assumed to be the same between the three samples (4$\mu$, 2e2$\mu$, and 4$3$). The relative expected number of signal events between these samples are dictated by some underlying physics that we are fairly confident. We assume that the difference between data and theory prediction can be incorporated into this single scale factor $\mu$. For the nominal theory prediction of signal, $\mu$ should be 1. For our observed data, $\mu$ is likely to be different from 1. We are interested in measuring the $\mu$ value from data. 

- To measure $\mu$, we create a series of `mu*S+B hypothesis` (where mu = np.linspace(0,3,3001) as an example) and use them to calculate the negative log likelihood ratio of the *observed data*. 
    - In the `mu*S + B hypothesis`, the expected number of signal event is $\mu\cdot{s_i}$, and the expectation is then $\mu\cdot{s_i}+b_i$
    - We refer to these $\mu$ dependent NLLR values as NLLR(mu)

- Draw the quantity of NLLR(mu) - min{NLLR(mu)} as a function of $\mu$. 
    - min{NLLR(mu)} is the mininum value of NLLR(mu)
        - e.g., say your NLLR(mu) for this series of mu values is NLLR_mu, which is a numpy array of shape (3001,), subtract the minimum value of the NLLR_mu from all entries of NLLR_mu by NLLR_mu - np.min(NLLR_mu)
    - this quantity is labeled as $-2\Delta{LL}$ in the plot

- The $\mu$ value that minimizes $-2\Delta{LL}$ should be the central value of the measured $\mu$, and we denote it as $\hat{\mu}$

- The two $\mu$ values that give a $-2\Delta{LL}$ value that differs from the minimum by 1 should define the $\pm 1 \sigma$ uncertainties. 
    - One of these $\mu$ value is smaller than the central value and is referred to as $\mu_{lo}$, and the other is larger than the central value and is referred to as $\mu_{hi}$
    - The $+1 \sigma$ and $-1 \sigma$ uncertainties of the $\mu$ measurement are then $\mu_{lo} - \hat{\mu}$ and $\mu_{hi} - \hat{\mu}$, respectively
    - We can present the measurement of $\mu$ in the format of
    # $$ \mu = \hat{\mu}^{+ \mu_{hi} - \hat{\mu}}_{- \hat{\mu} - \mu_{lo}}$$

**Produce the $-2\Delta{LL}$ plot and report the central value and the $\pm 1 \sigma$ uncertainty of the measured $\mu$**

Your plot will look like the one linked below. 
<img src="https://portal.nersc.gov/project/m3438/physics77/HW4/muscan.png" width=500>
This plot should also help you understand the measurement procedure. 

In [None]:
#Develop your code here

print("The central value of mu is {:4.2f}".format(mu_measured))
print("The minus one sigma error of mu is {:4.2f}".format(mu_lo - mu_measured))
print("The plus one sigma error of mu is {:4.2f}".format(mu_hi - mu_measured))

