In [None]:
# Download the data files if we need them. If you download the repo as a ZIP, 
# this cell is not needed. If running on colab, it will automatically download
# all required data.

from pathlib import Path


data_path = Path('../data-v2/VHbb_data_2jet.csv')
class_path = Path('ucl_masterclass.py')
if not data_path.exists():
    !wget -P ../data-v2/ https://raw.githubusercontent.com/nikitapond/in2HEP/master/data-v2/VHbb_data_2jet.csv
else:
    print("Data file already found")

if not class_path.exists():
    !wget https://raw.githubusercontent.com/nikitapond/in2HEP/master/notebooks/ucl_masterclass.py
else:
    print("Required custom classes already found")

## 2. $H\rightarrow b\bar{b}$ via Sequential Cuts

To provide a simple baseline analysis, which can be used to compare a multivariate based analysis to, a set of selection cuts on the kinematic and topological paramaters first needs to be optimised. The goal is to apply cuts that maximise the _signal sensitivity_. Cuts should be applied on all variables other than $m_{bb}$, as the $m_{bb}$ distribution is used to evalute the _signal sensitivity_, which is calculated on a bin-by-bin basis from a given distribution ($m_{bb}$ is the most sensitive single variable, so in the absence of a multivariate approach, it is the best single variable to use to distinguish signal from background).

# 2.1 Why do we apply cuts?

In particle physics, we want to try and find some evidence of a signal process, such as the production of a Higgs boson, which then decays to 2 b-jets. We do this by measuring the final particles, the b-jets, and then adding up their properties to find the properties of the Higgs. By doing this, we can find properties such as the Higgs mass. In the first example below, we show an ideal case, where we have a lot of Higgs bosons produced, and no background.

In [None]:
# Importing required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from copy import deepcopy

from ucl_masterclass import *

# 2.1.1 An Ideal Case

In [None]:

def generate_dummy_signal(n):
    '''
    Generates a dummy Higgs signal with a mean of 125 and a standard deviation of 1.
    '''
    return 125 + np.random.randn(n)*5

def make_hist(data, title):
    '''
    Makes a histogram of the data.
    '''
    plt.hist(data, bins=np.linspace(50, 200, 50));
    plt.xlabel("Mass [GeV]")
    plt.ylabel("Number of Events")
    plt.title(title)

make_hist(generate_dummy_signal(20000), "Dummy signal data")


# 2.1.2 What about background

In the above example, we can clearly see a peak at 125 GeV, highlighting a particle that exists with this mass. But this is an ideal case, with a lot of signal, and no background. What happens if we change these two factors?


In [None]:
def generate_dummy_background(n):

    exp_bkg = np.random.exponential(100, n)
    gaus_bkg_1 = np.random.randn(n)*30 + 100
    gaus_bkg_2 = np.random.randn(n)*30 + 200
    return np.concatenate([exp_bkg, gaus_bkg_1, gaus_bkg_2])

# Generate a dummy background, caused by several other processes
dummy_back = generate_dummy_background(20000)
# Generate a dummy signal, caused by the Higgs boson, with high statistics
dummy_signal = generate_dummy_signal(20000)
# What we actually measure is all the signal and background
dummy_measured = np.concatenate([dummy_back, dummy_signal])
make_hist(dummy_measured, "Dummy measured data")

In the above case, we've added some background, and made it more realistic, but we still have a lot of signal and so its still easy to see the peak at 125 GeV

# 2.1.3 Lowering the signal

The actual rate at which Higgs bosons is produced is signficantly smaller than the background. Lets see how that effects these plots

In [None]:

# Generate a dummy background, caused by several other processes
dummy_back = generate_dummy_background(10000)
# Generate a dummy signal, caused by the Higgs boson, with low statistics
dummy_signal = generate_dummy_signal(250)
# What we actually measure is all the signal and background
dummy_measured = np.concatenate([dummy_back, dummy_signal])

make_hist(dummy_measured, "Dummy measured data")


The above plot will look different each time you run it, due to the random numbers utilised. This is very similar to what actually happens in the ATLAS detector, with random fluctations in data occuring. Sometimes, you might get lucky and see a clear peak around 125 GeV, but a lot of the time you wont. Even worse, you might see random peaks elsewhere!
So, how can we find our Higgs boson when there's so much other stuff going on?
We can apply cuts!


# 2.1.4 Ideal cuts
In this ideal example, we shall apply 'cuts' which just reduce the amount of background and signal we have, but at ideal rates

In [None]:
bgk_factor = 0.5
signal_factor = 0.9
num_cuts = 4
bkg_reduced = 1.0
sig_reduced = 1.0

dummy_back = generate_dummy_background(10000)
dummy_signal = generate_dummy_signal(250)

plt.title("Measured data after cuts")
plt.xlabel("Mass [GeV]")
plt.ylabel("Number of Events")
plt.hist(np.concatenate([dummy_back, dummy_signal]), bins=np.linspace(50, 200, 50), label="No cuts");
for i in range(num_cuts):
    bkg_reduced *= bgk_factor
    sig_reduced *= signal_factor
    dummy_measured = np.concatenate([dummy_back[:int(len(dummy_back)*bkg_reduced)], dummy_signal[:int(len(dummy_signal)*sig_reduced)]])
    plt.hist(dummy_measured, bins=np.linspace(50, 200, 50), label=f"Cut {i+1}");
plt.legend()


In this final plot, we can see that after applying a series of cuts to our signal and background, we eventually start to see our peak more and more clearly. Once we can see a peak clearly above the background noise, we can use it to declare that we've found a Higgs.

When it actually comes to declaring we've discovered a particle, we need to ensure that the peak we see is larger than the random noise fluctuations we see in the background. If we don't do this check, and say every peak we see is a new particle, we'd be discovering new particles every week! We aim to find a peak which is statistically significant compared to the background, this means we calculate the probability that a peak is down to random noise. To discover a particle, we require the probability of the peak being caused by random noise to be approximatly 1 in a billion!

# 2.2 A more realistic example

In the above code, we generated some dummy data, with a clear peak from the Higgs boson decay, and a few different possible background processes. When applying cuts, we just imaged we were able to apply a cut that reduced the background by 50%, while only reducing the signal by 10%. We also only generated a single variable, the sum of the mass of the two b-jets. In real data, each point represents an event in the ATLAS detector. But how do we know if the cuts we're applying are working? The way we do this is to instead work with simulations, and compare these to what we see in the data. These are called Monte-Carlo (MC) simulations, as they involve random numbers, as they simulate quantum processes which have an intrinstic random nature. When we generate MC, we know what processes we have generated, and so we can plot all the different background and signal events with different colours, to make it clear what each processes is contributing. Below, lets look at the distributions for some of our variables:

In [None]:
# Load data into a pandas data frame
df = pd.read_csv('../data-v2/VHbb_data_2jet.csv')
df_original = deepcopy(df)

# The plot_variable takes two arguments. First is the data frame used to plot the distributions and
# second is the variable in question.
plot_variable(df,'mBB') # Draw the mBB distribution
plot_variable(df,'pTB1') # Draw the pTB1 distribution
plot_variable(df,'Mtop') # Draw the pTV distribution

We see that the signal and background distributions in the above plots are different. We can use this information to figure out what cuts to apply to try and maximise our *sensitivity*

## 2.2.1 What is Sensitivity?

Sensitivity is the quantative measure of how much our peak sticks up above the background. It has a fancy calculation that we don't need to get into, but the rough idea is this:
- Take our histogram of signal and background
- For each bin, calculate a per bin sensitivity. The per bin sensitivity increases when we have more signal in that bin.
- Add up the per-bin sensitivity 

In the below cell, we calculate the sensitivity on the variable $m_{bb}$, this is the mass we get by adding the two b-jets, which for signal represents the mass of the Higgs boson.

In [None]:
# The code below plots the mBB distribution before any selection is applied
plot_variable(df_original,'mBB')
# Calculate and output the sensitivity based up the original mBB distribution prior to any selection
# The sensitivity is calculated using the profile likelihood ratio test and Asimov approach
print("Sensitivity achieved before cuts ",sensitivity_cut_based(df_original))

We can try and apply a simple cut to increase our sensitivity, which we do below

In [None]:
# Apply cut
df = df.loc[df["Mtop"] / 1e3 > 100] # divide by 1e3 to convert from MeV to GeV

# The code below plots the mBB distribution before any selection is applied
plot_variable(df_original,'mBB')

# Calculate and output the sensitivity based up the original mBB distribution prior to any selection
# The sensitivity is calculated using the profile likelihood ratio test and Asimov approach
print("Sensitivity achieved before cuts ",sensitivity_cut_based(df_original))

# The code below plots the mBB distribution after the selection has applied
plot_variable(df,'mBB')

# Calculate and output the sensitivity based up the mBB distribution after the selectoin has been applied
print("Sensitivity achieved after cuts ",sensitivity_cut_based(df))

Here we see we wre able to increase the sensitivity by cutting on the variable 'Mtop'.
Is the cut we tried above optimal? 
- Try changing the point at which we cut. At the moment it 100 GeV, try a different value
- At the moment, we remove everything below 100 GeV - what happens if you remove everything above a certain cut value?

In [None]:
# Try out some cuts here! Copy and paste the relevent code from the above cell


##### Multiple cuts

As well as applying cuts on a single variable, you should also attempt to apply multiple cuts one after the other. Are optimal cuts over two variables at the same point when cutting on a single variable?

**Based upon material originally produced by hackingEducation for use in outreach**  
<img src="images/logo-black.png" width="50" align = 'left'/>