# Hackathon Prompt: Vector Boson Measurement
## IAIFI Summer School 2022

## 1. Introduction
In this lab we will investigate W bosons produced in the LHC's 8 TeV proton proton collisions. These samples were produced 4 years ago in a fun experiment that opened up the option of performing low mass resonance searches at the LHC. The studies done then have led to a wealth of results from both LHC experiments, ATLAS and CMS. To understand how this study works, we first need to introduce a few concepts.

![](images/Wqq.png)

Let's first consider the process that we would like to look for. The production of W bosons in proton collisions. Here is a Feynman diagram of the process.   The left part of the diagram represents the production of the W boson via some initial quark interaction (quarks and anti-quarks are present when two protons collide). At the right you have a gluon (bottom) that is produced in association with the W boson (top). At the top right, the W boson is decaying. It can decay to many things. The full list of W boson decays is here, in the W branching ratios section. The quark label generically means that the W boson decays to two quarks. In the reference document this is equivalent to a decay to hadrons.

Both quarks and gluons will decay into objects that we refer to as jets. A jet is collection of particles coming from an original quark or gluon. 

## 2. Measurement of Interest
You will do a **Bump Hunt** in mass *(Explanation and example code)*.
Your goal will be to extract that mass of the W from the given data using a classifier to remove as much background as possible. 

## 3. Possible Issue
Neural networks (and classifiers with enough capacity in general) have a tendency of learning features they should not. The field of machine learning concerned with studying this phenomenon is called [algorithmic fairness](https://en.wikipedia.org/wiki/Fairness_(machine_learning)) Explore how your classifier is using the mass to infer the labels (even if you do not directly use it as a training feature).
*(Explain how features can be correlated).*

## 4. Example solutions

- Moment decomposition to control classifier bias: https://arxiv.org/pdf/2010.09745.pdf
- Distance Correlation: https://arxiv.org/pdf/2001.05310.pdf



## Creating an environment with conda

In [1]:
# ! conda create vqq python=3.9
# ! conda activate vqq
# ! pip install -r requirements.txt

## Loading the data

In [6]:
#3GB Data Set
# !wget https://www.dropbox.com/s/bcyab2lljie72aj/data.tgz
# or if you don't have wget
!curl -LO https://www.dropbox.com/s/bcyab2lljie72aj/data.tgz
#130MB Data Set
# !wget https://www.dropbox.com/s/p756oa4mfw17lfw/data.zip
# !curl -LO https://www.dropbox.com/s/p756oa4mfw17lfw/data.zip

# Extract the data
# !unzip data.zip
!tar -xvf data.tgz

# # Clean the downloaded file
# !rm data.zip 
# !rm data.tgz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   133    0   133    0     0    502      0 --:--:-- --:--:-- --:--:--   503
100   340  100   340    0     0    739      0 --:--:-- --:--:-- --:--:--   739
  0     0    0   534    0     0    867      0 --:--:-- --:--:-- --:--:--   867
100 2898M  100 2898M    0     0  40.6M      0  0:01:11  0:01:11 --:--:-- 44.2M
._data
data/
data/._WQQ_s.root
data/WQQ_s.root
data/._ZZ.root
data/ZZ.root
data/._skimh
data/skimh/
data/._QCD_s.root
data/QCD_s.root
data/._ZQQ_s.root
data/ZQQ_s.root
data/._JetHT_s.root
data/JetHT_s.root
data/._TT.root
data/TT.root
data/._ggH.root
data/ggH.root
data/._WW.root
data/WW.root
data/._WZ.root
data/WZ.root
data/skimh/._WQQ_sh.root
data/skimh/WQQ_sh.root
data/skimh/._ZQQ_sh.root
data/skimh/ZQQ_sh.root
data/skimh/._VQQ_sh.root
data/skimh/VQQ_sh.root


In [7]:
import uproot
# Now lets look at the data. Our data sample is the JetHT dataset. 
# What that means is the data passed triggers that have a jet in one of the triggers. (discuss below)
data   = uproot.open("data/JetHT_s.root")["Tree"]

# In addition to above we have Monte Carlo Simulation of many processes
# Some of these process are well modelled in simulation and some of them are not
#-------------------------------------------------------------------------------

# Now we have our actual process qq=>W=>qq at 8TeV collision energy
wqq    = uproot.open("data/WQQ_s.root")["Tree"]

# Now we have our actual process qq=>Z=>qq at 8TeV collision energy
zqq    = uproot.open("data/ZQQ_s.root")["Tree"] 

#Hint: You could check for files in the data directory by doing "!ls data/" in a cell.
#The ZQQ file name is similar to WQQ

# Unfortunately the samples I made above a long time ago are very small. 
# To train NNs and make nice plots we will use larger samples produced at a different collision energy
# qq=>W=>qq at 13TeV collision energy
wqq13  = uproot.open("data/skimh/WQQ_sh.root")["Tree"]

# qq=>Z=>qq at 13TeV collision energy
zqq13  = uproot.open("data/skimh/ZQQ_sh.root")["Tree"]

# Now we have our worst modeled background this is also our main background. 
# This is is our di-jet quark and gluon background. 
# We just call these backgrounds QCD because they are produced with Quantum Chromo Dynamics. 
qcd    = uproot.open("data/QCD_s.root")["Tree"]

# Now we have the Higgs boson sample (we might need this in the future)
ggh    = uproot.open("data/ggH.root")["Tree"]

# And top-quark pair production background. 
tt     = uproot.open("data/TT.root")["Tree"]

# Finally we have the rarer double W, W+Z and Z+Z diboson samples where we have two bosons instead of one
ww     = uproot.open("data/WW.root")["Tree"]
wz     = uproot.open("data/WZ.root")["Tree"]
zz     = uproot.open("data/ZZ.root")["Tree"]