# Session 6 : Rediscover the Higgs boson by yourself!

## Welcome to the Session 6 !

This analysis focuses on the groundbreaking discovery of the Higgs boson, the most significant finding to date at CERN, specifically using the ATLAS experiment!<br>
The relevant paper is https://arxiv.org/pdf/1207.7214.pdf and we will follow mostly the Sections 5 and 5.1 ($H\rightarrow \gamma\gamma$)

Higgs boson decays into photon pairs (or Z$\gamma$): <CENTER><img src="images/Higss_gamma_gamma.png" style="width:70%"></CENTER>

The search for the SM Higgs boson via the decay $H\rightarrow \gamma\gamma$ focuses on the mass range of 110 GeV to 150 GeV. The primary background is SM diphoton production ($\gamma\gamma$), with additional contributions from $\gamma$ +jet and jet+jet production where one or two jets are misidentified as photons ($\gamma$j and jj), as well as from the Drell-Yan process.

By the end of this exercise you will be able to:

1. Rediscover the Higgs boson by yourself
2. Understand some more advance principles of a particle physics analysis

## Using RDataFrame

RDataFrame is a powerful data analysis framework provided by the ROOT software package. It allows users to efficiently analyze large datasets using a high-level interface in C++.

RDataFrame operates on columnar data, where each column represents a variable or attribute of the dataset. It provides a declarative and functional approach to data analysis. Instead of writing explicit loops or iteration, users define a series of operations on the data, which are then applied to the entire dataset in a parallel and optimized manner.

With RDataFrame, users can perform a wide range of operations on the data, such as filtering, selecting, transforming, and aggregating. 

Additionally, RDataFrame provides built-in support for multi-threading and distributed computing, allowing for efficient analysis of large datasets.

In [None]:
import ROOT
import os

In [None]:
# Enable multi-threading
ROOT.ROOT.EnableImplicitMT()

### FIle path and samples for Data and MC

In [None]:
path = "https://atlas-opendata.web.cern.ch/atlas-opendata/samples/2020/"

def get_data_samples():
    samples = ROOT.std.vector("string")()
    for tag in ["A", "B", "C", "D"]:
        samples.push_back(os.path.join(path, "GamGam/Data/data_{}.GamGam.root".format(tag)))
    return samples

def get_ggH125_samples():
    samples = ROOT.std.vector("string")()
    samples.push_back(os.path.join(path, "GamGam/MC/mc_343981.ggH125_gamgam.GamGam.root"))
    return samples

### Define dataframes using RDataFrame in ROOT

In [None]:
df = {}
df["data"] = ROOT.RDataFrame("mini", get_data_samples())
df["ggH"] = ROOT.RDataFrame("mini", get_ggH125_samples())
processes = list(df.keys())

In [None]:
# Apply scale factors and MC weight for simulated events and a weight of 1 for the data
for p in ["ggH"]:
    df[p] = df[p].Define("weight", "scaleFactor_PHOTON * scaleFactor_PhotonTRIGGER * scaleFactor_PILEUP * mcWeight");
df["data"] = df["data"].Define("weight", "1.0")

### Apply preselections and cuts on the leptons

1. Apply preselection cut on photon trigger
2. Find two good muons with tight ID, pt > 25 GeV and not in the transition region between barrel and encap ($1.37\leq |\eta| \leq 1.52$) and up to fiducial region $|\eta| \leq 2.37$
3. Select isolated photons only: photon_ptcone30 / photon_pT < 0.065 and photon_etcone20 / photon_pT < 0.065

In [None]:
for p in processes:
    # Apply preselection cut on photon trigger
    df[p] = df[p].Filter("") # Fill the line appropriately!!!

    # Find two good muons with tight ID, pt > 25 GeV and not in the transition region between barrel and encap 
    # !!!Change the ? with the appropriate value
    df[p] = df[p].Define("goodphotons", "photon_isTightID && (photon_pt > ?) && (abs(photon_eta) < ?) && ((abs(photon_eta) < ?) || (abs(photon_eta) > ?))")\
                 .Filter("Sum(goodphotons) == 2")

    # Take only isolated photons
    df[p] = df[p].Filter("Sum(photon_ptcone30[goodphotons] / photon_pt[goodphotons] < ? ) == 2")\
                 .Filter("? < ? ) == 2") # Fill the etcone isolation

### Declare the function that calculates the invariant mass (in C++ code) using the ROOT.gInterpreter

In [None]:
ROOT.gInterpreter.Declare(
"""
#include <math.h> // for M_PI
using Vec_t = const ROOT::VecOps::RVec<float>;
float ComputeInvariantMass(Vec_t& pt, Vec_t& eta, Vec_t& phi, Vec_t& e) {
    float dphi = abs(phi[0] - phi[1]);
    dphi = dphi < M_PI ? dphi : 2 * M_PI - dphi;
    return sqrt(2 * pt[0] / 1000.0 * pt[1] / 1000.0 * (cosh(eta[0] - eta[1]) - cos(dphi)));
}
""");

**<font color='red'>Task - Homework</font>: Calculate the invariant mass with an alternative way utilizing the formula implemented in Session 2 (Z boson invariant mass)**

In [None]:
hists = {}
for p in processes:
    # Make four vectors and compute invariant mass
    df[p] = df[p].Define("m_yy", "ComputeInvariantMass(photon_pt[goodphotons], photon_eta[goodphotons], photon_phi[goodphotons], photon_E[goodphotons])")

    # Make additional kinematic cuts and select mass window
    df[p] = df[p].Filter("photon_pt[goodphotons][0] / 1000.0 / m_yy > 0.35")\
                 .Filter("photon_pt[goodphotons][1] / 1000.0 / m_yy > 0.25")\
                 .Filter("(m_yy > 105) && (m_yy < 160)")

    # Book histogram of the invariant mass with this selection
    hists[p] = df[p].Histo1D(
            ROOT.ROOT.RDF.TH1DModel(p, "Diphoton invariant mass; m_{#gamma#gamma} [GeV];Events / bin", 30, 105, 160),
            "m_yy", "weight")

In [None]:
# Run the event loop
ggh = hists["ggH"].GetValue()
data = hists["data"].GetValue()

### Fitting and plotting

After making the histograms we need to fit the data to signal and background processes. For this we are using a fit model that compines a 3rd order polynomial (for the Background) and a gaussian distribution (for the signal). 

**The fitting procedure is described in details in the next cell.**

In [None]:
# Set styles
ROOT.gStyle.SetOptStat(0)
ROOT.gStyle.SetOptTitle(0)
ROOT.gStyle.SetMarkerStyle(20)
ROOT.gStyle.SetMarkerSize(1.2)
size = 0.08
ROOT.gStyle.SetLabelSize(size, "x")
ROOT.gStyle.SetLabelSize(size, "y")
ROOT.gStyle.SetTitleSize(size, "x")
ROOT.gStyle.SetTitleSize(size, "y")

# Create canvas with pads for main plot and data/MC ratio
c = ROOT.TCanvas("c", "", 700, 750)

upper_pad = ROOT.TPad("upper_pad", "", 0, 0.29, 1, 1)
lower_pad = ROOT.TPad("lower_pad", "", 0, 0, 1, 0.29)
for p in [upper_pad, lower_pad]:
    p.SetLeftMargin(0.14)
    p.SetRightMargin(0.05)
upper_pad.SetBottomMargin(0)
lower_pad.SetTopMargin(0)

upper_pad.Draw()
lower_pad.Draw()

data.SetStats(0)
data.SetTitle("")

# Fit signal + background model to data
upper_pad.cd()
fit = ROOT.TF1("fit", "([0]+[1]*x+[2]*x^2+[3]*x^3)+[4]*exp(-0.5*((x-[5])/[6])^2)", 105, 160)
fit.FixParameter(5, 125.0)
fit.FixParameter(4, 119.1)
fit.FixParameter(6, 2.39)
data.Fit("fit", "", "E SAME", 105, 160)
fit.SetLineColor(2)
fit.SetLineStyle(1)
fit.SetLineWidth(2)
fit.Draw("SAME")

# Draw background
bkg = ROOT.TF1("bkg", "([0]+[1]*x+[2]*x^2+[3]*x^3)", 105, 160)
for i in range(4):
    bkg.SetParameter(i, fit.GetParameter(i))
bkg.SetLineColor(4)
bkg.SetLineStyle(2)
bkg.SetLineWidth(2)
bkg.Draw("SAME")

# Draw data
data.SetMarkerStyle(20)
data.SetMarkerSize(1.2)
data.SetLineWidth(2)
data.SetLineColor(ROOT.kBlack)
data.Draw("E SAME")
data.SetMinimum(1e-3)
data.SetMaximum(8e3)

# Scale simulated events with luminosity * cross-section / sum of weights
# and merge to single Higgs signal
lumi = 10064.0 # in pb^-1
ggh.Scale(lumi * 0.102 / ggh.Integral())
higgs = ggh
higgs.Draw("HIST SAME")

# Draw ratio
lower_pad.cd()

ratiofit = ROOT.TH1F("ratiofit", "ratiofit", 5500, 105, 160)
ratiofit.Eval(fit)
ratiofit.SetLineColor(2)
ratiofit.SetLineStyle(1)
ratiofit.SetLineWidth(2)
ratiofit.Add(bkg, -1)
ratiofit.Draw()
ratiofit.SetMinimum(-150)
ratiofit.SetMaximum(225)
ratiofit.GetYaxis().SetTitle("Data - bkg")
ratiofit.GetYaxis().CenterTitle()
ratiofit.GetYaxis().SetNdivisions(503, False)
ratiofit.SetTitle("")
ratiofit.GetXaxis().SetTitle("m_{#gamma#gamma} [GeV]")

ratio = data.Clone()
ratio.Add(bkg, -1)
ratio.Draw("E SAME")
for i in range(1, data.GetNbinsX()):
    ratio.SetBinError(i, data.GetBinError(i))

# Add legend
upper_pad.cd()
legend = ROOT.TLegend(0.60, 0.55, 0.89, 0.85)
legend.SetFillStyle(0)
legend.SetBorderSize(0)
legend.SetTextSize(0.05)
legend.SetTextAlign(32)
legend.AddEntry(data, "Data" ,"lep")
legend.AddEntry(bkg, "Background", "l")
legend.AddEntry(fit, "Signal + Bkg.", "l")
legend.AddEntry(higgs, "Signal", "l")
legend.Draw("SAME")

# Add ATLAS label
text = ROOT.TLatex()
text.SetNDC()
text.SetTextFont(72)
text.SetTextSize(0.05)
text.DrawLatex(0.18, 0.84, "ATLAS")

text.SetTextFont(42)
text.DrawLatex(0.18 + 0.13, 0.84, "Open Data")

text.SetTextSize(0.04)
text.DrawLatex(0.18, 0.78, "#sqrt{s} = 13 TeV, 10 fb^{-1}");

In [None]:
%jsroot on
c.Draw()

# **<font color='red'>Tasks - Homework</font> :** 
### 1) Compare the output histogram with the one from the paper. Try to explain the differences, if they are any.
### 2) Decrease the fraction of data and check the impact of the lower stats. NB: For MC, you should also change the lumi value during the normalization procedure:
lumi = 0.5 #fb-1 # data_A only

lumi = 1.9 # fb-1 # data_B only

lumi = 2.9 # fb-1 # data_C only

lumi = 4.7 # fb-1 # data_D only

lumi = 10 # fb-1 # data_A,data_B,data_C,data_D
### 3) Optimize the cuts by looking the paper mention at the beginning
### 4) Find chi-squared for the fit
### 5) Find the mean and the width of the fitted Gaussian
### 6) Explore different initial guesses for the parameters of the fit
### 7) Try different functions for the fit


