# Install or upgrade libraries

It might be that you are running with the latest libraries and that they all work together fine. 

Running the following cell takes a minute or so but ensures that you have a consistent set of python tools. 

In [None]:
#'''
!pip install --upgrade pip

!pip install futures 
!pip install --upgrade awkward
!pip install --upgrade uproot

!pip install fsspec-xrootd

!pip install vector

!pip install --upgrade pandas


!pip install --upgrade matplotlib
#'''

We've also prepared some helper code that makes it easier to work with the data in this lesson.

You can see the code [here](https://github.com/cms-opendata-workshop/workshop2024-lesson-event-selection/blob/main/instructors/dpoa_workshop_utilities.py) but we will explain the functions and data objects in this notebook. 

Let's download it first. 

In [None]:
!wget https://raw.githubusercontent.com/cms-opendata-workshop/workshop2024-lesson-event-selection/main/instructors/dpoa_workshop_utilities.py

## Imports

Import all the libraries we will need and check their versions, in case you run into issues. 

In [None]:
%load_ext autoreload
%autoreload 2

# The classics
import numpy as np
import matplotlib.pylab as plt
import matplotlib # To get the version

import pandas as pd

# The newcomers
import awkward as ak
import uproot

import vector
vector.register_awkward()

import requests
import os

import time

import json

import dpoa_workshop_utilities
from dpoa_workshop_utilities import nanoaod_filenames
from dpoa_workshop_utilities import get_files_for_dataset
from dpoa_workshop_utilities import pretty_print
from dpoa_workshop_utilities import build_lumi_mask

import sys

In [None]:
print("Versions --------\n")
print(f"{sys.version = }\n")
print(f"{ak.__version__ = }\n")
print(f"{uproot.__version__ = }\n")
print(f"{np.__version__ = }\n")
print(f"{matplotlib.__version__ = }\n")
print(f"{vector.__version__ = }\n")
print(f"{pd.__version__ = }\n")

## Datasets and files

### Download the essential files

Eventually, we will want to process all the that are in some datasets. And for now, we would like to try to ensure
that not every one of you is accessing the same file at the same time. 

We've prepared some utilities in a files called `dpoa_workshop_utilities` that will make some of your work in this lesson easier. 

In an [earlier lesson](https://cms-opendata-workshop.github.io/workshop2024-lesson-dataset-scouting/instructor/index.html), you learned that the Open Data Portal provides files that list all of the ROOT files that are part of that dataset. 

We've provided those files as a dictionary called `nanoaod_filenames`. It's a python dictionary, that has the different datsets, as keys. 






In [None]:
print(nanoaod_filenames.keys())

`collision` refers to the data and the other names refer to the signal MC and background MC samples. 

* Signal MC datasets
  * `signal_M2000`
* Background MC datasets
  * `ttsemilep`
  * `tthadronic`
  * `ttleptonic`
  * `Wjets`
  
We can look at the file names of one of these datasets. Remember, these are the files *that contain the names and locations* of the actual ROOT files. 

In [None]:
nanoaod_filenames['ttsemilep']

We're going to download the contents of files (not the ROOT files!) to your docker space, so that you can run over subsets of these datasets. 

We merge the output into files with names like `FILE_LIST_ttsemilep.txt` for example. 

In [None]:
# Download these files

for datasetname in nanoaod_filenames.keys():
    
    print(datasetname)

    outfilename = f'FILE_LIST_{datasetname}.txt'

    # Remove the file if it exists
    try:
        os.remove(outfilename)
    except OSError:
        pass

    for url in nanoaod_filenames[datasetname]:
        print(url)

        r = requests.get(url, allow_redirects=True)

        open(outfilename, 'a').write(r.text)

We can execute some Linux commands to see what is inside them.

In [None]:
!ls -ltr | tail

In [None]:
# Look at the last 5 lines of one of these files. 
!tail -5 FILE_LIST_ttsemilep.txt

### **Challenge!**

You can get the number of files by counting the lines in each of your combined data files. You can do this
by running the command

```
!wc -l FILE_LIST_*.txt
```

Run this command in the cell below. How many collision files are there? How many signal files are there? How many files are there combined in the background sample files? 

In [None]:
# Your code here

!wc -l FILE_LIST_*.txt



We also need the file that tells us what runs and events we have good luminosity calculations for, so let's get that now. 

In [None]:
# Download the lumi file
!wget https://opendata.cern.ch/record/14220/files/Cert_271036-284044_13TeV_Legacy2016_Collisions16_JSON.txt

We've also provided you with a helper function to get the names of one of these ROOT files from the dictionary, 
either all of them or a subset that we pick randomly. 

This is going to allow most of us to access different files. Hopefully this works! :)

We'll try this for the `ttsemilep` since there are many files in the list so there is a better chance that most people will access a different file. 

In [None]:
filenames = get_files_for_dataset("ttsemilep", random=True, n=1)

filename = filenames[0]

print(filename)

Did we mostly get different files? Cut and paste the last set of numbers and letters from the filename into the chat. 

Now we're ready to work with these files!

## Selecting the data

### Can you open the datafile?

We'll work with the random `ttsemilep` file you already selected in the previous cells. We'll make use of `uproot` to open the file and 
pull out the number of events. 

Depending on your connection anw which random file you open, it may take 10-30 seconds for the cell to run. 

In [None]:
filename = filenames[0]
print(f"Opening...{filename}")
f = uproot.open(filename)

events = f['Events']

nevents = events.num_entries

print(f"{nevents = }")

The `events` object is a `TTree` implementation in python and behaves like a dictionary. This means 
we can get all the keys if we want. 

In [None]:
# Uncomment the following line to print all the keys

#print(events.keys())

Again, we have provided you with a helper function called `pretty_print` that will print subsets of the keys, based on strings
that you require or ignore. 

It will also format that output based on how many characters you want in a column (you are limited to 80 characters per line). 

Here is some example usage. 

In [None]:
# Pretty print all the keys with the default format
#pretty_print(events.keys())

# Pretty print keys with 30 characters per column, for keys that contain `FatJet`
#pretty_print(events.keys(), fmt='30s', require='FatJet')

# Pretty print keys with 40 characters per column, for keys that contain `Muon` and `Iso` but ignore ones with `HLT`
pretty_print(events.keys(), fmt='40s', require=['Muon', 'Iso'], ignore='HLT')

# Pretty print keys with 40 characters per column, for keys that contain `HLT` and `TkMu50`
#pretty_print(events.keys(), fmt='40s', require=['HLT', 'TkMu50'])

# Pretty print keys with 40 characters per column, for keys that contain `HLT`
#pretty_print(events.keys(), fmt='40s', require='HLT')

# Pretty print keys with 40 characters per column, for keys that contain `Jet_` but ignore ones with `Fat`
#pretty_print(events.keys(), fmt='40s', require='Jet_', ignore='Fat')

# Pretty print keys with 40 characters per column, for keys that contain `PuppiMET` but ignore ones with `Raw`
#pretty_print(events.keys(), fmt='40s', require='PuppiMET', ignore='Raw')

## Extract some data

We're going to pull out subsets of the data in order to do our analysis. 

As a reminder, you can find a list of the variable names in each dataset on the CERN Open Data Portal page for that dataset, for example, [here](https://opendata.cern.ch/eos/opendata/cms/dataset-semantics/NanoAODSIM/75156/ZprimeToTT_M2000_W20_TuneCP2_PSweights_13TeV-madgraph-pythiaMLM-pythia8_doc.html).

We're going to work with the following sets of variables
* `FatJet` for jets that are merges
* `Jet` for non-merged jets
* `Muon` for muons
* `PuppiMET` which is missing energy in the transverse plane (MET) for pileup per particle identification (Puppi)



# My new stuff

In [None]:
# Jets ---------------------------------------------------
# B-tagging variable
jet_btag = events['Jet_btagDeepB'].array()

# Measure of quality of measurement of jet
jet_jetid = events['Jet_jetId'].array()

# 4-momentum in pt, eta, phi, mass 
jet_pt = events['Jet_pt'].array()
jet_eta = events['Jet_eta'].array()
jet_phi = events['Jet_phi'].array()
jet_mass = events['Jet_mass'].array()


# Muons ---------------------------------------------------
# Muon isolation
muon_iso = events['Muon_miniIsoId'].array()

# Measure of quality of how well the muon is reconstructed
muon_tightId = events['Muon_tightId'].array()

# 4-momentum in pt, eta, phi, mass 
muon_pt = events['Muon_pt'].array()
muon_eta = events['Muon_eta'].array()
muon_phi = events['Muon_phi'].array()
muon_mass = events['Muon_mass'].array()


# MET ------------------------------------------------------
# 3-momentum in pt, eta, phi, mass 
met_pt = events['PuppiMET_pt'].array()
met_eta = 0*events['PuppiMET_pt'].array()  # Fix this to be 0
met_phi = events['PuppiMET_phi'].array() 

In [None]:
plt.figure(figsize=(12,4))

plt.subplot(1,2,1)
plt.hist(ak.flatten(jet_pt),bins=100,range=(0,250));

plt.subplot(1,2,2)
plt.hist(ak.flatten(jet_pt),bins=100,range=(0,50));

In [None]:
cut_jet_pt = jet_pt>20

njets = ak.num(jet_pt)
njets_after_cut = ak.num(jet_pt[cut_jet_pt])

plt.figure(figsize=(12,4))
plt.subplot(1,2,1)
plt.hist(njets,          bins=20, range=(0,20))
plt.locator_params(axis='x', nbins=20)

plt.subplot(1,2,2)
plt.hist(njets_after_cut,bins=20, range=(0,20))
plt.locator_params(axis='x', nbins=20)

;

In [None]:
#njet_cut = ak.num(jet_pt[cut_jet_pt], axis=1) >= 6
njet_cut = ak.num(jet_pt[cut_jet_pt], axis=1) == 6

njet_cut

In [None]:
nevents_pass_njet_cut = ak.num(njet_cut[njet_cut], axis=0)
nevents_pass_njet_cut

In [None]:
cut_muon = (muon_pt>25) & (muon_eta>-2.4) & (muon_eta<2.4) & \
           (muon_tightId == True) & (muon_iso>1)

# MET ------------------------------------------------------
# 3-momentum in pt, eta, phi, mass 
met_pt = events['PuppiMET_pt'].array()
met_eta = 0*events['PuppiMET_pt'].array()  # Fix this to be 0
met_phi = events['PuppiMET_phi'].array() 


In [None]:
cut_full_event = njet_cut

cut_jet = cut_jet_pt

In [None]:
jets = ak.zip(
    {"pt": jet_pt[cut_full_event][cut_jet[cut_full_event]], 
     "eta": jet_eta[cut_full_event][cut_jet[cut_full_event]], 
     "phi": jet_phi[cut_full_event][cut_jet[cut_full_event]], 
     "mass": jet_mass[cut_full_event][cut_jet[cut_full_event]]},
    with_name="Momentum4D",
)

#'''
muons = ak.zip(
    {"pt": muon_pt[cut_full_event][cut_muon[cut_full_event]], 
     "eta": muon_eta[cut_full_event][cut_muon[cut_full_event]], 
     "phi": muon_phi[cut_full_event][cut_muon[cut_full_event]], 
     "mass": muon_mass[cut_full_event][cut_muon[cut_full_event]]},
    with_name="Momentum4D",
)

met = ak.zip(
    {"pt": met_pt[cut_full_event], 
     "eta": met_eta[cut_full_event], 
     "phi": met_phi[cut_full_event], 
     "mass": 0}, # We assume this is a neutrino with 0 mass
    with_name="Momentum4D",
)
#'''

In [None]:
p4jets = ak.combinations(jets, 6)

p4j1, p4j2, p4j3, p4j4, p4j5, p4j6 = ak.unzip(p4jets)

#p4mu,p4fj,p4j,p4met = ak.unzip(ak.cartesian([muons, fatjets, jets, met]))
#p4j1, p4j2, p4j3, p4j4, p4j5, p4j6, p4mu, p4met = ak.unzip(ak.cartesian([p4jets, muons, met]))
p4j1, p4j2, p4j3, p4j4, p4j5, p4j6, p4mu, p4met = ak.unzip(ak.cartesian([p4j1, p4j2, p4j3, p4j4, p4j5, p4j6, muons, met]))


#p4mu = ak.combinations(muons,1)

#p4mmet = ak.combinations(met,1)


# Calculate a sum of the 4-momenta
#p4tot = p4mu + p4fj + p4j + p4met

In [None]:
p4jets[0][0]

In [None]:
print(p4j1[6])

In [None]:
#p4j1, p4j2, p4j3, p4j4, p4j5, p4j6 = ak.unzip(p4jets)
#p4j2.mass

In [None]:
p4tot = p4j1 + p4j2 + p4j3 + p4j4 + p4j5

p4tot2 = p4j6 + p4mu + met

In [None]:
print(ak.num(p4tot.mass))
print(ak.num(p4tot2.mass))

In [None]:
m = ak.flatten(p4tot.mass)

plt.figure(figsize=(12,4))

plt.subplot(1,3,1)
plt.hist(m, bins=300, range=(0,3000));

plt.subplot(1,3,2)
plt.hist(m, bins=50, range=(0,500));

plt.subplot(1,3,3)
plt.hist(m, bins=25, range=(100,250));

In [None]:
m = ak.flatten(p4tot2.mass)

plt.figure(figsize=(12,4))

plt.subplot(1,3,1)
plt.hist(m, bins=300, range=(0,3000));

plt.subplot(1,3,2)
plt.hist(m, bins=50, range=(0,500));

plt.subplot(1,3,3)
plt.hist(m, bins=25, range=(100,250));

In [None]:
p4tot = p4j1 + p4j2 + p4j3 + p4j4 + p4j5
masses1 = ak.flatten(p4tot.mass).to_numpy().tolist()

p4tot = p4j1 + p4j2 + p4j3 + p4j4 + p4j6
masses1 += ak.flatten(p4tot.mass).to_numpy().tolist()

p4tot = p4j1 + p4j2 + p4j3 + p4j5 + p4j6
masses1 += ak.flatten(p4tot.mass).to_numpy().tolist()

p4tot = p4j1 + p4j2 + p4j4 + p4j5 + p4j6
masses1 += ak.flatten(p4tot.mass).to_numpy().tolist()

p4tot = p4j1 + p4j3 + p4j4 + p4j5 + p4j6
masses1 += ak.flatten(p4tot.mass).to_numpy().tolist()

p4tot = p4j2 + p4j3 + p4j4 + p4j5 + p4j6
masses1 += ak.flatten(p4tot.mass).to_numpy().tolist()

plt.figure(figsize=(12,4))

plt.subplot(1,3,1)
plt.hist(masses1, bins=300, range=(0,3000));

plt.subplot(1,3,2)
plt.hist(masses1, bins=50, range=(0,500));

plt.subplot(1,3,3)
plt.hist(masses1, bins=25, range=(100,250));



In [None]:
p4tot2 = p4j6 + p4mu + met
masses2 = ak.flatten(p4tot2.mass).to_numpy().tolist()

p4tot2 = p4j5 + p4mu + met
masses2 += ak.flatten(p4tot2.mass).to_numpy().tolist()

p4tot2 = p4j4 + p4mu + met
masses2 += ak.flatten(p4tot2.mass).to_numpy().tolist()

p4tot2 = p4j3 + p4mu + met
masses2 += ak.flatten(p4tot2.mass).to_numpy().tolist()

p4tot2 = p4j2 + p4mu + met
masses2 += ak.flatten(p4tot2.mass).to_numpy().tolist()

p4tot2 = p4j1 + p4mu + met
masses2 += ak.flatten(p4tot2.mass).to_numpy().tolist()

plt.figure(figsize=(12,4))

plt.subplot(1,3,1)
plt.hist(masses2, bins=300, range=(0,3000));

plt.subplot(1,3,2)
plt.hist(masses2, bins=50, range=(0,500));

plt.subplot(1,3,3)
plt.hist(masses2, bins=25, range=(100,250));


In [None]:
# Fat jets ---------------------------------------------------

# Soft-drop mass, calculated using a particular algorithm
fatjet_mSD = events['FatJet_msoftdrop'].array()

# A newer tagging algorithm to identify merged top-quark jets
fatjet_tag = events['FatJet_particleNet_TvsQCD'].array()

# Measures of subjettiness, used in the original analysis
fatjet_tau2 = events['FatJet_tau2'].array()
fatjet_tau3 = events['FatJet_tau3'].array()

# 4-momentum in pt, eta, phi, mass 
fatjet_pt = events['FatJet_pt'].array()
fatjet_eta = events['FatJet_eta'].array()
fatjet_phi = events['FatJet_phi'].array()
fatjet_mass = events['FatJet_mass'].array()


# Jets ---------------------------------------------------
# B-tagging variable
jet_btag = events['Jet_btagDeepB'].array()

# Measure of quality of measurement of jet
jet_jetid = events['Jet_jetId'].array()

# 4-momentum in pt, eta, phi, mass 
jet_pt = events['Jet_pt'].array()
jet_eta = events['Jet_eta'].array()
jet_phi = events['Jet_phi'].array()
jet_mass = events['Jet_mass'].array()


# Muons ---------------------------------------------------
# Muon isolation
muon_iso = events['Muon_miniIsoId'].array()

# Measure of quality of how well the muon is reconstructed
muon_tightId = events['Muon_tightId'].array()

# 4-momentum in pt, eta, phi, mass 
muon_pt = events['Muon_pt'].array()
muon_eta = events['Muon_eta'].array()
muon_phi = events['Muon_phi'].array()
muon_mass = events['Muon_mass'].array()


# MET ------------------------------------------------------
# 3-momentum in pt, eta, phi, mass 
met_pt = events['PuppiMET_pt'].array()
met_eta = 0*events['PuppiMET_pt'].array()  # Fix this to be 0
met_phi = events['PuppiMET_phi'].array() 

# Scalar quantity used in selection criteria
ht_lep = muon_pt + met_pt

## Selections

We are going to create a series of masks that will return only subsets of the data that pass certain criteria. 

We will distinguish between cuts that select *events* and cuts that select *particles*. 

For example, a cut that selects *event* might only select events with MET greater than a certain value, since there is only one MET for any event, or events that are recorded on a Friday. 

A cut that selects certain particles might select muons with some $p_T$ greater than a certain value or jets that are tagged as coming from $b$-quark hadronization. 

When we apply our *particle* cuts, we actually have to select only the cuts that also pass the *event* cuts! It can be confusing, so let's start with a simple example. 

In [None]:
# Make a mock Awkward array of 3 events where we record MET and the muon pT.

# There are 3 muons in the first event, 1 muon in the second event and 4 muons in the 3rd event

arr = ak.Array({'MET':[80,20,54], 'muon_pt': [[110, 90, 10], [40], [250, 25, 10, 5]]})

arr

In [None]:
# Let's select events with MET > 25 and muon pT > 30

event_cut = arr['MET'] > 25

muon_cut = arr['muon_pt'] > 30


But what are these python objects?

In [None]:
print(event_cut)
print()

print(muon_cut)

They are just arrays of `True` and `False` that correspond to either the event or the particle passing
that particular cut. 

Note that they have very different shapes. 

`event_cut` has 3 entries corresponding to our 3 events. 

`muon_cut` has the same shape as the `muon_pt` array because there is a `True`/`False` for each muon.

We can now pass these back into our original data array to put out subsets of the data. 

In [None]:
# When we use the muon cut, we actually have to pick out subsets that pass the event cut
# So the order of this approach is
# arr[event_cut] - select events that pass the event cut
# arr[event_cut]['muon_pt'] - pull out the muon_pt values
# arr[event_cut]['muon_pt'][muon_cut[event_cut]] - select values that pass the muon cut,
#                                                  but make sure the muon cut is selected for 
#                                                  the surviving events

pt = arr[event_cut]['muon_pt'][muon_cut[event_cut]]

print(pt)

It may help you to play around with the above and print out the different steps. 

Now lets make the selections for our data.

In [None]:
# Particle-specific cuts -------------------------------------------

# Fat jet cuts --------------------------------
# Calculate a ratio of subjettiness variables
tau32 = fatjet_tau3/fatjet_tau2

# This is what was used in the old analysis
#cut_fatjet = (tau32>0.67) & (fatjet_eta>-2.4) & (fatjet_eta<2.4) & (fatjet_mSD>105) & (fatjet_mSD<220)

# We simplified it for this less
cut_fatjet = (fatjet_pt > 500) & (fatjet_tag > 0.5)

# Muon cuts ----------------------------------
cut_muon = (muon_pt>55) & (muon_eta>-2.4) & (muon_eta<2.4) & \
           (muon_tightId == True) & (muon_iso>1) & (ht_lep>150)

# Non-boosted jet cuts ------------------------
cut_jet = (jet_btag > 0.5) & (jet_jetid>=4)



# Event cuts --------------------------------------------------------
# MET cuts --------------------
cut_met = (met_pt > 50)

# Cuts on number of muons that pass our cuts
cut_nmuons = ak.num(cut_muon[cut_muon]) == 1

# Cut on the event passing the trigger 
cut_trigger = (events['HLT_TkMu50'].array())

# Cut on the number of fat jets that pass our selection criteria
cut_ntop = ak.num(cut_fatjet[cut_fatjet]) == 1

# Create a cut for the full event that is the "and" of all the separate cuts
cut_full_event = cut_trigger & cut_nmuons & cut_met & cut_ntop

## Use the cuts and calculate some values

We can use the [Vector class](https://vector.readthedocs.io/en/latest/)
to help us with our 4-vector arithmetic. 

We create 4-vector objects for the fat jets, jets, muons, and MET. These can be used with the Awkward array class to naturally handle combinations of particles within a given event, for example, calculating all candidates when we have more than 1 jet passing our selection criteria. 

For MET, we make the assumption that the momentum vector represents a neutrino, and so has effectively 0 mass for our purposes. We are still making an incorrect assumption about $\eta$, but this will work for our lesson. 

In [None]:
fatjets = ak.zip(
    {"pt": fatjet_pt[cut_full_event][cut_fatjet[cut_full_event]], 
     "eta": fatjet_eta[cut_full_event][cut_fatjet[cut_full_event]], 
     "phi": fatjet_phi[cut_full_event][cut_fatjet[cut_full_event]], 
     "mass": fatjet_mass[cut_full_event][cut_fatjet[cut_full_event]]},
    with_name="Momentum4D",
)

muons = ak.zip(
    {"pt": muon_pt[cut_full_event][cut_muon[cut_full_event]], 
     "eta": muon_eta[cut_full_event][cut_muon[cut_full_event]], 
     "phi": muon_phi[cut_full_event][cut_muon[cut_full_event]], 
     "mass": muon_mass[cut_full_event][cut_muon[cut_full_event]]},
    with_name="Momentum4D",
)

jets = ak.zip(
    {"pt": jet_pt[cut_full_event][cut_jet[cut_full_event]], 
     "eta": jet_eta[cut_full_event][cut_jet[cut_full_event]], 
     "phi": jet_phi[cut_full_event][cut_jet[cut_full_event]], 
     "mass": jet_mass[cut_full_event][cut_jet[cut_full_event]]},
    with_name="Momentum4D",
)

met = ak.zip(
    {"pt": met_pt[cut_full_event], 
     "eta": met_eta[cut_full_event], 
     "phi": met_phi[cut_full_event], 
     "mass": 0}, # We assume this is a neutrino with 0 mass
    with_name="Momentum4D",
)

We make use of some of the cool features of 
[Awkward to calculate the different combinations of particles](https://awkward-array.org/doc/main/getting-started/thinking-in-arrays.html).

We can then use the python objects to calculate the sum of the 4-momentum of all our particles.

In [None]:
# Calculate all the different combinations
p4mu,p4fj,p4j,p4met = ak.unzip(ak.cartesian([muons, fatjets, jets, met]))

# Calculate a sum of the 4-momenta
p4tot = p4mu + p4fj + p4j + p4met

### **Challenge!**

Now we can easily calculate the mass and make a quick and simple histogram, just to see if everything work. 

*Warning!* Make sure to `ak.flatten` your awkward array of masses before you pass the values to a histogram. 

In [None]:
# Get the mass
x = p4tot.mass

print(x)

#ncand_cut = ak.num(x)==1
ncand_cut = ak.num(x)>0

# Plot it!
# Your code here
plt.figure()
plt.hist(ak.flatten(x[ncand_cut]), bins=40, range=(0,4000));
plt.hist(x[ncand_cut][:,0], bins=40, range=(0,4000));


## Processing data files

When we process the collision data, everything is going to be exactly the same, except that we need to select (mask)
the data for which the beam luminosity was correctly calculated. 

This was explained in more detail in [previous lesson](https://cms-opendata-workshop.github.io/workshop2024-lesson-triggers-lumi/instructor/index.html) so we will just show you how to use it. 

We included a function to do this called `build_lumi_mask` and you imported it at the start of this notebook. It takes in the file (which we downloaded) that has the run information and the `events` `TTree` object. 

You can look at the JSON file if you like. This has the run info for the luminosity calculations.

In [None]:
!head -10 Cert_271036-284044_13TeV_Legacy2016_Collisions16_JSON.txt

Let's try this with a collisions file. 

In [None]:
filenames = get_files_for_dataset("collision", random=True, n=1)

filename = filenames[0]

print(f"Opening...{filename}")
f = uproot.open(filename)

events = f['Events']

nevents = events.num_entries

print(f"{nevents = }")

In [None]:
mask_lumi = build_lumi_mask('Cert_271036-284044_13TeV_Legacy2016_Collisions16_JSON.txt', events)

print(mask_lumi)
print(len(mask_lumi))

Note that the `mask_lumi` is an array of `True` and `False` for each event! So we can add this to our `cut_full_event`
mask, *when we are processing collision data files*. 

## Putting it all together

We've shown you all the building blocks to process both simulation and collision datasets. As you can see, in addition 
to processing the data, there is a lot of bookkeeping that has to be done in order to track datasets and files.

*There is no single correct way to do this!* :)

In fact, there are a number of different tools used by CMS experimentalists to do this, like [`coffea`](https://coffeateam.github.io/coffea/).

Some of these hide away all of the extra bookkeeping but for purposes of our lesson, we will make it explicit. 

In the cell below we have taken all of the above code and put it inside a python function and taken out some of the comments, in the interest of brevity. 

The function takes as input
* `filename`: the name of an input NanoAOD ROOT file
* `dataset`:  string to be used to organize output files
* `IS_DATA`: a flag that should be set to `True` if the input datafile is *collision* data.

Again, there is no one right way to analyze all these datafiles but you will usually choose to write out a subset of the data for later analysis. You can do this in any format you like. We have chosen to write out `.csv` files, using the 
[`pandas` python library](https://pandas.pydata.org/) for ease of use. 

We are saving the following kinematic variables. 

* Mass of the $t\overline{t}$ system
* $p_T$ of the muon
* $\eta$ of the muon

Those last two variables will be used to calculate uncertainties downstream in the analysis. 

There are also a few more variables that we are writing out, though we have not explained them yet. 

* `pileup`
* `weight`
* `nevents`
* `N_gen`
* `gw_pos`
* `gw_neg`

Let's give a very short explanation of each, but they will be discussed more in future lessons. 

`pileup` records how many proton-proton interactions were assumed when *simulated* data is processed. 

`weight` is the weight calculated by the Monte Carlo code used to simulate the interaction. It is often 1, but not always. This is a different value for each event. 

`nevents` is just the number of events in that file. We record the same number for each event in a file, just for ease of storage. 

`N_gen` is the difference of `gw_pos` and `gw_neg`. These are, respectively, the sum of the weights for all events that have `pos`itive values and the sum of the weights for all events that have `neg`ative values. For these values we again sum this up for all the events in a file and then store the same value for each event entry for ease of storage. 

These will be used to scale our simulation data accordingly when we compare with collision data and will be discussed in more detail in a future lesson. 

When we write out the `.csv` file, we name the file so that it contains the name of the input NanoAOD file and the dataset (e.g. `ttsemilep`) we are processing. 

The function returns a pandas dataframe, so you can play around with that, if you like. 

OK, let's process some files!

In [None]:
def process_file(filename, dataset='default', IS_DATA=False):
    
    print(f"Opening...{filename}")
    
    try:
        f = uproot.open(filename)
    except:
        print(f"Could not open {filename}")
        return None
        
    events = f['Events']

    nevents = events.num_entries

    print(f"{nevents = }")

    # FatJet -----------------------------------------------------
    fatjet_mSD = events['FatJet_msoftdrop'].array()

    fatjet_tag = events['FatJet_particleNet_TvsQCD'].array()

    fatjet_tau2 = events['FatJet_tau2'].array()
    fatjet_tau3 = events['FatJet_tau3'].array()

    fatjet_pt = events['FatJet_pt'].array()
    fatjet_eta = events['FatJet_eta'].array()
    fatjet_phi = events['FatJet_phi'].array()
    fatjet_mass = events['FatJet_mass'].array()
    
    # Muons -------------------------------------------------------
    muon_pt = events['Muon_pt'].array()
    muon_eta = events['Muon_eta'].array()
    muon_phi = events['Muon_phi'].array()
    muon_mass = events['Muon_mass'].array()

    muon_iso = events['Muon_miniIsoId'].array()

    muon_tightId = events['Muon_tightId'].array()

    
    # Jets -------------------------------------------------------
    jet_btag = events['Jet_btagDeepB'].array()
    jet_jetid = events['Jet_jetId'].array()

    jet_pt = events['Jet_pt'].array()
    jet_eta = events['Jet_eta'].array()
    jet_phi = events['Jet_phi'].array()
    jet_mass = events['Jet_mass'].array()
    
    # MET ---------------------------------------------------------
    met_pt = events['PuppiMET_pt'].array()
    met_eta = 0*events['PuppiMET_pt'].array()  # Fix this to be 0
    met_phi = events['PuppiMET_phi'].array() 

    ht_lep = muon_pt + met_pt
    
    #####################################################################################
    # Cuts
    #####################################################################################

    # Particle-specific cuts --------------------------------------
    tau32 = fatjet_tau3/fatjet_tau2

    #cut_fatjet = (tau32>0.67) & (fatjet_eta>-2.4) & (fatjet_eta<2.4) & (fatjet_mSD>105) & (fatjet_mSD<220)
    cut_fatjet = (fatjet_pt > 500) & (fatjet_tag > 0.5)

    cut_muon = (muon_pt>55) & (muon_eta>-2.4) & (muon_eta<2.4) & \
               (muon_tightId == True) & (muon_iso>1) & (ht_lep>150)

    cut_jet = (jet_btag > 0.5) & (jet_jetid>=4)



    # Event cuts -------------------------------------------------
    cut_met = (met_pt > 50)

    cut_nmuons = ak.num(cut_muon[cut_muon]) == 1
    cut_njets = ak.num(cut_jet[cut_jet]) == 1


    cut_trigger = (events['HLT_TkMu50'].array())
    
    cut_ntop = ak.num(cut_fatjet[cut_fatjet]) == 1

    cut_full_event = None
    if IS_DATA:    
        mask_lumi = build_lumi_mask('Cert_271036-284044_13TeV_Legacy2016_Collisions16_JSON.txt', events)#, verbose=True)
        cut_full_event = cut_trigger & cut_nmuons & cut_met & cut_ntop & mask_lumi
    else:
        cut_full_event = cut_trigger & cut_nmuons & cut_met & cut_ntop
    
    # Apply the cuts and calculate the di-top mass
    fatjets = ak.zip(
        {"pt": fatjet_pt[cut_full_event][cut_fatjet[cut_full_event]], 
         "eta": fatjet_eta[cut_full_event][cut_fatjet[cut_full_event]], 
         "phi": fatjet_phi[cut_full_event][cut_fatjet[cut_full_event]], 
         "mass": fatjet_mass[cut_full_event][cut_fatjet[cut_full_event]]},
        with_name="Momentum4D",
    )

    muons = ak.zip(
        {"pt": muon_pt[cut_full_event][cut_muon[cut_full_event]], 
         "eta": muon_eta[cut_full_event][cut_muon[cut_full_event]], 
         "phi": muon_phi[cut_full_event][cut_muon[cut_full_event]], 
         "mass": muon_mass[cut_full_event][cut_muon[cut_full_event]]},
        with_name="Momentum4D",
    )

    jets = ak.zip(
        {"pt": jet_pt[cut_full_event][cut_jet[cut_full_event]], 
         "eta": jet_eta[cut_full_event][cut_jet[cut_full_event]], 
         "phi": jet_phi[cut_full_event][cut_jet[cut_full_event]], 
         "mass": jet_mass[cut_full_event][cut_jet[cut_full_event]]},
        with_name="Momentum4D",
    )

    met = ak.zip(
        {"pt": met_pt[cut_full_event], 
         "eta": met_eta[cut_full_event], 
         "phi": met_phi[cut_full_event], 
         "mass": 0}, # We assume this is a neutrino with 0 mass
        with_name="Momentum4D",
    )
    
    p4mu,p4fj,p4j,p4met = ak.unzip(ak.cartesian([muons, fatjets, jets, met]))
    
    p4tot = p4mu + p4fj + p4j + p4met
    
    # Shape the weights and pileup
    N_gen = -999
    pileup = -999
    gw_pos = -999
    gw_neg = -999

    pileup_per_candidate = None
    
    tmpval_events = np.ones(len(ak.flatten(p4tot.mass)))
    tmpval = ak.ones_like(p4tot.mass)


    # Put in the MC weights
    if not IS_DATA:
        gen_weights = events['genWeight'].array()[cut_full_event]
        pileup = events['Pileup_nTrueInt'].array()[cut_full_event]

        gen_weights_per_candidate = tmpval * gen_weights
        #print(gen_weights_per_candidate)

        pileup_per_candidate = tmpval * pileup
        #print(pileup_per_candidate)

        # Get values associated with the total number of events. 
        # It's going to duplicate the number of entries, but we'll save the same value to 
        # each event
        gen_weights_org = events['genWeight'].array()

        gw_pos = ak.count(gen_weights_org[gen_weights_org > 0])
        gw_neg = ak.count(gen_weights_org[gen_weights_org < 0])
        N_gen = gw_pos - gw_neg
    else:
        pileup_per_candidate = -999*tmpval
        gen_weights_per_candidate = -999*tmpval
    

    # Build a dictionary and dataframe to write out the subset of data
    # we are interested in
    mydict = {}
    mydict['mtt'] = ak.flatten(p4tot.mass) 
    mydict['mu_pt'] = ak.flatten(p4mu.pt) 
    mydict['mu_abseta'] = np.abs(ak.flatten(p4mu.eta))
    mydict['pileup'] = ak.flatten(pileup_per_candidate)
    mydict['weight'] = ak.flatten(gen_weights_per_candidate)
    mydict['nevents'] = nevents*tmpval_events
    mydict['N_gen'] = N_gen*tmpval_events
    mydict['gw_pos'] = gw_pos*tmpval_events
    mydict['gw_neg'] = gw_neg*tmpval_events

    df = pd.DataFrame.from_dict(mydict)

    outfilename = f"OUTPUT_{dataset}_{filename.split('/')[-1].split('.')[0]}.csv"
    print(f'Saving output to {outfilename}')

    df.to_csv(outfilename, index=False)

    return df

Let's try using that function to process a couple of simulation files and a collision data file.

Depending on the particular dataset you are looking at, the speed of your computer, and how many people are accessing the file,
it may take between 20-90 seconds to process a single file. 

In [None]:
dataset = 'ttleptonic'

filenames = get_files_for_dataset(dataset, random=True, n=1)
filename = filenames[0]
df = process_file(filename, dataset=dataset, IS_DATA=False)

df

In [None]:
dataset = 'collision'

filenames = get_files_for_dataset(dataset, random=True, n=1)
filename = filenames[0]

# Don't forget to set IS_DATA to be true!
df = process_file(filename, dataset=dataset, IS_DATA=True)

df

Note that for the collision data, we don't store any information related to the weights or pileup, just the 
total number of events that was in the file. 

Once you have a dataframe, you can make simple plots with it, just by pulling out the values from a column. 

In [None]:
x = df['mu_pt'].values

plt.figure()
plt.hist(x,bins=50,range=(0,500))
plt.xlabel(r'Muon $p_T$ [GeV/c]', fontsize=14)
plt.tight_layout()



In [None]:
# An example of processing multiple files, each one writes out a separate .csv file

dataset = 'tthadronic'

# Get 3 files
filenames = get_files_for_dataset(dataset, random=True, n=3)

for filename in filenames:
    df = process_file(filename, dataset=dataset, IS_DATA=False)
    
    print(df)

We can list the contents of our directory to see the last few files we generated. 

In [None]:
!ls -ltr | tail

And we can make sure there are entries in them. 

In [None]:
!head -5 *tthadronic*.csv