# Accessing the ATLAS Open Data

This notebook is based on the use of  ATLAS Open Data 
http://opendata.atlas.cern

ATLAS Open Data provides open access to proton-proton collision data at the LHC. 

The ATLAS Collaboration makes avaliable approximately 1/10 of the data and the corresponding Monte Carlo samples for didactic purposes. 

The data are made available under the form of ROOT ntuples, the contents are discussed in the introduction to the hands-on exercise


The purpose of this nootebook is to give an example of access to these ntuples based on the uproot package and to load the data onto a panda dataframe and to save them as csv files for further use 



## Installation of packages not available by default on colab

In [1]:
import sys
# update the pip package installer
#%pip install --upgrade --user pip
# install required packages
#%pip install --upgrade --user uproot awkward vector numpy matplotlib

#!pip install uproot
#!pip install vector
#!pip install awkward

## Import packages used in the analysis

We're going to be using a number of tools to help us:
* uproot: lets us read .root files typically used in particle physics into data formats used in python
* awkward: lets us store data as awkward arrays, a format that generalizes numpy to nested data with possibly variable length lists
* vector: to allow vectorized 4-momentum calculations
* numpy: provides numerical calculations such as histogramming
* matplotlib: common tool for making plots, figures, images, visualisations

In [2]:
import uproot # for reading .root files
import awkward as ak # to represent nested data in columnar format
import vector # for 4-momentum calculations
import time # to measure time to analyse
import math # for mathematical functions such as square root
import numpy as np # for numerical calculations such as histogramming
import matplotlib.pyplot as plt
import pandas as pd

[Back to contents](#contents)

<a id='fraction'></a>

## File path

Files for processes including two leptons in the final state. We consider three files including the following simulated physics processes:

* pp -> WW -> llvv
* pp -> h -> WW -> llvv
* pp -> tt -> WWbb -> bbllvv

where "l" is an electron or a muon, "v" is a neutrino, escaping detection, and "b" is a jet produced from the hadronisation of a b-quark

The files are accessed on the fly from the open data repository at CERN, you may want to download them locally if you want to run faster

In [3]:
isamp=3
fn1="https://atlas-opendata.web.cern.ch/atlas-opendata/samples/2020/2lep/MC/mc_363492.llvv.2lep.root"
norm1=0.02473775926465947 # WW normalisation
filename='ww_atlas.csv.gz'
if isamp==2:
  fn1="https://atlas-opendata.web.cern.ch/atlas-opendata/samples/2020/2lep/MC/mc_345324.ggH125_WW2lep.2lep.root"
  norm1=2.652880224948454e-05 # HWW normalisation
  filename='hww_atlas.csv.gz'
if isamp==3:
  fn1="https://atlas-opendata.web.cern.ch/atlas-opendata/samples/2020/2lep/MC/mc_410000.ttbar_lep.2lep.root"
  norm1=0.0916632363839584 # ttbar normalisation
  filename='ttlep_atlas.csv.gz'


# Reading in the file

The ATLAS ntuples are in root format. They also include a lot of information that we do not need for this exercise.

Our aim is extracting the information we need, i.e. the 4-vectors of the two leptons and the missing transverse momentum from the ntuples, and to put it into a panda dataframe, which afterwards we can use to study how our events look like, and finally to develop a classification exercise.

The three next boxes of code effect this data extraction, and have as an output the <i>dftot</i> dataframe

You are welcome to understand the steps of the creation of the dataframe, this is very useful if you want to learn how to handle complex HEP ntuples.

For the purposes of this course you can simply run them, assume you have a dataframe dftot, and you want to perform operations on it

Steps:

* Read in the tree with uproot https://pypi.org/project/uproot/, and check number of events and structure of tree
* Define variables of the ntuple  which are of interest for our analysis, we have two groups
  * The vectors of leptons
  * The Missing ET and number of jets
* Store them into a series of awkward arrays  https://awkward-array.org/doc/main/index.html
* Reformat the arrays to numpy vectors
* Stack them
* Insert them into a Panda dataframe
* Label the columns of the dataframe

In [4]:
numread=10000
with uproot.open( fn1 + ":mini") as tree1:    
#
#  Print out content of ntuple
#
    print(tree1.show())

    numevents1 = tree1.num_entries # number of events
    
    print("Events in tree",numevents1)
    

name                 | typename                 | interpretation                
---------------------+--------------------------+-------------------------------
runNumber            | int32_t                  | AsDtype('>i4')
eventNumber          | int32_t                  | AsDtype('>i4')
channelNumber        | int32_t                  | AsDtype('>i4')
mcWeight             | float                    | AsDtype('>f4')
scaleFactor_PILEUP   | float                    | AsDtype('>f4')
scaleFactor_ELE      | float                    | AsDtype('>f4')
scaleFactor_MUON     | float                    | AsDtype('>f4')
scaleFactor_PHOTON   | float                    | AsDtype('>f4')
scaleFactor_TAU      | float                    | AsDtype('>f4')
scaleFactor_BTAG     | float                    | AsDtype('>f4')
scaleFactor_LepTR... | float                    | AsDtype('>f4')
scaleFactor_Photo... | float                    | AsDtype('>f4')
trigE                | bool                     | AsDtype(

## Charge in structure lep_momentum1 only the variables defined in vector tupvar

### Meaning of the lepton variables:
* lep_pt, lep_eta, lep_phi, lep_e:  components of 4-momenum of lepton in collider coordinates
* lep_charge: chage of the lepton, +1 or -1
* type of lepton:  electron=11   muon=13

### Variables that we want to extract from the nuple:
* The vectors with the characteristics of leptons in the event
* The two components of the missing tranverse momentum: modulus and azimuthal anglr
* The number of jets in the event

In [5]:
lepvar=["lep_pt", "lep_eta", "lep_phi","lep_E",
        "lep_charge","lep_type"]
scalvar=["met_et","met_phi","jet_n"]

weivar=["mcWeight"] # variables to calculate Monte Carlo weight

tupvar=lepvar+scalvar+weivar

In [6]:
lep_momentum1 = tree1.arrays(tupvar,entry_stop=numread,library="ak")

The leptons in the event appear in the ntuple as vectors which have a size corresponding to the number of leptons in the event (variable lep_n). The events in the used data samples have at least two leptons.
We want to store the components of the first and second leptons in dedicated vectors with variable names:

In [7]:
colnam=["ptl1","etal1","phil1","el1","chl1","typl1",
        "ptl2","etal2","phil2","el2","chl2","typl2"]

In [8]:
# create numpy vector assuming that lep_ arrays have 2 components
for i in range(0,2):
   for j in range(0,len(lepvar)):
     if i==0 and j==0:
       ptlep1=ak.to_numpy(lep_momentum1[lepvar[j]][:,i])
     else:
       ptlep1=np.vstack([ptlep1,ak.to_numpy(lep_momentum1[lepvar[j]][:,i])])
# end up with numpy 2d vector with n_var rows and n_event columns
# to transpose, as in pandas 'features' are columns and 'observations' rows
ptlep1=ptlep1.transpose()
# create dataframe 
dftot = pd.DataFrame(ptlep1)
# add names of columns
dftot.columns=colnam# add normalisation

In [9]:
# add scalar variables
for i in range(0, len(scalvar)):
   dftot[scalvar[i]]=lep_momentum1[scalvar[i]]

In [10]:
# add normalisation
dftot["totalWeight"]=lep_momentum1[weivar[0]]*norm1*numevents1/numread

In [11]:
# inspect dataframe
coln=dftot.columns

print(coln)

print(dftot.shape)

print(dftot.head())

print(dftot.info())

Index(['ptl1', 'etal1', 'phil1', 'el1', 'chl1', 'typl1', 'ptl2', 'etal2',
       'phil2', 'el2', 'chl2', 'typl2', 'met_et', 'met_phi', 'jet_n',
       'totalWeight'],
      dtype='object')
(10000, 16)
           ptl1     etal1     phil1            el1  chl1  typl1          ptl2  \
0  68258.742188  0.573097  2.797778   79778.437500  -1.0   13.0  32021.345703   
1  35269.156250 -1.321083 -1.891969   70790.914062   1.0   11.0  13273.810547   
2  59904.750000 -1.798291 -2.042090  185851.484375   1.0   11.0  16296.021484   
3  62116.527344 -0.077092  0.415694   62301.207031   1.0   11.0  30084.500000   
4  52995.558594  0.484091 -0.398400   59327.367188   1.0   11.0  20755.710938   

      etal2     phil2            el2  chl2  typl2         met_et   met_phi  \
0 -0.520212  1.882341   36452.757812   1.0   11.0   25995.283203 -0.886724   
1 -1.938462 -0.744185   47068.914062  -1.0   13.0  127970.046875  2.881174   
2 -2.553680  2.423617  105371.085938  -1.0   13.0   43242.371094 -0.558963   


In [12]:
# write dataframe to csv
dftot.to_csv(filename,compression={'method': 'gzip', 'compresslevel': 1, 'mtime': 1},index=False)

<a id='going_further'></a>