# Python in Data Analysis

iSTEP 2025, instructor Bingxuan Liu (SYSU, liubx28@mail.sysu.edu.cn)

In this tutorial, we will learn how to use Python to perform basic data analysis tasks in HEP experiments. As you will see, there is an increasing trend of utilising open-source, community-maintained tools/packages in our field. This path ensures better knowledge transfer and equips you with the essential skill-set for a broad range of applications, not necessarily in HEP only.     

## Data Structure 

Like data analysis in general, the data structure is a vital component. If we think about the data we usually analyse in particle physics, what features do they have?

### An Event

In particle physics, we consider either collision data or simulation events. An event contains the products of a given physics process (or a bunch of processes in hadron colliders, which has pile-up interactions), and the products are experiment-specific. The contents, though, depend on the data-processing stages. The data-processing pipeline collects raw detector inputs (pure electronic signals), and covert them to a format needed by the downstream steps. The chain can be quite long depending on the complexity of the experiment. We will not go into detail in this tutorial. If you are interested, you could find literatures by looking up this key word: **Event Data Model (EDM)**

This tutorial uses data format that is ready to be analysed, which means we have direct access to the physics objects, such as jet, leptons and photons. Sometimes, even higher level objects such as top candidates, W/Z candidates and Higgs candidates are available. Now, let's ask ourselves the following question: 

<center><b>How should we store everything in a coherent way?</b></center>


### Event Structure

Let's say we have an event from collision data collected by ATLAS, what does it usually look like?

|Event   | Jet      | Muon    | Electron | Photon  |
|--------| -------- | ------- | -------- | ------- |
|        | Jet1     | Muon1   | Electron1| Photon1 | 
| Event1 | Jet2     | --      | Electron2| Photon2 |
|        | Jet3     | --      | --       | --      |
|        | Jet1     | Muon1   | Electron1| Photon1 | 
| Event2 | Jet2     | Muon2   | --       | --      |
|        | Jet3     | --      | --       | --      |
|        | Jet4     | --      | --       | --      |
|  ...   |  ...     |  ...    |  ...     |  ...    | 

And for each object such as a jet, we would like to know its $p_{\mathrm{T}}$, $\eta$, $\phi$ and other important properties such as the hadronic energy fraction, etc. For simplicity, let's only consider the 4-vectors of the physics objects. Therefore, the full table becomes：

|Event   | Jet      | Muon    | Electron | Photon  |
|--------| -------- | ------- | -------- | ------- |
|        | Jet1 [$p_{\mathrm{T}}$, $\eta$, $\phi$, $m$] | Muon1 [$p_{\mathrm{T}}$, $\eta$, $\phi$, $m$] | Electron1 [$p_{\mathrm{T}}$, $\eta$, $\phi$, $m$]| Photon1 [$p_{\mathrm{T}}$, $\eta$, $\phi$, $m$]| 
| Event1 | Jet2 [$p_{\mathrm{T}}$, $\eta$, $\phi$, $m$] | --      | Electron2 [$p_{\mathrm{T}}$, $\eta$, $\phi$, $m$]| Photon2 [$p_{\mathrm{T}}$, $\eta$, $\phi$, $m$] |
|        | Jet3 [$p_{\mathrm{T}}$, $\eta$, $\phi$, $m$] | --      | --       | --      |
|        | Jet1 [$p_{\mathrm{T}}$, $\eta$, $\phi$, $m$] | Muon1 [$p_{\mathrm{T}}$, $\eta$, $\phi$, $m$] | Electron1 [$p_{\mathrm{T}}$, $\eta$, $\phi$, $m$]| Photon1 [$p_{\mathrm{T}}$, $\eta$, $\phi$, $m$] | 
| Event2 | Jet2 [$p_{\mathrm{T}}$, $\eta$, $\phi$, $m$] | Muon2 [$p_{\mathrm{T}}$, $\eta$, $\phi$, $m$] | --       | --      |
|        | Jet3 [$p_{\mathrm{T}}$, $\eta$, $\phi$, $m$] | --      | --       | --      |
|        | Jet4 [$p_{\mathrm{T}}$, $\eta$, $\phi$, $m$] | --      | --       | --      |
|  ...   |  ...     |  ...    |  ...     |  ...    |

### Data Structure

The data structure should have the ability to locate every single variable in data, for example, $p_{\mathrm{T}}$ of jet1 in event2. The simplest structure is very common and you must have seen it:

|Event   | Jet1 $p_{\mathrm{T}}$ | Jet1 $\eta$ | Jet1 $\phi$ | Jet1 $m$  | Jet2 $p_{\mathrm{T}}$ | Jet2 $\eta$ | Jet2 $\phi$ | Jet2 $m$  | ...|
|--------| -------- | ------- | -------- | ------- | -------- | ------- | -------- | ------- |------- |
|Event1   | 100 GeV | 1.2 | -1.5 | 15 GeV | 80 GeV | 0.2 | 1.4 | 12 GeV | ...| 
|Event2   | 120 GeV | 0.9 | 2.1 | 8 GeV | NaN | NaN | NaN | NaN | ...| 
|Event3   | ... | ... | ... | ... | ... | ... | ... | ... | ...|

Yep, it is just the Excel sheets we have all used (or been tortured by). It is also called a CSV format (comma separated variables). The syntax is straightforward, as indicated by the name, that each variable is separated by commas. This is arguably the most adopted data format in the area of data analysis, so we encourage you to get familiar with it, and use it as the starting point. This also allows you to easily use data produced in other disciplines. 

Another more advanced data structure groups similar variables, such as jet $p_\mathrm{T}$. In this case, each event contains a field (column) called "jet $p_\mathrm{T}$", etc. If we present this structure in the above fashion, we will get:

|Event   | Jet $p_\mathrm{T}$ | Jet $\eta$ | Jet $\phi$ | Jet $m$  | ... |
|--------| -------- | ------- | -------- | ------- | ------- |
| Event1 | 100 GeV  | 1.2   | -1.5 | 15 GeV | ... |
|        | 80 GeV   | 0.2   | 1.4  | 12 GeV | ... |
| Event2 | 120 GeV  | 0.9   | 2.1  | 8 GeV  | ... |
|  ...   |  ...     |  ...  |  ... |  ...   | ... |

This is the so-called **"flat n-Tuple"**, which means each field (column) is a variable, instead of a physics object as the first table. If the data contains physics objects such as jets directly, it is the **Analysis Object Data (AOD)**:

|Event   | Jet      | ...    | 
|--------| -------- | ------- | 
| Event1 | Jet1 [$p_{\mathrm{T}}$, $\eta$, $\phi$, $m$] | ... | 
|        | Jet2 [$p_{\mathrm{T}}$, $\eta$, $\phi$, $m$] | ... | 
| Event2 | Jet1 [$p_{\mathrm{T}}$, $\eta$, $\phi$, $m$] | ... | 
|  ...   |  ...     |  ...    |  


## Python

How is everything related with Python? Well, every format mentioned above can be seen as arrays. The simplest CSV format is equivalent to a 2D array. Let's see how to handle those first.

### Numpy and Pandas

numpy (https://numpy.org/) is an open-source package specialised in array and matrix operations. You will find it extremely popular in data analysis and machine learning, because of its design goal. pandas (https://pandas.pydata.org/) is a python-based data analysis library, with a lot of nice built-in functionalities. This section we will go through some basic usages of these two packages. 

In [None]:
#Load the libraries
import numpy as np
import pandas as pd

Let's create a numpy array by hand:

In [None]:
jet_pt1 = np.array([100, 120],'d')
print(jet_pt1)
print(jet_pt1.size)
print(jet_pt1.dtype)

We have created an array with two elements, corresponding to the leading jet (jet1) $p_{\mathrm{T}}$. Now lets create a longer one:

In [None]:
jet_pt1 = np.array([100, 120, 80, 70, 150, 200, 160, 250],'d')
print(jet_pt1)
print(jet_pt1.size)
print(jet_pt1.dtype)

In data analysis, we need to select data satisfying a given condition. This can be done via array operations. For instance, lets pick out the events with leading jet $p_{\mathrm{T}}$ larger than 100 GeV.

In [None]:
selection = jet_pt1 > 100
print(selection)
print(jet_pt1[selection])

As you can see, the logic operations using arrays will give you a mask, and you can use this mask to select the data you want. Lets see a more complicated case:

In [None]:
jet_pt1 = np.array([100, 120, 80, 70, 150, 200, 160, 250],'d')
jet_eta1 = np.array([2.1, 1.5, -1.1, -2.5, 0.8, -0.4, -0.6, 0.9],'d')
selection = (np.abs(jet_eta1) < 2) & (jet_pt1 > 100)
print(selection)
print(jet_pt1[selection])
print(jet_eta1[selection])

In the above example, we applied a selection that requires the leading jet $p_{\mathrm{T}}$ to be larger than 100 GeV and the absolute $\eta$ to be less than 2. You can construct any selections you want, but some complicated logics may need quite some thinkings. It is much more convenient than the traditional C++ way:

```
using namespace std;

vector<float> selected_jet_pt1;

vector<float> selected_jet_eta1;

for (uint i; i < jet_pt1.size(); i++) {

    if (jet_pt1.at(i) > 100 && abs(jet_eta1.at(i)) < 2) {
    
        selected_jet_pt1.push_back(jet_pt1.at(i));
        
        selected_jet_eta1.push_back(jet_eta1.at(i));
        
    }
    
}
```

You need to gradually get familiar with the numpy array operations. With the help of internet and AI, you can find solutions to almost all your numpy related problems quickly. Experience comes with practice.  

pandas, on the other hand, offers higher level functions to handle datasets. You will find it very useful when you have an input SCV file. It can also be used to create CSV files. Let's first create one:

In [None]:
jet_pt1 = np.array([100, 120, 80, 70, 150, 200, 160, 250],'d')
df = pd.DataFrame(jet_pt1, columns=['jet_pt1']) 
df.to_csv("dummy.csv")

Now you should see a dummy.csv file in the current working directory. Open it and see what it contains. You may realise instantly that there is a column without a header. Next, let's load this file and do some operations.

In [None]:
read_df = pd.read_csv('dummy.csv')
print(read_df)
print(read_df.info())

There is a column with a header called "Unnamed: 0", which is not needed. Let's drop it:

In [None]:
read_df.drop(columns="Unnamed: 0", inplace=True)
print(read_df)

We can also add a new column. This is a very useful function. When trying ML, in particular unsupervised ML techniques, we need to label the dataset. Here if we assume the data comes from background, and we want to add a new column to represent the label, how to achieve this?

In [None]:
read_df['label'] = 'background'
print(read_df)

You can always construct new variables to be added to the dataset. The syntax is the same. After reading the data from a csv file, we also want to drop certain column, or only select particular columns.

In [None]:
read_df_X = read_df.drop(columns="label").copy()
read_df_y = read_df['label'].copy()
print(read_df_X)
print(read_df_y)

We can stack different columns to create a bigger dataset.

In [None]:
jet_pt1 = np.array([100, 120, 80, 70, 150, 200, 160, 250],'d')
jet_eta1 = np.array([2.1, 1.5, -1.1, -2.5, 0.8, -0.4, -0.6, 0.9],'d')
jet_eta1_and_pt1 = np.column_stack((jet_pt1, jet_eta1))
df = pd.DataFrame(jet_eta1_and_pt1, columns=['jet_pt1', 'jet_eta1']) 
df.to_csv("dummy.csv")
read_df = pd.read_csv('dummy.csv')
print(read_df)
print(read_df.info())

### Awkward Array

Everything is easy, right? The above examples can be handled with vanilla numpy sand pandas. One important feature of the example is that the dimension of the dataset is fixed. You may notice that numpy can only process this type of array. Let's say a failure example:

In [None]:
jet_pt1 = np.array([100, 120, 80, 70, 150, 200, 160, 250],'d')
jet_eta1 = np.array([2.1, 1.5, -1.1, -2.5, 0.8, -0.4, -0.6],'d')
jet_eta1_and_pt1 = np.column_stack((jet_pt1, jet_eta1))
df = pd.DataFrame(jet_eta1_and_pt1, columns=['jet_pt1', 'jet_eta1']) 
df.to_csv("dummy.csv")
read_df = pd.read_csv('dummy.csv')
print(read_df)
print(read_df.info())

This is a genuine problem as we have noticed that in HEP each event may contain various number of physics objects. Although we can pad the missing objects with NaN or other particular values, it is not convenient. A library that can handle variable length arrays is very useful in this field. Here comes the rescue, awkward array (https://awkward-array.org/doc/main/index.html). It is also called a jagged array.

In [None]:
import awkward as ak

jet_eta1_and_pt1 = ak.Array([
    [{"jet_pt1": 100, "jet_eta1": 2.1}], [{"jet_pt1": 120, "jet_eta1": 1.5}],
    [{"jet_pt1": 80, "jet_eta1": -1.1}], [{"jet_pt1": 70, "jet_eta1": -2.5}],
    [{"jet_pt1": 150, "jet_eta1": 0.8}], [{"jet_pt1": 200, "jet_eta1": -0.4}],
    [{"jet_pt1": 160}], [{"jet_pt1": 250, "jet_eta1": -0.6}],
])

print(jet_eta1_and_pt1)
print(jet_eta1_and_pt1["jet_pt1"])
print(jet_eta1_and_pt1["jet_eta1"])

In the above example, we omitted the jet_eta1 for the second last entry, but the code did not throw and error, and it just added [None] to the missing element. Here we just use a very naive example. In reality, we should not have a dataset where one particular jet_eta1 is missing (it means the data integrity is compromised). But we can see from earlier tables that an event may contain different numbers of jets, etc. So if we have a library that can read common data format, converting it to jagged array efficiently, it will be very convenient. 

### Uproot

uproot (https://github.com/scikit-hep/uproot5) is a pure python-based library that allows us to work with ROOT files. In this tutorial, we attempt to demystify ROOT. Usually, the first data format concept you encounter in particle physics is ROOT (https://root.cern.ch/). A lot of the instructions introduce ROOT by explaining its basic structures such as branches, leaves, etc. But they are nothing special other than arrays. If you have understood the tables we added earlier, you have understood the basic structure of ROOT. Previously, one has to use C++ or pyROOT to deal with ROOT files, but with uproot, we can convert everything to jagged arrays, and embrace a pure python-based workflow.    

We have prepared a test Delphes output for you, which is a heavy particle (Z') decaying to two jets. The heavy particle mass is set to 1 TeV. So we are suppose to see a peak around 1 TeV when checking the invariant mass distribution of the jets. ROOT files store the data in TTrees. We will extract the jet info and save the 4-vectors in an array.

In [None]:
# Use the following syntax to read a delphes root file and parse the branches to arrays
# You might want to open the test file using ROOT to see the structure. 
# What is the tree name? How are all the branches defined?
import uproot as r

f = r.open("./data/delphes_zprime_1TeV.root")
delphes_tree = f["Delphes;1"]

pt = delphes_tree["Jet.PT"].array()
eta = delphes_tree["Jet.Eta"].array()
phi = delphes_tree["Jet.Phi"].array()
m = delphes_tree["Jet.Mass"].array()               

All the numpy array syntax can be applied seamlessly to awkward array. But there are a few specific awkward functions.

In [None]:
print(pt)
print(pt.type)
print(ak.num(pt))
print(pt[ak.num(pt) > 2])

Ok so fine uproot can convert the ROOT files to arrays and everything can be done in python, but how about other precious ROOT libraries? TLorentzVectors, RooStats? Good question! The HEP community has made plenty of efforts to create python alternatives. In this tutorial, we will introduce the vector library (https://github.com/scikit-hep/vector), which support vector operations. It is a very important package to do data analysis in HEP.

In [None]:
import vector

In [None]:
vector.register_awkward()
jet_vec = ak.zip({
  "pt": pt,
  "phi": phi,
  "eta": eta,
  "mass": m,
},with_name="Momentum4D")                      

Now let's check the constructed jet_vec object. We want to select events with at least two jets in order to get the invariant mass. The syntax is also very similar. 

In [None]:
print(len(jet_vec))
jet_vec_select = jet_vec[ak.num(jet_vec) >= 2]
print(len(jet_vec_select))

Then let us get the leading and sub-leading jets. We can assign them to individual Lorentz vectors, after which all Lorentz vector operations supported by the vector library can be used. 

In [None]:
lead_jet = jet_vec_select[:,0] # https://stackoverflow.com/questions/16815928/what-does-mean-on-numpy-arrays
sublead_jet = jet_vec_select[:,1]

dijet = lead_jet + sublead_jet
dijet_mass = dijet.mass

## Visualising the data

Visualising the data, or plotting, is a common task in HEP. You can hardly find any publications without a plot showing the data. ROOT has a quite extensive plotting suite, which can satisfy most of your need. But the rapid growing usage of ML in HEP encourages us to endorse a more broadly used package. matplotlib (https://matplotlib.org) is a python implementation of the matlab plotting functionalities, and it is the most popular plotting option among the new generation. 

You may think about what kind of plots are useful in HEP and how to make those in matplotlib.

### Histogram

You have to know how to make a histogram, without any doubts. A histogram represents the so-called "binned" data. Instead of showing the raw data, data points are grouped into bins. It reduces the statistical uncertainties and captures the big picture. Let's see a few examples below:

In [None]:
import matplotlib.pylab as plt

# The matplotlib hist function takes 1-D numpy arrays as the input 
# so a step converting awkward array to 1-D numpy is needed

print(pt.type)
print(ak.flatten(pt).type)

plt.figure()
plt.hist(ak.flatten(pt),bins=20,range=(0,1000)); #https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html
plt.xlabel(r'Jet $p_T$ (GeV/c)',fontsize=14)
plt.show()

plt.hist(pt[ak.num(pt) >= 1][:,0],bins=20,range=(0,1000));
plt.xlabel(r'Leading Jet $p_T$ (GeV/c)',fontsize=14)
plt.show()

plt.hist(pt[ak.num(pt) >= 2][:,1],bins=20,range=(0,1000));
plt.xlabel(r'Sub-leading Jet $p_T$ (GeV/c)',fontsize=14)
plt.show()

plt.hist(dijet_mass,bins=20,range=(0,2000));
plt.xlabel(r'Mass  (GeV/c)',fontsize=14)
plt.show()

One important thing to remember is that the hist function returns the binning options, the event counts, etc:

In [None]:
counts, bins, patches = plt.hist(ak.flatten(pt),bins=20,range=(0,1000))
print(counts)
print(bins)
print(patches)

This means you can save the histogram info. You will realise in your career that usually a plot needs a few rounds of formatting before it can be published. If we can save the histogram data info, we just have to deal with the formatting later on. One good trick is to save a pickle file:

In [None]:
import pickle

print(counts, bins)

pickle.dump((counts, bins), open('pt.pkl', 'wb'))  

counts_new, bins_new = pickle.load(open('pt.pkl', 'rb'))

print(counts_new, bins_new)

We have preserved the histogram data. If you need to completely remake the plot or combine this histogram with something else, it is a convenient way. Here we would like to emphasize the reproducibility in research. As we all know, it is the very essence of scientific research. It is greatly encouraged to build up such a habit to document what you have done and check point your intermediate results.

### Scatter Plots

When we have multiple variables, we often want to know if there is correlation between them. A scatter plot is the simplest method. As indicated, the data is scattered around a 2D plane.

In [None]:
plt.scatter(pt[ak.num(pt) >= 2][:,0], pt[ak.num(pt) >= 2][:,1])
plt.xlabel("Leading jet pT [GeV]")
plt.ylabel("Sub-leading jet pT [GeV]")
plt.show()

You can also make a 2D histogram. We leave this as your own exercise.

## Exercise -- Put everything together

Alright, the above info should be sufficient for you to start exploring a specific topic. Here is the practice:

In the data folder, you will find several signal samples and a background sample. Try to use what you have learned today to answer those questions:

1. What is the main feature of the signal? How do those different signal processes compare?
2. What variables can we construct to separate the signal from the background?
3. What analysis strategy can you come up with? Could you use what we have introduced today to illustrate your reasoning?