# AwkwardArrays in Julia for High-Energy Physics Data Analysis

AwkwardArrays are designed to handle irregular, nested data structures with Python being the primary language

 * [Reading RNTuple data with Uproot](https://ariostas-talks.github.io/2024-07-02-pyhep-uproot-rntuple/lab/index.html) by Andres Rios-Tascon
 * [Distributed Columnar HEP analysis using coffea + dask](https://github.com/ikrommyd/pyhep2024-coffea-dask) by Iason Krommydas
 * Easy Columnar File Conversion with ‘hepconvert' by Zoë Bilodeau
 * A new SymPy backend for vector: uniting experimental and theoretical physicists by Saransh Chopra

Recent integration into Julia offers new possibilities

## Why Julia?

Sharing Awkward Array data structures between Python and Julia to encourage the Python users to run their analysis both in an eco-system of their choice and in Julia

Physicists are using Awkward Array in Python and data format conversion is the hardest part of language boundary-hopping

## Installation and Setup

[PythonCall and Julia Call](https://juliapy.github.io/PythonCall.jl/stable/) allow to call Python code from Julia and Julia code from Python via a symmetric interface.

In [None]:
pip install juliacall

In [None]:
from juliacall import Main as jl

## Install UnROOT

In [None]:
jl.seval("using Pkg")
jl.seval("Pkg.add(\"UnROOT\")")

## Reading data

In [None]:
jl.seval("using UnROOT")

In [None]:
file = jl.Main.ROOTFile("data/SMHiggsToZZTo4L.root")

In [None]:
file

In [None]:
events = jl.Main.LazyTree(file, "Events")

In [None]:
events

In [None]:
events.Muon_pt

In [None]:
type(events)

In [None]:
muons_pt = events.Muon_pt
muons_eta = events.Muon_eta
muon_phi = events.Muon_phi
muons_mass = events.Muon_mass
muons_charge = events.Muon_charge
muons_isolation = events.Muon_pfRelIso03_all

In [None]:
type(muons_pt)

## Using Julia Functions

In [None]:
jl.seval("""
function my_fun(x,y)
    return 2x.+y
end
""")

In [None]:
jl.my_fun(2,3)

## Introducing AwkwardArrays

In [None]:
import awkward as ak

In [None]:
jl.seval("using AwkwardArray")

Let's write a function to convert the UnROOT's LazyBranches into an AwkwardArray in Julia and return the AwkwardArray to Python.

*Note:* Jerry Ling is planning to provide this feature in UnROOT.

In [None]:
jl.seval("""
using AwkwardArray

function make_record_array(events)
    array = AwkwardArray.RecordArray(
        NamedTuple{(:pt, :eta, :phi, :mass, :charge, :isolation)}((
            AwkwardArray.from_iter(events.Muon_pt),
            AwkwardArray.from_iter(events.Muon_eta), 
            AwkwardArray.from_iter(events.Muon_phi), 
            AwkwardArray.from_iter(events.Muon_mass), 
            AwkwardArray.from_iter(events.Muon_charge), 
            AwkwardArray.from_iter(events.Muon_pfRelIso03_all),
        )
    ))
    return AwkwardArray.convert(array)
end
""")

In [None]:
%%time
muons = jl.make_record_array(events)

In [None]:
%%time
muons = jl.make_record_array(events)

In [None]:
type(muons)

In [None]:
muons

In [None]:
muons.pt

In [None]:
%%time
muons = ak.zip({
                "pt": muons_pt,
                "eta": events.Muon_eta,
                "phi": events.Muon_phi,
                "mass": events.Muon_mass,
                "charge": events.Muon_charge,
                "isolation": events.Muon_pfRelIso03_all,
            },)

In [None]:
muons

## Data Analysis Example 

Thanks to Iason Krommydas' - see his talk for more details.

In [None]:
def find_4lep_kernel(events_leptons, builder):
    """Search for valid 4-lepton combinations from an array of events * leptons {charge, ...}

    A valid candidate has two pairs of leptons that each have balanced charge
    Outputs an array of events * candidates {indices 0..3} corresponding to all valid
    permutations of all valid combinations of unique leptons in each event
    (omitting permutations of the pairs)
    """
    for leptons in events_leptons:
        builder.begin_list()
        nlep = len(leptons)
        for i0 in range(nlep):
            for i1 in range(i0 + 1, nlep):
                if leptons[i0].charge + leptons[i1].charge != 0:
                    continue
                for i2 in range(nlep):
                    for i3 in range(i2 + 1, nlep):
                        if len({i0, i1, i2, i3}) < 4:
                            continue
                        if leptons[i2].charge + leptons[i3].charge != 0:
                            continue
                        builder.begin_tuple(4)
                        builder.index(0).integer(i0)
                        builder.index(1).integer(i1)
                        builder.index(2).integer(i2)
                        builder.index(3).integer(i3)
                        builder.end_tuple()
        builder.end_list()

    return builder


This code is translated to Julia by ChatGPT:

In [None]:
using AwkwardArray

function find_4lep_kernel(events_leptons, builder)
    """
    Search for valid 4-lepton combinations from an array of events * leptons {charge, ...}

    A valid candidate has two pairs of leptons that each have balanced charge.
    Outputs an array of events * candidates {indices 0..3} corresponding to all valid
    permutations of all valid combinations of unique leptons in each event
    (omitting permutations of the pairs).
    """
    for leptons in events_leptons
        builder.begin_list()
        nlep = length(leptons)
        for i0 in 1:nlep
            for i1 in (i0 + 1):nlep
                if leptons[i0].charge + leptons[i1].charge != 0
                    continue
                end
                for i2 in 1:nlep
                    for i3 in (i2 + 1):nlep
                        if length(Set([i0, i1, i2, i3])) < 4
                            continue
                        end
                        if leptons[i2].charge + leptons[i3].charge != 0
                            continue
                        end
                        builder.begin_tuple(4)
                        builder.index(1).integer(i0 - 1)  # Julia is 1-based, subtract 1 for 0-based indexing
                        builder.index(2).integer(i1 - 1)
                        builder.index(3).integer(i2 - 1)
                        builder.index(4).integer(i3 - 1)
                        builder.end_tuple()
                    end
                end
            end
        end
        builder.end_list()
    end

    return builder
end


In [None]:
def process(self, events):
        dataset_axis = hist.axis.StrCategory(
            [], growth=True, name="dataset", label="Primary dataset"
        )
        mass_axis = hist.axis.Regular(
            300, 0, 300, name="mass", label=r"$m_{\mu\mu}$ [GeV]"
        )
        pt_axis = hist.axis.Regular(300, 0, 300, name="pt", label=r"$p_{T,\mu}$ [GeV]")

        h_nMuons = hda.Hist(
            dataset_axis,
            hda.hist.hist.axis.IntCategory(
                range(6), name="nMuons", label="Number of good muons"
            ),
            storage="weight",
            label="Counts",
        )
        h_m4mu = hda.hist.Hist(
            dataset_axis, mass_axis, storage="weight", label="Counts"
        )
        h_mZ1 = hda.hist.Hist(dataset_axis, mass_axis, storage="weight", label="Counts")
        h_mZ2 = hda.hist.Hist(dataset_axis, mass_axis, storage="weight", label="Counts")
        h_ptZ1mu1 = hda.hist.Hist(
            dataset_axis, pt_axis, storage="weight", label="Counts"
        )
        h_ptZ1mu2 = hda.hist.Hist(
            dataset_axis, pt_axis, storage="weight", label="Counts"
        )

        cutflow = dict()

        dataset = events.metadata["dataset"]
        muons = ak.zip(
            {
                "pt": events.Muon_pt,
                "eta": events.Muon_eta,
                "phi": events.Muon_phi,
                "mass": events.Muon_mass,
                "charge": events.Muon_charge,
                "isolation": events.Muon_pfRelIso03_all,
            },
            with_name="PtEtaPhiMCandidate",
            behavior=candidate.behavior,
        )

        # make sure they are sorted by transverse momentum
        muons = muons[ak.argsort(muons.pt, axis=1)]

        cutflow["all events"] = ak.num(muons, axis=0)

        # impose some quality and minimum pt cuts on the muons
        muons = muons[(muons.pt > 5) & (muons.isolation < 0.2)]
        cutflow["at least 4 good muons"] = ak.sum(ak.num(muons) >= 4)
        h_nMuons.fill(dataset=dataset, nMuons=ak.num(muons))

        # reduce first axis: skip events without enough muons
        muons = muons[ak.num(muons) >= 4]

        # find all candidates with helper function
        fourmuon = dak.map_partitions(find_4lep, muons)
        fourmuon = [muons[fourmuon[idx]] for idx in "0123"]

        fourmuon = ak.zip(
            {
                "z1": ak.zip(
                    {
                        "lep1": fourmuon[0],
                        "lep2": fourmuon[1],
                        "p4": fourmuon[0] + fourmuon[1],
                    }
                ),
                "z2": ak.zip(
                    {
                        "lep1": fourmuon[2],
                        "lep2": fourmuon[3],
                        "p4": fourmuon[2] + fourmuon[3],
                    }
                ),
            }
        )

        cutflow["at least one candidate"] = ak.sum(ak.num(fourmuon) > 0)

        # require minimum dimuon mass
        fourmuon = fourmuon[(fourmuon.z1.p4.mass > 60.0) & (fourmuon.z2.p4.mass > 20.0)]
        cutflow["minimum dimuon mass"] = ak.sum(ak.num(fourmuon) > 0)

        # choose permutation with z1 mass closest to nominal Z boson mass
        bestz1 = ak.singletons(ak.argmin(abs(fourmuon.z1.p4.mass - 91.1876), axis=1))
        fourmuon = ak.flatten(fourmuon[bestz1])

        h_m4mu.fill(
            dataset=dataset,
            mass=(fourmuon.z1.p4 + fourmuon.z2.p4).mass,
        )
        h_mZ1.fill(
            dataset=dataset,
            mass=fourmuon.z1.p4.mass,
        )
        h_mZ2.fill(
            dataset=dataset,
            mass=fourmuon.z2.p4.mass,
        )
        h_ptZ1mu1.fill(
            dataset=dataset,
            pt=fourmuon.z1.lep1.pt,
        )
        h_ptZ1mu2.fill(
            dataset=dataset,
            pt=fourmuon.z1.lep2.pt,
        )
        return {
            "nMuons": h_nMuons,
            "mass": h_m4mu,
            "mass_z1": h_mZ1,
            "mass_z2": h_mZ2,
            "pt_z1_mu1": h_ptZ1mu1,
            "pt_z1_mu2": h_ptZ1mu2,
            "cutflow": {dataset: cutflow},
        }


In [None]:
using Histograms
using Awkward
using DataFrames
using DataFramesMeta

function process(events)
    # Define axes
    dataset_axis = Histograms.StrCategory([], growth=true, name="dataset", label="Primary dataset")
    mass_axis = Histograms.Regular(300, 0, 300, name="mass", label="\$m_{\mu\mu}$ [GeV]")
    pt_axis = Histograms.Regular(300, 0, 300, name="pt", label="\$p_{T,\mu}$ [GeV]")

    # Define histograms
    h_nMuons = Histograms.Hist(dataset_axis, Histograms.IntCategory(0:5, name="nMuons", label="Number of good muons"), storage="weight", label="Counts")
    h_m4mu = Histograms.Hist(dataset_axis, mass_axis, storage="weight", label="Counts")
    h_mZ1 = Histograms.Hist(dataset_axis, mass_axis, storage="weight", label="Counts")
    h_mZ2 = Histograms.Hist(dataset_axis, mass_axis, storage="weight", label="Counts")
    h_ptZ1mu1 = Histograms.Hist(dataset_axis, pt_axis, storage="weight", label="Counts")
    h_ptZ1mu2 = Histograms.Hist(dataset_axis, pt_axis, storage="weight", label="Counts")

    cutflow = Dict()

    dataset = events.metadata["dataset"]

    # Prepare muons
    muons = Awkward.zip(Dict(
        "pt" => events.Muon_pt,
        "eta" => events.Muon_eta,
        "phi" => events.Muon_phi,
        "mass" => events.Muon_mass,
        "charge" => events.Muon_charge,
        "isolation" => events.Muon_pfRelIso03_all
    ))

    # Sort muons by transverse momentum
    muons = muons[Awkward.argsort(muons.pt, axis=1)]

    cutflow["all events"] = Awkward.num(muons, axis=0)

    # Quality and minimum pt cuts
    muons = muons[(muons.pt .> 5) .& (muons.isolation .< 0.2)]
    cutflow["at least 4 good muons"] = sum(Awkward.num(muons) .>= 4)
    Histograms.fill!(h_nMuons, Dict("dataset" => dataset, "nMuons" => Awkward.num(muons)))

    # Skip events without enough muons
    muons = muons[Awkward.num(muons) .>= 4]

    # Find four-muon candidates
    fourmuon = find_4lep(muons)
    fourmuon = [muons[fourmuon[idx]] for idx in ["0", "1", "2", "3"]]

    fourmuon = Awkward.zip(Dict(
        "z1" => Awkward.zip(Dict(
            "lep1" => fourmuon[1],
            "lep2" => fourmuon[2],
            "p4" => fourmuon[1] + fourmuon[2]
        )),
        "z2" => Awkward.zip(Dict(
            "lep1" => fourmuon[3],
            "lep2" => fourmuon[4],
            "p4" => fourmuon[3] + fourmuon[4]
        ))
    ))

    cutflow["at least one candidate"] = sum(Awkward.num(fourmuon) .> 0)

    # Minimum dimuon mass requirement
    fourmuon = fourmuon[(fourmuon.z1.p4.mass .> 60.0) .& (fourmuon.z2.p4.mass .> 20.0)]
    cutflow["minimum dimuon mass"] = sum(Awkward.num(fourmuon) .> 0)

    # Choose permutation with z1 mass closest to nominal Z boson mass
    bestz1 = Awkward.singletons(Awkward.argmin(abs.(fourmuon.z1.p4.mass .- 91.1876), axis=1))
    fourmuon = Awkward.flatten(fourmuon[bestz1])

    # Fill histograms
    Histograms.fill!(h_m4mu, Dict("dataset" => dataset, "mass" => (fourmuon.z1.p4 + fourmuon.z2.p4).mass))
    Histograms.fill!(h_mZ1, Dict("dataset" => dataset, "mass" => fourmuon.z1.p4.mass))
    Histograms.fill!(h_mZ2, Dict("dataset" => dataset, "mass" => fourmuon.z2.p4.mass))
    Histograms.fill!(h_ptZ1mu1, Dict("dataset" => dataset, "pt" => fourmuon.z1.lep1.pt))
    Histograms.fill!(h_ptZ1mu2, Dict("dataset" => dataset, "pt" => fourmuon.z1.lep2.pt))

    return Dict(
        "nMuons" => h_nMuons,
        "mass" => h_m4mu,
        "mass_z1" => h_mZ1,
        "mass_z2" => h_mZ2,
        "pt_z1_mu1" => h_ptZ1mu1,
        "pt_z1_mu2" => h_ptZ1mu2,
        "cutflow" => Dict(dataset => cutflow)
    )
end


In [None]:
# Sort muons by transverse momentum
muons = muons[ak.argsort(muons.pt, axis=1)]

In [None]:
cutflow = {}
cutflow["all events"] = ak.num(muons, axis=0)

In [None]:
# Quality and minimum pt cuts
# muons = muons[(muons.pt .> 5) .& (muons.isolation .< 0.2)]
muons = muons[(muons.pt > 5) & (muons.isolation < 0.2)]
cutflow["at least 4 good muons"] = ak.sum(ak.num(muons) >= 4)

In [None]:
# Skip events without enough muons
muons = muons[ak.num(muons) >= 4]

In [None]:
muons

In [None]:
jl.seval("""
using AwkwardArray

function find_4lep(events_leptons)

    array = AwkwardArray.ListArray{
        AwkwardArray.Index64,
        AwkwardArray.TupleArray{Tuple{
            AwkwardArray.PrimitiveArray{Int64},
            AwkwardArray.PrimitiveArray{Int64},
            AwkwardArray.PrimitiveArray{Int64},
            AwkwardArray.PrimitiveArray{Int64}}
        }
    }()
    for leptons in events_leptons
        nlep = length(leptons[:charge])
        for i0 in 1:nlep
            for i1 in (i0 + 1):nlep
                if leptons[i0][:charge] + leptons[i1][:charge] != 0
                    continue
                end
                for i2 in 1:nlep
                    for i3 in (i2 + 1):nlep
                        if length(Set([i0, i1, i2, i3])) < 4
                            continue
                        end
                        if leptons[i2][:charge] + leptons[i3][:charge] != 0
                            continue
                        end
                        
                        push!(array.content, (i0 - 1))  # Julia is 1-based, subtract 1 for 0-based indexing
                        push!(array.content, (i1 - 1))
                        push!(array.content, (i2 - 1))
                        push!(array.content, (i3 - 1))
                        
                        AwkwardArray.end_tuple!(array.content)
                    end
                end
            end
        end
        AwkwardArray.end_list!(array)
    end

    return array

end
""")

In [None]:
# Find four-muon candidates
jl.find_4lep(muons)