In [1]:
import uproot
import awkward

Let's load the example data in [NanoAOD](https://twiki.cern.ch/twiki/bin/view/CMSPublic/WorkBookNanoAOD) format:

In [2]:
# Create a TTree from root
tree = uproot.open("data/nano_dy.root")["Events"]
# TTree -> awkward.Array[awkward.Record[str, awkward.Array]]
array = tree.arrays(ak_add_doc=True)

In [3]:
array.show()

[{run: 1, luminosityBlock: 13889, event: 3749778, HTXS_Higgs_pt: 0, ...},
 {run: 1, luminosityBlock: 13889, event: 3749762, HTXS_Higgs_pt: 0, ...},
 {run: 1, luminosityBlock: 13889, event: 3749777, HTXS_Higgs_pt: 0, ...},
 {run: 1, luminosityBlock: 13889, event: 3749768, HTXS_Higgs_pt: 0, ...},
 {run: 1, luminosityBlock: 13889, event: 3749761, HTXS_Higgs_pt: 0, ...},
 {run: 1, luminosityBlock: 13889, event: 3749773, HTXS_Higgs_pt: 0, ...},
 {run: 1, luminosityBlock: 13889, event: 3749781, HTXS_Higgs_pt: 0, ...},
 {run: 1, luminosityBlock: 13889, event: 3749786, HTXS_Higgs_pt: 0, ...},
 {run: 1, luminosityBlock: 13889, event: 3749788, HTXS_Higgs_pt: 0, ...},
 {run: 1, luminosityBlock: 13889, event: 3749783, HTXS_Higgs_pt: 0, ...},
 ...,
 {run: 1, luminosityBlock: 13889, event: 3749862, HTXS_Higgs_pt: 0, ...},
 {run: 1, luminosityBlock: 13889, event: 3749866, HTXS_Higgs_pt: 0, ...},
 {run: 1, luminosityBlock: 13889, event: 3749861, HTXS_Higgs_pt: 0, ...},
 {run: 1, luminosityBlock: 13889

The resulting data is a list of records. Each record represents a single event and all of its parameters data. For example here's some of the data for the first event in our file:

In [4]:
array[0].show(50)

{run: 1,
 luminosityBlock: 13889,
 event: 3749778,
 HTXS_Higgs_pt: 0,
 HTXS_Higgs_y: nan,
 HTXS_stage1_1_cat_pTjet25GeV: 0,
 HTXS_stage1_1_cat_pTjet30GeV: 0,
 HTXS_stage1_1_fine_cat_pTjet25GeV: 0,
 HTXS_stage1_1_fine_cat_pTjet30GeV: 0,
 HTXS_stage_0: 0,
 HTXS_stage_1_pTjet25: 0,
 HTXS_stage_1_pTjet30: 0,
 HTXS_njets25: 0,
 HTXS_njets30: 0,
 btagWeight_CSVV2: 0.951,
 btagWeight_DeepCSVB: 0.893,
 CaloMET_phi: 2.79,
 CaloMET_pt: 32.1,
 CaloMET_sumEt: 652,
 ChsMET_phi: 2.51,
 ChsMET_pt: 33.7,
 ChsMET_sumEt: 784,
 nCorrT1METJet: 5,
 CorrT1METJet_area: [0.579, 0.449, 0.509, 0.519, 0.638],
 CorrT1METJet_eta: [-2.36, 4.33, 2.27, 3.92, 2.62],
 CorrT1METJet_muonSubtrFactor: [3.59e-08, 1.08e-08, ..., 7.16e-09, -2.98e-08],
 CorrT1METJet_phi: [0.387, 2.03, 1.56, 2.39, -0.405],
 CorrT1METJet_rawPt: [12.9, 15.3, 10.2, 14.9, 9.41],
 nElectron: 0,
 Electron_deltaEtaSC: [],
 Electron_dr03EcalRecHitSumEt: [],
 Electron_dr03HcalDepth1TowerSumEt: [],
 Electron_dr03TkSumPt: [],
 Electron_dr03TkSumPtHEEP: []

## Awkward-zipper example usage

The goal of the awkward-zipper package is to restructure the record of each event. These records are restructured in the same manner as in [Coffea package](https://coffea-hep.readthedocs.io/en/v2025.1.1/api/coffea.nanoevents.NanoAODSchema.html).

In [5]:
from awkward_zipper import NanoAOD

restructure = NanoAOD(version="latest")
result = restructure(array)

  result = restructure(array)
  result = restructure(array)
  result = restructure(array)
  result = restructure(array)


In [6]:
awkward.materialize(result)
result

Now let's go step by step how awkward-zipper reconstructs the original NanoAOD data.

## How the new fields are added


In [7]:
from awkward_zipper.kernels import (
    counts2offsets,
    local2globalindex,
)

For example we have a local index branches with names matching `{source}_{target}Idx*` are converted to global indexes for the event chunk (postfix `G`).
(All local indices and their correlating global indices are taken from `NanoAOD.all_cross_references` dictionary)

In [8]:
local_index = 'Jet_electronIdx1'

# cross_referense = NanoAOD.all_cross_references[local_index]
# global_index = "n" + cross_referense
global_index = 'nElectron'

array["Jet_electronIdx1G"] = local2globalindex(array[local_index], array[global_index])
array["Jet_electronIdx1G"]

How the functions like local2globalindex work is the main difference between awkward-zipper and coffea.

awkward-zipper does its inner calculations on awkward arrays, while coffea does them using [forms and buffers](https://awkward-array.org/doc/main/reference/generated/ak.to_buffers.html).

This change will make it easier for users to create their own ‘schemas’ (or modify existing ones)



These fields are then grouped by name.

Finally, all collections are then zipped into one NanoEvents record and returned.


## These fields are then grouped by name, where if:

one branch exists named name and no branches start with name_, it gets interpreted as a single flat array;

In [9]:
# Example: Each event has only one Run Id. Interpreted flat array will look look like this:
result.run

one branch exists named name, one named n{name}, and no branches start with name_, it gets interpreted as a single jagged array;

In [10]:
# Example: Each event has a flat array of PS Weights. Interpreted single jagged array will look look like this:
result.PSWeight

no branch exists named {name} and many branches start with name_*, they get interpreted as a flat table; or

In [11]:
#Example: Each event has a SINGLE Generator. Each Generator consists of a record of Generator parameters. These parameters can be scalars or flat arrays. Interpreted flat table will look look like this:
result.Generator

one branch exists named n{name} and many branches start with name_*, they interpreted as a jagged table.

In [12]:
# Example: Each event has an array of Jets. Each Jet consists of a record of Jet parameters. These parameters can be scalars or flat arrays. Interpreted jagged table will look look like this:
result.Jet

Finally, all collections are then zipped into one NanoEvents record and returned.

Final result:

In [13]:
result

In [14]:
result.Jet.mass

## Zipper with virtual arrays

Let's load the same root file but as virtual arrays. Virtual arrays don't load the data from disk (or in other words don't materialize the data).

In [15]:
# Create a TTree from root
tree = uproot.open("data/nano_dy.root")["Events"]
# to load virtual arrays
access_log = [] # which of the data was materialized
# TTree -> awkward.Array[awkward.Record[str, awkward.Array]]
array = tree.arrays(ak_add_doc=True, access_log=access_log, virtual=True)

Calling zipper

In [16]:
restructure = NanoAOD(version="latest")
result = restructure(array)

  result = restructure(array)
  result = restructure(array)
  result = restructure(array)
  result = restructure(array)


In [17]:
access_log

[]

In [18]:
result

In [19]:
result.Jet

In [20]:
access_log

[]

## Example calculation of a Z-peak

In [21]:
restructure = NanoAOD(version="latest")
result = restructure(array)

  result = restructure(array)
  result = restructure(array)
  result = restructure(array)
  result = restructure(array)


In [22]:
zcands = awkward.combinations(result.Muon, 2)

In [23]:
access_log

[Accessed(branch='nMuon', buffer_key="('<root>', 'nMuon')-data")]

In [24]:
# calculate invariant mass
mass = awkward.flatten((zcands["0"] + zcands["1"]).mass)
mass

In [25]:
access_log

[Accessed(branch='nMuon', buffer_key="('<root>', 'nMuon')-data"),
 Accessed(branch='Muon_pt', buffer_key="('<root>', 'Muon_pt')-offsets"),
 Accessed(branch='Muon_pt', buffer_key="('<root>', 'Muon_pt', None)-data"),
 Accessed(branch='Muon_phi', buffer_key="('<root>', 'Muon_phi')-offsets"),
 Accessed(branch='Muon_phi', buffer_key="('<root>', 'Muon_phi', None)-data"),
 Accessed(branch='Muon_eta', buffer_key="('<root>', 'Muon_eta')-offsets"),
 Accessed(branch='Muon_eta', buffer_key="('<root>', 'Muon_eta', None)-data"),
 Accessed(branch='Muon_mass', buffer_key="('<root>', 'Muon_mass')-offsets"),
 Accessed(branch='Muon_mass', buffer_key="('<root>', 'Muon_mass', None)-data"),
 Accessed(branch='Muon_charge', buffer_key="('<root>', 'Muon_charge')-offsets"),
 Accessed(branch='Muon_charge', buffer_key="('<root>', 'Muon_charge', None)-data")]

We can see that for this, 4-vector coordinates were loaded, which were used to add the combinations