# Columnar data analysis with `DAOD_PHYSLITE`

`DAOD_PHYSLITE` a prototype format within ATLAS to provide a small, generic analysis format for end-user analysis. A standard set of calibrations is already applied during production, making it suitable for fast downstream processing.
<img src="img/run3_model_focus.png" width="800"/>
(plot from [presentation at CHEP2020](https://doi.org/10.1051/epjconf/202024506014))


The format and corresponding analysis applications are still under development. This presentation focusses on applying columnar data analysis with python tools on this format. For further information also see

- [VCHEP2021 presentation](https://doi.org/10.1051/epjconf/202125103001)
- code for R&D studies: https://gitlab.cern.ch/nihartma/physlite-experiments

## Reading the data using uproot

The PHYSLITE ROOT files currently follow a similar structure as regular ATLAS xAODs, containing several trees, where the one holding the actual data is called `CollectionTree`:

In [None]:
import uproot

In [None]:
f = uproot.open("data/DAOD_PHYSLITE_21.2.108.0.art.pool.root")

In [None]:
f.keys()

All branches are stored with the highest split level and in most cases the data is stored in branches called `Aux.<something>` or `AuxDyn.<something>`. Typically these are vectors of fundamental types, like e.g. pt/eta/phi of particle collections. They can be read into numpy arrays efficiently using uproot since the data is (except for the 10-byte vector headers whoose positions are known from ROOT's event offsets) stored as contiguous blocks.

In [None]:
f["CollectionTree"].show("/AnalysisElectronsAuxDyn.(pt|eta|phi)$/i", name_width=30, interpretation_width=50)

The most relevant exception to this are `ElementLink` branches which provide cross references into other collections. First, they are often 2-dimensional (`vector<vector<...>`) and second, their data part (`ElementLink`) is serialized as a structure of 2 32bit unsigned integers: a hash `m_persKey`, identifying the target collection and an index `m_persIndex` identifying the array-index of the corresponding particle in the target collection.

In [None]:
f["CollectionTree/AnalysisElectronsAuxDyn.trackParticleLinks"].typename

In [None]:
f["CollectionTree/AnalysisElectronsAuxDyn.trackParticleLinks"].streamer

In [None]:
[element.all_members for element in f.file.streamer_named("ElementLinkBase").elements]

Uproot can read this, but the loop that deserializes the data is done in python and therefore slow. This is not relevant for this very small file, but becomes important for larger files.

In [None]:
%%time
f["CollectionTree/AnalysisElectronsAuxDyn.trackParticleLinks"].array()

This can be handled by [AwkwardForth](https://doi.org/10.1051/epjconf/202125103002) which is however currently (November 2021) not yet integrated with uproot. I included a small module that can handle the relevant branches in PHYSLITE with a function `branch_to_array` that uses AwkwardForth internally.

One can actually see a significant improvement already for the small file with only 40 events!

In [None]:
from awkward_forth_physlite import branch_to_array

In [None]:
branch_to_array(f["CollectionTree/AnalysisElectronsAuxDyn.trackParticleLinks"])

In [None]:
%%timeit
# using standard uproot
f.file.array_cache.clear()
f["CollectionTree/AnalysisElectronsAuxDyn.trackParticleLinks"].array()

In [None]:
%%timeit
# using awkward forth
f.file.array_cache.clear()
branch_to_array(f["CollectionTree/AnalysisElectronsAuxDyn.trackParticleLinks"])

## Integration with `coffea.nanoevents`

Still in development

In [None]:
from coffea.nanoevents import NanoEventsFactory, PHYSLITESchema

In [None]:
factory = NanoEventsFactory.from_root(
    "data/DAOD_PHYSLITE_21.2.108.0.art.pool.root", "CollectionTree", schemaclass=PHYSLITESchema
)
events = factory.events()

In [None]:
events

In [None]:
events.Electrons

In [None]:
events.Electrons.pt

In [None]:
closest_jets = events.Electrons.nearest(events.Jets)
events.Electrons.delta_r(closest_jets)

## Open questions

- How to handle systematics/more complicated things (like e.g. MET)?
  - Simplify application of systematics, e.g. parametrized for simple application?
  - Or can we provide an interface to existing C++ CP tools?
- How far can this analysis style be brought upstream?
  - Directly run on raw PHYSLITE content?
  - Or produce smaller ntuples?