Feature request/suggestion: Awkward Arrays of events #65

jpivarski · 2022-09-15T15:26:11Z

I learned about this project from @HDembinski in PyHEP 2022:

btw, I am working an a generator wrapper called impy, which will offer the Pythia8 record in numpy array form

I agree that providing a truly Pythonic interface to not just one but a lot of generators is a great idea!

Providing an iterator that yields NumPy arrays, like numpythia and pyjet, partially accelerates analysis by allowing vectorization over particles, but not over events. Each event has a different number of particles, so such an array can't be a NumPy array, but it can be an Awkward Array.

@aryan26roy and I implemented such an interface in https://github.com/scikit-hep/fastjet and wrote about it in arXiv:2202.03911. The key idea is to minimize the number of Python objects and steps through the Python interpreter:

The fastjet library was written with a minimal coupling between Python and C++ (having learned how not to do Python-C++ couplings), passing only simple data types—numbers and borrowed arrays—between the languages, and it lets Python/NumPy own the array buffers, so that they're scoped as Python users expect. However, the fastjet interface also builds the Awkward types manually, and there is now a better way to do it, which requires less maintenance.

This summer, @ManasviGoyal implemented an Awkward Array-builder in header-only C++ that automates this process without a loss of performance (which the ak.ArrayBuilder interface has, if you're familiar with it). Her new awkward::LayoutBuilder specializes an Awkward data structure using C++ templates, which can then be filled and converted to a Python Awkward Array through ak.from_buffers.

The LayoutBuilder documentation is still being integrated into the docs:

https://github.com/scikit-hep/awkward/blob/manasvi/layout-builder-user-guide/docs-sphinx/user-guide/how-to-use-header-only-layoutbuilder.md

but I can give a short walk-through in this issue.

Getting the code

The header-only LayoutBuilder is in this directory (4 files):

https://github.com/scikit-hep/awkward/tree/main/src/awkward/_v2/cpp-headers/awkward

which could be included in the impy project as a git submodule (how I generally deal with C++ header-only dependencies), and it is also shipped in the awkward Python package; you can get -I compiler flags like this:

python -m awkward._v2.config --cflags

LayoutBuilder for records and variable-length lists

An array of events with one-value-per-event attributes and different-length lists of particles can be modeled using only records (structs) and variable-length lists. Here's an example of building that with LayoutBuilder:

  #include "awkward/LayoutBuilder.h"

  using UserDefinedMap = std::map<std::size_t, std::string>;

  template<class PRIMITIVE>
  using NumpyBuilder = awkward::LayoutBuilder::Numpy<PRIMITIVE>;

  template<class PRIMITIVE, class BUILDER>
  using ListOffsetBuilder = awkward::LayoutBuilder::ListOffset<PRIMITIVE, BUILDER>;

  template<class... BUILDERS>
  using RecordBuilder = awkward::LayoutBuilder::Record<UserDefinedMap, BUILDERS...>;

  template<std::size_t field_name, class BUILDER>
  using RecordField = awkward::LayoutBuilder::Field<field_name, BUILDER>;

  enum Field : std::size_t {x, y};

  UserDefinedMap fields_map({
    {Field::x, "x"},
    {Field::y, "y"}});

  ListOffsetBuilder<int64_t,
      RecordBuilder<
          RecordField<Field::x, NumpyBuilder<double>>,
          RecordField<Field::y, ListOffsetBuilder<int64_t,
              NumpyBuilder<int32_t>>
  >>> builder;

This builder can then be filled with

  auto& subbuilder = builder.begin_list();
  subbuilder.set_field_names(fields_map);

  auto& x_builder = subbuilder.field<Field::x>();
  auto& y_builder = subbuilder.field<Field::y>();

  x_builder.append(1.1);
  auto& y_subbuilder = y_builder.begin_list();
  y_subbuilder.append(1);
  y_builder.end_list();

  x_builder.append(2.2);
  y_builder.begin_list();
  y_subbuilder.append(1);
  y_subbuilder.append(2);
  y_builder.end_list();

  builder.end_list();

  builder.begin_list();
  builder.end_list();

  builder.begin_list();

  x_builder.append(3.3);
  y_builder.begin_list();
  y_subbuilder.append(1);
  y_subbuilder.append(2);
  y_subbuilder.append(3);
  y_builder.end_list();

  builder.end_list();

though you would likely do that in a loop (like ArrayBuilder). The above data are equivalent to:

[
    [{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}],
    [],
    [{"x": 3.3, "y": [1, 2, 3]}]
]

Getting these data into Python

The LayoutBuilder code doesn't have helper methods to pass the data through pybind11 so that different projects can use different binding generators. (It's currently being used in one Cling project, Awkward ↔ RDataFrame, and one Cython project, ctapipe.) So here's an explanation of how to use it with pybind11.

The data comes out of LayoutBuilder as a set of named buffers and a Form (JSON) that tells Awkward Array how to put it all together. Here's an example of making an Awkward Array like that in Python (not from LayoutBuilder):

>>> import awkward._v2 as ak
>>> import numpy as np
>>> container = {            # normally, you would get these buffers from C++
...     "node0-offsets": np.array([0, 2, 2, 3], dtype=np.int64),
...     "node2-data": np.array([1.1, 2.2, 3.3], dtype=np.float64),
...     "node3-offsets": np.array([0, 1, 3, 6], dtype=np.int64),
...     "node4-data": np.array([1, 1, 2, 1, 2, 3], dtype=np.int32),
... }
>>> form = """
... {
...     "class": "ListOffsetArray",
...     "offsets": "i64",
...     "content": {
...         "class": "RecordArray",
...         "contents": {
...             "x": {
...                 "class": "NumpyArray",
...                 "primitive": "float64",
...                 "form_key": "node2"
...             },
...             "y": {
...                 "class": "ListOffsetArray",
...                 "offsets": "i64",
...                 "content": {
...                     "class": "NumpyArray",
...                     "primitive": "int32",
...                     "form_key": "node4"
...                 },
...                 "form_key": "node3"
...             }
...         },
...         "form_key": "node1"
...     },
...     "form_key": "node0"
... }
... """

>>> array = ak.from_buffers(form=form, length=3, container=container)

>>> array.show(type=True)
type: 3 * var * {
    x: float64,
    y: var * int32
}
[[{x: 1.1, y: [1]}, {x: 2.2, y: [1, 2]}],
 [],
 [{x: 3.3, y: [1, 2, 3]}]]

So we just need LayoutBuilder to give us these pieces.

Moreover, we want NumPy to own the array buffers, so that they get deleted when the Awkward Array goes out of Python scope, not when the LayoutBuilder goes out of C++ scope. The hand-off therefore needs a few steps.

First, you ask for the set of buffer names and their sizes, as a number of bytes (not a number of items):

  std::map<std::string, size_t> names_nbytes = {};
  builder.buffer_nbytes(names_nbytes);

Next, you allocate memory for those buffers in Python, presumably with np.empty(nbytes, dtype=np.uint8) and get void* pointers to these buffers by casting the output of numpy_array.ctypes.data (pointer as integer).
Then, you let the LayoutBuilder fill these buffers with

  std::map<std::string, void*> buffers;
  builder.to_buffers(buffers);

Finally, you get the JSON form with

  std::string form = builder.form();

Now you can pass everything over the border from C++ to Python using pybind11's py::buffer_protocol for the buffers, as well as an integer for the length and a string for the Form.

Unlike the fastjet interface, if you ever need to make a change to the format of the records—add, remove, rename, change the type of a field—you don't need to change anything in the Python-C++ interface. All of that is contained in the specialization of the C++ template and the filling procedure, which are both in your C++ code.

Conclusion

So, I gave you a lot of information without even knowing if you're interested in providing this kind of interface, but it's so that you can assess whether you want to undertake this step, with a realistic sense of what it would involve. Getting batches of events per Python iteration rather than single events can have ergonomic and performance benefits, especially in the limit of large numbers of small events.

Thank you for your consideration!

The text was updated successfully, but these errors were encountered:

HDembinski · 2022-09-15T16:44:54Z

Hi Jim, thank you for this note, but I think that adding support for awkward is adding complexity for us without a real performance benefit.

If the computational cost of generating the event and doing computation with the event is large compared to the cost of the Python loop over events, then you do not gain from using awkward. This is most certainly the case here. You can just use ordinary numpy arrays, which is simpler than working with awkward arrays.

As far as I understand, impy + pyhepmc is already faster in generating a HepMC3 output file than the pure C++ code CRMC, although the impy loops over events in Python and the generator output is converted into numpy arrays which are converted back into HepMC3 data structures before they are written to the disk. This seems to indicate that the current approach is already very performant.

If the particle multiplicity per event was small, then awkward should accelerate things, but the multiplicity at the LHC is fairly large.

jpivarski mentioned this issue Oct 18, 2022

ak.Array wrongly deduces input type scikit-hep/awkward#1807

Open

jpivarski mentioned this issue Mar 6, 2023

C++ Header files missing conda-forge/awkward-cpp-feedstock#13

Closed

1 task

HDembinski mentioned this issue Nov 7, 2023

Alternative interface to produce many events with varying initial states #188

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request/suggestion: Awkward Arrays of events #65

Feature request/suggestion: Awkward Arrays of events #65

jpivarski commented Sep 15, 2022

HDembinski commented Sep 15, 2022 •

edited

Loading

Feature request/suggestion: Awkward Arrays of events #65

Feature request/suggestion: Awkward Arrays of events #65

Comments

jpivarski commented Sep 15, 2022

Getting the code

LayoutBuilder for records and variable-length lists

Getting these data into Python

Conclusion

HDembinski commented Sep 15, 2022 • edited Loading

HDembinski commented Sep 15, 2022 •

edited

Loading