Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request/suggestion: Awkward Arrays of events #65

Open
jpivarski opened this issue Sep 15, 2022 · 1 comment
Open

Feature request/suggestion: Awkward Arrays of events #65

jpivarski opened this issue Sep 15, 2022 · 1 comment

Comments

@jpivarski
Copy link

I learned about this project from @HDembinski in PyHEP 2022:

btw, I am working an a generator wrapper called impy, which will offer the Pythia8 record in numpy array form

I agree that providing a truly Pythonic interface to not just one but a lot of generators is a great idea!

Providing an iterator that yields NumPy arrays, like numpythia and pyjet, partially accelerates analysis by allowing vectorization over particles, but not over events. Each event has a different number of particles, so such an array can't be a NumPy array, but it can be an Awkward Array.

@aryan26roy and I implemented such an interface in https://github.com/scikit-hep/fastjet and wrote about it in arXiv:2202.03911. The key idea is to minimize the number of Python objects and steps through the Python interpreter:

image

The fastjet library was written with a minimal coupling between Python and C++ (having learned how not to do Python-C++ couplings), passing only simple data types—numbers and borrowed arrays—between the languages, and it lets Python/NumPy own the array buffers, so that they're scoped as Python users expect. However, the fastjet interface also builds the Awkward types manually, and there is now a better way to do it, which requires less maintenance.

This summer, @ManasviGoyal implemented an Awkward Array-builder in header-only C++ that automates this process without a loss of performance (which the ak.ArrayBuilder interface has, if you're familiar with it). Her new awkward::LayoutBuilder specializes an Awkward data structure using C++ templates, which can then be filled and converted to a Python Awkward Array through ak.from_buffers.

The LayoutBuilder documentation is still being integrated into the docs:

https://github.com/scikit-hep/awkward/blob/manasvi/layout-builder-user-guide/docs-sphinx/user-guide/how-to-use-header-only-layoutbuilder.md

but I can give a short walk-through in this issue.

Getting the code

The header-only LayoutBuilder is in this directory (4 files):

https://github.com/scikit-hep/awkward/tree/main/src/awkward/_v2/cpp-headers/awkward

which could be included in the impy project as a git submodule (how I generally deal with C++ header-only dependencies), and it is also shipped in the awkward Python package; you can get -I compiler flags like this:

python -m awkward._v2.config --cflags

LayoutBuilder for records and variable-length lists

An array of events with one-value-per-event attributes and different-length lists of particles can be modeled using only records (structs) and variable-length lists. Here's an example of building that with LayoutBuilder:

  #include "awkward/LayoutBuilder.h"

  using UserDefinedMap = std::map<std::size_t, std::string>;

  template<class PRIMITIVE>
  using NumpyBuilder = awkward::LayoutBuilder::Numpy<PRIMITIVE>;

  template<class PRIMITIVE, class BUILDER>
  using ListOffsetBuilder = awkward::LayoutBuilder::ListOffset<PRIMITIVE, BUILDER>;

  template<class... BUILDERS>
  using RecordBuilder = awkward::LayoutBuilder::Record<UserDefinedMap, BUILDERS...>;

  template<std::size_t field_name, class BUILDER>
  using RecordField = awkward::LayoutBuilder::Field<field_name, BUILDER>;

  enum Field : std::size_t {x, y};

  UserDefinedMap fields_map({
    {Field::x, "x"},
    {Field::y, "y"}});

  ListOffsetBuilder<int64_t,
      RecordBuilder<
          RecordField<Field::x, NumpyBuilder<double>>,
          RecordField<Field::y, ListOffsetBuilder<int64_t,
              NumpyBuilder<int32_t>>
  >>> builder;

This builder can then be filled with

  auto& subbuilder = builder.begin_list();
  subbuilder.set_field_names(fields_map);

  auto& x_builder = subbuilder.field<Field::x>();
  auto& y_builder = subbuilder.field<Field::y>();

  x_builder.append(1.1);
  auto& y_subbuilder = y_builder.begin_list();
  y_subbuilder.append(1);
  y_builder.end_list();

  x_builder.append(2.2);
  y_builder.begin_list();
  y_subbuilder.append(1);
  y_subbuilder.append(2);
  y_builder.end_list();

  builder.end_list();

  builder.begin_list();
  builder.end_list();

  builder.begin_list();

  x_builder.append(3.3);
  y_builder.begin_list();
  y_subbuilder.append(1);
  y_subbuilder.append(2);
  y_subbuilder.append(3);
  y_builder.end_list();

  builder.end_list();

though you would likely do that in a loop (like ArrayBuilder). The above data are equivalent to:

[
    [{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}],
    [],
    [{"x": 3.3, "y": [1, 2, 3]}]
]

Getting these data into Python

The LayoutBuilder code doesn't have helper methods to pass the data through pybind11 so that different projects can use different binding generators. (It's currently being used in one Cling project, Awkward ↔ RDataFrame, and one Cython project, ctapipe.) So here's an explanation of how to use it with pybind11.

The data comes out of LayoutBuilder as a set of named buffers and a Form (JSON) that tells Awkward Array how to put it all together. Here's an example of making an Awkward Array like that in Python (not from LayoutBuilder):

>>> import awkward._v2 as ak
>>> import numpy as np
>>> container = {            # normally, you would get these buffers from C++
...     "node0-offsets": np.array([0, 2, 2, 3], dtype=np.int64),
...     "node2-data": np.array([1.1, 2.2, 3.3], dtype=np.float64),
...     "node3-offsets": np.array([0, 1, 3, 6], dtype=np.int64),
...     "node4-data": np.array([1, 1, 2, 1, 2, 3], dtype=np.int32),
... }
>>> form = """
... {
...     "class": "ListOffsetArray",
...     "offsets": "i64",
...     "content": {
...         "class": "RecordArray",
...         "contents": {
...             "x": {
...                 "class": "NumpyArray",
...                 "primitive": "float64",
...                 "form_key": "node2"
...             },
...             "y": {
...                 "class": "ListOffsetArray",
...                 "offsets": "i64",
...                 "content": {
...                     "class": "NumpyArray",
...                     "primitive": "int32",
...                     "form_key": "node4"
...                 },
...                 "form_key": "node3"
...             }
...         },
...         "form_key": "node1"
...     },
...     "form_key": "node0"
... }
... """

>>> array = ak.from_buffers(form=form, length=3, container=container)

>>> array.show(type=True)
type: 3 * var * {
    x: float64,
    y: var * int32
}
[[{x: 1.1, y: [1]}, {x: 2.2, y: [1, 2]}],
 [],
 [{x: 3.3, y: [1, 2, 3]}]]

So we just need LayoutBuilder to give us these pieces.

Moreover, we want NumPy to own the array buffers, so that they get deleted when the Awkward Array goes out of Python scope, not when the LayoutBuilder goes out of C++ scope. The hand-off therefore needs a few steps.

  1. First, you ask for the set of buffer names and their sizes, as a number of bytes (not a number of items):
  std::map<std::string, size_t> names_nbytes = {};
  builder.buffer_nbytes(names_nbytes);
  1. Next, you allocate memory for those buffers in Python, presumably with np.empty(nbytes, dtype=np.uint8) and get void* pointers to these buffers by casting the output of numpy_array.ctypes.data (pointer as integer).
  2. Then, you let the LayoutBuilder fill these buffers with
  std::map<std::string, void*> buffers;
  builder.to_buffers(buffers);
  1. Finally, you get the JSON form with
  std::string form = builder.form();

Now you can pass everything over the border from C++ to Python using pybind11's py::buffer_protocol for the buffers, as well as an integer for the length and a string for the Form.

Unlike the fastjet interface, if you ever need to make a change to the format of the records—add, remove, rename, change the type of a field—you don't need to change anything in the Python-C++ interface. All of that is contained in the specialization of the C++ template and the filling procedure, which are both in your C++ code.

Conclusion

So, I gave you a lot of information without even knowing if you're interested in providing this kind of interface, but it's so that you can assess whether you want to undertake this step, with a realistic sense of what it would involve. Getting batches of events per Python iteration rather than single events can have ergonomic and performance benefits, especially in the limit of large numbers of small events.

Thank you for your consideration!

@HDembinski
Copy link
Collaborator

HDembinski commented Sep 15, 2022

Hi Jim, thank you for this note, but I think that adding support for awkward is adding complexity for us without a real performance benefit.

If the computational cost of generating the event and doing computation with the event is large compared to the cost of the Python loop over events, then you do not gain from using awkward. This is most certainly the case here. You can just use ordinary numpy arrays, which is simpler than working with awkward arrays.

As far as I understand, impy + pyhepmc is already faster in generating a HepMC3 output file than the pure C++ code CRMC, although the impy loops over events in Python and the generator output is converted into numpy arrays which are converted back into HepMC3 data structures before they are written to the disk. This seems to indicate that the current approach is already very performant.

If the particle multiplicity per event was small, then awkward should accelerate things, but the multiplicity at the LHC is fairly large.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants