### Conversions of Awkward Arrays to and from RDataFrame (C++)

The ROOT RDataFrame is a declarative, parallel framework for data analysis and manipulation. RDataFrame reads columnar data via a data source. The transformations can be applied to the data to select rows and/or to define new columns, and to produce results: histograms, etc.

In [1]:
import awkward as ak

In [2]:
import ROOT

Welcome to JupyROOT 6.28/04


#### From Awkward to RDataFrame

The function for Awkward → RDataFrame conversion is ak.to_rdataframe().

The argument to this function requires a dictionary: { \<column name string\> : \<awkwad array\> }. This function always returns

cppyy.gbl.ROOT.RDF.RInterface
object.

In [3]:
array_x = ak.Array(
    [
        {"x": [1.1, 1.2, 1.3]},
        {"x": [2.1, 2.2]},
        {"x": [3.1]},
        {"x": [4.1, 4.2, 4.3, 4.4]},
        {"x": [5.1]},
    ]
)
array_y = ak.Array([1, 2, 3, 4, 5])
array_z = ak.Array([[1.1], [2.1, 2.3, 2.4], [3.1], [4.1, 4.2, 4.3], [5.1]])

The arrays given for each column have to be equal length:

In [4]:
assert len(array_x) == len(array_y) == len(array_z)

The dictionary key defines a column name in RDataFrame.

In [5]:
df = ak.to_rdataframe({"x": array_x, "y": array_y, "z": array_z})

The ak.to_rdataframe() function presents a generated-on-demand Awkward Array view as an RDataFrame source. There is a small overhead of generating Awkward RDataSource C++ code. This operation does not execute the RDataFrame event loop. The array data are not copied.

The column readers are generated based on the run-time type of the views. Here is a description of the RDataFrame columns:

In [6]:
df.Describe().Print()

Dataframe from datasource Custom Datasource

Property                Value
--------                -----
Columns in total            3
Columns from defines        0
Event loops run             0
Processing slots            1

Column  Type                            Origin
------  ----                            ------
x       awkward::Record_9q0lAle34wU     Dataset
y       int64_t                         Dataset
z       ROOT::VecOps::RVec<double>      Dataset

In [7]:
array = ak.from_rdataframe(
    df,
    columns=(
        "x",
        "y",
        "z",
    ),
)
array

The x column contains an Awkward Array with a made-up type; awkward::Record_cKnX5DyNVM.

Awkward Arrays are dynamically typed, so in a C++ context, the type name is hashed. In practice, there is no need to know the type. The C++ code should use a placeholder type specifier auto. The type of the variable that is being declared will be automatically deduced from its initializer.

#### From RDataFrame to Awkward

The function for RDataFrame → Awkward conversion is ak.from_rdataframe(). The argument to this function accepts a tuple of strings that are the RDataFrame column names. By default this function returns

ak.Array
type.

In [8]:
array = ak.from_rdataframe(
    df,
    columns=(
        "x",
        "y",
        "z",
    ),
)
array

When RDataFrame runs multi-threaded event loops, the entry processing order is not guaranteed:

In [9]:
ROOT.ROOT.EnableImplicitMT()

Let’s recreate the dataframe, to reflect the new multi-threading mode

In [10]:
df = ak.to_rdataframe({"x": array_x, "y": array_y, "z": array_z})

If the keep_order parameter set to True, the columns will keep order after filtering:

In [11]:
df = df.Filter("y % 2 == 0")

array = ak.from_rdataframe(
    df,
    columns=(
        "x",
        "y",
        "z",
    ),
    keep_order=True,
)
array

#### Analysis Example

In [12]:
df = ROOT.RDataFrame('Events', 'root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root')

In [13]:
array = ak.from_rdataframe(
    df,
    columns=("Muon_charge","Muon_eta","Muon_mass","Muon_phi","Muon_pt","nMuon",),)

In [14]:
array.show(type=True)

type: 61540413 * {
    Muon_charge: var * int32,
    Muon_eta: var * float32,
    Muon_mass: var * float32,
    Muon_phi: var * float32,
    Muon_pt: var * float32,
    nMuon: uint32
}
[{Muon_charge: [1, -1], Muon_eta: [1.58, 1.29], Muon_mass: [...], ...},
 {Muon_charge: [-1, 1], Muon_eta: [1.06, -0.853], Muon_mass: [...], ...},
 {Muon_charge: [-1, -1, -1], Muon_eta: [1.82, ..., 1.98], Muon_mass: ..., ...},
 {Muon_charge: [1, -1, -1], Muon_eta: [-2.39, ...], Muon_mass: [...], ...},
 {Muon_charge: [-1, 1], Muon_eta: [-2.11, -1.16], Muon_mass: [...], ...},
 {Muon_charge: [1, -1], Muon_eta: [-0.0816, ...], Muon_mass: [...], ...},
 {Muon_charge: [1, -1, 1], Muon_eta: [-0.0208, ...], Muon_mass: [...], ...},
 {Muon_charge: [-1, -1], Muon_eta: [-0.913, ...], Muon_mass: [...], ...},
 {Muon_charge: [1, 1], Muon_eta: [-0.361, -0.69], Muon_mass: [...], ...},
 {Muon_charge: [1, -1], Muon_eta: [-0.61, -0.601], Muon_mass: [...], ...},
 ...,
 {Muon_charge: [-1, 1, -1], Muon_eta: [-1.41, ..., 1.75], M

In [15]:
df = ak.to_rdataframe({"Events": array})

rdf = df.Filter('Events.nMuon() == 2')\
    .Filter('Events.Muon_charge()[0] != Events.Muon_charge()[1]')\
    .Define("dimuon_mass", """
return std::sqrt(2 * Events.Muon_pt()[0] * Events.Muon_pt()[1]
    * (std::cosh(Events.Muon_eta()[0] - Events.Muon_eta()[1])
    - std::cos(Events.Muon_phi()[0] - Events.Muon_phi()[1])));
""")

In [16]:
rdf.Describe().Print()

Dataframe from datasource Custom Datasource

Property                Value
--------                -----
Columns in total            2
Columns from defines        1
Event loops run             0
Processing slots           12

Column          Type                            Origin
------          ----                            ------
Events          awkward::Record_053ktaRwcM      Dataset
dimuon_mass     float                           Define