# Awkward RDataFrame Tutorial

* [Awkward Array and RDataFrame](#Awkward_Array_and_RDataFrame)
* [From Awkward Array to RDataFrame](#From_Awkward_Array_to_RDataFrame)
    * [`ak.to_rdataframe` function](#ak.to_rdataframe_function)
    * [Columns type](#Columns_type)
    * [Operations on data in RDataFrame](#Operations_on_data_in_RDataFrame)
    * [Retrieve selected columns](#Retrieve_selected_columns)
    * [Layout details](#Layout_details)
    * [The same operation on Awkward arrays in Python](#The_same_operation_on_Awkward_arrays_in_Python)
* [From RDataFrame to Awkward Array](#From_RDataFrame_to_Awkward_Array)
* [Data analysis: from C++ to Python](#Data_analysis:_from_C++_to_Python)
    * [Convert selected columns to Awkward Array](#Convert_selected_columns_to_Awkward_Array)

## Awkward Array and RDataFrame <a class="anchor" id="Awkward_Array_and_RDataFrame"></a>

Awkward Array and RDataFrame are two very different ways of performing calculations at scale

Awkward Array is a library for nested, variable-sized data, including arbitrary-length lists, records, mixed types, and missing data, using NumPy-like idioms

RDataFrame - ROOT's declarative analysis interface

supports many input formats
The Awkward-RDataFrame bridge provides users with more flexibility in mixing different packages and languages in their analyses. There are numerous benefits of combining both Python and C++. Physicists can mix analyses using Awkward Arrays, Numba, and ROOT C++ in memory, without saving to disk and without leaving their environment.

## From Awkward Array to RDataFrame <a class="anchor" id="From_Awkward_Array_to_RDataFrame"></a>

In [1]:
import awkward as ak
import ROOT

Welcome to JupyROOT 6.26/10


In [2]:
array_x = ak.Array(
        [
            {"x": [1.1, 1.2, 1.3]},
            {"x": [2.1, 2.2]},
            {"x": [3.1]},
            {"x": [4.1, 4.2, 4.3, 4.4]},
            {"x": [5.1]},
        ]
    )
array_y = ak.Array([1, 2, 3, 4, 5])
array_z = ak.Array([[1.1], [2.1, 2.3, 2.4], [3.1], [4.1, 4.2, 4.3], [5.1]])

In [3]:
array_x.show(type=True)

type: 5 * {
    x: var * float64
}
[{x: [1.1, 1.2, 1.3]},
 {x: [2.1, 2.2]},
 {x: [3.1]},
 {x: [4.1, 4.2, 4.3, 4.4]},
 {x: [5.1]}]


In [4]:
array_y.show(type=True)

type: 5 * int64
[1,
 2,
 3,
 4,
 5]


In [5]:
array_z.show(type=True)

type: 5 * var * float64
[[1.1],
 [2.1, 2.3, 2.4],
 [3.1],
 [4.1, 4.2, 4.3],
 [5.1]]


### `ak.to_rdataframe` function <a class="anchor" id="ak.to_rdataframe_function"></a>

* The awkward style `ak.to_rdataframe` function requires a dictionary:
    * each - unique - key defines a column name in RDataFrame
    * the arrays given for each column have to be equal length

* There is a small overhead of generating Awkward RDataSource C++ code
    * This operation does not execute the RDF event loop
    * The array data are not copied

In [6]:
assert len(array_x) == len(array_y) == len(array_z)

In [7]:
df = ak.to_rdataframe({"x": array_x, "y": array_y, "z": array_z})

In [8]:
df.Describe().Print()

Dataframe from datasource Custom Datasource

Property                Value
--------                -----
Columns in total            3
Columns from defines        0
Event loops run             0
Processing slots            1

Column  Type                            Origin
------  ----                            ------
x       awkward::Record_TCZWUpuv5XA     Dataset
y       int64_t                         Dataset
z       ROOT::VecOps::RVec<double>      Dataset

The `x` column contains an Awkward Array with a made-up type; `awkward::Record_TCZWUpuv5XA`.

Awkward Arrays are dynamically typed, so in a C++ context, the type name is hashed. In practice, there is no need to know the type. The C++ code should use a placeholder type specifier auto. The type of the variable that is being declared will be automatically deduced from its initializer.

### Columns type  <a class="anchor" id="Columns_type"></a>

In [9]:
df.GetColumnType("x")

'awkward::Record_TCZWUpuv5XA'

In [10]:
df.GetColumnType("y")

'int64_t'

In [11]:
df.GetColumnType("z")

'ROOT::VecOps::RVec<double>'

### Operations on data in RDataFrame   <a class="anchor" id="Operations_on_data_in_RDataFrame"></a>

Scheduling a filtering operation does not execute the event loop.

In [12]:
df = df.Filter("y > 2")

Let's check the state of the dataframe to make sure that the event loop was not triggered.

In [13]:
df.Describe().Print()

Dataframe from datasource Custom Datasource

Property                Value
--------                -----
Columns in total            3
Columns from defines        0
Event loops run             0
Processing slots            1

Column  Type                            Origin
------  ----                            ------
x       awkward::Record_TCZWUpuv5XA     Dataset
y       int64_t                         Dataset
z       ROOT::VecOps::RVec<double>      Dataset

### Retrieve selected columns <a class="anchor" id="Retrieve_selected_columns"></a>

The `ak.from_rdataframe` function converts selected columns to native Awkward Arrays.

The function takes a tuple of strings that are the RDF column names.

The event loop is triggered once to retrieve all selected columns.

In [15]:
out = ak.from_rdataframe(
    df,
    columns=("x", "y", "z",),
)

In [16]:
out

Let's check the filtered entries for y > 2:

In [20]:
print(out["y"].to_list())
print(out["z"].to_list())
print(out["x"].to_list())

[3, 4, 5]
[[3.1], [4.1, 4.2, 4.3], [5.1]]
[{'x': [3.1]}, {'x': [4.1, 4.2, 4.3, 4.4]}, {'x': [5.1]}]


Check to make sure we triggered the event loop only once:

In [21]:
df.Describe().Print()

Dataframe from datasource Custom Datasource

Property                Value
--------                -----
Columns in total            3
Columns from defines        0
Event loops run             1
Processing slots            1

Column  Type                            Origin
------  ----                            ------
x       awkward::Record_TCZWUpuv5XA     Dataset
y       int64_t                         Dataset
z       ROOT::VecOps::RVec<double>      Dataset

### Layout details <a class="anchor" id="Layout_details"></a>

The RecordArray data: its content NumpyArray - is not copied, it is indexed: it is wrapped in an IndexedArray - because of the filter selection. The other two columns data are copied.

In [22]:
out.layout

<RecordArray is_tuple='false' len='3'>
    <content index='0' field='y'>
        <NumpyArray dtype='int64' len='3'>[3 4 5]</NumpyArray>
    </content>
    <content index='1' field='z'>
        <ListOffsetArray len='3'>
            <offsets><Index dtype='int64' len='4'>
                [0 1 4 5]
            </Index></offsets>
            <content><NumpyArray dtype='float64' len='5'>[3.1 4.1 4.2 4.3 5.1]</NumpyArray></content>
        </ListOffsetArray>
    </content>
    <content index='2' field='x'>
        <IndexedArray len='3'>
            <index><Index dtype='int64' len='3'>
                [2 3 4]
            </Index></index>
            <content><RecordArray is_tuple='false' len='5'>
                <content index='0' field='x'>
                    <ListArray len='5'>
                        <starts><Index dtype='int64' len='5'>
                            [ 0  3  5  6 10]
                        </Index></starts>
                        <stops><Index dtype='int64' len='5'>
      

### The same operation on Awkward arrays in Python <a class="anchor" id="The_same_operation_on_Awkward_arrays_in_Python"></a>
* Is the array type as expected? Yes.
* Is its layout the same? No. Awkward arrays are immutable.

In [23]:
array_yzx = ak.Array({"y": array_y, "z": array_z, "x": array_x})
filtered_array = array_yzx[array_yzx["y"] > 2]
filtered_array.show(type=True)

type: 3 * {
    y: int64,
    z: var * float64,
    x: {
        x: var * float64
    }
}
[{y: 3, z: [3.1], x: {x: [3.1]}},
 {y: 4, z: [4.1, 4.2, 4.3], x: {x: [4.1, ...]}},
 {y: 5, z: [5.1], x: {x: [5.1]}}]


In [24]:
filtered_array.layout

<IndexedArray len='3'>
    <index><Index dtype='int64' len='3'>
        [2 3 4]
    </Index></index>
    <content><RecordArray is_tuple='false' len='5'>
        <content index='0' field='y'>
            <NumpyArray dtype='int64' len='5'>[1 2 3 4 5]</NumpyArray>
        </content>
        <content index='1' field='z'>
            <ListOffsetArray len='5'>
                <offsets><Index dtype='int64' len='6'>
                    [0 1 4 5 8 9]
                </Index></offsets>
                <content><NumpyArray dtype='float64' len='9'>
                    [1.1 2.1 2.3 2.4 3.1 4.1 4.2 4.3 5.1]
                </NumpyArray></content>
            </ListOffsetArray>
        </content>
        <content index='2' field='x'>
            <RecordArray is_tuple='false' len='5'>
                <content index='0' field='x'>
                    <ListOffsetArray len='5'>
                        <offsets><Index dtype='int64' len='6'>
                            [ 0  3  5  6 10 11]
                 

## From RDataFrame to Awkward Array <a class="anchor" id="From_RDataFrame_to_Awkward_Array"></a>

The `ak.from_rdataframe` function converts selected columns to native Awkward Arrays. The function takes a tuple of strings that are the RDF column names and recognizes the following column data types:

* Primitive types: `integer`, `float`, `double`, `std::complex<double>`, etc.
    
* Lists of primitive types and the arbitrary depth nested lists of primitive types: `std::vector<double>`, `RVec<int>`, etc.
    
* Awkward types: run-time generated types derived from `awkward::ArrayView` or `awkward::RecordView`
    * no copy required because Awkward Arrays are immutable

## Data analysis: from C++ to Python <a class="anchor" id="Data_analysis:_from_C++_to_Python"></a>

In [25]:
import awkward as ak
import ROOT

In [26]:
df = ROOT.RDataFrame('Events', 'root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root')

CMS data from CERN Open Data portal [DOI:10.7483/OPENDATA.CMS.LVG5.QT81](http://opendata.web.cern.ch/record/12341)

* This dataset contains about 60 mio. data events from the CMS detector taken in 2012 during Run B and C.
* The original AOD dataset is converted to the NanoAOD format and reduced to the muon collections.
* The dataset in the file is called <b>Events</b> and contains the following columns:
    * <b>nMuon</b> `unsigned int` <i>Number of muons in this event</i>
    * <b>Muon_pt</b> `float[nMuon]` <i>Transverse momentum of the muons (stored as an array of size nMuon)</i>
    * <b>Muon_eta</b> `float[nMuon]` <i>Pseudorapidity of the muons</i>
    * <b>Muon_phi</b> `float[nMuon]` <i>Azimuth of the muons</i>
    * <b>Muon_mass</b> `float[nMuon]` <i>Mass of the muons</i>
    * <b>Muon_charge</b> `int[nMuon]` <i>Charge of the muons (either 1 or -1)</i>

In [27]:
# Describe the state of the dataframe.
# Note that this operation is not running the event loop.
df.Describe().Print()

Dataframe from TChain Events in file root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root

Property                Value
--------                -----
Columns in total            6
Columns from defines        0
Event loops run             0
Processing slots            1

Column          Type                            Origin
------          ----                            ------
Muon_charge     ROOT::VecOps::RVec<Int_t>       Dataset
Muon_eta        ROOT::VecOps::RVec<Float_t>     Dataset
Muon_mass       ROOT::VecOps::RVec<Float_t>     Dataset
Muon_phi        ROOT::VecOps::RVec<Float_t>     Dataset
Muon_pt         ROOT::VecOps::RVec<Float_t>     Dataset
nMuon           UInt_t                          Dataset

Build a small analysis studying the invariant mass of dimuon systems.

* See [ROOT tutorial](https://root.cern.ch/doc/master/df102__NanoAODDimuonAnalysis_8py.html) for more information.

In [28]:
df = df.Filter('nMuon == 2')\
       .Filter('Muon_charge[0] != Muon_charge[1]')\
       .Define('Dimuon_mass', 'InvariantMass(Muon_pt, Muon_eta, Muon_phi, Muon_mass)')\
       .Filter('Dimuon_mass > 70')\
       .Range(1000)

Trigger the event loop by asking for the mean of the dimuon mass:

In [29]:
print('\nApproximate mass of the Z boson: {:.2f} GeV\n'.format(
        df.Mean('Dimuon_mass').GetValue()))


Approximate mass of the Z boson: 91.44 GeV



Check that this operation triggered the event loop once

In [30]:
df.Describe().Print()

Dataframe from TChain Events in file root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root

Property                Value
--------                -----
Columns in total            7
Columns from defines        1
Event loops run             1
Processing slots            1

Column          Type                            Origin
------          ----                            ------
Dimuon_mass     float                           Define
Muon_charge     ROOT::VecOps::RVec<Int_t>       Dataset
Muon_eta        ROOT::VecOps::RVec<Float_t>     Dataset
Muon_mass       ROOT::VecOps::RVec<Float_t>     Dataset
Muon_phi        ROOT::VecOps::RVec<Float_t>     Dataset
Muon_pt         ROOT::VecOps::RVec<Float_t>     Dataset
nMuon           UInt_t                          Dataset

## Convert selected columns to Awkward Array <a class="anchor" id="Convert_selected_columns_to_Awkward_Array"></a>

* The scheduled analysis executed the event loop once
* A user can take the data out as an Awkward Array
* If the columns type is not `awkward`, the `ROOT::VecOps::RVecs` content is copied to a Numpy buffer

In [31]:
array = ak.from_rdataframe(
        df,
        columns=(
            "Dimuon_mass",
        ),
    )

In [32]:
array