# Choosing RecordArray structures suitable for nanoAOD data format and masking with the `ak.mask` method.
According to A. Hollands in the discussion https://github.com/scikit-hep/awkward/discussions/1585 the type of awkward array matters when masking.
A type like `3 * var * {"Muon_pt": float64, "Muon_eta": float64}` is prefered over `3 * {"Muon_pt": var * float64, "Muon_eta": var * float64}` for proper broadcasting.

> [...] The main issue here is that your record structures are not deeply nested, i.e. your array has type
> 
> 3 * {"Muon_pt": var * float64, "Muon_eta": var * float64}
> 
> instead of
> 
> 3 * var * {"Muon_pt": float64, "Muon_eta": float64}
> 
> The form of your array is an important choice that should make it easy to work with your data in a natural way. In the case of records, that means choosing whether to have var * {'x': float6} or {'x': var * float64}. In this instance, it feels like you want the former, not the latter. Otherwise, you need to slice each field, [...]

> Record of arrays:
> 
> 3 * {"Muon_pt": var * float64, "Muon_eta": var * float64}
> 
> Array of records:
> 
> 3 * var * {"Muon_pt": float64, "Muon_eta": float64}
> 
> Uproot returns the "record of arrays", as by default there is no information about whether the branches are compatible with one another.
> 
> You can use pass how="zip" into the arrays() method that uproot provides to instruct it to give you the branches as a zipped array. You might need to limit arrays() to only the branches that you want to zip together.

In [1]:
import awkward as ak
print("awkward version:", ak.__version__)
import uproot
print("uproot version: ", uproot.__version__)
# import numpy as np
# print("numpy version:  ", np.__version__)

awkward version: 1.8.0
uproot version:  4.3.3
numpy version:   1.23.1


## Preparation

A function to print useful information about awkward arrays:

In [2]:
def print_array_info(akarray, layout:bool =None):
    if layout == None: layout = False
    print("fields:\n{} \n".format(akarray.fields))
    print("type: \n{} \n".format(akarray.type))
    if layout == True:
        print("layout: \n{}".format(akarray.layout))

Import the `Events` tree from a nanoAOD file:

In [3]:
events = uproot.open("./nanoAOD_MC_testfile.root:Events")

Chosing branch keys to extract from the `Events` tree:

In [4]:
muon_branches = ["Muon_pt", "Muon_eta", "Muon_phi", "Muon_tightId", "Muon_pfRelIso04_all", "Muon_charge"]

In [5]:
jet_branches = [
    "Jet_pt",
    "Jet_eta",
    "Jet_phi",
    "Jet_btagDeepFlavB",
    "Jet_btagDeepFlavCvB",
    "Jet_btagDeepFlavCvL",
    "Jet_jetId",
    "Jet_puId",
    "Jet_hadronFlavour",
]

## load muon branches without zipping

In [6]:
muons_no_zip = events.arrays(muon_branches) #, entry_stop=10)
print_array_info(muons_no_zip)

fields:
['Muon_pt', 'Muon_eta', 'Muon_phi', 'Muon_tightId', 'Muon_pfRelIso04_all', 'Muon_charge'] 

type: 
1089694 * {"Muon_pt": var * float32, "Muon_eta": var * float32, "Muon_phi": var * float32, "Muon_tightId": var * bool, "Muon_pfRelIso04_all": var * float32, "Muon_charge": var * int32} 



## load muon branches with manual zipping using `ak.zip`

In [7]:
muons_manual_zip = ak.zip({muon_branches[i].split("_", maxsplit=1)[1]: events[muon_branches[i]].array() for i in range(len(muon_branches))})
print_array_info(muons_manual_zip)

fields:
['pt', 'eta', 'phi', 'tightId', 'pfRelIso04_all', 'charge'] 

type: 
1089694 * var * {"pt": float32, "eta": float32, "phi": float32, "tightId": bool, "pfRelIso04_all": float32, "charge": int32} 



## load muon branches with manual zipping using uproots `how="zip"`

In [8]:
muons_zip = events.arrays(muon_branches, how="zip") #, entry_stop=10)
print_array_info(muons_zip)
print_array_info(muons_zip.Muon)

fields:
['Muon'] 

type: 
1089694 * {"Muon": var * {"pt": float32, "eta": float32, "phi": float32, "tightId": bool, "pfRelIso04_all": float32, "charge": int32}} 

fields:
['pt', 'eta', 'phi', 'tightId', 'pfRelIso04_all', 'charge'] 

type: 
1089694 * var * {"pt": float32, "eta": float32, "phi": float32, "tightId": bool, "pfRelIso04_all": float32, "charge": int32} 



## testing array structure with two different object types (Muon and Jet)

In [9]:
muons_with_jets_no_zip = events.arrays(muon_branches + jet_branches)
print_array_info(muons_with_jets_no_zip) # , layout=True)

fields:
['Muon_pt', 'Muon_eta', 'Muon_phi', 'Muon_tightId', 'Muon_pfRelIso04_all', 'Muon_charge', 'Jet_pt', 'Jet_eta', 'Jet_phi', 'Jet_btagDeepFlavB', 'Jet_btagDeepFlavCvB', 'Jet_btagDeepFlavCvL', 'Jet_jetId', 'Jet_puId', 'Jet_hadronFlavour'] 

type: 
1089694 * {"Muon_pt": var * float32, "Muon_eta": var * float32, "Muon_phi": var * float32, "Muon_tightId": var * bool, "Muon_pfRelIso04_all": var * float32, "Muon_charge": var * int32, "Jet_pt": var * float32, "Jet_eta": var * float32, "Jet_phi": var * float32, "Jet_btagDeepFlavB": var * float32, "Jet_btagDeepFlavCvB": var * float32, "Jet_btagDeepFlavCvL": var * float32, "Jet_jetId": var * int32, "Jet_puId": var * int32, "Jet_hadronFlavour": var * int32} 



In [10]:
muons_with_jets_zip = events.arrays(muon_branches + jet_branches, how="zip")
print_array_info(muons_with_jets_zip) # , layout=True)

fields:
['Muon', 'Jet'] 

type: 
1089694 * {"Muon": var * {"pt": float32, "eta": float32, "phi": float32, "tightId": bool, "pfRelIso04_all": float32, "charge": int32}, "Jet": var * {"pt": float32, "eta": float32, "phi": float32, "btagDeepFlavB": float32, "btagDeepFlavCvB": float32, "btagDeepFlavCvL": float32, "jetId": int32, "puId": int32, "hadronFlavour": int32}} 



In [11]:
print_array_info(muons_with_jets_zip.Muon) # , layout=True)


fields:
['pt', 'eta', 'phi', 'tightId', 'pfRelIso04_all', 'charge'] 

type: 
1089694 * var * {"pt": float32, "eta": float32, "phi": float32, "tightId": bool, "pfRelIso04_all": float32, "charge": int32} 



In [12]:
print_array_info(muons_with_jets_zip.Jet) # , layout=True)


fields:
['pt', 'eta', 'phi', 'btagDeepFlavB', 'btagDeepFlavCvB', 'btagDeepFlavCvL', 'jetId', 'puId', 'hadronFlavour'] 

type: 
1089694 * var * {"pt": float32, "eta": float32, "phi": float32, "btagDeepFlavB": float32, "btagDeepFlavCvB": float32, "btagDeepFlavCvL": float32, "jetId": int32, "puId": int32, "hadronFlavour": int32} 



In [13]:
all_muon_and_jet_branches = events.arrays(filter_name=["Muon_*","Jet_*"], how="zip")
print_array_info(all_muon_and_jet_branches)

fields:
['Jet', 'Muon'] 

type: 
1089694 * {"Jet": var * {"pt": float32, "eta": float32, "phi": float32, "btagDeepFlavB": float32, "btagDeepFlavCvB": float32, "btagDeepFlavCvL": float32, "jetId": int32, "puId": int32, "hadronFlavour": int32}, "Muon": var * {"pt": float32, "eta": float32, "phi": float32, "tightId": bool, "pfRelIso04_all": float32, "charge": int32}} 



In [14]:
print_array_info(all_muon_and_jet_branches.Muon)

fields:
['pt', 'eta', 'phi', 'tightId', 'pfRelIso04_all', 'charge'] 

type: 
1089694 * var * {"pt": float32, "eta": float32, "phi": float32, "tightId": bool, "pfRelIso04_all": float32, "charge": int32} 



In [15]:
print_array_info(all_muon_and_jet_branches.Jet)


fields:
['pt', 'eta', 'phi', 'btagDeepFlavB', 'btagDeepFlavCvB', 'btagDeepFlavCvL', 'jetId', 'puId', 'hadronFlavour'] 

type: 
1089694 * var * {"pt": float32, "eta": float32, "phi": float32, "btagDeepFlavB": float32, "btagDeepFlavCvB": float32, "btagDeepFlavCvL": float32, "jetId": int32, "puId": int32, "hadronFlavour": int32} 



## prepare masks and information printouts

Create a mask with the criterium that the absolute value of eta has to be smaller than 0.5. This value is only for demonstration purposes.

In [16]:
def mask_function1(akarray):
    return abs(akarray.Muon_eta) < .5
def mask_function2(akarray):
    return abs(akarray.eta) < .5
def mask_function3(akarray):
    return abs(akarray.Muon.eta) < .5

In [17]:
def slice_test(akarray, mask_function):
    array_copy = ak.copy(akarray)
    mask = mask_function(array_copy)
    sliced_array = array_copy[mask]
    print_array_info(sliced_array)
    print("first 10 events to list")
    print(sliced_array[:10].to_list())
    return sliced_array

In [18]:
def mask_test(akarray, mask_function):
    array_copy = ak.copy(akarray)
    mask = mask_function(array_copy)
    masked_array = ak.mask(array_copy, mask) # array_copy.mask(mask)
    print_array_info(masked_array)
    print("first 10 events to list")
    print(masked_array[:10].to_list())
    return masked_array

## initial data in the first ten events without slicing or masking

In [19]:
print(muons_no_zip[:10].Muon_eta.to_list())

[[1.1298828125, 0.162200927734375], [], [0.66748046875, -0.584716796875], [-0.3990478515625, -2.0595703125], [2.24853515625], [2.115234375], [], [0.9600830078125], [1.9267578125], []]


We can see already in the first events some entries with |eta| > 0.5 and hope them to be masked or sliced away.

## masking and slicing on the three types of array structures

## no zip
### masking

In [20]:
muons_no_zip_masked = mask_test(muons_no_zip, mask_function=mask_function1)
print("\nonly the eta array:")
print(muons_no_zip_masked[:10].Muon_eta.to_list())

fields:
['Muon_pt', 'Muon_eta', 'Muon_phi', 'Muon_tightId', 'Muon_pfRelIso04_all', 'Muon_charge'] 

type: 
1089694 * var * ?{"Muon_pt": var * float32, "Muon_eta": var * float32, "Muon_phi": var * float32, "Muon_tightId": var * bool, "Muon_pfRelIso04_all": var * float32, "Muon_charge": var * int32} 

first 10 events to list
[[None, {'Muon_pt': [53.02759552001953, 30.640729904174805], 'Muon_eta': [1.1298828125, 0.162200927734375], 'Muon_phi': [1.251220703125, -2.2509765625], 'Muon_tightId': [True, True], 'Muon_pfRelIso04_all': [0.018847892060875893, 0.016491081565618515], 'Muon_charge': [-1, 1]}], [], [None, None], [{'Muon_pt': [36.98743438720703, 32.94463348388672], 'Muon_eta': [-0.3990478515625, -2.0595703125], 'Muon_phi': [-1.376708984375, 1.7509765625], 'Muon_tightId': [True, True], 'Muon_pfRelIso04_all': [0.0, 0.08915365487337112], 'Muon_charge': [1, -1]}, None], [None], [None], [], [None], [None], []]

only the eta array:
[[None, [1.1298828125, 0.162200927734375]], [], [None, None]

Looking again at eta of the first ten events, we see, that the shape of the list changed and the values with |eta| > 0.5 are still there instead of beeing replaced with `None`.

### slicing

In [21]:
muons_no_zip_sliced = slice_test(muons_no_zip, mask_function=mask_function1)
print("\nonly the eta array:")
print(muons_no_zip_sliced[:10].Muon_eta.to_list())

fields:
['Muon_pt', 'Muon_eta', 'Muon_phi', 'Muon_tightId', 'Muon_pfRelIso04_all', 'Muon_charge'] 

type: 
1089694 * {"Muon_pt": var * float32, "Muon_eta": var * float32, "Muon_phi": var * float32, "Muon_tightId": var * bool, "Muon_pfRelIso04_all": var * float32, "Muon_charge": var * int32} 

first 10 events to list
[{'Muon_pt': [30.640729904174805], 'Muon_eta': [0.162200927734375], 'Muon_phi': [-2.2509765625], 'Muon_tightId': [True], 'Muon_pfRelIso04_all': [0.016491081565618515], 'Muon_charge': [1]}, {'Muon_pt': [], 'Muon_eta': [], 'Muon_phi': [], 'Muon_tightId': [], 'Muon_pfRelIso04_all': [], 'Muon_charge': []}, {'Muon_pt': [], 'Muon_eta': [], 'Muon_phi': [], 'Muon_tightId': [], 'Muon_pfRelIso04_all': [], 'Muon_charge': []}, {'Muon_pt': [36.98743438720703], 'Muon_eta': [-0.3990478515625], 'Muon_phi': [-1.376708984375], 'Muon_tightId': [True], 'Muon_pfRelIso04_all': [0.0], 'Muon_charge': [1]}, {'Muon_pt': [], 'Muon_eta': [], 'Muon_phi': [], 'Muon_tightId': [], 'Muon_pfRelIso04_all': [

When slicing we get the expected result but the entries are just gone instead of replaced with `None`. But this is what we would expect from slicing anyhow.

## automatic zip
### masking

In [23]:
muons_zip_masked = mask_test(muons_zip, mask_function=mask_function3)
print("\nonly the eta array:")
print(muons_zip_masked[:10].Muon.eta.to_list())

fields:
['Muon'] 

type: 
1089694 * var * ?{"Muon": var * {"pt": float32, "eta": float32, "phi": float32, "tightId": bool, "pfRelIso04_all": float32, "charge": int32}} 

first 10 events to list
[[None, {'Muon': [{'pt': 53.02759552001953, 'eta': 1.1298828125, 'phi': 1.251220703125, 'tightId': True, 'pfRelIso04_all': 0.018847892060875893, 'charge': -1}, {'pt': 30.640729904174805, 'eta': 0.162200927734375, 'phi': -2.2509765625, 'tightId': True, 'pfRelIso04_all': 0.016491081565618515, 'charge': 1}]}], [], [None, None], [{'Muon': [{'pt': 36.98743438720703, 'eta': -0.3990478515625, 'phi': -1.376708984375, 'tightId': True, 'pfRelIso04_all': 0.0, 'charge': 1}, {'pt': 32.94463348388672, 'eta': -2.0595703125, 'phi': 1.7509765625, 'tightId': True, 'pfRelIso04_all': 0.08915365487337112, 'charge': -1}]}, None], [None], [None], [], [None], [None], []]

only the eta array:
[[None, [1.1298828125, 0.162200927734375]], [], [None, None], [[-0.3990478515625, -2.0595703125], None], [None], [None], [], [Non

Despite having a different array structure, broadcasting didn't work as desired here.

### slicing

In [24]:
muons_zip_sliced = slice_test(muons_zip, mask_function=mask_function3)
print("\nonly the eta array:")
print(muons_zip_sliced[:10].Muon.eta.to_list())

fields:
['Muon'] 

type: 
1089694 * {"Muon": var * {"pt": float32, "eta": float32, "phi": float32, "tightId": bool, "pfRelIso04_all": float32, "charge": int32}} 

first 10 events to list
[{'Muon': [{'pt': 30.640729904174805, 'eta': 0.162200927734375, 'phi': -2.2509765625, 'tightId': True, 'pfRelIso04_all': 0.016491081565618515, 'charge': 1}]}, {'Muon': []}, {'Muon': []}, {'Muon': [{'pt': 36.98743438720703, 'eta': -0.3990478515625, 'phi': -1.376708984375, 'tightId': True, 'pfRelIso04_all': 0.0, 'charge': 1}]}, {'Muon': []}, {'Muon': []}, {'Muon': []}, {'Muon': []}, {'Muon': []}, {'Muon': []}]

only the eta array:
[[0.162200927734375], [], [], [-0.3990478515625], [], [], [], [], [], []]


It looks like slicing worked again in the way we expect it to behave.

## manual zip
### masking

In [26]:
muons_manual_zip_masked = mask_test(muons_manual_zip, mask_function=mask_function2)
print("\nonly the eta array:")
print(muons_manual_zip_masked[:10].eta.to_list())

fields:
['pt', 'eta', 'phi', 'tightId', 'pfRelIso04_all', 'charge'] 

type: 
1089694 * var * ?{"pt": float32, "eta": float32, "phi": float32, "tightId": bool, "pfRelIso04_all": float32, "charge": int32} 

first 10 events to list
[[None, {'pt': 30.640729904174805, 'eta': 0.162200927734375, 'phi': -2.2509765625, 'tightId': True, 'pfRelIso04_all': 0.016491081565618515, 'charge': 1}], [], [None, None], [{'pt': 36.98743438720703, 'eta': -0.3990478515625, 'phi': -1.376708984375, 'tightId': True, 'pfRelIso04_all': 0.0, 'charge': 1}, None], [None], [None], [], [None], [None], []]

only the eta array:
[[None, 0.162200927734375], [], [None, None], [-0.3990478515625, None], [None], [None], [], [None], [None], []]


Looking only at the eta array, it looks like we finally made it, but looking closely at the type and the `.to_list()` representation of all fields in the array, we see that it's still not correct.

### slicing

In [27]:
muons_manual_zip_sliced = slice_test(muons_manual_zip, mask_function=mask_function2)
print("\nonly the eta array:")
print(muons_manual_zip_sliced[:10].eta.to_list())

fields:
['pt', 'eta', 'phi', 'tightId', 'pfRelIso04_all', 'charge'] 

type: 
1089694 * var * {"pt": float32, "eta": float32, "phi": float32, "tightId": bool, "pfRelIso04_all": float32, "charge": int32} 

first 10 events to list
[[{'pt': 30.640729904174805, 'eta': 0.162200927734375, 'phi': -2.2509765625, 'tightId': True, 'pfRelIso04_all': 0.016491081565618515, 'charge': 1}], [], [], [{'pt': 36.98743438720703, 'eta': -0.3990478515625, 'phi': -1.376708984375, 'tightId': True, 'pfRelIso04_all': 0.0, 'charge': 1}], [], [], [], [], [], []]

only the eta array:
[[0.162200927734375], [], [], [-0.3990478515625], [], [], [], [], [], []]


Slicing looks fine again.

## conclusion

For all three cases with arrays of records, records of arrays, or records of arrays of records, broadcasting seems to work properly with slicing but not with `ak.mask`.
When using `ak.mask`, the type becomes `1089694 * var * ?{"pt": float32, ... }` instead of `1089694 * var * {"pt": float32, ... }` without the `?`. The difference can be observed also in the `.to_list()` representation, where entries are not masked properly.