# Reading hepfiles

*Note*: If you have not run through the `write_hepfile` do that first to generate the output file from that. That output file will be used as the input here!

## Reading the Entire File

In [1]:
# import the load function
from hepfile import load

We begin with a file, and load it into an empty data dictionary:

In [2]:
infile = 'output_from_scratch.hdf5'
data, event = load(infile)

Building the indices...

Built the indices!
Data is read in and input file is closed.


data is a dictionary containing counters, indices, and data for all the features we care about. event is an empty dictionary waiting to be filled by data from some new event.

In [3]:
print(data)

{'_MAP_DATASETS_TO_COUNTERS_': {'_SINGLETONS_GROUP_': '_SINGLETONS_GROUP_/COUNTER', 'jet': 'jet/njet', 'muons': 'muons/nmuon', 'jet/e': 'jet/njet', 'jet/px': 'jet/njet', 'jet/py': 'jet/njet', 'jet/pz': 'jet/njet', 'jet/algorithm': 'jet/njet', 'jet/words': 'jet/njet', 'muons/e': 'muons/nmuon', 'muons/px': 'muons/nmuon', 'muons/py': 'muons/nmuon', 'muons/pz': 'muons/nmuon', 'METpx': '_SINGLETONS_GROUP_/COUNTER', 'METpy': '_SINGLETONS_GROUP_/COUNTER'}, '_MAP_DATASETS_TO_INDEX_': {'_SINGLETONS_GROUP_': '_SINGLETONS_GROUP_/COUNTER_INDEX', 'jet': 'jet/njet_INDEX', 'muons': 'muons/nmuon_INDEX', 'jet/e': 'jet/njet_INDEX', 'jet/px': 'jet/njet_INDEX', 'jet/py': 'jet/njet_INDEX', 'jet/pz': 'jet/njet_INDEX', 'jet/algorithm': 'jet/njet_INDEX', 'jet/words': 'jet/njet_INDEX', 'muons/e': 'muons/nmuon_INDEX', 'muons/px': 'muons/nmuon_INDEX', 'muons/py': 'muons/nmuon_INDEX', 'muons/pz': 'muons/nmuon_INDEX', 'METpx': '_SINGLETONS_GROUP_/COUNTER_INDEX', 'METpy': '_SINGLETONS_GROUP_/COUNTER_INDEX'}, '_LIST

In [4]:
print(event)

{'METpx': None, 'METpy': None, '_SINGLETONS_GROUP_/COUNTER': None, 'jet/algorithm': None, 'jet/e': None, 'jet/njet': None, 'jet/px': None, 'jet/py': None, 'jet/pz': None, 'jet/words': None, 'muons/e': None, 'muons/nmuon': None, 'muons/px': None, 'muons/py': None, 'muons/pz': None}


## Reading Part of a File

If you only want to read part of a file, you can load only certain groups. This is especially useful for very large datasets.

To do this, you can use the `desired_groups` and `subset` arguments to load:

In [5]:
data,event = load(infile,desired_groups=['jet'],subset=(5,10))

Will read in a subset of the file!
From bucket 5 (inclusive) through bucket 9 (inclusive)
Bucket 10 is not read in
Reading in 5 buckets

Not reading out muons/pz from the file....
Not reading out muons/py from the file....
Not reading out muons/px from the file....
Not reading out muons/nmuon from the file....
Not reading out muons/e from the file....
Not reading out muons from the file....
Not reading out _SINGLETONS_GROUP_/COUNTER from the file....
Not reading out _SINGLETONS_GROUP_ from the file....
Not reading out METpy from the file....
Not reading out METpx from the file....
Building the indices...

Built the indices!
Data is read in and input file is closed.


In [6]:
print(data.keys())

dict_keys(['_MAP_DATASETS_TO_COUNTERS_', '_MAP_DATASETS_TO_INDEX_', '_LIST_OF_COUNTERS_', '_LIST_OF_DATASETS_', '_META_', '_NUMBER_OF_BUCKETS_', '_SINGLETONS_GROUP_', '_SINGLETONS_GROUP_/COUNTER', '_SINGLETONS_GROUP_/COUNTER_INDEX', 'jet/njet', 'jet/njet_INDEX', 'muons/nmuon', 'muons/nmuon_INDEX', 'jet/algorithm', 'jet/e', 'jet/px', 'jet/py', 'jet/pz', 'jet/words', '_GROUPS_', '_MAP_DATASETS_TO_DATA_TYPES_', '_PROTECTED_NAMES_'])


## Reading into Awkward Arrays

Awkward arrays are a very fast datatype for heterogeneous datasets. It is relatively easy to read hepfiles into them, all you need to do is add the flag `return_type='awkward'` to `load`. Note: the event return will still just be a simple dictionary.

In [7]:
data,event = load(infile, return_type='awkward')

Building the indices...

Built the indices!
Data is read in and input file is closed.


In [8]:
data.show() # display data
print()
data['jet'].show() # display just the jet data
print()
data.jet.px.show() # display the px data from the jet dataset

[{METpx: 0.0112, METpy: 0.779, jet: {...}, muons: {...}},
 {METpx: 0.295, METpy: 0.353, jet: {...}, muons: {...}},
 {METpx: 0.931, METpy: 0.66, jet: {...}, muons: {e: [], ...}},
 {METpx: 0.00724, METpy: 0.558, jet: {...}, muons: {...}},
 {METpx: 0.993, METpy: 0.491, jet: {...}, muons: {...}},
 {METpx: 0.744, METpy: 0.211, jet: {...}, muons: {...}},
 {METpx: 0.838, METpy: 0.773, jet: {...}, muons: {...}},
 {METpx: 0.763, METpy: 0.254, jet: {...}, muons: {...}},
 {METpx: 0.4, METpy: 0.443, jet: {...}, muons: {e: [], ...}},
 {METpx: 0.933, METpy: 0.257, jet: {...}, muons: {...}},
 ...,
 {METpx: 0.785, METpy: 0.288, jet: {...}, muons: {...}},
 {METpx: 0.871, METpy: 0.609, jet: {...}, muons: {...}},
 {METpx: 0.81, METpy: 0.105, jet: {...}, muons: {e: [], ...}},
 {METpx: 0.477, METpy: 0.654, jet: {...}, muons: {...}},
 {METpx: 0.992, METpy: 0.17, jet: {...}, muons: {e: [], ...}},
 {METpx: 0.0397, METpy: 0.755, jet: {...}, muons: {...}},
 {METpx: 0.111, METpy: 0.973, jet: {...}, muons: {...}}

In [9]:
event

{'METpx': None,
 'METpy': None,
 '_SINGLETONS_GROUP_/COUNTER': None,
 'jet/algorithm': None,
 'jet/e': None,
 'jet/njet': None,
 'jet/px': None,
 'jet/py': None,
 'jet/pz': None,
 'jet/words': None,
 'muons/e': None,
 'muons/nmuon': None,
 'muons/px': None,
 'muons/py': None,
 'muons/pz': None}

With the `return_type=awkward` flag, you can still select a subset of the data in the same way!

In [10]:
data,event = load(infile, return_type='awkward', desired_groups=['jet'], subset=(5,10))

Will read in a subset of the file!
From bucket 5 (inclusive) through bucket 9 (inclusive)
Bucket 10 is not read in
Reading in 5 buckets

Not reading out muons/pz from the file....
Not reading out muons/py from the file....
Not reading out muons/px from the file....
Not reading out muons/nmuon from the file....
Not reading out muons/e from the file....
Not reading out muons from the file....
Not reading out _SINGLETONS_GROUP_/COUNTER from the file....
Not reading out _SINGLETONS_GROUP_ from the file....
Not reading out METpy from the file....
Not reading out METpx from the file....
Building the indices...

Built the indices!
Data is read in and input file is closed.


In [11]:
data.show() # display data
print()
data['jet'].show() # display just the jet data
print()
data.jet.px.show() # display the px data from the jet dataset

[{jet: {algorithm: [0, -1, 0, ..., 0, -1], e: [...], px: [...], ...}},
 {jet: {algorithm: [-1, 0, -1, ..., -1, -1], e: [...], px: [...], ...}},
 {jet: {algorithm: [0, -1, -1, ..., 0, 0], e: [...], px: [...], ...}},
 {jet: {algorithm: [-1, 0, 0, ..., 0, 0], e: [...], px: [...], ...}},
 {jet: {algorithm: [-1, -1, 0, ..., 0, 0], e: [...], px: [...], ...}}]

[{algorithm: [0, -1, 0, 0, ..., 0, 0, -1], e: [0.759, ...], px: [...], ...},
 {algorithm: [-1, 0, -1, ..., -1, -1, -1], e: [0.386, ...], px: [...], ...},
 {algorithm: [0, -1, -1, -1, ..., 0, 0, 0], e: [0.519, ...], px: [...], ...},
 {algorithm: [-1, 0, 0, 0, ..., -1, 0, 0], e: [0.589, ...], px: [...], ...},
 {algorithm: [-1, -1, 0, -1, ..., -1, 0, 0], e: [0.0114, ...], px: [...], ...}]

[[0.86, 0.0234, 0.608, 0.811, 0.844, ..., 0.0332, 0.508, 0.617, 0.692, 0.972],
 [0.144, 0.981, 0.521, 0.198, 0.159, ..., 0.709, 0.22, 0.316, 0.656, 0.475],
 [0.391, 0.198, 0.928, 0.754, 0.749, ..., 0.0771, 0.846, 0.152, 0.177, 0.822],
 [0.661, 0.733, 0.

In [12]:
event

{'jet/algorithm': None,
 'jet/e': None,
 'jet/njet': None,
 'jet/px': None,
 'jet/py': None,
 'jet/pz': None,
 'jet/words': None}

## Reading into a Dictionary of Pandas DataFrames

To read into a dictionary of pandas dataframes where each dataframe represents data on a different group all we need to do is provide `return_type='pandas'` to `load`.

In [13]:
data, event = load(infile, return_type='pandas')

Building the indices...

Built the indices!
Data is read in and input file is closed.


In [14]:
print(f'Group Names: {data.keys()}')

Group Names: dict_keys(['_SINGLETONS_GROUP_', 'jet', 'muons'])


In [15]:
print('jet information:')
data['jet']

jet information:


Unnamed: 0,algorithm,e,px,py,pz,words,event_num
0,-1,0.178622,0.600409,0.835830,0.588082,b'bye',0
1,0,0.761923,0.402971,0.210146,0.897793,b'aloha',0
2,-1,0.251654,0.130950,0.093568,0.918656,b'aloha',0
3,-1,0.869398,0.848791,0.604355,0.957420,b'hi',0
4,0,0.206576,0.180672,0.894818,0.835994,b'ciao',0
...,...,...,...,...,...,...,...
169995,0,0.194587,0.837496,0.512124,0.277076,b'aloha',9999
169996,0,0.548121,0.555247,0.166261,0.876548,b'ciao',9999
169997,0,0.149202,0.419852,0.079596,0.189418,b'aloha',9999
169998,0,0.395414,0.693458,0.594776,0.329097,b'aloha',9999


Once again, we can use a subset of the data with specific groups. However, note how the event numbers get reset to 0-4 when we use a subset with 5 rows. If this is a problem, you should look at converting the default output of `load` to a dictionary of pandas dataframes by hand using the `hf.df_tools.hepfile_to_df` method. 

In [16]:
data,event = load(infile, return_type='pandas', desired_groups=['jet'], subset=(5,10))

Will read in a subset of the file!
From bucket 5 (inclusive) through bucket 9 (inclusive)
Bucket 10 is not read in
Reading in 5 buckets

Not reading out muons/pz from the file....
Not reading out muons/py from the file....
Not reading out muons/px from the file....
Not reading out muons/nmuon from the file....
Not reading out muons/e from the file....
Not reading out muons from the file....
Not reading out _SINGLETONS_GROUP_/COUNTER from the file....
Not reading out _SINGLETONS_GROUP_ from the file....
Not reading out METpy from the file....
Not reading out METpx from the file....
Building the indices...

Built the indices!
Data is read in and input file is closed.


In [17]:
data['jet']

Unnamed: 0,algorithm,e,px,py,pz,words,event_num
0,0,0.758752,0.859935,0.125270,0.438869,b'hi',0
1,-1,0.879485,0.023435,0.451097,0.183107,b'hi',0
2,0,0.131377,0.608286,0.516602,0.375593,b'hi',0
3,0,0.622495,0.811466,0.631376,0.121259,b'aloha',0
4,-1,0.594603,0.844051,0.304491,0.176327,b'aloha',0
...,...,...,...,...,...,...,...
80,0,0.870541,0.061788,0.651052,0.488788,b'hi',4
81,0,0.238655,0.299215,0.831584,0.526371,b'aloha',4
82,-1,0.612054,0.814023,0.239940,0.262937,b'hi',4
83,0,0.830777,0.684267,0.012055,0.138162,b'hi',4
