# Reading hepfiles

*Note*: If you have not run through the `write_hepfile` do that first to generate the output file from that. That output file will be used as the input here!

## Reading the Entire File

In [1]:
# import the load function
from hepfile import load

We begin with a file, and load it into an empty data dictionary:

In [2]:
infile = 'output_from_scratch.hdf5'
data, event = load(infile)

data is a dictionary containing counters, indices, and data for all the features we care about. event is an empty dictionary waiting to be filled by data from some new event.

In [3]:
print(data)

{'_MAP_DATASETS_TO_COUNTERS_': {'_SINGLETONS_GROUP_': '_SINGLETONS_GROUP_/COUNTER', 'jet': 'jet/njet', 'muons': 'muons/nmuon', 'jet/e': 'jet/njet', 'jet/px': 'jet/njet', 'jet/py': 'jet/njet', 'jet/pz': 'jet/njet', 'jet/algorithm': 'jet/njet', 'jet/words': 'jet/njet', 'muons/e': 'muons/nmuon', 'muons/px': 'muons/nmuon', 'muons/py': 'muons/nmuon', 'muons/pz': 'muons/nmuon', 'METpx': '_SINGLETONS_GROUP_/COUNTER', 'METpy': '_SINGLETONS_GROUP_/COUNTER'}, '_MAP_DATASETS_TO_INDEX_': {'_SINGLETONS_GROUP_': '_SINGLETONS_GROUP_/COUNTER_INDEX', 'jet': 'jet/njet_INDEX', 'muons': 'muons/nmuon_INDEX', 'jet/e': 'jet/njet_INDEX', 'jet/px': 'jet/njet_INDEX', 'jet/py': 'jet/njet_INDEX', 'jet/pz': 'jet/njet_INDEX', 'jet/algorithm': 'jet/njet_INDEX', 'jet/words': 'jet/njet_INDEX', 'muons/e': 'muons/nmuon_INDEX', 'muons/px': 'muons/nmuon_INDEX', 'muons/py': 'muons/nmuon_INDEX', 'muons/pz': 'muons/nmuon_INDEX', 'METpx': '_SINGLETONS_GROUP_/COUNTER_INDEX', 'METpy': '_SINGLETONS_GROUP_/COUNTER_INDEX'}, '_LIST

In [4]:
print(event)

{'METpx': None, 'METpy': None, '_SINGLETONS_GROUP_/COUNTER': None, 'jet/algorithm': None, 'jet/e': None, 'jet/njet': None, 'jet/px': None, 'jet/py': None, 'jet/pz': None, 'jet/words': None, 'muons/e': None, 'muons/nmuon': None, 'muons/px': None, 'muons/py': None, 'muons/pz': None}


## Reading Part of a File

If you only want to read part of a file, you can load only certain groups. This is especially useful for very large datasets.

To do this, you can use the `desired_groups` and `subset` arguments to load:

In [5]:
data,event = load(infile,desired_groups=['jet'],subset=(5,10))

In [6]:
print(data.keys())

dict_keys(['_MAP_DATASETS_TO_COUNTERS_', '_MAP_DATASETS_TO_INDEX_', '_LIST_OF_COUNTERS_', '_LIST_OF_DATASETS_', '_META_', '_NUMBER_OF_BUCKETS_', '_SINGLETONS_GROUP_', '_SINGLETONS_GROUP_/COUNTER', '_SINGLETONS_GROUP_/COUNTER_INDEX', 'jet/njet', 'jet/njet_INDEX', 'muons/nmuon', 'muons/nmuon_INDEX', 'jet/algorithm', 'jet/e', 'jet/px', 'jet/py', 'jet/pz', 'jet/words', '_GROUPS_', '_MAP_DATASETS_TO_DATA_TYPES_', '_PROTECTED_NAMES_'])


## Reading into Awkward Arrays

Awkward arrays are a very fast datatype for heterogeneous datasets. It is relatively easy to read hepfiles into them, all you need to do is add the flag `return_type='awkward'` to `load`. Note: the event return will still just be a simple dictionary.

In [7]:
data,event = load(infile, return_type='awkward')

In [8]:
data.show() # display data
print()
data['jet'].show() # display just the jet data
print()
data.jet.px.show() # display the px data from the jet dataset

[{METpx: 0.386, METpy: 0.213, jet: {...}, muons: {...}},
 {METpx: 0.123, METpy: 0.927, jet: {...}, muons: {...}},
 {METpx: 0.863, METpy: 0.178, jet: {...}, muons: {...}},
 {METpx: 0.0628, METpy: 0.754, jet: {...}, muons: {...}},
 {METpx: 0.161, METpy: 0.408, jet: {...}, muons: {...}},
 {METpx: 0.217, METpy: 0.853, jet: {...}, muons: {...}},
 {METpx: 0.539, METpy: 0.761, jet: {...}, muons: {...}},
 {METpx: 0.631, METpy: 0.723, jet: {...}, muons: {...}},
 {METpx: 0.376, METpy: 0.846, jet: {...}, muons: {...}},
 {METpx: 0.091, METpy: 0.517, jet: {...}, muons: {...}},
 ...,
 {METpx: 0.0554, METpy: 0.0152, jet: {...}, muons: {...}},
 {METpx: 0.404, METpy: 0.41, jet: {...}, muons: {e: [], ...}},
 {METpx: 0.825, METpy: 0.688, jet: {...}, muons: {...}},
 {METpx: 0.392, METpy: 0.177, jet: {...}, muons: {...}},
 {METpx: 0.187, METpy: 0.311, jet: {...}, muons: {...}},
 {METpx: 0.873, METpy: 0.356, jet: {...}, muons: {...}},
 {METpx: 0.036, METpy: 0.052, jet: {...}, muons: {...}},
 {METpx: 0.168, 

In [9]:
event

{'METpx': None,
 'METpy': None,
 '_SINGLETONS_GROUP_/COUNTER': None,
 'jet/algorithm': None,
 'jet/e': None,
 'jet/njet': None,
 'jet/px': None,
 'jet/py': None,
 'jet/pz': None,
 'jet/words': None,
 'muons/e': None,
 'muons/nmuon': None,
 'muons/px': None,
 'muons/py': None,
 'muons/pz': None}

With the `return_type=awkward` flag, you can still select a subset of the data in the same way!

In [10]:
data,event = load(infile, return_type='awkward', desired_groups=['jet'], subset=(5,10))

In [11]:
data.show() # display data
print()
data['jet'].show() # display just the jet data
print()
data.jet.px.show() # display the px data from the jet dataset

[{jet: {algorithm: [-1, 0, -1, ..., -1, 0], e: [...], px: [...], ...}},
 {jet: {algorithm: [0, -1, 0, ..., 0, 0, 0], e: [...], px: [...], ...}},
 {jet: {algorithm: [0, 0, 0, ..., 0, -1, 0], e: [...], px: [...], ...}},
 {jet: {algorithm: [0, -1, 0, ..., 0, 0], e: [...], px: [...], ...}},
 {jet: {algorithm: [0, 0, 0, ..., -1, 0], e: [...], px: [...], ...}}]

[{algorithm: [-1, 0, -1, 0, ..., 0, -1, 0], e: [0.667, ...], px: [...], ...},
 {algorithm: [0, -1, 0, 0, ..., 0, 0, 0], e: [0.75, ...], px: [...], ...},
 {algorithm: [0, 0, 0, 0, ..., 0, -1, 0], e: [0.564, ...], px: [...], ...},
 {algorithm: [0, -1, 0, -1, ..., -1, 0, 0], e: [0.344, ...], px: [...], ...},
 {algorithm: [0, 0, 0, 0, ..., -1, -1, 0], e: [0.289, ...], px: [...], ...}]

[[0.891, 0.684, 0.719, 1, 0.843, 0.973, ..., 0.74, 0.663, 0.204, 0.787, 0.666],
 [0.867, 0.442, 0.319, 0.476, 0.162, ..., 0.285, 0.848, 0.907, 0.99, 0.627],
 [0.47, 0.867, 0.454, 0.656, 0.66, ..., 0.381, 0.67, 0.232, 0.814, 0.883],
 [0.769, 0.144, 0.452, 0

In [12]:
event

{'jet/algorithm': None,
 'jet/e': None,
 'jet/njet': None,
 'jet/px': None,
 'jet/py': None,
 'jet/pz': None,
 'jet/words': None}

## Reading into a Dictionary of Pandas DataFrames

To read into a dictionary of pandas dataframes where each dataframe represents data on a different group all we need to do is provide `return_type='pandas'` to `load`.

In [13]:
data, event = load(infile, return_type='pandas')

In [14]:
print(f'Group Names: {data.keys()}')

Group Names: dict_keys(['_SINGLETONS_GROUP_', 'jet', 'muons'])


In [15]:
print('jet information:')
data['jet']

jet information:


Unnamed: 0,algorithm,e,px,py,pz,words,event_num
0,-1,0.518825,0.725761,0.133617,0.498106,b'aloha',0
1,0,0.066499,0.167956,0.662646,0.454498,b'hi',0
2,0,0.812855,0.551531,0.123245,0.056884,b'ciao',0
3,-1,0.292169,0.459002,0.758781,0.953022,b'ciao',0
4,0,0.512365,0.520725,0.240334,0.485343,b'bye',0
...,...,...,...,...,...,...,...
169995,0,0.625886,0.863305,0.751234,0.550784,b'aloha',9999
169996,0,0.069708,0.845667,0.879986,0.359886,b'bye',9999
169997,0,0.277128,0.495122,0.723845,0.481453,b'bye',9999
169998,0,0.404217,0.762674,0.064453,0.647051,b'bye',9999


Once again, we can use a subset of the data with specific groups. However, note how the event numbers get reset to 0-4 when we use a subset with 5 rows. If this is a problem, you should look at converting the default output of `load` to a dictionary of pandas dataframes by hand using the `hf.df_tools.hepfile_to_df` method. 

In [16]:
data,event = load(infile, return_type='pandas', desired_groups=['jet'], subset=(5,10))

In [17]:
data['jet']

Unnamed: 0,algorithm,e,px,py,pz,words,event_num
0,-1,0.667102,0.891190,0.718513,0.768162,b'bye',0
1,0,0.146582,0.683707,0.756508,0.472253,b'ciao',0
2,-1,0.865275,0.718874,0.927169,0.794849,b'hi',0
3,0,0.776318,0.999773,0.350176,0.440168,b'hi',0
4,0,0.462614,0.843460,0.351398,0.929219,b'hi',0
...,...,...,...,...,...,...,...
80,-1,0.544807,0.688257,0.273543,0.637789,b'ciao',4
81,-1,0.239300,0.235639,0.579898,0.609811,b'bye',4
82,-1,0.971387,0.206133,0.797268,0.155473,b'aloha',4
83,-1,0.576240,0.870437,0.513300,0.039285,b'bye',4
