# Handling LH5 data

LEGEND stores its data in [HDF5](https://www.hdfgroup.org/solutions/hdf5) format, a high-performance data format becoming popular in experimental physics. LEGEND Data Objects (LGDO) are represented as HDF5 objects according to a custom specification, documented [here](https://legend-exp.github.io/legend-data-format-specs/dev/hdf5).

## Reading data from disk

Let's start by downloading a small test LH5 file with the [pylegendtestdata](https://pypi.org/project/pylegendtestdata/) package (it takes a while depending on your internet connection):

In [None]:
from legendtestdata import LegendTestData

ldata = LegendTestData()
lh5_file = ldata.get_path("lh5/LDQTA_r117_20200110T105115Z_cal_geds_raw.lh5")

We can use `lgdo.lh5.ls()` [[docs]](https://legend-pydataobj.readthedocs.io/en/stable/api/lgdo.html#lgdo.lh5.tools.ls) to inspect the file contents:

In [None]:
from lgdo.lh5 import ls

ls(lh5_file)

This particular file contains an HDF5 group (they behave like directories). The second argument of `ls()` can be used to inspect a group (without the trailing `/`, only the group name is returned, if existing):

In [None]:
ls(lh5_file, "geds/")  # returns ['geds/raw'], which is a group again
ls(lh5_file, "geds/raw/")

<div class="alert alert-info">

**Note:** Alternatively to `ls()`, `show()` [[docs]](https://legend-pydataobj.readthedocs.io/en/stable/api/lgdo.html#lgdo.lh5.tools.show) prints a nice representation of the LH5 file contents (with LGDO types) on screen:

</div>

In [None]:
from lgdo.lh5 import show

show(lh5_file)

The group contains several LGDOs. Let's read them in memory. We start by initializing an `LH5Store` [[docs]](https://legend-pydataobj.readthedocs.io/en/stable/api/lgdo.html#lgdo.lh5.store.LH5Store) object:

In [None]:
from lgdo.lh5 import LH5Store

store = LH5Store()

`read()` [[docs]](https://legend-pydataobj.readthedocs.io/en/stable/api/lgdo.html#lgdo.lh5.store.LH5Store.read) reads an LGDO from disk and returns the object in memory together with the number of rows (as a tuple), if an object has such a property. Let's try to read `geds/raw`:

In [None]:
store.read("geds/raw", lh5_file)

As shown by the type signature, it is interpreted as a `Table` with 100 rows. Its contents (or "columns") can be therefore viewed as LGDO objects of the same length. For example `timestamp`:

In [None]:
obj, n_rows = store.read("geds/raw/timestamp", lh5_file)
obj

is an LGDO `Array` with 100 elements.

`read_object()` also allows to perform more advanced data reading. For example, let's read only rows from 15 to 25:

In [None]:
obj, n_rows = store.read("geds/raw/timestamp", lh5_file, start_row=15, n_rows=10)
print(obj)

Or, let's read only columns `timestamp` and `energy` from the `geds/raw` table and rows `[1, 3, 7, 9, 10, 15]`:

In [None]:
obj, n_rows = store.read(
    "geds/raw", lh5_file, field_mask=("timestamp", "energy"), idx=[1, 3, 7, 9, 10, 15]
)
print(obj)

As you might have noticed, `read_object()` loads all the requested data in memory at once. This can be a problem when dealing with large datasets. `LH5Iterator` [[docs]](https://legend-pydataobj.readthedocs.io/en/stable/api/lgdo.html#lgdo.lh5.iterator.LH5Iterator) makes it possible to handle data one chunk at a time (sequentially) to avoid running out of memory:

In [None]:
from lgdo.lh5 import LH5Iterator

for lh5_obj, entry, n_rows in LH5Iterator(lh5_file, "geds/raw/energy", buffer_len=20):
    print(f"entry {entry}, energy = {lh5_obj} ({n_rows} rows)")

### Converting LGDO data to alternative formats

Each LGDO is equipped with a class method called `view_as()` [[docs]](https://legend-pydataobj.readthedocs.io/en/stable/api/lgdo.types.html#lgdo.types.lgdo.LGDO.view_as), which allows the user to "view" the data (i.e. avoiding copying data as much as possible) in a different, third-party format.

LGDOs generally support viewing as NumPy (`np`), Pandas (`pd`) or [Awkward](https://awkward-array.org) (`ak`) data structures, with some exceptions. We strongly recommend having a look at the `view_as()` API docs of each LGDO type for more details (for `Table.view_as()` [[docs]](https://legend-pydataobj.readthedocs.io/en/stable/api/lgdo.types.html#lgdo.types.table.Table.view_as), for example).

<div class="alert alert-info">

**Note:** To obtain a copy of the data in the selected third-party format, the user can call the appropriate third-party copy method on the view (e.g. `pandas.DataFrame.copy()`, if viewing the data as a Pandas dataframe).

</div>

Let's play around with our good old table, can we view it as a Pandas dataframe?

In [None]:
obj, _ = store.read("geds/raw", lh5_file)
df = obj.view_as("pd")
df

Yes! But how are the nested objects being handled?

Nested tables have been flattened by prefixing their column names with the table object name (`obj.waveform.values` becomes `df.waveform_values`) and multi-dimensional columns are represented by Awkward arrays:

In [None]:
df.waveform_values

But what if we wanted to have the waveform values as a NumPy array?

In [None]:
obj.waveform.values.view_as("np")

Can we just view the full table as a huge Awkward array? Of course:

In [None]:
obj.view_as("ak")

Note that viewing a `VectorOfVector` as an Awkward array is a nearly zero-copy operation and opens a new avenue of fast computational possibilities thanks to Awkward:

In [None]:
import awkward as ak

# tracelist is a VoV on disk
trlist = obj.tracelist.view_as("ak")
ak.mean(trlist)

Last but not least, we support attaching physical units (that might be stored in the `units` attribute of an LGDO) to data views through Pint, if the third-party format allows it:

In [None]:
df = obj.view_as("pd", with_units=True)
df.timestamp.dtype

Note that we also provide the `read_as()` [[docs]](https://legend-pydataobj.readthedocs.io/en/stable/api/lgdo.lh5.html#lgdo.lh5.tools.read_as) shortcut to save some typing, for users that would like to read LH5 data on disk straight into some third-party format:

In [None]:
from lgdo.lh5 import read_as

read_as("geds/raw", lh5_file, "pd", with_units=True)

## Writing data to disk

Let's start by creating some LGDOs:

In [None]:
from lgdo import Array, Scalar, WaveformTable
import numpy as np

rng = np.random.default_rng(12345)

scalar = Scalar("made with legend-pydataobj!")
array = Array(rng.random(size=10))
wf_table = WaveformTable(values=rng.integers(low=1000, high=5000, size=(10, 1000)))

The `write()` [[docs]](https://legend-pydataobj.readthedocs.io/en/stable/api/lgdo.html#lgdo.lh5.store.LH5Store.write) method of `LH5Store` makes it possible to write LGDO objects on disk. Let's start by writing `scalar` with name `message` in a file named `my_data.lh5` in the current directory:

In [None]:
store = LH5Store()

store.write(scalar, name="message", lh5_file="my_objects.lh5", wo_mode="overwrite_file")

Let's now inspect the file contents:

In [None]:
from lgdo.lh5 import show

show("my_objects.lh5")

The string object has been written at the root of the file `/`. Let's now write also `array` and `wf_table`, this time in a HDF5 group called `closet`:

In [None]:
store.write(array, name="numbers", group="closet", lh5_file="my_objects.lh5")
store.write(wf_table, name="waveforms", group="closet", lh5_file="my_objects.lh5")
show("my_objects.lh5")

Everything looks right!

<div class="alert alert-info">

**Note:** `LH5Store.write()` allows for more advanced usage, like writing only some rows of the input object or appending to existing array-like structures. Have a look at the [[docs]](https://legend-pydataobj.readthedocs.io/en/stable/api/lgdo.html#lgdo.lh5.store.LH5Store.write) for more information.

</div>