# Pytables working with hierarchical data formats

Introductory tutorial to `pytables` and `HDF5` based on the documentation at [https://www.pytables.org/usersguide/tutorials.html]).

## Importing tables objects
Before starting you need to import the public objects in the tables package. You normally do that by executing:

In [1]:
from rich import print
import tables
import numpy as np

## Declaring a Column Descriptor
Now, imagine that we have a particle detector and we want to create a table object in order to save data retrieved from it. You need first to define the table, the number of columns it has, what kind of object is contained in each column, and so on.

Having determined our columns and their types, we can now declare a new Particle class that will contain all this information:

In [2]:
from tables import *

In [3]:
class Particle(IsDescription):
    name      = StringCol(16)   # 16-character String
    idnumber  = Int64Col()      # Signed 64-bit integer
    ADCcount  = UInt16Col()     # Unsigned short integer
    TDCcount  = UInt8Col()      # unsigned byte
    grid_i    = Int32Col()      # 32-bit integer
    grid_j    = Int32Col()      # 32-bit integer
    pressure  = Float32Col()    # float  (single-precision)
    energy    = Float64Col()    # double (double-precision)

## Creating a PyTables file from scratch
Use the top-level `open_file()` function to create a PyTables file:

In [4]:
h5file = open_file("tutorial1.h5", mode="w", title="Test file")

## Creating a new group
Now, to better organize our data, we will create a `group` called `detector` that branches from the root node. We will save our particle data table in this group:

In [5]:
group = h5file.create_group("/", 'detector', title='Detector information')

Here, we have taken the File instance h5file and invoked its `File.create_group()` method to create a new group called detector branching from `/` (another way to refer to the h5file.root object we mentioned above). This will create a new Group (see The Group class) object instance that will be assigned to the variable group.

## Creating a new table (dataset)
Let’s now create a Table (see The Table class) object as a branch off the newly-created group. We do that by calling the `File.create_table()` method of the h5file object:

In [6]:
table = h5file.create_table(group, 'readout', Particle, title="Readout example")

We create the Table instance under group. We assign this table the node name `readout`. The Particle class declared before is the description parameter (to define the columns of the table) and finally we set `Readout example` as the Table title. With all this information, a new Table instance is created and assigned to the variable table.

If you are curious about how the object tree looks right now, simply print the File instance variable `h5file`, and examine the output:

In [7]:
print(h5file)

As you can see, a dump of the object tree is displayed. It’s easy to see the Group and Table objects we have just created. If you want more information, just type the variable containing the File instance:

In [8]:
h5file

File(filename=tutorial1.h5, title='Test file', mode='w', root_uep='/', filters=Filters(complevel=0, shuffle=False, bitshuffle=False, fletcher32=False, least_significant_digit=None))
/ (RootGroup) 'Test file'
/detector (Group) 'Detector information'
/detector/readout (Table(0,)) 'Readout example'
  description := {
  "ADCcount": UInt16Col(shape=(), dflt=0, pos=0),
  "TDCcount": UInt8Col(shape=(), dflt=0, pos=1),
  "energy": Float64Col(shape=(), dflt=0.0, pos=2),
  "grid_i": Int32Col(shape=(), dflt=0, pos=3),
  "grid_j": Int32Col(shape=(), dflt=0, pos=4),
  "idnumber": Int64Col(shape=(), dflt=0, pos=5),
  "name": StringCol(itemsize=16, shape=(), dflt=b'', pos=6),
  "pressure": Float32Col(shape=(), dflt=0.0, pos=7)}
  byteorder := 'little'
  chunkshape := (1394,)

More detailed information is displayed about each object in the tree. Note how Particle, our table descriptor class, is printed as part of the readout table description information. In general, you can **obtain much more information about the objects and their children by just printing them**. That introspection capability is very useful, and I recommend that you use it extensively.

The time has come to fill this table with some values. First we will **get a pointer to the Row** (see The Row class) instance of this table instance:

In [9]:
particle = table.row

The row attribute of table points to the Row instance that will be used to write data rows into the table. We write data simply by assigning the Row instance the values for each row as if it were a dictionary (although it is actually an extension class), using the column names as keys.

Below is an example of how to write rows:

In [10]:
for i in range(10):
    particle['name']  = f'Particle: {i:6d}'
    particle['TDCcount'] = i % 256
    particle['ADCcount'] = (i * 256) % (1 << 16)
    particle['grid_i'] = i
    particle['grid_j'] = 10 - i
    particle['pressure'] = float(i*i)
    particle['energy'] = float(particle['pressure'] ** 4)
    particle['idnumber'] = i * (2 ** 34)

    # Insert a new particle record
    particle.append()

This code should be easy to understand. The lines inside the loop just assign values to the different columns in the Row instance particle (see The Row class). A call to its append() method **writes this information to the table I/O buffer**.

After we have processed all our data, we should **flush the table’s I/O buffer if we want to write all this data to disk**. We achieve that by calling the table.flush() method:

In [11]:
table.flush()

Remember, flushing a table is a very important step as it will not only help to **maintain the integrity of your file**, but also will **free valuable memory resources** (i.e. internal buffers) that your program may need for other things.

## Reading (and selecting) data in a table
Ok. We have our data on disk, and now we need to access it and select from specific columns the values we are interested in. See the example below:

In [12]:
table = h5file.root.detector.readout
pressure = [x['pressure'] for x in table.iterrows() if x['TDCcount'] > 3 
            and 20 <= x['pressure'] < 50]

print(pressure)

The first line creates a “shortcut” to the readout table deeper on the object tree. As you can see, we use the natural naming schema to access it. We also could have used the `h5file.get_node()` method, as we will do later on.

You will recognize the last two lines as a Python list comprehension. It loops over the rows in table as they are provided by the `Table.iterrows()` iterator. The iterator returns values until all the data in table is exhausted. These rows are filtered using the expression:

```python
x['TDCcount'] > 3 and 20 <= x['pressure'] < 50
```

PyTables do offer other, more powerful ways of performing selections which may be more suitable if you have very large tables or if you need very high query speeds. They are called **in-kernel and indexed queries**, and you can use them through `Table.where()` and other related methods.

Let’s use an in-kernel selection to query the name column for the same set of cuts:

In [13]:
names = [ x['name'] for x in table.where("""(TDCcount > 3) & (20 <= pressure) & (pressure < 50)""") ]

print(names)

## Creating new array objects
In order to separate the selected data from the mass of detector data, we will create a new group columns branching off the root group. Afterwards, under this group, we will create two arrays that will contain the selected data. First, we create the group:

In [14]:
gcolumns = h5file.create_group(h5file.root, "columns", "Pressure and Name")

Note that this time we have specified the first parameter using natural naming (h5file.root) instead of with an absolute path string (“/”).

Now, create the first of the two Array objects we’ve just mentioned:

In [15]:
h5file.create_array(gcolumns, 'pressure', np.array(pressure), "Pressure column selection")

/columns/pressure (Array(3,)) 'Pressure column selection'
  atom := Float64Atom(shape=(), dflt=0.0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := None

We already know the first two parameters of the File.create_array() methods (these are the same as the first two in create_table): they are the parent group where Array will be created and the Array instance name. The third parameter is the object we want to save to disk. In this case, it is a NumPy array that is built from the selection list we created before. The fourth parameter is the title.

Now, we will save the second array. It contains the list of strings we selected before: we save this object as-is, with no further conversion:

In [16]:
h5file.create_array(gcolumns, 'name', names, "Name column selection")

/columns/name (Array(3,)) 'Name column selection'
  atom := StringAtom(itemsize=16, shape=(), dflt=b'')
  maindim := 0
  flavor := 'python'
  byteorder := 'irrelevant'
  chunkshape := None

As you can see, `File.create_array()` accepts names (which is a regular Python list) as an object parameter. Actually, it accepts a variety of different regular objects (see create_array()) as parameters. The flavor attribute (see the output above) saves the original kind of object that was saved. Based on this flavor, PyTables will be able to retrieve exactly the same object from disk later on.

Note that in these examples, the `create_array` method returns an Array instance that is not assigned to any variable. Don’t worry, this is intentional to show the kind of object we have created by displaying its representation. The Array objects have been attached to the object tree and saved to disk, as you can see if you print the complete object tree:

In [17]:
print(h5file)

## Closing the file

In [18]:
h5file.close()

You have now created your first PyTables file with a table and two arrays. You can examine it with any generic HDF5 tool, such as `h5dump` or `h5ls`. Here is what the tutorial1.h5 looks like when read with the h5ls program.

In [19]:
!h5ls -rd tutorial1.h5

/                        Group
/columns                 Group
/columns/name            Dataset {3}
    Data:
        (0) "Particle:      5", "Particle:      6", "Particle:      7"
/columns/pressure        Dataset {3}
    Data:
        (0) 25, 36, 49
/detector                Group
/detector/readout        Dataset {10/Inf}
    Data:
        (0) {0, 0, 0, 0, 10, 0, "Particle:      0", 0},
        (1) {256, 1, 1, 1, 9, 17179869184, "Particle:      1", 1},
        (2) {512, 2, 256, 2, 8, 34359738368, "Particle:      2", 4},
        (3) {768, 3, 6561, 3, 7, 51539607552, "Particle:      3", 9},
        (4) {1024, 4, 65536, 4, 6, 68719476736, "Particle:      4", 16},
        (5) {1280, 5, 390625, 5, 5, 85899345920, "Particle:      5", 25},
        (6) {1536, 6, 1679616, 6, 4, 103079215104, "Particle:      6", 36},
        (7) {1792, 7, 5764801, 7, 3, 120259084288, "Particle:      7", 49},
        (8) {2048, 8, 16777216, 8, 2, 137438953472, "Particle:      8", 64},
        (9) {2304, 9, 4304672

## Setting and getting user attributes

This time, we have opened the file in `a`ppend mode. We use this mode to add more information to the file.

In [19]:
h5file = open_file("tutorial1.h5", "a")

PyTables provides an easy and concise way to complement the meaning of your node objects on the tree by using the AttributeSet class (see The AttributeSet class). You can access this object through the standard attribute `attrs` in **Leaf nodes** and `_v_attrs` in **Group nodes**.

For example, let’s imagine that we want to save the date indicating when the data in /detector/readout table has been acquired, as well as the temperature during the gathering process:

In [20]:
table = h5file.root.detector.readout
table.attrs.gath_date = "Wed, 06/12/2003 18:33"
table.attrs.temperature = 18.4
table.attrs.temp_scale = "Celsius"

Now, let’s set a somewhat more complex attribute in the `/detector` group:

In [21]:
detector = h5file.root.detector
detector._v_attrs.stuff = [5, (2.3, 4.5), "Integer and tuple"]

Note how the AttributeSet instance is accessed with the `_v_attrs` attribute because detector is a Group node. In general, you can save any standard Python data structure as an attribute node. See The AttributeSet class for a more detailed explanation of how they are serialized for export to disk.

Retrieving the attributes is equally simple:

In [22]:
table.attrs.gath_date

'Wed, 06/12/2003 18:33'

In [23]:
table.attrs.temperature

18.4

In [24]:
table.attrs.temp_scale

'Celsius'

In [25]:
detector._v_attrs.stuff

[5, (2.3, 4.5), 'Integer and tuple']

If you want to examine the current user attribute set of /detector/table, you can print its representation (try hitting the TAB key twice if you are on a Unix Python console with the rlcompleter module active):

In [26]:
table.attrs

/detector/readout._v_attrs (AttributeSet), 23 attributes:
   [CLASS := 'TABLE',
    FIELD_0_FILL := 0,
    FIELD_0_NAME := 'ADCcount',
    FIELD_1_FILL := 0,
    FIELD_1_NAME := 'TDCcount',
    FIELD_2_FILL := 0.0,
    FIELD_2_NAME := 'energy',
    FIELD_3_FILL := 0,
    FIELD_3_NAME := 'grid_i',
    FIELD_4_FILL := 0,
    FIELD_4_NAME := 'grid_j',
    FIELD_5_FILL := 0,
    FIELD_5_NAME := 'idnumber',
    FIELD_6_FILL := b'',
    FIELD_6_NAME := 'name',
    FIELD_7_FILL := 0.0,
    FIELD_7_NAME := 'pressure',
    NROWS := 10,
    TITLE := 'Readout example',
    VERSION := '2.7',
    gath_date := 'Wed, 06/12/2003 18:33',
    temp_scale := 'Celsius',
    temperature := 18.4]