# What Is HDF5? What is PyTables?

This tutorial is a modified hodgepodge of material from Francesc Alted's and Anthony Scopatz's pytables tutorials:

www.pytables.org/docs/PyData2012-NYC.pdf
http://pyvideo.org/video/2705/hdf5-is-for-lovers-tutorial-part-1

HDF5 is a free, open source, binary file type. It is a key library for 'big science' due to its
scalability, multiple APIs, and efficiency with structured, dense arrays of
numbers.

The acronym stands for *Hierarchial Data Format*.  It is hierarchical in the
sense that the data is structured (much like a directory). HDF allows datasets to live in a nested tree structure. In effect, HDF5 is a file system within a file. (SQL has flat tables only).

HDF5 is a database binary format with the ability to store lots of different datasets along with metadata, optimized I/O, and the ability to query its contents.

- performs operations on-disk
- free software (BSD)
- has many APIs (c, c++, fortran 90, Java, and *python*)
- has no limit on number of data objects
- can represent various data objects as well as metadata

PyTables is not [alted2013]:

- Not a relational database replacement
- Not a distributed database
- Not extremely secure or safe (it’s more about speed!)
- Not a mere HDF5 wrapper
￼

In [1]:
import numpy as np
import tables as tb

In [3]:
# Create a new file
f = tb.open_file("atest.h5", "w")

In [4]:
# Create a NumPy array, number of ghosts per sq. ft.
a = np.arange(100).reshape(20,5)

In [5]:
# Save the array
f.create_array(f.root, "ghosts", a)

/ghosts (Array(20, 5)) ''
  atom := Int64Atom(shape=(), dflt=0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := None

In [6]:
# See data
f.root.ghosts[:]

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24],
       [25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34],
       [35, 36, 37, 38, 39],
       [40, 41, 42, 43, 44],
       [45, 46, 47, 48, 49],
       [50, 51, 52, 53, 54],
       [55, 56, 57, 58, 59],
       [60, 61, 62, 63, 64],
       [65, 66, 67, 68, 69],
       [70, 71, 72, 73, 74],
       [75, 76, 77, 78, 79],
       [80, 81, 82, 83, 84],
       [85, 86, 87, 88, 89],
       [90, 91, 92, 93, 94],
       [95, 96, 97, 98, 99]])

In [7]:
# Select some data areas
ta = f.root.ghosts
ta[1:10:3,2:5]

array([[ 7,  8,  9],
       [22, 23, 24],
       [37, 38, 39]])

In [8]:
np.allclose(ta[1:10:3,2:5], a[1:10:3,2:5])

True

In [10]:
# Create another array, number of goblins per square foot
ta2 = f.create_array(f.root, "goblins", np.arange(10))

In [11]:
np.allclose(ta2, np.arange(10))

True

In [12]:
ls -l atest.h5

-rw-rw-r--  1 khuff  staff  0 Nov 14 21:13 atest.h5


In [13]:
# Flush data to the file (very important to keep all your data safe!)
f.flush()

In [14]:
ls -l atest.h5

-rw-rw-r--  1 khuff  staff  3024 Nov 14 21:13 atest.h5


In [15]:
f.close()  # close access to file

In [16]:
ta[:]

ClosedNodeError: the node object is closed

## There are different kinds of dataset though

- Array
- CArray (chunked array)
- EArray (extendable array)
- VLArray (variable length array)
- Table (structured array w/ named fields)

We just had an example of Arrays. Let's look at Tables. 

In [17]:
ta

<closed tables.array.Array at 0x10b20cc48>

In [18]:
f = tb.openFile("atest.h5", "r")

AttributeError: module 'tables' has no attribute 'openFile'

In [19]:
f

<closed File>

In [20]:
# The description for the tabular data
class TabularData(tb.IsDescription):
    names = tb.StringCol(200)
    ages = tb.IntCol()
    heights = tb.FloatCol()

In [22]:
# Open a file and create the Table container
f = tb.open_file('atable.h5', 'w')
t = f.create_table(f.root, 'table', TabularData, 'table title', filters=tb.Filters(5, 'blosc'))

In [23]:
t

/table (Table(0,), shuffle, blosc(5)) 'table title'
  description := {
  "ages": Int32Col(shape=(), dflt=0, pos=0),
  "heights": Float64Col(shape=(), dflt=0.0, pos=1),
  "names": StringCol(itemsize=200, shape=(), dflt=b'', pos=2)}
  byteorder := 'little'
  chunkshape := (309,)

In [26]:
#  Fill the table with some 1 million rows
from time import time
t0 = time()
r = t.row
for i in range(1000*1000):
    r['names'] = str(i)
    r['ages'] = i + 1
    r['heights'] = i * (i + 1)
    r.append()
t.flush()
print("Insert time: %.3fs" % (time()-t0,))

Insert time: 1.229s


In [27]:
t

/table (Table(1000000,), shuffle, blosc(5)) 'table title'
  description := {
  "ages": Int32Col(shape=(), dflt=0, pos=0),
  "heights": Float64Col(shape=(), dflt=0.0, pos=1),
  "names": StringCol(itemsize=200, shape=(), dflt=b'', pos=2)}
  byteorder := 'little'
  chunkshape := (309,)

In [28]:
# Size on disk
!ls -lh atable.h5

-rw-rw-r--  1 khuff  staff   5.7M Nov 14 21:16 atable.h5


In [29]:
# Real size
np.prod(t.shape) * t.dtype.itemsize / 2**20.

202.178955078125

In [30]:
# Do a query (regular)
%time [r['names'] for r in t if r['ages'] < 10]

CPU times: user 296 ms, sys: 13.1 ms, total: 309 ms
Wall time: 316 ms


[b'0', b'1', b'2', b'3', b'4', b'5', b'6', b'7', b'8']

In [31]:
# Repeat the query, but using in-kernel method
%time [r['names'] for r in t.where('ages < 10')]

CPU times: user 200 ms, sys: 24.6 ms, total: 225 ms
Wall time: 215 ms


[b'0', b'1', b'2', b'3', b'4', b'5', b'6', b'7', b'8']

In [32]:
# Performing complex conditions (regular query)
%time [r['names'] for r in t if r['ages'] < 10 and r['heights'] < 10]

CPU times: user 280 ms, sys: 8.39 ms, total: 288 ms
Wall time: 291 ms


[b'0', b'1', b'2']

In [33]:
# Complex, in-kernel queries
%time [r['names'] for r in t.where('(ages < 10) & (heights < 10)')]

CPU times: user 221 ms, sys: 33.9 ms, total: 255 ms
Wall time: 242 ms


[b'0', b'1', b'2']

In [34]:
# Get a structured array out of disk
sa = t[:]
sa

array([(1, 0.0, b'0'), (2, 2.0, b'1'), (3, 6.0, b'2'), ...,
       (999998, 999995000006.0, b'999997'),
       (999999, 999997000002.0, b'999998'),
       (1000000, 999999000000.0, b'999999')], 
      dtype=[('ages', '<i4'), ('heights', '<f8'), ('names', 'S200')])

In [35]:
# Perform the query
%time sa[((sa['ages'] < 10) & (sa['heights'] < 10))]['names']

CPU times: user 24.5 ms, sys: 3.48 ms, total: 27.9 ms
Wall time: 28.1 ms


array([b'0', b'1', b'2'], 
      dtype='|S200')

In [39]:
# Create an index for the on-disk table
%time t.cols.ages.create_csindex()

CPU times: user 944 ms, sys: 33.9 ms, total: 978 ms
Wall time: 988 ms


1000000

In [40]:
# Repeat the complex query (indexed)
%time [r['names'] for r in t.where('(ages < 10) & (heights < 10)')]

CPU times: user 3.73 ms, sys: 1.56 ms, total: 5.29 ms
Wall time: 5.44 ms


[b'0', b'1', b'2']

In [41]:
f.close()

## What's all this about hierarchy?

In [42]:
!cp atest.h5 atest2.h5

In [44]:
f
f = tb.open_file("atest2.h5", "a")

In [45]:
f.create_group(f.root, 'inschool', 'The kids from school')

/inschool (Group) 'The kids from school'
  children := []

In [46]:
f

File(filename=atest2.h5, title='', mode='a', root_uep='/', filters=Filters(complevel=0, shuffle=False, bitshuffle=False, fletcher32=False, least_significant_digit=None))
/ (RootGroup) ''
/ghosts (Array(20, 5)) ''
  atom := Int64Atom(shape=(), dflt=0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := None
/goblins (Array(10,)) ''
  atom := Int64Atom(shape=(), dflt=0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := None
/inschool (Group) 'The kids from school'

In [48]:
f.move_node(f.root.ghosts, f.root.inschool)

In [49]:
f

File(filename=atest2.h5, title='', mode='a', root_uep='/', filters=Filters(complevel=0, shuffle=False, bitshuffle=False, fletcher32=False, least_significant_digit=None))
/ (RootGroup) ''
/goblins (Array(10,)) ''
  atom := Int64Atom(shape=(), dflt=0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := None
/inschool (Group) 'The kids from school'
/inschool/ghosts (Array(20, 5)) ''
  atom := Int64Atom(shape=(), dflt=0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := None

In [50]:
f.create_group('/g1/g2/g3/g4', 'g5', createparents=True)

/g1/g2/g3/g4/g5 (Group) ''
  children := []

In [51]:
f

File(filename=atest2.h5, title='', mode='a', root_uep='/', filters=Filters(complevel=0, shuffle=False, bitshuffle=False, fletcher32=False, least_significant_digit=None))
/ (RootGroup) ''
/goblins (Array(10,)) ''
  atom := Int64Atom(shape=(), dflt=0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := None
/g1 (Group) ''
/inschool (Group) 'The kids from school'
/inschool/ghosts (Array(20, 5)) ''
  atom := Int64Atom(shape=(), dflt=0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := None
/g1/g2 (Group) ''
/g1/g2/g3 (Group) ''
/g1/g2/g3/g4 (Group) ''
/g1/g2/g3/g4/g5 (Group) ''

In [53]:
f.create_array(f.root.g1.g2.g3.g4.g5, 'goblins', np.arange(10))

/g1/g2/g3/g4/g5/goblins (Array(10,)) ''
  atom := Int64Atom(shape=(), dflt=0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := None

In [54]:
f

File(filename=atest2.h5, title='', mode='a', root_uep='/', filters=Filters(complevel=0, shuffle=False, bitshuffle=False, fletcher32=False, least_significant_digit=None))
/ (RootGroup) ''
/goblins (Array(10,)) ''
  atom := Int64Atom(shape=(), dflt=0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := None
/g1 (Group) ''
/inschool (Group) 'The kids from school'
/inschool/ghosts (Array(20, 5)) ''
  atom := Int64Atom(shape=(), dflt=0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := None
/g1/g2 (Group) ''
/g1/g2/g3 (Group) ''
/g1/g2/g3/g4 (Group) ''
/g1/g2/g3/g4/g5 (Group) ''
/g1/g2/g3/g4/g5/goblins (Array(10,)) ''
  atom := Int64Atom(shape=(), dflt=0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := None

In [55]:
f.root.g1.g2.g3.g4.g5.goblins[:]

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [56]:
f.remove_node(f.root.g1.g2.g3.g4.g5.goblins)

In [57]:
f

File(filename=atest2.h5, title='', mode='a', root_uep='/', filters=Filters(complevel=0, shuffle=False, bitshuffle=False, fletcher32=False, least_significant_digit=None))
/ (RootGroup) ''
/goblins (Array(10,)) ''
  atom := Int64Atom(shape=(), dflt=0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := None
/g1 (Group) ''
/inschool (Group) 'The kids from school'
/inschool/ghosts (Array(20, 5)) ''
  atom := Int64Atom(shape=(), dflt=0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := None
/g1/g2 (Group) ''
/g1/g2/g3 (Group) ''
/g1/g2/g3/g4 (Group) ''
/g1/g2/g3/g4/g5 (Group) ''

In [59]:
for n in f: print(n)

/ (RootGroup) ''
/g1 (Group) ''
/goblins (Array(10,)) ''
/inschool (Group) 'The kids from school'
/g1/g2 (Group) ''
/inschool/ghosts (Array(20, 5)) ''
/g1/g2/g3 (Group) ''
/g1/g2/g3/g4 (Group) ''
/g1/g2/g3/g4/g5 (Group) ''


In [61]:
for n in f.walk_nodes(): print(n)

/ (RootGroup) ''
/g1 (Group) ''
/goblins (Array(10,)) ''
/inschool (Group) 'The kids from school'
/g1/g2 (Group) ''
/inschool/ghosts (Array(20, 5)) ''
/g1/g2/g3 (Group) ''
/g1/g2/g3/g4 (Group) ''
/g1/g2/g3/g4/g5 (Group) ''


In [63]:
for n in f.walk_nodes(f.root.inschool): print(n)

/inschool (Group) 'The kids from school'
/inschool/ghosts (Array(20, 5)) ''


In [65]:
for n in f.walk_nodes(f.root.inschool, classname="Array"): print(n[:2])

[[0 1 2 3 4]
 [5 6 7 8 9]]


In [66]:
f.close()

## Let's talk briefly about metadata

In [67]:
!cp atest2.h5 atest3.h5

In [68]:
import tables as tb
f = tb.open_file("atest3.h5", "a")

In [69]:
f

File(filename=atest3.h5, title='', mode='a', root_uep='/', filters=Filters(complevel=0, shuffle=False, bitshuffle=False, fletcher32=False, least_significant_digit=None))
/ (RootGroup) ''
/goblins (Array(10,)) ''
  atom := Int64Atom(shape=(), dflt=0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := None
/g1 (Group) ''
/inschool (Group) 'The kids from school'
/inschool/ghosts (Array(20, 5)) ''
  atom := Int64Atom(shape=(), dflt=0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := None
/g1/g2 (Group) ''
/g1/g2/g3 (Group) ''
/g1/g2/g3/g4 (Group) ''
/g1/g2/g3/g4/g5 (Group) ''

In [70]:
f.root.goblins.attrs

/goblins._v_attrs (AttributeSet), 4 attributes:
   [CLASS := 'ARRAY',
    FLAVOR := 'numpy',
    TITLE := '',
    VERSION := '2.4']

In [71]:
f.root.goblins.attrs.myattr = "All the goblins on the block!"

In [72]:
f.root.goblins.attrs

/goblins._v_attrs (AttributeSet), 5 attributes:
   [CLASS := 'ARRAY',
    FLAVOR := 'numpy',
    TITLE := '',
    VERSION := '2.4',
    myattr := 'All the goblins on the block!']

In [73]:
# h5dump and h5ls inspect the contents of the file
!h5ls -av atest3.h5

Opened "atest3.h5" with sec2 driver.
g1                       Group
    Attribute: CLASS scalar
        Type:      5-byte null-terminated UTF-8 string
        Data:  "GROUP"
    Attribute: TITLE null
        Type:      1-byte null-terminated UTF-8 string

    Attribute: VERSION scalar
        Type:      3-byte null-terminated UTF-8 string
        Data:  "1.0"
    Location:  1:4240
    Links:     1
goblins                  Dataset {10/10}
    Attribute: CLASS scalar
        Type:      5-byte null-terminated UTF-8 string
        Data:  "ARRAY"
    Attribute: FLAVOR scalar
        Type:      5-byte null-terminated UTF-8 string
        Data:  "numpy"
    Attribute: TITLE null
        Type:      1-byte null-terminated UTF-8 string

    Attribute: VERSION scalar
        Type:      3-byte null-terminated UTF-8 string
        Data:  "2.4"
    Location:  1:1712
    Links:     1
    Storage:   80 logical bytes, 80 allocated bytes, 100.00% utilization
    Type:      n

In [74]:
# flush 
f.flush()

In [75]:
!h5ls -av atest3.h5

Opened "atest3.h5" with sec2 driver.
g1                       Group
    Attribute: CLASS scalar
        Type:      5-byte null-terminated UTF-8 string
        Data:  "GROUP"
    Attribute: TITLE null
        Type:      1-byte null-terminated UTF-8 string

    Attribute: VERSION scalar
        Type:      3-byte null-terminated UTF-8 string
        Data:  "1.0"
    Location:  1:4240
    Links:     1
goblins                  Dataset {10/10}
    Attribute: CLASS scalar
        Type:      5-byte null-terminated UTF-8 string
        Data:  "ARRAY"
    Attribute: FLAVOR scalar
        Type:      5-byte null-terminated UTF-8 string
        Data:  "numpy"
    Attribute: TITLE null
        Type:      1-byte null-terminated UTF-8 string

    Attribute: VERSION scalar
        Type:      3-byte null-terminated UTF-8 string
        Data:  "2.4"
    Attribute: myattr scalar
        Type:      29-byte null-terminated UTF-8 string
        Data:  "All the goblins on the block

In [76]:
f.root.goblins.attrs.grades = np.arange(10)
f.flush()

In [77]:
!h5ls -av atest3.h5

Opened "atest3.h5" with sec2 driver.
g1                       Group
    Attribute: CLASS scalar
        Type:      5-byte null-terminated UTF-8 string
        Data:  "GROUP"
    Attribute: TITLE null
        Type:      1-byte null-terminated UTF-8 string

    Attribute: VERSION scalar
        Type:      3-byte null-terminated UTF-8 string
        Data:  "1.0"
    Location:  1:4240
    Links:     1
goblins                  Dataset {10/10}
    Attribute: CLASS scalar
        Type:      5-byte null-terminated UTF-8 string
        Data:  "ARRAY"
    Attribute: FLAVOR scalar
        Type:      5-byte null-terminated UTF-8 string
        Data:  "numpy"
    Attribute: TITLE null
        Type:      1-byte null-terminated UTF-8 string

    Attribute: VERSION scalar
        Type:      3-byte null-terminated UTF-8 string
        Data:  "2.4"
    Attribute: grades {10}
        Type:      native long
        Data:
            (0) 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
    Attrib

In [78]:
attrs = f.root.goblins.attrs

In [79]:
attrs

/goblins._v_attrs (AttributeSet), 6 attributes:
   [CLASS := 'ARRAY',
    FLAVOR := 'numpy',
    TITLE := '',
    VERSION := '2.4',
    grades := array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
    myattr := 'All the goblins on the block!']

In [80]:
del attrs.grades
attrs

/goblins._v_attrs (AttributeSet), 5 attributes:
   [CLASS := 'ARRAY',
    FLAVOR := 'numpy',
    TITLE := '',
    VERSION := '2.4',
    myattr := 'All the goblins on the block!']

In [81]:
attrs.candy = 12.3
attrs

/goblins._v_attrs (AttributeSet), 6 attributes:
   [CLASS := 'ARRAY',
    FLAVOR := 'numpy',
    TITLE := '',
    VERSION := '2.4',
    candy := 12.300000000000001,
    myattr := 'All the goblins on the block!']

In [86]:
for n in f.walk_nodes(f.root.inschool, classname="Array"): print(n.attrs)

/inschool/ghosts._v_attrs (AttributeSet), 4 attributes


In [87]:
f.close()