# What Is HDF5? What is PyTables?

This tutorial is a modified hodgepodge of material from Francesc Alted's and Anthony Scopatz's pytables tutorials:

www.pytables.org/docs/PyData2012-NYC.pdf
http://pyvideo.org/video/2705/hdf5-is-for-lovers-tutorial-part-1

HDF5 is a free, open source, binary file type. It is a key library for 'big science' due to its
scalability, multiple APIs, and efficiency with structured, dense arrays of
numbers.

The acronym stands for *Hierarchial Data Format*.  It is hierarchical in the
sense that the data is structured (much like a directory). HDF allows datasets to live in a nested tree structure. In effect, HDF5 is a file system within a file. (SQL has flat tables only).

HDF5 is a database binary format with the ability to store lots of different datasets along with metadata, optimized I/O, and the ability to query its contents.

- performs operations on-disk
- free software (BSD)
- has many APIs (c, c++, fortran 90, Java, and *python*)
- has no limit on number of data objects
- can represent various data objects as well as metadata

PyTables is not [alted2013]:

- Not a relational database replacement
- Not a distributed database
- Not extremely secure or safe (it’s more about speed!)
- Not a mere HDF5 wrapper
￼

In [1]:
import numpy as np
import tables as tb

In [2]:
# Create a new file
f = tb.openFile("atest.h5", "w")

In [3]:
# Create a NumPy array, number of ghosts per sq. ft.
a = np.arange(100).reshape(20,5)

In [4]:
# Save the array
f.createArray(f.root, "ghosts", a)

/ghosts (Array(20, 5)) ''
  atom := Int64Atom(shape=(), dflt=0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := None

In [5]:
# See data
f.root.ghosts[:]

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24],
       [25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34],
       [35, 36, 37, 38, 39],
       [40, 41, 42, 43, 44],
       [45, 46, 47, 48, 49],
       [50, 51, 52, 53, 54],
       [55, 56, 57, 58, 59],
       [60, 61, 62, 63, 64],
       [65, 66, 67, 68, 69],
       [70, 71, 72, 73, 74],
       [75, 76, 77, 78, 79],
       [80, 81, 82, 83, 84],
       [85, 86, 87, 88, 89],
       [90, 91, 92, 93, 94],
       [95, 96, 97, 98, 99]])

In [6]:
# Select some data areas
ta = f.root.ghosts
ta[1:10:3,2:5]

array([[ 7,  8,  9],
       [22, 23, 24],
       [37, 38, 39]])

In [8]:
np.allclose(ta[1:10:3,2:5], a[1:10:3,2:5])

True

In [9]:
# Create another array, number of goblins per square foot
ta2 = f.createArray(f.root, "goblins", np.arange(10))

In [10]:
np.allclose(ta2, np.arange(10))

True

In [11]:
ls -l atest.h5

-rw-rw-r--  1 khuff  staff  0 Sep  4 14:26 atest.h5


In [12]:
# Flush data to the file (very important to keep all your data safe!)
f.flush()

In [13]:
ls -l atest.h5

-rw-rw-r--  1 khuff  staff  3024 Sep  4 14:28 atest.h5


In [14]:
f.close()  # close access to file

In [15]:
ta[:]

ClosedNodeError: the node object is closed

## There are different kinds of dataset though

- Array
- CArray (chunked array)
- EArray (extendable array)
- VLArray (variable length array)
- Table (structured array w/ named fields)

We just had an example of Arrays. Let's look at Tables. 

In [17]:
ta

<closed tables.array.Array at 0x104e64490>

In [24]:
f = tb.openFile("atest.h5", "r")

In [25]:
f

File(filename=atest.h5, title='', mode='r', root_uep='/', filters=Filters(complevel=0, shuffle=False, fletcher32=False, least_significant_digit=None))
/ (RootGroup) ''
/ghosts (Array(20, 5)) ''
  atom := Int64Atom(shape=(), dflt=0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := None
/goblins (Array(10,)) ''
  atom := Int64Atom(shape=(), dflt=0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := None

In [18]:
# The description for the tabular data
class TabularData(tb.IsDescription):
    names = tb.StringCol(200)
    ages = tb.IntCol()
    heights = tb.FloatCol()

In [19]:
# Open a file and create the Table container
f = tb.openFile('atable.h5', 'w')
t = f.createTable(f.root, 'table', TabularData, 'table title', filters=tb.Filters(5, 'blosc'))

In [20]:
t

/table (Table(0,), shuffle, blosc(5)) 'table title'
  description := {
  "ages": Int32Col(shape=(), dflt=0, pos=0),
  "heights": Float64Col(shape=(), dflt=0.0, pos=1),
  "names": StringCol(itemsize=200, shape=(), dflt='', pos=2)}
  byteorder := 'little'
  chunkshape := (309,)

In [22]:
#  Fill the table with some 1 million rows
from time import time
t0 = time()
r = t.row
for i in xrange(1000*1000):
    r['names'] = str(i)
    r['ages'] = i + 1
    r['heights'] = i * (i + 1)
    r.append()
t.flush()
print "Insert time: %.3fs" % (time()-t0,) 

Insert time: 1.424s


In [24]:
t

/table (Table(2000000,), shuffle, blosc(5)) 'table title'
  description := {
  "ages": Int32Col(shape=(), dflt=0, pos=0),
  "heights": Float64Col(shape=(), dflt=0.0, pos=1),
  "names": StringCol(itemsize=200, shape=(), dflt='', pos=2)}
  byteorder := 'little'
  chunkshape := (309,)

In [25]:
# Size on disk
!ls -lh atable.h5

-rw-rw-r--  1 khuff  staff   9.7M Nov 21 17:14 atable.h5


In [27]:
# Real size
np.prod(t.shape) * t.dtype.itemsize / 2**20.

404.35791015625

In [28]:
# Do a query (regular)
%time [r['names'] for r in t if r['ages'] < 10]

CPU times: user 598 ms, sys: 15 ms, total: 613 ms
Wall time: 657 ms


['0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8']

In [29]:
# Repeat the query, but using in-kernel method
%time [r['names'] for r in t.where('ages < 10')]

CPU times: user 440 ms, sys: 49.2 ms, total: 489 ms
Wall time: 489 ms


['0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8']

In [30]:
# Performing complex conditions (regular query)
%time [r['names'] for r in t if r['ages'] < 10 and r['heights'] < 10]

CPU times: user 674 ms, sys: 15.6 ms, total: 689 ms
Wall time: 690 ms


['0', '1', '2', '0', '1', '2']

In [31]:
# Complex, in-kernel queries
%time [r['names'] for r in t.where('(ages < 10) & (heights < 10)')]

CPU times: user 439 ms, sys: 54.7 ms, total: 494 ms
Wall time: 492 ms


['0', '1', '2', '0', '1', '2']

In [32]:
# Get a structured array out of disk
sa = t[:]
sa

array([(1, 0.0, '0'), (2, 2.0, '1'), (3, 6.0, '2'), ...,
       (999998, 999995000006.0, '999997'),
       (999999, 999997000002.0, '999998'),
       (1000000, 999999000000.0, '999999')], 
      dtype=[('ages', '<i4'), ('heights', '<f8'), ('names', 'S200')])

In [None]:
# Perform the query
%time sa[((sa['ages'] < 10) & (sa['heights'] < 10))]['names']

In [None]:
# Create an index for the on-disk table
%time t.cols.ages.createCSIndex()

In [None]:
# Repeat the complex query (indexed)
%time [r['names'] for r in t.where('(ages < 10) & (heights < 10)')]

In [None]:
f.close()

## What's all this about hierarchy?

In [None]:
!cp atest.h5 atest2.h5

In [None]:
f
f = tb.openFile("atest2.h5", "a")

In [None]:
f.createGroup(f.root, 'inschool', 'The kids from school')

In [None]:
f

In [None]:
f.moveNode(f.root.ghosts, f.root.inschool)

In [None]:
f

In [None]:
f.createGroup('/g1/g2/g3/g4', 'g5', createparents=True)

In [None]:
f

In [None]:
f.createArray(f.root.g1.g2.g3.g4.g5, 'goblins', np.arange(10))

In [None]:
f

In [None]:
f.root.g1.g2.g3.g4.g5.goblins[:]

In [None]:
f.removeNode(f.root.g1.g2.g3.g4.g5.goblins)

In [None]:
f

In [None]:
for n in f: print n

In [None]:
for n in f.walkNodes(): print n

In [None]:
for n in f.walkNodes(f.root.inschool): print n

In [None]:
for n in f.walkNodes(f.root.inschool, classname="Array"): print n[:2]

In [None]:
f.close()

## Let's talk briefly about metadata

In [None]:
!cp atest2.h5 atest3.h5

In [None]:
import tables as tb
f = tb.openFile("atest3.h5", "a")

In [None]:
f

In [None]:
f.root.goblins.attrs

In [None]:
f.root.goblins.attrs.myattr = "All the goblins on the block!"

In [None]:
f.root.goblins.attrs

In [None]:
# h5dump and h5ls inspect the contents of the file
!h5ls -av atest3.h5

In [None]:
# flush 
f.flush()

In [None]:
!h5ls -av atest3.h5

In [None]:
f.root.goblins.attrs.grades = np.arange(10)
f.flush()

In [None]:
!h5ls -av atest3.h5

In [None]:
attrs = f.root.goblins.attrs

In [None]:
attrs

In [None]:
del attrs.grades
attrs

In [None]:
attrs.candy = 12.3
attrs

In [None]:
for n in f.walkNodes(f.root.inschool, classname="Array"): print `n.attrs`

In [None]:
f.close()