# HDF5 files with Pandas

**TODO**
* ...

In [1]:
# import python packages here...

Documentation:
* https://pandas.pydata.org/pandas-docs/stable/io.html
* https://pandas.pydata.org/pandas-docs/stable/cookbook.html#cookbook-hdf

In [2]:
import numpy as np
import pandas as pd

In [3]:
FILE = "test.h5"

In [4]:
!rm test.h5

rm: test.h5: No such file or directory


## Read/write HDF files using `HDFStore` objects API

`HDFStore` is a dict-like object which reads and writes pandas using the high performance HDF5 format using the PyTables library. 

Documentation: https://pandas.pydata.org/pandas-docs/stable/io.html#hdf5-pytables

### Make data to save/load

In [5]:
s = pd.Series([3.14, 2.72, np.nan], index=['pi', 'e', 'nan'])
s

pi     3.14
e      2.72
nan     NaN
dtype: float64

In [6]:
df = pd.DataFrame(np.array([[3, 1, 4],[2, 7, np.nan]]).T,
                  index=pd.date_range('1/1/2000', periods=3),
                  columns=['A', 'B'])
df

Unnamed: 0,A,B
2000-01-01,3.0,2.0
2000-01-02,1.0,7.0
2000-01-03,4.0,


### Write to HDF5 file

In [7]:
store = pd.HDFStore(FILE)

Objects can be written to the file just like adding key-value pairs to a dict:

In [8]:
store['df'] = df      # the equivalent of: store.put('df', df)
store['series'] = s   # the equivalent of: store.put('series', s)

Closing a Store:

In [9]:
store.close()

In [10]:
del df
del s
del store

### Read from HDF5 file

In [11]:
with pd.HDFStore(FILE) as store:
    print(store.keys())
    df = store['df']      # the equivalent of: store.get('df')
    s = store['series']   # the equivalent of: store.get('series')

['/df', '/series']


In [12]:
df

Unnamed: 0,A,B
2000-01-01,3.0,2.0
2000-01-02,1.0,7.0
2000-01-03,4.0,


In [13]:
s

pi     3.14
e      2.72
NaN     NaN
dtype: float64

## Read/Write HDF5 files using the `to_hdf()`/`read_hdf()` top-level API

`HDFStore` supports an top-level API using `read_hdf` for reading and `to_hdf` for writing.

Documentation: https://pandas.pydata.org/pandas-docs/stable/io.html#id2

### Make data to save/load

In [14]:
!rm test.h5

In [15]:
s = pd.Series([3.14, 2.72, np.nan], index=['pi', 'e', 'nan'])
s

pi     3.14
e      2.72
nan     NaN
dtype: float64

In [16]:
df = pd.DataFrame(np.array([[3, 1, 4],[2, 7, np.nan]]).T,
                  index=pd.date_range('1/1/2000', periods=3),
                  columns=['A', 'B'])
df

Unnamed: 0,A,B
2000-01-01,3.0,2.0
2000-01-02,1.0,7.0
2000-01-03,4.0,


### Write a DataFrame in a HDF5 file

In [23]:
s.to_hdf(FILE, key='series')
df.to_hdf(FILE, key='df')

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->index] [items->None]

  f(store)


Useful paremeters:

    format : 'fixed(f)|table(t)', default is 'fixed'
        fixed(f) : Fixed format
                   Fast writing/reading. Not-appendable, nor searchable
        table(t) : Table format
                   Write as a PyTables Table structure which may perform
                   worse but allow more flexible operations like searching
                   / selecting subsets of the data
    append : boolean, default False
        For Table formats, append the input data to the existing
    data_columns :  list of columns, or True, default None
        List of columns to create as indexed data columns for on-disk
        queries, or True to use all columns. By default only the axes
        of the object are indexed. See `here
        <http://pandas.pydata.org/pandas-docs/stable/io.html#query-via-data-columns>`__.
        Applicable only to format='table'.
    complevel : int, 0-9, default None
        Specifies a compression level for data.
        A value of 0 disables compression.
    complib : {'zlib', 'lzo', 'bzip2', 'blosc'}, default 'zlib'
        Specifies the compression library to be used.
        As of v0.20.2 these additional compressors for Blosc are supported
        (default if no compressor specified: 'blosc:blosclz'):
        {'blosc:blosclz', 'blosc:lz4', 'blosc:lz4hc', 'blosc:snappy',
        'blosc:zlib', 'blosc:zstd'}.
        Specifying a compression library which is not available issues
        a ValueError.
    fletcher32 : bool, default False
        If applying compression use the fletcher32 checksum
    dropna : boolean, default False.
        If true, ALL nan rows will not be written to store.

In [18]:
del df
del s
del store

### Read a DataFrame from a HDF5 file

In [19]:
s = pd.read_hdf(FILE, key='series')  # the `key` param can be omitted if the HDF file contains a single pandas object
s

pi     3.14
e      2.72
NaN     NaN
dtype: float64

In [20]:
df = pd.read_hdf(FILE, key='df')  # the `key` param can be omitted if the HDF file contains a single pandas object
df

Unnamed: 0,A,B
2000-01-01,3.0,2.0
2000-01-02,1.0,7.0
2000-01-03,4.0,


In [21]:
!rm test.h5

### Read/Write a compressed HDF5 file

In [50]:
a = np.random.randint(10, size=(1000, 1000))
df = pd.DataFrame(a)
del a

In [51]:
df.to_hdf(FILE, key='df')

In [52]:
!ls -lh test.h5

-rw-r--r--  1 jdecock  staff    16M Feb 22 00:03 test.h5


In [53]:
df.to_hdf(FILE,
          key='df',
          complevel=9,     # 0-9, default None, Specifies a compression level for data. 0 = disables compression
          complib='zlib') # 'zlib', 'lzo', 'bzip2', 'blosc', 'blosc:blosclz', 'blosc:lz4', 'blosc:lz4hc', 'blosc:snappy', 'blosc:zlib', 'blosc:zstd'. default 'zlib'


In [54]:
!ls -lh test.h5

-rw-r--r--  1 jdecock  staff   8.9M Feb 22 00:03 test.h5


In [58]:
df = pd.read_hdf(FILE, key='df')  # the `key` param can be omitted if the HDF file contains a single pandas object
df.memory_usage().sum()

8008000

In [21]:
!rm test.h5