# Contents

- Main features of HDF5
- Overview: Packages for HDF5 with Python
  + h5py
  + pytables
  + pandas
  + xarray
  
Notebooks here: https://github.com/mcrot/meetup-python-hdf5



# HDF5: **H**ierachical **D**ata **F**ormat Version **5**

*HDF5* used as term for three things:

- file format
- data model 
- software (libraries, tools, interfaces for many languages)

Features:

- storage of large amounts of (scientific) data
- data structures often used in science (arrays, images, ..)
- rich set of data types including composite and user-defined data types
- hierachical structure 
- including metadata ($\rightarrow$ *self-describing*)
- compressed
- binary, platform-independet
- easily sharable
- *old* reliable technology
- open source (BSD-style)

# Data Model

HDF5 is not only a file format but also a data model.

https://support.hdfgroup.org/HDF5/Tutor/HDF5Intro.pdf

Two primary types of objects (many other):
  
1. Groups

   - HDF5 file itself is a group
   - can hold other groups, links to other elements and 
     *datasets*
   
2. Datasets

   - $n$-dimensional array of elements with metadata,
     each element of the dataset may be a complex object itself


# File Format
 
Specification of file format is complex, more than 150 pages. 

https://support.hdfgroup.org/HDF5/doc/Specs.html

In pratice one implementation, written in C, by HDF Group.
In general a user can work with libraries instead.


# HDF Group

- starting in the 80's
- 18 Years at University of Illinois National Center
for Supercomputing Applications (NCSA)
- Spun-out from NCSA in July, 2006
- Non-profit organisation
- Intellectual property:
  + The HDF Group owns HDF4 and HDF5
  + HDF formats and libraries to remain open
  + BSD-style license 

## Mission Statement

<div class="alert alert-block alert-info">
   To ensure long-term accessibility of HDF
   data through sustainable development
   and support of HDF technologies. 
</div>

More:

 https://www.hdfgroup.org/the-hdf-group-mission/


## Users

List of HDF5 Group

    https://support.hdfgroup.org/HDF5/users5.html
    

# What means *large*?

- in principle, no limits on file size (according to *HDF Group*)
- a single dataset is limited, but currently not in practice:

  > The library currently allows up to 32 dimension dataspaces, and 
  > each dimension can have up to an unsigned 64-bit value. Each 
  > datatype can be somewhat arbitrarily large, since you can have 
  > array datatypes or compound datatypes that are very large.
  > Multiplying those two factors together gives a theoretical upper 
  > limit probably in the thousands of bits of for a dataset's size.  
  > However, the library currently only supports 64-bit offsets 
  > (although that is easily adjustable when files over 16 exabytes are   > needed!), so that is the practical upper limit.
   
  https://support.hdfgroup.org/HDF5/faq/limits.html
  
  https://en.wikipedia.org/wiki/Exabyte
  
- heavy use in NASA's Earth Observing System (*EOS*) project:    
 
    http://hdfeos.org/
    

  





# Critics

Discussion about HDF5:

 http://cyrille.rossant.net/moving-away-hdf5/
 
Answer of Konrad Hinsen:

 http://blog.khinsen.net/posts/2016/01/07/on-hdf5-and-the-future-of-data-management/
 
 