# Data IO (input/output)


# Introduction

ESRF data comes in (too many) different formats:

* Specfile
* EDF
* HDF5

And specific detector formats:

* MarCCD
* Pilatus CBF
* Dectris Eiger
* …


HDF5 is now the standard ESRF data format so we will only focus on it today.

# HDF5

![hdf_group](images/HDF_logo.png "HDF group")

## what is hdf5 ?

[HDF5](https://en.wikipedia.org/wiki/Hierarchical_Data_Format) (for Hierarchical Data Format) is a file format to structure and store data for high volume and complex data

## Why hdf5 ?

* Hierarchical collection of data (directory and file, UNIX-like path)
* High-performance (binary)
* Portable file format (Standard exchange format for heterogeneous data)
* Self-describing extensible types, rich metadata
* Support data compression
* Free ( & open source)
* Adopted by a large number of institute (NASA, LIGO, ...)
* Adopted by most of the synchrotrons (esrf, SOLEIL, Daisy...)
* Insure [forward and backward compatibility](https://support.hdfgroup.org/HDF5/doc/ADGuide/CompatFormat180.html)

**Data can be mostly anything: image, table, graphs, documents**

## HDF5 description

The container is mostly structured with:

* **File**: the root of the container
* **Group**: a grouping structure containing groups or datasets
* **Dataset**: a multidimensional array of data elements
* And other features (links, attributes, datatypes)

![hdf5_class_diag](images/hdf5_model.png "hdf5 class diagram")


## HDF5 example

Here is an example of the file generated by [pyFAI](https://github.com/silx-kit/pyFAI)

![hdf5_example](images/hdf5_example.png "hdf5 example")

## Usefull tools for HDF5

* `h5ls`, `h5dump`, `hdfview`
```bash
>>> h5ls -r my_first_one.h5 
>>> /                        Group
>>> /data1                   Dataset {100, 100}
>>> /group1                  Group
>>> /group1/data2            Dataset {100, 100}
```

* `silx view`

```bash
>>> pip install silx
>>> silx view my_file.h5
```

* `h5glance`: File browser for jupyter

==> The HDF group provides a web page with more tools https://support.hdfgroup.org/HDF5/doc/RM/Tools.html

## h5py

![h5py book](images/h5py.gif "h5py book")

[h5py](https://www.h5py.org/) is the python binding for accessing hdf5. Originally from [Andrew Collette](http://shop.oreilly.com/product/0636920030249.do)

With time work more and more closely with the hdfgroup.

Easy to associate hdf5 and python, everything is represented as a dictionnary.

### How to read an hdf5 file with h5py

first open a file using a [File Object](http://docs.h5py.org/en/stable/high/file.html)
```
h5py.File('myfile.hdf5', opening_mode)
```

[opening modes](http://docs.h5py.org/en/stable/high/file.html#opening-creating-files) are:

|         |                                                  |
|---------|--------------------------------------------------|
| r       | Readonly, file must exist                        |
| r+      | Read/write, file must exist                      |
| w       | Create file, truncate if exists                  |
| w- or x | Create file, fail if exists                      |
| a       | Read/write if exists, create otherwise (default) |
   

# Accessing ESRF data

## Library


Those are already available for most ESRF computers.

Cross platform (available for Windows, Linux, Mac OS X)

Also available from source code (under MIT license)

* https://github.com/h5py/h5py

## Read a HDF5 file

In [None]:
import h5py

h5file = h5py.File('data/test.h5', "r")

# print available names at the first level
print("First children:", h5file['/'].keys())

In [None]:
# Get a dataset from a sub group
dataset = h5file['/diff_map_0004/data/map']

# Here we only read metadata from the dataset
print("Dataset:", dataset.shape, dataset.size, dataset.dtype)

In [None]:
# Remember to close the file
h5file.close()

In [None]:
# Or better, use a context manager
# The file is closed for you
with h5py.File('data/test.h5', "r") as h5file:
    print(h5file['/'].keys())

## HDF5 mimics numpy-array

The data is read from the file only when it is needed.

In [None]:
import h5py
h5file = h5py.File('data/test.h5', "r")
dataset = h5file['/diff_map_0004/data/map']

In [None]:
# Read and apply an operation
print(dataset[5, 5, 0:5])
print(2 * dataset[0, 5, 0:5])

In [None]:
# copy the data and store it as a numpy-array
b = dataset[...]
b[0, 0, 0:5] = 0
print(dataset[0, 0, 0:5])
print(b[0, 0, 0:5])

In [None]:
h5file.close()

## Write a HDF5 file

* http://docs.h5py.org/en/stable/high/group.html
* http://docs.h5py.org/en/stable/high/dataset.html

In [1]:
import numpy
import h5py

# Create a 2D data
data = numpy.arange(100 * 100)
data.shape = 100, 100

# Notice the mode='w', as 'write'
with h5py.File('my_first_one.h5', mode='w') as h5file:

    # write data into a dataset from the root
    h5file['/data1'] = data

    # write data into a dataset from group1
    h5file['/group1/data2'] = data

    # Or with a functional API
    g = h5file.create_group("/group2")
    g.create_dataset("data3", data=data)

# Exercice: Flat field correction

Flat-field correction is a technique used to improve quality in digital imaging.

The goal is to normalize images and remove artifacts caused by variations in the pixel-to-pixel sensitivity of the detector and/or by distortions in the optical path. (see https://en.wikipedia.org/wiki/Flat-field_correction)

$$ normalized = \frac{raw - dark}{flat - dark} $$

* `normalized`: Image after flat field correction
* `raw`: Raw image. It is acquired with the sample.
* `flat`: Flat field image. It is the response given out by the detector for a uniform input signal. This image is acquired without the sample.
* `dark`: Also named `background` or `dark current`. It is the response given out by the detector when there is no signal. This image is acquired without the beam.

# Exercise 1

1. Browse the file ``data/ID16B_diatomee.h5``
2. Reach a single raw data, a flat field and a dark image from this file
3. Apply the flat field correction
4. Save the result into a new HDF5 file

If you are stuck, the solution is provided in the file [solutions/exercise1.py](./solutions/exercise1.py)

In [None]:
from h5glance import H5Glance
H5Glance("data/ID16B_diatomee.h5")

In [None]:
import h5py

# Read the data

...

# Compute the result

normalized = flatfield_correction(raw, flat, dark)

# Save the result

...


# Exercise 2

1. Apply the flat field correction to all raw data available (use the same flat and dark for all the images)
2. Save each result into different datasets of the same HDF5 file

If you are stuck, the solution is provided in the file [solutions/exercise2.py](./solutions/exercise2.py)


# Exercise 3

From the previous exercise, we can see that the flat field correction was not very good for the last images.

Another flat field was acquired at the end of the acquisition.

We could use this information to compute a flat field closer to the image we want to normalize. It can be done with a linear interpolation of the flat images by using the name of the image as the interpolation factor (which varies between 0 and 500 in this case).

1. For each raw data, compute the corresponding flat field using lineal interpolation (between `flatfield/0000` and `flatfield/0500`)
2. Save each result into different datasets in a single HDF5 file

If you are stuck, the solution is provided in the file [solutions/exercise3.py](./solutions/exercise3.py)

# Conclusion

Preconized libraries according to the use case and the file format.

| Formats              | Read            | Write |
|----------------------|-----------------|-------|
| HDF5                 | silx/h5py       | h5py  |
| Specfile             | silx            |       |
| EDF                  | silx/fabio      | fabio |
| Other raster formats | silx/fabio      | fabio |

# Utils: Conversion tools


- `fabio-convert`: To convert raster images 
- `silx convert`: To convert EDF, or spec files to HDF5

# Nexus

[Nexus](https://www.nexusformat.org/) is a data format for neutron, x-ray, and muon science.

It defined a common to represente dataset.