# Data IO (input/output)


# Introduction

ESRF data comes in (too many) different formats:

* Specfile
* EDF
* HDF5

And specific detector formats:

* MarCCD
* Pilatus CBF
* Dectris Eiger
* …


HDF5 is now the standard ESRF data format so we will only focus on it today.

Methods for accessing other file format are described in the [io_spec_edf.ipynb](io_spec_edf.ipynb) notebook

# HDF5

![hdf_group](images/HDF_logo.png "HDF group")

## What is hdf5 ?

[HDF5](https://en.wikipedia.org/wiki/Hierarchical_Data_Format) (for Hierarchical Data Format) is a file format to structure and store complex ans high volumes of data.

## Why hdf5 ?

* Hierarchical collection of data (directory and file, UNIX-like path)
* High-performance (binary)
* Portable file format (Standard exchange format for heterogeneous data)
* Self-describing extensible types, rich metadata
* Support data compression
* Free ( & open source)
* Adopted by a large number of institutes (NASA, LIGO, ...)
* Adopted by most of the synchrotrons (ESRF, SOLEIL, Desy...)

**Data can be mostly anything: image, table, graphs, documents**

## HDF5 description

The container is mostly structured with:

* **File**: the root of the container
* **Group**: a grouping structure containing groups or datasets
* **Dataset**: a multidimensional array of data elements
* And other features (links, attributes, datatypes)

![hdf5_class_diag](images/hdf5_model.png "hdf5 class diagram")


## HDF5 example

Here is an example of the file generated by [pyFAI](https://github.com/silx-kit/pyFAI)

![hdf5_example](images/hdf5_example.png "hdf5 example")

## Useful tools for HDF5

* `h5ls`, `h5dump`, `hdfview`
```bash
>>> h5ls -r my_first_one.h5 
>>> /                        Group
>>> /data1                   Dataset {100, 100}
>>> /group1                  Group
>>> /group1/data2            Dataset {100, 100}
```

* `silx view`

```bash
>>> pip install silx
>>> silx view my_file.h5
```

* `h5glance`: File browser for jupyter

* `h5py`: Access HDF5 files from python

==> The HDF group provides a web page with more tools https://support.hdfgroup.org/HDF5/doc/RM/Tools.html

# h5py

![h5py book](images/h5py.png "h5py book")

[h5py](https://www.h5py.org/) is the python binding for accessing hdf5. Originally from [Andrew Collette](http://shop.oreilly.com/product/0636920030249.do)

h5py is already available in most ESRF computers.
It is cross platform and can be installed using for example:

- apt:
```bash
apt-get install python3-h5py
```
- pip
```bash
pip install h5py
```
- Also available from source code (under MIT license)

    * https://github.com/h5py/h5py

## How to read an hdf5 file with h5py ?

* first open a file using a [File Object](http://docs.h5py.org/en/stable/high/file.html)
```
h5py.File('myfile.hdf5', opening_mode)
```

   [opening modes](http://docs.h5py.org/en/stable/high/file.html#opening-creating-files) are:

|         |                                                  |
|---------|--------------------------------------------------|
| r       | Readonly, file must exist                        |
| r+      | Read/write, file must exist                      |
| w       | Create file, truncate if exists                  |
| w- or x | Create file, fail if exists                      |
| a       | Read/write if exists, create otherwise (default) |

* then you will be able to access your data from the root node. [Groups](http://docs.h5py.org/en/stable/high/group.html) operate as dictionaries.
   

In [None]:
import h5py

h5file = h5py.File('data/test.h5', "r")

# print available names at the first level
print("First children:", h5file.keys())

In [None]:
# Get a dataset from a sub group
dataset = h5file['/diff_map_0004/data/map']
dataset

In [None]:
# Remember to close the file
h5file.close()

### Context manager

* Context manager will allocate and release resources 'automatically' when needed.
* Usually used from the `with` statement.

so to write safely into a file, instead of having something like above, do:

In [None]:
# Or better, use a context manager
# The file is closed for you
with h5py.File('data/test.h5', "r") as h5file:
    print(h5file.keys())
    dataset = h5file['/diff_map_0004/data/map']
    print(dataset)

## HDF5 mimics numpy-array

The data is read from the file only when it is needed.

In [None]:
import h5py
h5file = h5py.File('data/test.h5', "r")
dataset = h5file['/diff_map_0004/data/map']

In [None]:
# Here we only read metadata from the dataset
print("Dataset:", dataset.shape, dataset.size, dataset.dtype)

In [None]:
# Read and apply an operation
print(dataset[5, 5, 0:5])
print(2 * dataset[0, 5, 0:5])

In [None]:
# copy the data and store it as a numpy-array
# if no copy is done, the data will not be accessible once the file is closed 
b = dataset[()]
b[0, 0, 0:5] = 0
print(dataset[0, 0, 0:5])
print(b[0, 0, 0:5])

In [None]:
h5file.close()

## How to write in a HDF5 file with h5py ?

* *there are several ways for writing groups and datasets. Here we will only focus on the 'dictionary' like API.*
* http://docs.h5py.org/en/stable/high/group.html
* http://docs.h5py.org/en/stable/high/dataset.html

In [None]:
import numpy
import h5py

data = numpy.random.random(10000)
data.shape = 100, 100

# write
h5file = h5py.File('my_first_one.h5', mode='w')

# write data into a dataset from the root
h5file['/data1'] = data

# write data into a dataset from group1
h5file['/group1/data2'] = data

h5file.close()

The same operation with a context manager

In [None]:
import numpy
import h5py

# Create 2D data
data = numpy.arange(100 * 100)
data.shape = 100, 100

# Notice the mode='w', as 'write'
with h5py.File('my_first_one.h5', mode='w') as h5file:

    # write data into a dataset from the root
    h5file['/data1'] = data

    # write data into a dataset from group1
    h5file['/group1/data2'] = data

The same with a context manager and avoiding the dictionary API 

In [None]:
import numpy
import h5py

# Create 2D data
data = numpy.arange(100 * 100)
data.shape = 100, 100

# Notice the mode='w', as 'write'
with h5py.File('my_first_one.h5', mode='w') as h5file:

    # write data into a dataset from the root
    h5file.create_dataset('data1', data=data)

    # Or with a functional API
    grp1 = h5file.create_group("group1")
    grp1.create_dataset("data2", data=data)

# Exercice: Flat field correction

Flat-field correction is a technique used to improve quality in digital imaging.

The goal is to normalize images and remove artifacts caused by variations in the pixel-to-pixel sensitivity of the detector and/or by distortions in the optical path. (see https://en.wikipedia.org/wiki/Flat-field_correction)

$$ normalized = \frac{raw - dark}{flat - dark} $$

* `normalized`: Image after flat field correction
* `raw`: Raw image. It is acquired with the sample.
* `flat`: Flat field image. It is the response given out by the detector for a uniform input signal. This image is acquired without the sample.
* `dark`: Also named `background` or `dark current`. It is the response given out by the detector when there is no signal. This image is acquired without the beam.

Here is a function implementing the flat field correction:

*note: make sure you execute the cell for this function to be defined*

In [None]:
import numpy

def flatfield_correction(raw, flat, dark):
    """
    Apply a flat-field correction to a raw data using a flat and a dark.
    """
    # Make sure that the computation is done using float
    # to avoid type overflow or loss of precision
    raw = raw.astype(numpy.float32)
    flat = flat.astype(numpy.float32)
    dark = dark.astype(numpy.float32)
    # Do the computation
    return (raw - dark) / (flat - dark)

# Exercise 1

1. Browse the file ``data/ID16B_diatomee.h5``
2. Get a single raw dataset, a flat field dataset and a dark image dataset from this file
3. Apply the flat field correction
4. Save the result into a new HDF5 file

If you are stuck, the solution is provided in the file [solutions/exercise1.py](./solutions/exercise1.py)

In [None]:
from h5glance import H5Glance
H5Glance("data/ID16B_diatomee.h5")

In [None]:
import h5py

# this is a comment

# step1: Read the data

# raw_data_path = ...
# raw_data = ...

# flat_path = ...
# flat = ...

# dark_path = ...
# dark = ...

# step2: Compute the result

# normalized = flatfield_correction(raw_data, flat, dark)

# step3: Save the result

# ...


*note: if you like to plot an image you can use the imshow command !!! the %pylab should be called once before calling the imshow function !!!*

In [None]:
%pylab inline

In [None]:
import numpy
imshow(numpy.random.random((20, 60)))

# Exercise 2

1. Apply the flat field correction to all raw data available (use the same flat and dark for all the images)
2. Save each result into different datasets of the same HDF5 file

If you are stuck, the solution is provided in the file [solutions/exercise2.py](./solutions/exercise2.py)


# Exercise 3

From the previous exercise, we can see that the flat field correction was not very good for the last images.

Another flat field was acquired at the end of the acquisition.

We could use this information to compute a flat field closer to the image we want to normalize. It can be done with a linear interpolation of the flat images by using the name of the image as the interpolation factor (which varies between 0 and 500 in this case).

1. For each raw data, compute the corresponding flat field using lineal interpolation (between `flatfield/0000` and `flatfield/0500`)
2. Save each result into different datasets in a single HDF5 file

If you are stuck, the solution is provided in the file [solutions/exercise3.py](./solutions/exercise3.py)

# Conclusion

Preconized libraries according to the use case and the file format.

| Formats              | Read            | Write |
|----------------------|-----------------|-------|
| HDF5                 | silx/h5py       | h5py  |
| Specfile             | silx            |       |
| EDF                  | silx/fabio      | fabio |
| Other raster formats | silx/fabio      | fabio |

# Utils: Conversion tools


- `fabio-convert`: To convert raster images 
- `silx convert`: To convert EDF, or spec files to HDF5

# Nexus

[Nexus](https://www.nexusformat.org/) is a data format for neutron, x-ray, and muon science.

It aims to be a common data format for scientists for greater collaboration.

If you intend to store some data to be shared it can give you a 'standard way' for storing it.

The main advantage is to insure compatibility between your data files and existing softwares (if they respect the nexus format) or from your software to different datasets.

* an example on [how to store tomography raw data](http://download.nexusformat.org/doc/html/classes/applications/NXtomo.html?highlight=tomography)
* an example to store [tomoraphy application (3D reconstruction)](http://download.nexusformat.org/doc/html/classes/applications/NXtomoproc.html?highlight=tomography)
