Skip to content

Commit

Permalink
update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
vigji committed Mar 27, 2021
1 parent 9393743 commit 5f618fd
Showing 1 changed file with 24 additions and 12 deletions.
36 changes: 24 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,28 +10,40 @@



A package for saving and reading large HDF5-based chunked arrays.
A minimal package for saving and reading large HDF5-based chunked arrays.

This package has been developed in the [`Portugues lab`](http://www.portugueslab.com) for volumetric calcium imaging data. `split_dataset` is extensivly used in the calcium imaging analysis package [`fimpy`](https://github.com/portugueslab/fimpy); SplitDatasets are saved by the microscope control libraries [`sashimi`](https://github.com/portugueslab/sashimi) and [`brunoise`](https://github.com/portugueslab/brunoise).
This package has been developed in the [`Portugues lab`](http://www.portugueslab.com) for volumetric calcium imaging data. `split_dataset` is extensively used in the calcium imaging analysis package [`fimpy`](https://github.com/portugueslab/fimpy); The microscope control libraries [`sashimi`](https://github.com/portugueslab/sashimi) and [`brunoise`](https://github.com/portugueslab/brunoise) save files as split datasets.

[`napari-split-dataset`](https://github.com/portugueslab/napari-split-dataset) support the visualization of SplitDatasets in `napari`
[`napari-split-dataset`](https://github.com/portugueslab/napari-split-dataset) support the visualization of SplitDatasets in `napari`.

# Features
The package contains the definition of `SplitDataset` objects, that save large arrays in memory as separate `.h5` files. Any n of dimensions and block sizes are supported in principle, but the package has been used only with 3D and 4D arrays.
Numpy-style indexing can then be used to retrieve data from a `SplitDataset` object.
## Why using Split dataset?
Split datasets are numpy-like array saved over multiple h5 files. The concept of spli datasets is not different from e.g. [zarr arrays](https://zarr.readthedocs.io/en/stable/); however, relying on h5 files allow for partial reading even within the same file, which is crucial for visualizing volumetric time series, the main application `split_dataset` has been developed for (see [this discussion](https://github.com/zarr-developers/zarr-python/issues/521) on the limitation of zarr arrays).

# Minimal example
# Structure of a split dataset
A split dataset is contained in a folder containing multiple, numbered h5 files (one file per chunk) and a metadata json file with information on the shape of the full dataset and of its chunks.
The h5 files are saved using the [flammkuchen](https://github.com/portugueslab/flammkuchen) library (ex [deepdish](https://deepdish.readthedocs.io/en/latest/)). Each file contains a dictionary with the data under the `stack` keyword.

`SplitDataset` objects can than be instantiated from the dataset path, and numpy-style indexing can then be used to load data as numpy arrays. Any n of dimensions and block sizes are supported in principle; the package has been used mainly with 3D and 4D arrays.



## Minimal example
```python
# Load a SplitDataset via a SplitDataset object:
from split_dataset import SplitDataset
from split_dataset import SplitDataset
ds = SplitDataset(path_to_dataset)

# Retrieve data in an interval:
ds[n_start:n_end, :, :, :]
data_array = ds[n_start:n_end, :, :, :]
```

## Creating split datasets
New split datasets can be created with the `split_dataset.save_to_split_dataset` function, provided that the original data is fully loaded in memory. Alternatively, e.g. for time acquisitions, a split dataset can be saved one chunk at a time. It is enough to save with `flammkuchen` correctly formatted .h5 files and the correspondent json metadata file describing the full split dataset shape (this is [what happens in sashimi](https://github.com/portugueslab/sashimi/blob/01046f2f24483ab702be379843a1782ababa7d2d/sashimi/processes/streaming_save.py#L186))


# TODO
* provide utilities for
* support for more advanced indexing (support for step and vector indexing)
* support for cropping a `SplitDataset`
* support for resolution and frequency metadata

Expand All @@ -52,6 +64,6 @@ Credits

Part of this package was inspired by [Cookiecutter](https://github.com/audreyr/cookiecutter) and [this](https://github.com/audreyr/cookiecutter-pypackage) template.

.. _`Portugues lab`:
.. _Cookiecutter:
.. _this:
.. _`Portugues lab`:
.. _Cookiecutter:
.. _this:

0 comments on commit 5f618fd

Please sign in to comment.