update README.md

portugueslab · Mar 27, 2021 · 5f618fd · 5f618fd
1 parent 9393743
commit 5f618fd
Showing 1 changed file with 24 additions and 12 deletions.
diff --git a/README.md b/README.md
@@ -10,28 +10,40 @@
 
 
 
-A package for saving and reading large HDF5-based chunked arrays. 
+A minimal package for saving and reading large HDF5-based chunked arrays.
 
-This package has been developed in the [`Portugues lab`](http://www.portugueslab.com) for volumetric calcium imaging data. `split_dataset` is extensivly used in the calcium imaging analysis package [`fimpy`](https://github.com/portugueslab/fimpy); SplitDatasets are saved by the microscope control libraries [`sashimi`](https://github.com/portugueslab/sashimi) and [`brunoise`](https://github.com/portugueslab/brunoise).
+This package has been developed in the [`Portugues lab`](http://www.portugueslab.com) for volumetric calcium imaging data. `split_dataset` is extensively used in the calcium imaging analysis package [`fimpy`](https://github.com/portugueslab/fimpy); The microscope control libraries [`sashimi`](https://github.com/portugueslab/sashimi) and [`brunoise`](https://github.com/portugueslab/brunoise) save files as split datasets.
 
-[`napari-split-dataset`](https://github.com/portugueslab/napari-split-dataset) support the visualization of SplitDatasets in `napari`
+[`napari-split-dataset`](https://github.com/portugueslab/napari-split-dataset) support the visualization of SplitDatasets in `napari`.
 
-# Features
-The package contains the definition of  `SplitDataset` objects, that save large arrays in memory as separate  `.h5` files. Any n of dimensions and block sizes are supported in principle, but the package has been used only with 3D and 4D arrays.
-Numpy-style indexing can then be used to retrieve data from a `SplitDataset` object.
+## Why using Split dataset?
+Split datasets are numpy-like array saved over multiple h5 files. The concept of spli datasets is not different from e.g. [zarr arrays](https://zarr.readthedocs.io/en/stable/); however, relying on h5 files allow for partial reading even within the same file, which is crucial for visualizing volumetric time series, the main application `split_dataset` has been developed for (see [this discussion](https://github.com/zarr-developers/zarr-python/issues/521) on the limitation of zarr arrays).
 
-# Minimal example
+# Structure of a split dataset
+A split dataset is contained in a folder containing multiple, numbered  h5 files (one file per chunk) and a metadata json file with information on the shape of the full dataset and of its chunks.
+The h5 files are saved using the [flammkuchen](https://github.com/portugueslab/flammkuchen) library (ex [deepdish](https://deepdish.readthedocs.io/en/latest/)). Each file contains a dictionary with the data under the `stack` keyword.
+
+`SplitDataset` objects can than be instantiated from the dataset path, and numpy-style indexing can then be used to load data as numpy arrays. Any n of dimensions and block sizes are supported in principle; the package has been used mainly with 3D and 4D arrays.
+
+
+
+## Minimal example
 ```python
 # Load a  SplitDataset via a SplitDataset object:
-from split_dataset import SplitDataset 
+from split_dataset import SplitDataset
 ds = SplitDataset(path_to_dataset)
 
 # Retrieve data in an interval:
-ds[n_start:n_end, :, :, :]
+data_array = ds[n_start:n_end, :, :, :]
 ```
 
+## Creating split datasets
+New split datasets can be created with the `split_dataset.save_to_split_dataset` function, provided that the original data is fully loaded in memory. Alternatively, e.g. for time acquisitions, a split dataset can be saved one chunk at a time. It is enough to save with `flammkuchen` correctly formatted .h5 files and the correspondent json metadata file describing the full split dataset shape (this is [what happens in sashimi](https://github.com/portugueslab/sashimi/blob/01046f2f24483ab702be379843a1782ababa7d2d/sashimi/processes/streaming_save.py#L186))
+
 
 # TODO
+* provide utilities for
+* support for more advanced indexing (support for step and vector indexing)
 * support for cropping a `SplitDataset`
 * support for resolution and frequency metadata
 
@@ -52,6 +64,6 @@ Credits
 
 Part of this package was inspired by  [Cookiecutter](https://github.com/audreyr/cookiecutter) and [this](https://github.com/audreyr/cookiecutter-pypackage) template.
 
-.. _`Portugues lab`: 
-.. _Cookiecutter: 
-.. _this: 
+.. _`Portugues lab`:
+.. _Cookiecutter:
+.. _this: