Skip to content

Commit

Permalink
[Feature] Add support for using select user-defined zarr stores (#62)
Browse files Browse the repository at this point in the history
* Add support for using select user-defined zarr stores
* Update resolution of references to work also for file-based Zarr stores
* Update test_io_zarr.py to allow file-based Zarr stores
* Updated changelog
* Add ZarrIO.file property ease implementation of tests
* Refactored ZarrIO tests for consistency and to run all backends via dedicated test classes
* Update NWBZarrIO to support the new path options from ZarrIO
* Update test_io_convert.py to test with all supported zarr.storage backends
* Added docs on how to integrate new backends stores with ZarrIO
* Update storage docs to add missing reserved links and groups
* Add DEFAULT_SPEC_LOC_DIR and SUPPORTED_ZARR_STORES module variable of backend.py
* Add Mixin and test cases to test conversion between Zarr and Zarr
* Update ZarrIO tutorial to describe using custom data stores
* Increase HDMF version to 3.5
* Removed filepath param from get_builder_exists_on_disk
* Consistently close file in test when explicitly opened

Co-authored-by: Ryan Ly <rly@lbl.gov>
  • Loading branch information
oruebel and rly committed Jan 18, 2023
1 parent 42134e2 commit f59da74
Show file tree
Hide file tree
Showing 13 changed files with 1,153 additions and 426 deletions.
26 changes: 24 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,28 @@
# HDMF-ZARR Changelog

## 0.2.0 (Latest)
## 0.3.0 (Upcoming)

### New Features
* Added support, tests, and docs for using ``DirectoryStore``, ``TempStore``, and
``NestedDirectoryStore`` Zarr storage backends with ``ZarrIO`` and ``NWBZarrIO``
@oruebel [#62](https://github.com/hdmf-dev/hdmf-zarr/pull/62)

### Minor enhancements
* Updated handling of references on read to simplify future integration of file-based Zarr
stores (e.g., ZipStore or database stores) @oruebel [#62](https://github.com/hdmf-dev/hdmf-zarr/pull/62)

### Test suite enhancements
* Modularized unit tests to simplify running tests for multiple Zarr storage backends
@oruebel [#62](https://github.com/hdmf-dev/hdmf-zarr/pull/62)

### Docs
* Added developer documentation on how to integrate new storage backends with ZarrIO
[#62](https://github.com/hdmf-dev/hdmf-zarr/pull/62)

### API Changes
* Removed unused ``filepath`` argument from ``ZarrIO.get_builder_exists_on_disk`` [#62](https://github.com/hdmf-dev/hdmf-zarr/pull/62)

## 0.2.0 (January 6, 2023)

### Bugs
* Updated the storage of links/references to use paths relative to the current Zarr file to avoid breaking
Expand All @@ -22,7 +44,7 @@
* Removed dependency on ``dandi`` library for data download in the conversion tutorial by storing the NWB files as
local resources @oruebel [#61](https://github.com/hdmf-dev/hdmf-zarr/pull/61)

## 0.1.0
## 0.1.0 (August 23, 2022)

### New features

Expand Down
46 changes: 46 additions & 0 deletions docs/gallery/plot_zarr_io.py
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,7 @@
#
zarr_io.close()


###############################################################################
# Converting to/from HDF5 using ``export``
# ----------------------------------------
Expand Down Expand Up @@ -137,3 +138,48 @@
intable_from_zarr = zarr_read_io.read()
intable_zarr_df = intable_from_zarr.to_dataframe()
intable_zarr_df # display the table in the gallery output


###############################################################################
# Using custom Zarr storage backends
# -----------------------------------
#
# :py:class:`~hdmf_zarr.backend.ZarrIO` supports a subset of data stores available
# for Zarr, e.g., :py:class`~zarr.storage.DirectoryStore`, :py:class`~zarr.storage.TempStore`,
# and :py:class`~zarr.storage.NestedDirectoryStore`. The supported stores are defined
# in :py:attr:`~hdmf_zarr.backend.SUPPORTED_ZARR_STORES`. A main limitation to supporting
# all possible Zarr stores in :py:class:`~hdmf_zarr.backend.ZarrIO` is due to the fact that
# Zarr does not support links and references.
#
# .. note:
#
# See :ref:`sec-integrating-zarr-data-store` for details on how to integrate
# new stores with :py:class:`~hdmf_zarr.backend.ZarrIO`.
#
# To use a store other than the default, we simply need to instantiate the store
# and set pass it to :py:class:`~hdmf_zarr.backend.ZarrIO` via the ``path`` parameter.
# Here we use a :py:class`~zarr.storage.NestedDirectoryStore` to write a simple
# :py:class:`hdmf.common.CSRMatrix` container to disk.
#

from zarr.storage import NestedDirectoryStore
from hdmf.common import CSRMatrix

zarr_nsd_dir = "example_nested_store.zarr"
store = NestedDirectoryStore(zarr_dir)
csr_container = CSRMatrix(
name=ROOT_NAME,
data=[1, 2, 3, 4, 5, 6],
indices=[0, 2, 2, 0, 1, 2],
indptr=[0, 2, 3, 6],
shape=(3, 3))

# Write the csr_container to Zarr using a NestedDirectoryStore
with ZarrIO(path=zarr_nsd_dir, manager=get_manager(), mode='w') as zarr_io:
zarr_io.write(csr_container)

# Read the CSR matrix to confirm the data was written correctly
with ZarrIO(path=zarr_nsd_dir, manager=get_manager(), mode='r') as zarr_io:
csr_read = zarr_io.read()
print(" data=%s\n indices=%s\n indptr=%s\n shape=%s" %
(str(csr_read.data), str(csr_read.indices), str(csr_read.indptr), str(csr_read.shape)))
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ Citing hdmf-zarr
:caption: For Developers:

storage
integrating_data_stores
hdmf_zarr

Indices and tables
Expand Down
143 changes: 143 additions & 0 deletions docs/source/integrating_data_stores.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
.. _sec-integrating-zarr-data-stores:

================================
Integrating New Zarr Data Stores
================================

:py:class:`~hdmf_zarr.backend.ZarrIO` by default uses the Zarr
:zarr-docs:`DirectoryStore <api/storage.html#zarr.storage.DirectoryStore>` via
the :py:meth:`zarr.convenience.open` method. :py:class:`~hdmf_zarr.backend.ZarrIO` further
supports all stores listed in :py:class:`~hdmf_zarr.backend.SUPPORTED_ZARR_STORES`.
Users can specify a particular store using the ``path`` parameter when creating a new
:py:class:`~hdmf_zarr.backend.ZarrIO` instance. This document discusses key steps towards
integrating other data stores available for Zarr with :py:class:`~hdmf_zarr.backend.ZarrIO`.


Updating ZarrIO
===============

1. Import and add the new storage class to the :py:class:`~hdmf_zarr.backend.SUPPORTED_ZARR_STORES`.
This will in turn allow instances of your new storage class to be passed as a ``path`` parameter
to :py:meth:`~hdmf_zarr.backend.ZarrIO.__init__`
and :py:meth:`~hdmf_zarr.backend.ZarrIO.load_namespaces` and pass
:py:meth:`~hdmf.utils.docval` validation for these functions.

* If your store has a ``.path`` property then the :py:attr:`~hdmf.backends.io.HDMFIO.source` property
will be set accordingly in ``__init__`` in :py:class:`~hdmf_zarr.backend.ZarrIO`, otherwise
``__init__`` may need to be updated to set a correct ``source`` (used, e.g., to define links).

2. Update :py:meth:`~hdmf_zarr.backend.ZarrIO.open` and :py:meth:`~hdmf_zarr.backend.ZarrIO.close`
as necessary.

3. Depending on the type of data store, it may also be necessary to update the handling of links
and references in :py:class:`~hdmf_zarr.backend.ZarrIO`. In principle, reading and writing of
links should not need to change, however, in particular the
:py:meth:`~hdmf_zarr.backend.ZarrIO.__resolve_ref` and
:py:meth:`~hdmf_zarr.backend.ZarrIO.get_builder_exists_on_disk`
method may need to be updated to ensure
references are opened correctly on read for files stored with your new store. The
:py:meth:`~hdmf_zarr.backend.ZarrIO.__get_ref` function may also need to be updated, in
particular in case the links to your store also modify the storage schema for links
(e.g., if you need to store additional metadata in order to resolve links to your store).

Updating NWBZarrIO
==================

In most cases we should not need to update :py:class:`~hdmf_zarr.nwb.NWBZarrIO` as it inherits
directly from :py:class:`~hdmf_zarr.backend.ZarrIO`. However, in particular if the interface for
``__init__`` has changed for :py:class:`~hdmf_zarr.backend.ZarrIO`,
then we may also need to modify :py:class:`~hdmf_zarr.nwb.NWBZarrIO` accordingly.

Updating Unit Tests
===================

Much of the core test harness of ``hdmf_zarr`` is modularized to simplify running existing
tests with new storage backends. In this way, we can quickly create a collection of common tests
for new backends, and new test cases added to the test suite can be run with all backends.
The relevant test class are located in the `/tests/unit <https://github.com/hdmf-dev/hdmf-zarr/tree/dev/tests/unit>`_
directory of the hdmf_zarr repository.

test_zarrio.py
--------------
`base_tests_zarrio.py <https://github.com/hdmf-dev/hdmf-zarr/blob/dev/tests/unit/base_tests_zarrio.py>`_
provides a collection of base classes that define common
test cases to test basic functionality of :py:class:`~hdmf_zarr.backend.ZarrIO`. Using these base classes, the
`test_zarrio.py <https://github.com/hdmf-dev/hdmf-zarr/blob/dev/tests/unit/test_io_zarr.py>`_ module
then implements concrete tests for various backends. To create tests for a new data store, we need to
add the following main classes (while ``<MyStore>`` in the code below would need to be replaced with the
class name of the new data store):

1. **Create tests for new data store:** Add the following main classes (while ``<MyStore>`` in the code below would need to be replaces with the class name of the new data store):

.. code-block:: python
#########################################
# <MyStore> tests
#########################################
class TestZarrWriter<MyStore>(BaseTestZarrWriter):
"""Test writing of builder with Zarr using a custom <MyStore>"""
def setUp(self):
super().setUp()
self.store = <MyStore>()
self.store_path = self.store.path
class TestZarrWriteUnit<MyStore>(BaseTestZarrWriteUnit):
"""Unit test for individual write functions using a custom <MyStore>"""
def setUp(self):
super().setUp()
self.store = <MyStore>()
self.store_path = self.store.path
class TestExportZarrToZarr<MyStore>(BaseTestExportZarrToZarr):
"""Test exporting Zarr to Zarr using <MyStore>."""
def setUp(self):
super().setUp()
self.stores = [<MyStore>() for i in range(len(self.store_path))]
self.store_paths = [s.path for s in self.stores]
.. note:
In the case of ``BaseTestZarrWriter`` and ``BaseTestZarrWriteUnit`` the ``self.store`` variable defines
the data store to use with :py:class:`~hdmf_zarr.backend.ZarrIO` while running tests.
``self.store_path`` is used during ``tearDown`` to clean up files as well as in some cases
to setup links in test ``Builders`` or if a test case requires opening a file with Zarr directly.
``BaseTestExportZarrToZarr`` tests exporting between Zarr data stores but requires 4 stores and
paths to be specified via the ``self.store`` and ``self.store_path`` variable. To test export
between your new backend, you can simply set up all 4 instances to the new store while using different
storage paths for the different instances (which are saved in ``self.store_paths``).
2. **Update ``base_tests_zarrio.reopen_store``** If our new data store cannot be reused after
it has been closed via :py:meth:`~hdmf_zarr.backend.ZarrIO.close`, then update the method
to either reopen or create a new equivalent data store that can be used for read.
The function is used in tests that write data, then close the ZarrIO, and
create a new ZarrIO to read and validate the data.

3. **Run and update tests** Depending on your data store, some test cases in ``BaseTestZarrWriter``, ``BaseTestZarrWriteUnit``
or ``BaseTestExportZarrToZarr`` may need to be updated to correctly work with our data store.
Simply run the test suite to see if any cases are failing to see whether the ``setUp`` in your
test classes or any specific test cases may need to be updated.

test_io_convert.py
------------------
`test_io_convert.py <https://github.com/hdmf-dev/hdmf-zarr/blob/dev/tests/unit/test_io_convert.py>`_
uses a collection of mixin classes to define custom test classes to test export from one IO backend
to another. As such, the test cases here typically first write to one target and then export to
another target and then compare that the data between the two files is consistent.

1. **Update ``MixinTestHDF5ToZarr``, ``MixinTestZarrToZarr``, and ``MixinTestZarrToZarr``**
mixin classes to add the new backend to the ``WRITE_PATHS`` (if Zarr is the initial write
target) and/or ``EXPORT_PATHS`` (if Zarr is the export target) variables to define our
store as a write or export store for :py:class:`~hdmf_zarr.backend.ZarrIO`, respectively.
Once we have added our new store as write/export targets to these mixins, all test cases
defined in the module will be run with our new backend. Specifically, we here commonly
need to add an instance of our new data store to:

* ``MixinTestHDF5ToZarr.EXPORT_PATHS``
* ``MixinTestZarrToHDF5.WRITE_PATHS``
* ``MixinTestZarrToZarr.WRITE_PATHS`` and ``MixinTestZarrToZarr.EXPORT_PATHS``

2. **Update tests and ZarrIO as necessary** Run the test suite and fix any identified issues.

42 changes: 33 additions & 9 deletions docs/source/storage.rst
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
.. _sec-zarr-storage:

========
Storage
========
=====================
Storage Specification
=====================

hdmf-zarr currently uses the Zarr :zarr-docs:`DirectoryStory <api/storage.html#zarr.storage.DirectoryStore>`,
which uses directories and files on a standard file system to serialize data. Below we describe how
hdmf-zarr currently uses the Zarr :zarr-docs:`DirectoryStore <api/storage.html#zarr.storage.DirectoryStore>`,
which uses directories and files on a standard file system to serialize data.

Format Mapping
==============
Expand Down Expand Up @@ -62,6 +62,14 @@ Groups
object ID Attribute ``object_id`` on the Zarr Group
============================ ======================================================================================

.. _sec-zarr-storage-groups-reserved:

Reserved groups
----------------

The :py:class:`~hdmf_zarr.backend.ZarrIO` backend typically caches the schema used to create a file in the
group ``/specifications`` (see also :ref:`sec-zarr-caching-specifications`)

.. _sec-zarr-storage-datasets:

Datasets
Expand Down Expand Up @@ -127,8 +135,9 @@ Reserved attributes
-------------------

The :py:class:`~hdmf_zarr.backend.ZarrIO` backend defines a set of reserved attribute names defined in
py:attr:`~hdmf_zarr.backend.ZarrIO.__reserve_attribute`. These reserved attributes are used to implement
functionality (e.g., links and object references) that are not natively supported by Zarr.
:py:attr:`~hdmf_zarr.backend.ZarrIO.__reserve_attribute`. These reserved attributes are used to implement
functionality (e.g., links and object references, which are not natively supported by Zarr) and may be
added on any Group or Dataset in the file.

============================ ======================================================================================
Reserved Attribute Name Usage
Expand All @@ -139,6 +148,16 @@ functionality (e.g., links and object references) that are not natively supporte
See :ref:`sec-zarr-storage-references`
============================ ======================================================================================

In addition, the following reserved attributes are added to the root Group of the file only:

============================ ======================================================================================
Reserved Attribute Name Usage
============================ ======================================================================================
.specloc Attribute storing the path to the Group where the scheme for the file are
cached. See :py:attr:`~hdmf_zarr.backend.SPEC_LOC_ATTR`
============================ ======================================================================================


.. _sec-zarr-storage-links:

Links
Expand Down Expand Up @@ -337,6 +356,8 @@ The mappings of data types is as follows
+--------------------------+------------------------------------+----------------+


.. _sec-zarr-caching-specifications:

Caching format specifications
=============================

Expand All @@ -345,8 +366,11 @@ directly in the Zarr file. Caching the specification in the file ensures that us
the specification directly if necessary without requiring external resources.
For the Zarr backend, caching of the schema is implemented as follows.

The Zarr backend adds the reserved top-level group ``/specifications`` in which all format specifications (including
extensions) are cached. The ``/specifications`` group contains for each specification namespace a subgroup
The :py:class:`~hdmf_zarr.backend.ZarrIO`` backend adds the reserved top-level group ``/specifications``
in which all format specifications (including extensions) are cached. The default name for this group is
defined in :py:attr:`~hdmf_zarr.backend.DEFAULT_SPEC_LOC_DIR` and caching of
specifications is implemented in ``ZarrIO.__cache_spec``.
The ``/specifications`` group contains for each specification namespace a subgroup
``/specifications/<namespace-name>/<version>`` in which the specification for a particular version of a namespace
are stored (e.g., ``/specifications/core/2.0.1`` in the case of the NWB core namespace at version 2.0.1).
The actual specification data is then stored as a JSON string in scalar datasets with a binary, variable-length string
Expand Down
2 changes: 1 addition & 1 deletion requirements-min.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
hdmf==3.4.0
hdmf==3.5.0
zarr==2.11.0
numcodecs==0.9.1
pynwb==2.0.0
Expand Down
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# pinned dependencies to reproduce an entire development environment to use HDMF-ZARR
hdmf==3.4.0
hdmf==3.5.0
zarr==2.11.0
numcodecs==0.9.1
pynwb==2.0.1
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@


reqs = [
'hdmf>=3.4.0',
'hdmf>=3.5.0',
'zarr>=2.11.0',
'numcodecs>=0.9.1',
'pynwb>=2.0.0',
Expand Down
Loading

0 comments on commit f59da74

Please sign in to comment.