Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Add support for using select user-defined zarr stores #62

Merged
merged 42 commits into from
Jan 18, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
9d26290
Add support for using select user-defined zarr stores
oruebel Jan 5, 2023
0995543
Update resolution of references to work also for file-based Zarr stores
oruebel Jan 5, 2023
1a94f17
Update test_io_zarr.py to allow file-based Zarr stores
oruebel Jan 5, 2023
88d5dfb
Add SQLLite test draft
oruebel Jan 5, 2023
7076b82
Updated changelog
oruebel Jan 5, 2023
2bb8d78
Merge branch 'dev' into add/alternate_stores
oruebel Jan 6, 2023
d9a3a75
Merge branch 'dev' into add/alternate_stores
oruebel Jan 6, 2023
1dca7df
Add ZarrIO.file property ease implementation of tests
oruebel Jan 6, 2023
022cf46
Refactored ZarrIO tests for consistency and to run all backends via d…
oruebel Jan 7, 2023
0a59d9c
Update NWBZarrIO to support the new path options from ZarrIO
oruebel Jan 7, 2023
9b77f40
Minor changes to tests and comments
oruebel Jan 7, 2023
6a6b8f0
Update test_io_convert.py to test with all supported zarr.storage bac…
oruebel Jan 7, 2023
4c11f8c
Added docs on how to integrate new backends stores with ZarrIO
oruebel Jan 7, 2023
9a28380
Clarify the docs to integrate stores
oruebel Jan 7, 2023
6fc99cb
Update storage docs to add missing reserved links and groups
oruebel Jan 7, 2023
c85052c
Add DEFAULT_SPEC_LOC_DIR and SUPPORTED_ZARR_STORES module variable of…
oruebel Jan 7, 2023
a03bc32
Minor fixes to TempStore tests
oruebel Jan 7, 2023
e3e06a3
Add Mixin and test cases to test convertion between Zarr and Zarr
oruebel Jan 7, 2023
40c23ca
Update ZarrIO tutorial to describe using custom data stores
oruebel Jan 7, 2023
03b16d4
Update Changelog
oruebel Jan 7, 2023
55daea8
Attempt to fix Windows tests
oruebel Jan 7, 2023
fd00185
Add note on why we set dir on TempStore
oruebel Jan 7, 2023
954cf4f
Remove commented code
oruebel Jan 8, 2023
18191d5
Fix bad test setup
oruebel Jan 8, 2023
bc4386f
Set store paths in child classes
oruebel Jan 8, 2023
c9d7261
Merge branch 'dev' into add/alternate_stores
oruebel Jan 11, 2023
7595087
Merge branch 'dev' into add/alternate_stores
oruebel Jan 11, 2023
8d61358
Minor text fixes
rly Jan 17, 2023
c96aad6
Minor text fixes
rly Jan 17, 2023
4a316c8
Minor text fixes
rly Jan 17, 2023
dcbae13
Minor text fixes
rly Jan 17, 2023
2f36fd7
Minor text fixes
rly Jan 17, 2023
5dbff7e
Minor text edits
rly Jan 17, 2023
220ad36
Increase HDMF version to 3.5
oruebel Jan 17, 2023
86cfef2
Removed filepath param from get_builder_exists_on_disk
oruebel Jan 17, 2023
d629f9d
Update integrate new store docs
oruebel Jan 17, 2023
92ee3e0
Fix bad documentation of class members of mixins
oruebel Jan 17, 2023
10c983f
Remove references to SQLite store
oruebel Jan 18, 2023
077875d
Add missing message to assert in MixinTestCaseConvert
oruebel Jan 18, 2023
f373727
Consistenlty close file in test when explicitly opened
oruebel Jan 18, 2023
2bce016
Simplify test to reuse IO object
oruebel Jan 18, 2023
cfd0788
Updates dates in changelog
oruebel Jan 18, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 24 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,28 @@
# HDMF-ZARR Changelog

## 0.2.0 (Latest)
## 0.3.0 (Upcoming)

### New Features
* Added support, tests, and docs for using ``DirectoryStore``, ``TempStore``, and
``NestedDirectoryStore`` Zarr storage backends with ``ZarrIO`` and ``NWBZarrIO``
@oruebel [#62](https://github.com/hdmf-dev/hdmf-zarr/pull/62)

### Minor enhancements
* Updated handling of references on read to simplify future integration of file-based Zarr
stores (e.g., ZipStore or database stores) @oruebel [#62](https://github.com/hdmf-dev/hdmf-zarr/pull/62)

### Test suite enhancements
* Modularized unit tests to simplify running tests for multiple Zarr storage backends
@oruebel [#62](https://github.com/hdmf-dev/hdmf-zarr/pull/62)

### Docs
* Added developer documentation on how to integrate new storage backends with ZarrIO
[#62](https://github.com/hdmf-dev/hdmf-zarr/pull/62)

### API Changes
* Removed unused ``filepath`` argument from ``ZarrIO.get_builder_exists_on_disk`` [#62](https://github.com/hdmf-dev/hdmf-zarr/pull/62)

## 0.2.0 (January 6, 2023)

### Bugs
* Updated the storage of links/references to use paths relative to the current Zarr file to avoid breaking
Expand All @@ -22,7 +44,7 @@
* Removed dependency on ``dandi`` library for data download in the conversion tutorial by storing the NWB files as
local resources @oruebel [#61](https://github.com/hdmf-dev/hdmf-zarr/pull/61)

## 0.1.0
## 0.1.0 (August 23, 2022)

### New features

Expand Down
46 changes: 46 additions & 0 deletions docs/gallery/plot_zarr_io.py
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,7 @@
#
zarr_io.close()


###############################################################################
# Converting to/from HDF5 using ``export``
# ----------------------------------------
Expand Down Expand Up @@ -137,3 +138,48 @@
intable_from_zarr = zarr_read_io.read()
intable_zarr_df = intable_from_zarr.to_dataframe()
intable_zarr_df # display the table in the gallery output


###############################################################################
# Using custom Zarr storage backends
# -----------------------------------
#
# :py:class:`~hdmf_zarr.backend.ZarrIO` supports a subset of data stores available
# for Zarr, e.g., :py:class`~zarr.storage.DirectoryStore`, :py:class`~zarr.storage.TempStore`,
# and :py:class`~zarr.storage.NestedDirectoryStore`. The supported stores are defined
# in :py:attr:`~hdmf_zarr.backend.SUPPORTED_ZARR_STORES`. A main limitation to supporting
# all possible Zarr stores in :py:class:`~hdmf_zarr.backend.ZarrIO` is due to the fact that
# Zarr does not support links and references.
#
# .. note:
#
# See :ref:`sec-integrating-zarr-data-store` for details on how to integrate
# new stores with :py:class:`~hdmf_zarr.backend.ZarrIO`.
#
# To use a store other than the default, we simply need to instantiate the store
# and set pass it to :py:class:`~hdmf_zarr.backend.ZarrIO` via the ``path`` parameter.
# Here we use a :py:class`~zarr.storage.NestedDirectoryStore` to write a simple
# :py:class:`hdmf.common.CSRMatrix` container to disk.
#

from zarr.storage import NestedDirectoryStore
from hdmf.common import CSRMatrix

zarr_nsd_dir = "example_nested_store.zarr"
store = NestedDirectoryStore(zarr_dir)
csr_container = CSRMatrix(
name=ROOT_NAME,
data=[1, 2, 3, 4, 5, 6],
indices=[0, 2, 2, 0, 1, 2],
indptr=[0, 2, 3, 6],
shape=(3, 3))

# Write the csr_container to Zarr using a NestedDirectoryStore
with ZarrIO(path=zarr_nsd_dir, manager=get_manager(), mode='w') as zarr_io:
zarr_io.write(csr_container)

# Read the CSR matrix to confirm the data was written correctly
with ZarrIO(path=zarr_nsd_dir, manager=get_manager(), mode='r') as zarr_io:
csr_read = zarr_io.read()
print(" data=%s\n indices=%s\n indptr=%s\n shape=%s" %
(str(csr_read.data), str(csr_read.indices), str(csr_read.indptr), str(csr_read.shape)))
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ Citing hdmf-zarr
:caption: For Developers:

storage
integrating_data_stores
hdmf_zarr

Indices and tables
Expand Down
143 changes: 143 additions & 0 deletions docs/source/integrating_data_stores.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
.. _sec-integrating-zarr-data-stores:

================================
Integrating New Zarr Data Stores
================================

:py:class:`~hdmf_zarr.backend.ZarrIO` by default uses the Zarr
:zarr-docs:`DirectoryStore <api/storage.html#zarr.storage.DirectoryStore>` via
the :py:meth:`zarr.convenience.open` method. :py:class:`~hdmf_zarr.backend.ZarrIO` further
supports all stores listed in :py:class:`~hdmf_zarr.backend.SUPPORTED_ZARR_STORES`.
Users can specify a particular store using the ``path`` parameter when creating a new
:py:class:`~hdmf_zarr.backend.ZarrIO` instance. This document discusses key steps towards
integrating other data stores available for Zarr with :py:class:`~hdmf_zarr.backend.ZarrIO`.


Updating ZarrIO
===============

1. Import and add the new storage class to the :py:class:`~hdmf_zarr.backend.SUPPORTED_ZARR_STORES`.
This will in turn allow instances of your new storage class to be passed as a ``path`` parameter
to :py:meth:`~hdmf_zarr.backend.ZarrIO.__init__`
and :py:meth:`~hdmf_zarr.backend.ZarrIO.load_namespaces` and pass
:py:meth:`~hdmf.utils.docval` validation for these functions.

* If your store has a ``.path`` property then the :py:attr:`~hdmf.backends.io.HDMFIO.source` property
will be set accordingly in ``__init__`` in :py:class:`~hdmf_zarr.backend.ZarrIO`, otherwise
``__init__`` may need to be updated to set a correct ``source`` (used, e.g., to define links).

2. Update :py:meth:`~hdmf_zarr.backend.ZarrIO.open` and :py:meth:`~hdmf_zarr.backend.ZarrIO.close`
as necessary.

3. Depending on the type of data store, it may also be necessary to update the handling of links
and references in :py:class:`~hdmf_zarr.backend.ZarrIO`. In principle, reading and writing of
links should not need to change, however, in particular the
:py:meth:`~hdmf_zarr.backend.ZarrIO.__resolve_ref` and
:py:meth:`~hdmf_zarr.backend.ZarrIO.get_builder_exists_on_disk`
method may need to be updated to ensure
references are opened correctly on read for files stored with your new store. The
:py:meth:`~hdmf_zarr.backend.ZarrIO.__get_ref` function may also need to be updated, in
particular in case the links to your store also modify the storage schema for links
(e.g., if you need to store additional metadata in order to resolve links to your store).

Updating NWBZarrIO
==================

In most cases we should not need to update :py:class:`~hdmf_zarr.nwb.NWBZarrIO` as it inherits
directly from :py:class:`~hdmf_zarr.backend.ZarrIO`. However, in particular if the interface for
``__init__`` has changed for :py:class:`~hdmf_zarr.backend.ZarrIO`,
then we may also need to modify :py:class:`~hdmf_zarr.nwb.NWBZarrIO` accordingly.

Updating Unit Tests
===================

Much of the core test harness of ``hdmf_zarr`` is modularized to simplify running existing
tests with new storage backends. In this way, we can quickly create a collection of common tests
for new backends, and new test cases added to the test suite can be run with all backends.
The relevant test class are located in the `/tests/unit <https://github.com/hdmf-dev/hdmf-zarr/tree/dev/tests/unit>`_
directory of the hdmf_zarr repository.

test_zarrio.py
--------------
`base_tests_zarrio.py <https://github.com/hdmf-dev/hdmf-zarr/blob/dev/tests/unit/base_tests_zarrio.py>`_
provides a collection of base classes that define common
test cases to test basic functionality of :py:class:`~hdmf_zarr.backend.ZarrIO`. Using these base classes, the
`test_zarrio.py <https://github.com/hdmf-dev/hdmf-zarr/blob/dev/tests/unit/test_io_zarr.py>`_ module
then implements concrete tests for various backends. To create tests for a new data store, we need to
add the following main classes (while ``<MyStore>`` in the code below would need to be replaced with the
class name of the new data store):

1. **Create tests for new data store:** Add the following main classes (while ``<MyStore>`` in the code below would need to be replaces with the class name of the new data store):

.. code-block:: python
#########################################
# <MyStore> tests
#########################################
class TestZarrWriter<MyStore>(BaseTestZarrWriter):
"""Test writing of builder with Zarr using a custom <MyStore>"""
def setUp(self):
super().setUp()
self.store = <MyStore>()
self.store_path = self.store.path
class TestZarrWriteUnit<MyStore>(BaseTestZarrWriteUnit):
"""Unit test for individual write functions using a custom <MyStore>"""
def setUp(self):
super().setUp()
self.store = <MyStore>()
self.store_path = self.store.path
class TestExportZarrToZarr<MyStore>(BaseTestExportZarrToZarr):
"""Test exporting Zarr to Zarr using <MyStore>."""
def setUp(self):
super().setUp()
self.stores = [<MyStore>() for i in range(len(self.store_path))]
self.store_paths = [s.path for s in self.stores]
.. note:
In the case of ``BaseTestZarrWriter`` and ``BaseTestZarrWriteUnit`` the ``self.store`` variable defines
the data store to use with :py:class:`~hdmf_zarr.backend.ZarrIO` while running tests.
``self.store_path`` is used during ``tearDown`` to clean up files as well as in some cases
to setup links in test ``Builders`` or if a test case requires opening a file with Zarr directly.
``BaseTestExportZarrToZarr`` tests exporting between Zarr data stores but requires 4 stores and
paths to be specified via the ``self.store`` and ``self.store_path`` variable. To test export
between your new backend, you can simply set up all 4 instances to the new store while using different
storage paths for the different instances (which are saved in ``self.store_paths``).
2. **Update ``base_tests_zarrio.reopen_store``** If our new data store cannot be reused after
it has been closed via :py:meth:`~hdmf_zarr.backend.ZarrIO.close`, then update the method
to either reopen or create a new equivalent data store that can be used for read.
The function is used in tests that write data, then close the ZarrIO, and
create a new ZarrIO to read and validate the data.

3. **Run and update tests** Depending on your data store, some test cases in ``BaseTestZarrWriter``, ``BaseTestZarrWriteUnit``
or ``BaseTestExportZarrToZarr`` may need to be updated to correctly work with our data store.
Simply run the test suite to see if any cases are failing to see whether the ``setUp`` in your
test classes or any specific test cases may need to be updated.

test_io_convert.py
------------------
`test_io_convert.py <https://github.com/hdmf-dev/hdmf-zarr/blob/dev/tests/unit/test_io_convert.py>`_
uses a collection of mixin classes to define custom test classes to test export from one IO backend
to another. As such, the test cases here typically first write to one target and then export to
another target and then compare that the data between the two files is consistent.

1. **Update ``MixinTestHDF5ToZarr``, ``MixinTestZarrToZarr``, and ``MixinTestZarrToZarr``**
mixin classes to add the new backend to the ``WRITE_PATHS`` (if Zarr is the initial write
target) and/or ``EXPORT_PATHS`` (if Zarr is the export target) variables to define our
store as a write or export store for :py:class:`~hdmf_zarr.backend.ZarrIO`, respectively.
Once we have added our new store as write/export targets to these mixins, all test cases
defined in the module will be run with our new backend. Specifically, we here commonly
need to add an instance of our new data store to:

* ``MixinTestHDF5ToZarr.EXPORT_PATHS``
* ``MixinTestZarrToHDF5.WRITE_PATHS``
* ``MixinTestZarrToZarr.WRITE_PATHS`` and ``MixinTestZarrToZarr.EXPORT_PATHS``

2. **Update tests and ZarrIO as necessary** Run the test suite and fix any identified issues.

42 changes: 33 additions & 9 deletions docs/source/storage.rst
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
.. _sec-zarr-storage:

========
Storage
========
=====================
Storage Specification
=====================

hdmf-zarr currently uses the Zarr :zarr-docs:`DirectoryStory <api/storage.html#zarr.storage.DirectoryStore>`,
which uses directories and files on a standard file system to serialize data. Below we describe how
hdmf-zarr currently uses the Zarr :zarr-docs:`DirectoryStore <api/storage.html#zarr.storage.DirectoryStore>`,
which uses directories and files on a standard file system to serialize data.

Format Mapping
==============
Expand Down Expand Up @@ -62,6 +62,14 @@ Groups
object ID Attribute ``object_id`` on the Zarr Group
============================ ======================================================================================

.. _sec-zarr-storage-groups-reserved:

Reserved groups
----------------

The :py:class:`~hdmf_zarr.backend.ZarrIO` backend typically caches the schema used to create a file in the
group ``/specifications`` (see also :ref:`sec-zarr-caching-specifications`)

.. _sec-zarr-storage-datasets:

Datasets
Expand Down Expand Up @@ -127,8 +135,9 @@ Reserved attributes
-------------------

The :py:class:`~hdmf_zarr.backend.ZarrIO` backend defines a set of reserved attribute names defined in
py:attr:`~hdmf_zarr.backend.ZarrIO.__reserve_attribute`. These reserved attributes are used to implement
functionality (e.g., links and object references) that are not natively supported by Zarr.
:py:attr:`~hdmf_zarr.backend.ZarrIO.__reserve_attribute`. These reserved attributes are used to implement
functionality (e.g., links and object references, which are not natively supported by Zarr) and may be
added on any Group or Dataset in the file.

============================ ======================================================================================
Reserved Attribute Name Usage
Expand All @@ -139,6 +148,16 @@ functionality (e.g., links and object references) that are not natively supporte
See :ref:`sec-zarr-storage-references`
============================ ======================================================================================

In addition, the following reserved attributes are added to the root Group of the file only:

============================ ======================================================================================
Reserved Attribute Name Usage
============================ ======================================================================================
.specloc Attribute storing the path to the Group where the scheme for the file are
cached. See :py:attr:`~hdmf_zarr.backend.SPEC_LOC_ATTR`
============================ ======================================================================================


.. _sec-zarr-storage-links:

Links
Expand Down Expand Up @@ -337,6 +356,8 @@ The mappings of data types is as follows
+--------------------------+------------------------------------+----------------+


.. _sec-zarr-caching-specifications:

Caching format specifications
=============================

Expand All @@ -345,8 +366,11 @@ directly in the Zarr file. Caching the specification in the file ensures that us
the specification directly if necessary without requiring external resources.
For the Zarr backend, caching of the schema is implemented as follows.

The Zarr backend adds the reserved top-level group ``/specifications`` in which all format specifications (including
extensions) are cached. The ``/specifications`` group contains for each specification namespace a subgroup
The :py:class:`~hdmf_zarr.backend.ZarrIO`` backend adds the reserved top-level group ``/specifications``
in which all format specifications (including extensions) are cached. The default name for this group is
defined in :py:attr:`~hdmf_zarr.backend.DEFAULT_SPEC_LOC_DIR` and caching of
specifications is implemented in ``ZarrIO.__cache_spec``.
The ``/specifications`` group contains for each specification namespace a subgroup
``/specifications/<namespace-name>/<version>`` in which the specification for a particular version of a namespace
are stored (e.g., ``/specifications/core/2.0.1`` in the case of the NWB core namespace at version 2.0.1).
The actual specification data is then stored as a JSON string in scalar datasets with a binary, variable-length string
Expand Down
2 changes: 1 addition & 1 deletion requirements-min.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
hdmf==3.4.0
hdmf==3.5.0
zarr==2.11.0
numcodecs==0.9.1
pynwb==2.0.0
Expand Down
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# pinned dependencies to reproduce an entire development environment to use HDMF-ZARR
hdmf==3.4.0
hdmf==3.5.0
zarr==2.11.0
numcodecs==0.9.1
pynwb==2.0.1
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@


reqs = [
'hdmf>=3.4.0',
'hdmf>=3.5.0',
'zarr>=2.11.0',
'numcodecs>=0.9.1',
'pynwb>=2.0.0',
Expand Down
Loading