TermSetWrapper and write support (#950)

* working concept * minor cleaning * foo file * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * checkpoint * checkpoint * Update src/hdmf/utils.py Co-authored-by: Oliver Ruebel <oruebel@users.noreply.github.com> * clean up * checkpoint * tests placeholders * checkpoint * placeholder * placeholder * placeholder * working write and herd * cleanup * checkpoint on updating append * integrate append * test checkpoint * test checkpoint * test fixes * termset tests * termset tests * termset tests * checkpoint/remove field_name * cleanup * make sure things pass without bad tests * cleanup * temp fix for test * termset tutorial * tests and bug fix on write * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tests and bug fix on write * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * ruff * bug fix * doc * doc * Update test_docval.py * tests * tests * tests * Update utils.py Co-authored-by: Ryan Ly <rly@lbl.gov> * Update utils.py Co-authored-by: Ryan Ly <rly@lbl.gov> * Update utils.py Co-authored-by: Ryan Ly <rly@lbl.gov> * ryan feedback * Update src/hdmf/build/objectmapper.py Co-authored-by: Ryan Ly <rly@lbl.gov> * Update docs/gallery/plot_term_set.py Co-authored-by: Ryan Ly <rly@lbl.gov> * Update docs/gallery/plot_term_set.py Co-authored-by: Ryan Ly <rly@lbl.gov> * Update docs/gallery/plot_term_set.py Co-authored-by: Ryan Ly <rly@lbl.gov> * Update docs/gallery/plot_term_set.py Co-authored-by: Ryan Ly <rly@lbl.gov> * tutorial * Update CHANGELOG.md * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * test next * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * format * validation changes * Update tests/unit/test_term_set.py * clean up * Update io.py * Update CHANGELOG.md Co-authored-by: Oliver Ruebel <oruebel@users.noreply.github.com> * tuple change * Update tests/unit/test_term_set.py * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update src/hdmf/term_set.py * test feedback --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Oliver Ruebel <oruebel@users.noreply.github.com> Co-authored-by: Ryan Ly <rly@lbl.gov>
hdmf-dev · Sep 28, 2023 · e1105e4 · e1105e4
1 parent ddc842b
commit e1105e4
Show file tree

Hide file tree

Showing 18 changed files with 522 additions and 267 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,9 @@
 
 ## HDMF 3.9.1 (Upcoming)
 
+### Enhancements
+- Updated `TermSet` to be used with `TermSetWrapper`, allowing for general use of validation for datasets and attributes. This also brings updates to `HERD` integration and updates on `write` to easily add references for wrapped datasets/attributes. @mavaylon1 [#950](https://github.com/hdmf-dev/hdmf/pull/950)
+
 ### Minor improvements
 - Removed warning when namespaces are loaded and the attribute marking where the specs are cached is missing. @bendichter [#926](https://github.com/hdmf-dev/hdmf/pull/926)
 

diff --git a/docs/gallery/plot_term_set.py b/docs/gallery/plot_term_set.py
@@ -3,8 +3,9 @@
 =======
 
 This is a user guide for interacting with the
-:py:class:`~hdmf.term_set.TermSet` class. The :py:class:`~hdmf.term_set.TermSet` type
-is experimental and is subject to change in future releases. If you use this type,
+:py:class:`~hdmf.term_set.TermSet` and :py:class:`~hdmf.term_set.TermSetWrapper` classes.
+The :py:class:`~hdmf.term_set.TermSet` and :py:class:`~hdmf.term_set.TermSetWrapper` types
+are experimental and are subject to change in future releases. If you use these types,
 please provide feedback to the HDMF team so that we can improve the structure and
 overall capabilities.
 
@@ -14,15 +15,18 @@
 set of terms from brain atlases, species taxonomies, and anatomical, cell, and
 gene function ontologies.
 
-:py:class:`~hdmf.term_set.TermSet` serves two purposes: data validation and external reference
-management. Users will be able to validate their data to their own set of terms, ensuring
+Users will be able to validate their data and attributes to their own set of terms, ensuring
 clean data to be used inline with the FAIR principles later on.
-The  :py:class:`~hdmf.term_set.TermSet` class allows for a reusable and sharable
-pool of metadata to serve as references to any dataset.
+The :py:class:`~hdmf.term_set.TermSet` class allows for a reusable and sharable
+pool of metadata to serve as references for any dataset or attribute.
 The :py:class:`~hdmf.term_set.TermSet` class is used closely with
-:py:class:`~hdmf.common.resources.ExternalResources` to more efficiently map terms
-to data. Please refer to the tutorial on ExternalResources to see how :py:class:`~hdmf.term_set.TermSet`
-is used with :py:class:`~hdmf.common.resources.ExternalResources`.
+:py:class:`~hdmf.common.resources.HERD` to more efficiently map terms
+to data.
+
+In order to actually use a :py:class:`~hdmf.term_set.TermSet`, users will use the
+:py:class:`~hdmf.term_set.TermSetWrapper` to wrap data and attributes. The
+:py:class:`~hdmf.term_set.TermSetWrapper` uses a user-provided :py:class:`~hdmf.term_set.TermSet`
+to perform validation.
 
 :py:class:`~hdmf.term_set.TermSet` is built upon the resources from LinkML, a modeling
 language that uses YAML-based schema, giving :py:class:`~hdmf.term_set.TermSet`
@@ -68,7 +72,7 @@
     import linkml_runtime  # noqa: F401
 except ImportError as e:
     raise ImportError("Please install linkml-runtime to run this example: pip install linkml-runtime") from e
-from hdmf.term_set import TermSet
+from hdmf.term_set import TermSet, TermSetWrapper
 
 try:
     dir_path = os.path.dirname(os.path.abspath(__file__))
@@ -114,71 +118,75 @@
 terms['Homo sapiens']
 
 ######################################################
-# Validate Data with TermSet
+# Validate Data with TermSetWrapper
 # ----------------------------------------------------
-# :py:class:`~hdmf.term_set.TermSet` has been integrated so that :py:class:`~hdmf.container.Data` and its
-# subclasses support a term_set attribute. By having this attribute set, the data will be validated
-# and all new data will be validated.
+# :py:class:`~hdmf.term_set.TermSetWrapper` can be wrapped around data.
+# To validate data, the user will set the data to the wrapped data, in which validation must pass
+# for the data object to be created.
 data = VectorData(
     name='species',
     description='...',
-    data=['Homo sapiens'],
-    term_set=terms)
+    data=TermSetWrapper(value=['Homo sapiens'], termset=terms)
+    )
 
 ######################################################
-# Validate on append with TermSet
+# Validate Attributes with TermSetWrapper
 # ----------------------------------------------------
-# As mentioned prior, when the term_set attribute is set, then all new data is validated. This is true for both
-# append and extend methods.
+# Similar to wrapping datasets, :py:class:`~hdmf.term_set.TermSetWrapper` can be wrapped around any attribute.
+# To validate attributes, the user will set the attribute to the wrapped value, in which validation must pass
+# for the object to be created.
+data = VectorData(
+    name='species',
+    description=TermSetWrapper(value='Homo sapiens', termset=terms),
+    data=['Human']
+    )
+
+######################################################
+# Validate on append with TermSetWrapper
+# ----------------------------------------------------
+# As mentioned prior, when using a :py:class:`~hdmf.term_set.TermSetWrapper`, all new data is validated.
+# This is true for adding new data with append and extend.
+data = VectorData(
+    name='species',
+    description='...',
+    data=TermSetWrapper(value=['Homo sapiens'], termset=terms)
+    )
+
 data.append('Ursus arctos horribilis')
 data.extend(['Mus musculus', 'Myrmecophaga tridactyla'])
 
 ######################################################
-# Validate Data in a DynamicTable with TermSet
+# Validate Data in a DynamicTable
 # ----------------------------------------------------
-# Validating data with :py:class:`~hdmf.common.table.DynamicTable` is determined by which columns were
-# initialized with the term_set attribute set. The data is validated when the columns are created or
-# modified. Since adding the columns to a DynamicTable does not modify the data, validation is
-# not being performed at that time.
+# Validating data for :py:class:`~hdmf.common.table.DynamicTable` is determined by which columns were
+# initialized with a :py:class:`~hdmf.term_set.TermSetWrapper`. The data is validated when the columns
+# are created and modified using ``DynamicTable.add_row``.
 col1 = VectorData(
     name='Species_1',
     description='...',
-    data=['Homo sapiens'],
-    term_set=terms,
+    data=TermSetWrapper(value=['Homo sapiens'], termset=terms),
 )
 col2 = VectorData(
     name='Species_2',
     description='...',
-    data=['Mus musculus'],
-    term_set=terms,
+    data=TermSetWrapper(value=['Mus musculus'], termset=terms),
 )
 species = DynamicTable(name='species', description='My species', columns=[col1,col2])
 
-######################################################
-# Validate new rows in a DynamicTable with TermSet
-# ----------------------------------------------------
+##########################################################
+# Validate new rows in a DynamicTable with TermSetWrapper
+# --------------------------------------------------------
 # Validating new rows to :py:class:`~hdmf.common.table.DynamicTable` is simple. The
 # :py:func:`~hdmf.common.table.DynamicTable.add_row` method will automatically check each column for a
-# :py:class:`~hdmf.term_set.TermSet` (via the term_set attribute). If the attribute is set, the the data will be
-# validated for that column using that column's :py:class:`~hdmf.term_set.TermSet`. If there is invalid data, the
+# :py:class:`~hdmf.term_set.TermSetWrapper`. If a wrapper is being used, then the data will be
+# validated for that column using that column's :py:class:`~hdmf.term_set.TermSet` from the
+# :py:class:`~hdmf.term_set.TermSetWrapper`. If there is invalid data, the
 # row will not be added and the user will be prompted to fix the new data in order to populate the table.
 species.add_row(Species_1='Mus musculus', Species_2='Mus musculus')
 
-######################################################
-# Validate new columns in a DynamicTable with TermSet
-# ----------------------------------------------------
-# As mentioned prior, validating in a :py:class:`~hdmf.common.table.DynamicTable` is determined
-# by the columns. The :py:func:`~hdmf.common.table.DynamicTable.add_column` method has a term_set attribute
-# as if you were making a new instance of :py:class:`~hdmf.common.table.VectorData`. When set, this attribute
-# will be used to validate the data. The column will not be added if there is invalid data.
-col1 = VectorData(
-    name='Species_1',
-    description='...',
-    data=['Homo sapiens'],
-    term_set=terms,
-)
-species = DynamicTable(name='species', description='My species', columns=[col1])
-species.add_column(name='Species_2',
-                   description='Species data',
-                   data=['Mus musculus'],
-                   term_set=terms)
+#############################################################
+# Validate new columns in a DynamicTable with TermSetWrapper
+# -----------------------------------------------------------
+# To add a column that is validated using :py:class:`~hdmf.term_set.TermSetWrapper`,
+# wrap the data in the :py:func:`~hdmf.common.table.DynamicTable.add_column`
+# method as if you were making a new instance of :py:class:`~hdmf.common.table.VectorData`.
diff --git a/src/hdmf/__init__.py b/src/hdmf/__init__.py
@@ -3,7 +3,7 @@
 from .container import Container, Data, DataRegion, HERDManager
 from .region import ListSlicer
 from .utils import docval, getargs
-from .term_set import TermSet
+from .term_set import TermSet, TermSetWrapper
 
 
 @docval(

diff --git a/src/hdmf/backends/hdf5/h5tools.py b/src/hdmf/backends/hdf5/h5tools.py
@@ -17,6 +17,7 @@
 from ...build import (Builder, GroupBuilder, DatasetBuilder, LinkBuilder, BuildManager, RegionBuilder,
                       ReferenceBuilder, TypeMap, ObjectMapper)
 from ...container import Container
+from ...term_set import TermSetWrapper
 from ...data_utils import AbstractDataChunkIterator
 from ...spec import RefSpec, DtypeSpec, NamespaceCatalog
 from ...utils import docval, getargs, popargs, get_data_shape, get_docval, StrDataset
@@ -63,7 +64,7 @@ def can_read(path):
              'doc': 'a pre-existing h5py.File, S3File, or RemFile object', 'default': None},
             {'name': 'driver', 'type': str, 'doc': 'driver for h5py to use when opening HDF5 file', 'default': None},
             {'name': 'herd_path', 'type': str,
-             'doc': 'The path to the HERD', 'default': None},)
+             'doc': 'The path to read/write the HERD file', 'default': None},)
     def __init__(self, **kwargs):
         """Open an HDF5 file for IO.
         """
@@ -359,7 +360,10 @@ def copy_file(self, **kwargs):
              'default': True},
             {'name': 'exhaust_dci', 'type': bool,
              'doc': 'If True (default), exhaust DataChunkIterators one at a time. If False, exhaust them concurrently.',
-             'default': True})
+             'default': True},
+            {'name': 'herd', 'type': 'HERD',
+             'doc': 'A HERD object to populate with references.',
+             'default': None})
     def write(self, **kwargs):
         """Write the container to an HDF5 file."""
         if self.__mode == 'r':
@@ -1096,6 +1100,10 @@ def write_dataset(self, **kwargs):  # noqa: C901
             data = data.data
         else:
             options['io_settings'] = {}
+        if isinstance(data, TermSetWrapper):
+            # This is for when the wrapped item is a dataset
+            # (refer to objectmapper.py for wrapped attributes)
+            data = data.value
         attributes = builder.attributes
         options['dtype'] = builder.dtype
         dset = None

diff --git a/src/hdmf/backends/io.py b/src/hdmf/backends/io.py
@@ -22,7 +22,7 @@ def can_read(path):
             {"name": "source", "type": (str, Path),
              "doc": "the source of container being built i.e. file path", 'default': None},
             {'name': 'herd_path', 'type': str,
-             'doc': 'The path to the HERD', 'default': None},)
+             'doc': 'The path to read/write the HERD file', 'default': None},)
     def __init__(self, **kwargs):
         manager, source, herd_path = getargs('manager', 'source', 'herd_path', kwargs)
         if isinstance(source, Path):
@@ -74,20 +74,29 @@ def read(self, **kwargs):
 
         return container
 
-    @docval({'name': 'container', 'type': Container, 'doc': 'the Container object to write'}, allow_extra=True)
+    @docval({'name': 'container', 'type': Container, 'doc': 'the Container object to write'},
+            {'name': 'herd', 'type': 'HERD',
+             'doc': 'A HERD object to populate with references.',
+             'default': None}, allow_extra=True)
     def write(self, **kwargs):
-        """Write a container to the IO source."""
         container = popargs('container', kwargs)
-        f_builder = self.__manager.build(container, source=self.__source, root=True)
-        self.write_builder(f_builder, **kwargs)
+        herd = popargs('herd', kwargs)
 
+        """Optional: Write HERD."""
         if self.herd_path is not None:
-            herd = container.get_linked_resources()
-            if herd is not None:
-                herd.to_zip(path=self.herd_path)
-            else:
-                msg = "Could not find linked HERD. Container was still written to IO source."
-                warn(msg)
+            # If HERD is not provided, create a new one, else extend existing one
+            if herd is None:
+                from hdmf.common import HERD
+                herd = HERD(type_map=self.manager.type_map)
+
+            # add_ref_term_set to search for and resolve the TermSetWrapper
+            herd.add_ref_term_set(container) # container would be the NWBFile
+            # write HERD
+            herd.to_zip(path=self.herd_path)
+
+        """Write a container to the IO source."""
+        f_builder = self.__manager.build(container, source=self.__source, root=True)
+        self.write_builder(f_builder, **kwargs)
 
     @docval({'name': 'src_io', 'type': 'HDMFIO', 'doc': 'the HDMFIO object for reading the data to export'},
             {'name': 'container', 'type': Container,

diff --git a/src/hdmf/build/objectmapper.py b/src/hdmf/build/objectmapper.py
@@ -12,6 +12,7 @@
 from .manager import Proxy, BuildManager
 from .warnings import MissingRequiredBuildWarning, DtypeConversionWarning, IncorrectQuantityBuildWarning
 from ..container import AbstractContainer, Data, DataRegion
+from ..term_set import TermSetWrapper
 from ..data_utils import DataIO, AbstractDataChunkIterator
 from ..query import ReferenceResolver
 from ..spec import Spec, AttributeSpec, DatasetSpec, GroupSpec, LinkSpec, RefSpec
@@ -564,6 +565,8 @@ def get_attr_value(self, **kwargs):
                 msg = ("%s '%s' does not have attribute '%s' for mapping to spec: %s"
                        % (container.__class__.__name__, container.name, attr_name, spec))
                 raise ContainerConfigurationError(msg)
+            if isinstance(attr_val, TermSetWrapper):
+                attr_val = attr_val.value
             if attr_val is not None:
                 attr_val = self.__convert_string(attr_val, spec)
                 spec_dt = self.__get_data_type(spec)
@@ -937,7 +940,6 @@ def __add_attributes(self, builder, attributes, container, build_manager, source
                 if attr_value is None:
                     self.logger.debug("        Skipping empty attribute")
                     continue
-
             builder.set_attribute(spec.name, attr_value)
 
     def __set_attr_to_ref(self, builder, attr_value, build_manager, spec):