Skip to content

Commit

Permalink
TermSet Integration with up-to-date HDMF (#880)
Browse files Browse the repository at this point in the history
* cleaned up

* path/tests

* path

* ruff

* coverage and update

* link

* link

* clean

* Update CHANGELOG.md

* er gallery

* Update src/hdmf/container.py

Co-authored-by: Oliver Ruebel <oruebel@users.noreply.github.com>

* Update docs/gallery/plot_term_set.py

Co-authored-by: Oliver Ruebel <oruebel@users.noreply.github.com>

* Update docs/gallery/plot_term_set.py

Co-authored-by: Oliver Ruebel <oruebel@users.noreply.github.com>

* feedback

* termset simplify

* Update requirements.txt

* Update requirements-min.txt

* Update requirements-dev.txt

* head

* Update requirements-dev.txt

* Update requirements-min.txt

* Update requirements.txt

* Update requirements-min.txt

* gallery

* gallery

* Update requirements-doc.txt

* Update conf.py

* Update conf.py

* sys

* sys

* import

* docc

* termset

* path

* req_doc

* source

* path

* path

* path

* path

* Update docs/gallery/plot_external_resources.py

Co-authored-by: Ryan Ly <rly@lbl.gov>

* Update docs/gallery/example_term_set.yaml

Co-authored-by: Ryan Ly <rly@lbl.gov>

* Update docs/gallery/plot_term_set.py

* Update docs/gallery/plot_term_set.py

* Update docs/gallery/plot_term_set.py

Co-authored-by: Ryan Ly <rly@lbl.gov>

* Update docs/gallery/plot_term_set.py

Co-authored-by: Ryan Ly <rly@lbl.gov>

* Update docs/gallery/plot_term_set.py

Co-authored-by: Ryan Ly <rly@lbl.gov>

* Update docs/gallery/plot_term_set.py

Co-authored-by: Ryan Ly <rly@lbl.gov>

* Update docs/gallery/plot_term_set.py

Co-authored-by: Ryan Ly <rly@lbl.gov>

* Update docs/gallery/plot_term_set.py

* Update requirements-min.txt

Co-authored-by: Ryan Ly <rly@lbl.gov>

* Update docs/gallery/plot_external_resources.py

Co-authored-by: Ryan Ly <rly@lbl.gov>

* Update docs/gallery/plot_external_resources.py

Co-authored-by: Ryan Ly <rly@lbl.gov>

* Update requirements-opt.txt

* Update docs/gallery/plot_external_resources.py

Co-authored-by: Ryan Ly <rly@lbl.gov>

* Update docs/gallery/plot_term_set.py

Co-authored-by: Ryan Ly <rly@lbl.gov>

* updates

* updates

* updates

* Update pyproject.toml

* Update requirements-min.txt

Co-authored-by: Ryan Ly <rly@lbl.gov>

* Update requirements-dev.txt

Co-authored-by: Ryan Ly <rly@lbl.gov>

* Update requirements-doc.txt

Co-authored-by: Ryan Ly <rly@lbl.gov>

* Update requirements-doc.txt

* updates

* Update requirements.txt

* Update tests/unit/test_term_set.py

* Update docs/gallery/plot_term_set.py

* Update docs/gallery/plot_term_set.py

* Update CHANGELOG.md

* Update src/hdmf/term_set.py

* Update docs/gallery/plot_term_set.py

Co-authored-by: Oliver Ruebel <oruebel@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* feedback

* ruff

* Update src/hdmf/term_set.py

---------

Co-authored-by: Oliver Ruebel <oruebel@users.noreply.github.com>
Co-authored-by: Ryan Ly <rly@lbl.gov>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
4 people committed Jul 6, 2023
1 parent 82baf69 commit 0c01dd7
Show file tree
Hide file tree
Showing 18 changed files with 824 additions and 10 deletions.
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,11 @@
### New features and minor improvements
- Updated `ExternalResources` to have EntityKeyTable with updated tests/documentation and minor bug fix to ObjectKeyTable. @mavaylon1 [#872](https://github.com/hdmf-dev/hdmf/pull/872)
- Added warning for DynamicTableRegion links that are not added to the same parent as the original container object. @mavaylon1 [#891](https://github.com/hdmf-dev/hdmf/pull/891)
- Added the `TermSet` class along with integrated validation methods for any child of `AbstractContainer`, e.g., `VectorData`, `Data`, `DynamicTable`. @mavaylon1 [#880](https://github.com/hdmf-dev/hdmf/pull/880)

### Documentation and tutorial enhancements:

- Added tutorial for the new `TermSet` class @mavaylon1 [#880](https://github.com/hdmf-dev/hdmf/pull/880)

## HMDF 3.6.1 (May 18, 2023)

Expand Down
24 changes: 24 additions & 0 deletions docs/gallery/example_term_set.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
id: termset/species_example
name: Species
version: 0.0.1
prefixes:
NCBI_TAXON: https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=
imports:
- linkml:types
default_range: string

enums:
Species:
permissible_values:
Homo sapiens:
description: the species is human
meaning: NCBI_TAXON:9606
Mus musculus:
description: the species is a house mouse
meaning: NCBI_TAXON:10090
Ursus arctos horribilis:
description: the species is a grizzly bear
meaning: NCBI_TAXON:116960
Myrmecophaga tridactyla:
description: the species is an anteater
meaning: NCBI_TAXON:71006
35 changes: 34 additions & 1 deletion docs/gallery/plot_external_resources.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,10 @@
:py:class:`~hdmf.common.resources.Key`
* :py:class:`~hdmf.common.resources.FileTable` where each row describes a
:py:class:`~hdmf.common.resources.File`
* :py:class:`~hdmf.common.resources.EntityTable` where each row describes an
* :py:class:`~hdmf.common.resources.EntityTable` where each row describes an
:py:class:`~hdmf.common.resources.Entity`
* :py:class:`~hdmf.common.resources.EntityKeyTable` where each row describes an
:py:class:`~hdmf.common.resources.EntityKey`
* :py:class:`~hdmf.common.resources.ObjectTable` where each row describes an
:py:class:`~hdmf.common.resources.Object`
* :py:class:`~hdmf.common.resources.ObjectKeyTable` where each row describes an
Expand Down Expand Up @@ -209,6 +211,7 @@ def __init__(self, **kwargs):
er.entities.to_dataframe()
er.keys.to_dataframe()
er.object_keys.to_dataframe()
er.entity_keys.to_dataframe()

###############################################################################
# Using the get_key method
Expand Down Expand Up @@ -320,3 +323,33 @@ def __init__(self, **kwargs):

er_read = ExternalResources.from_norm_tsv(path='./')
os.remove('./er.zip')

###############################################################################
# Using TermSet with ExternalResources
# ------------------------------------------------
# :py:class:`~hdmf.term_set.TermSet` allows for an easier way to add references to
# :py:class:`~hdmf.common.resources.ExternalResources`. These enumerations take place of the
# entity_id and entity_uri parameters. :py:class:`~hdmf.common.resources.Key` values will have
# to match the name of the term in the :py:class:`~hdmf.term_set.TermSet`.
from hdmf.term_set import TermSet

try:
dir_path = os.path.dirname(os.path.abspath(__file__))
yaml_file = os.path.join(dir_path, 'example_term_set.yaml')
except NameError:
dir_path = os.path.dirname(os.path.abspath('.'))
yaml_file = os.path.join(dir_path, 'gallery/example_term_set.yaml')

terms = TermSet(term_schema_path=yaml_file)
col1 = VectorData(
name='Species_Data',
description='...',
data=['Homo sapiens', 'Ursus arctos horribilis'],
term_set=terms,
)

species = DynamicTable(name='species', description='My species', columns=[col1],)
er.add_ref_term_set(file=file,
container=species,
attribute='Species_Data',
)
146 changes: 146 additions & 0 deletions docs/gallery/plot_term_set.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
"""
TermSet
=======
This is a user guide for interacting with the
:py:class:`~hdmf.term_set.TermSet` class. The :py:class:`~hdmf.term_set.TermSet` type
is experimental and is subject to change in future releases. If you use this type,
please provide feedback to the HDMF team so that we can improve the structure and
overall capabilities.
Introduction
-------------
The :py:class:`~hdmf.term_set.TermSet` class provides a way for users to create their own
set of terms from brain atlases, species taxonomies, and anatomical, cell, and
gene function ontologies.
:py:class:`~hdmf.term_set.TermSet` serves two purposes: data validation and external reference
management. Users will be able to validate their data to their own set of terms, ensuring
clean data to be used inline with the FAIR principles later on.
The :py:class:`~hdmf.term_set.TermSet` class allows for a reusable and sharable
pool of metadata to serve as references to any dataset.
The :py:class:`~hdmf.term_set.TermSet` class is used closely with
:py:class:`~hdmf.common.resources.ExternalResources` to more efficiently map terms
to data. Please refer to the tutorial on ExternalResources to see how :py:class:`~hdmf.term_set.TermSet`
is used with :py:class:`~hdmf.common.resources.ExternalResources`.
:py:class:`~hdmf.term_set.TermSet` is built upon the resources from LinkML, a modeling
language that uses YAML-based schema, giving :py:class:`~hdmf.term_set.TermSet`
a standardized structure and a variety of tools to help the user manage their references.
How to make a TermSet Schema
----------------------------
Before the user can take advantage of all the wonders within the
:py:class:`~hdmf.term_set.TermSet` class, the user needs to create a LinkML schema (YAML) that provides
all the permissible term values. Please refer to https://linkml.io/linkml/intro/tutorial06.html
to learn more about how LinkML structures their schema.
1. The name of the schema is up to the user, e.g., the name could be "Species" if the term set will
contain species terms.
2. The prefixes will be the standardized prefix of your source, followed by the URI to the terms.
For example, the NCBI Taxonomy is abbreviated as NCBI_TAXON, and Ensemble is simply Ensemble.
As mentioned prior, the URI needs to be to the terms; this is to allow the URI to later be coupled
with the source id for the term to create a valid link to the term source page.
3. The schema uses LinkML enumerations to list all the possible terms. Currently, users will need to
manually outline the terms within the enumeration's permissible values.
For a clear example, please view the
`example_term_set.yaml <https://github.com/hdmf-dev/hdmf/blob/dev/docs/gallery/example_term_set.yaml>`_
for this tutorial, which provides a concise example of how a term set schema looks.
"""
######################################################
# Creating an instance of the TermSet class
# ----------------------------------------------------
from hdmf.common import DynamicTable, VectorData
import os

try:
dir_path = os.path.dirname(os.path.abspath(__file__))
yaml_file = os.path.join(dir_path, 'example_term_set.yaml')
except NameError:
dir_path = os.path.dirname(os.path.abspath('.'))
yaml_file = os.path.join(dir_path, 'gallery/example_term_set.yaml')

######################################################
# Viewing TermSet values
# ----------------------------------------------------
# :py:class:`~hdmf.term_set.TermSet` has methods to retrieve terms. The :py:func:`~hdmf.term_set.TermSet:view_set`
# method will return a dictionary of all the terms and the corresponding information for each term.
# Users can index specific terms from the :py:class:`~hdmf.term_set.TermSet`. LinkML runtime will need to be installed.
# You can do so by first running ``pip install linkml-runtime``.
from hdmf.term_set import TermSet
terms = TermSet(term_schema_path=yaml_file)
print(terms.view_set)

# Retrieve a specific term
terms['Homo sapiens']

######################################################
# Validate Data with TermSet
# ----------------------------------------------------
# :py:class:`~hdmf.term_set.TermSet` has been integrated so that :py:class:`~hdmf.container.Data` and its
# subclasses support a term_set attribute. By having this attribute set, the data will be validated
# and all new data will be validated.
data = VectorData(
name='species',
description='...',
data=['Homo sapiens'],
term_set=terms)

######################################################
# Validate on append with TermSet
# ----------------------------------------------------
# As mentioned prior, when the term_set attribute is set, then all new data is validated. This is true for both
# append and extend methods.
data.append('Ursus arctos horribilis')
data.extend(['Mus musculus', 'Myrmecophaga tridactyla'])

######################################################
# Validate Data in a DynamicTable with TermSet
# ----------------------------------------------------
# Validating data with :py:class:`~hdmf.common.table.DynamicTable` is determined by which columns were
# initialized with the term_set attribute set. The data is validated when the columns are created or
# modified. Since adding the columns to a DynamicTable does not modify the data, validation is
# not being performed at that time.
col1 = VectorData(
name='Species_1',
description='...',
data=['Homo sapiens'],
term_set=terms,
)
col2 = VectorData(
name='Species_2',
description='...',
data=['Mus musculus'],
term_set=terms,
)
species = DynamicTable(name='species', description='My species', columns=[col1,col2])

######################################################
# Validate new rows in a DynamicTable with TermSet
# ----------------------------------------------------
# Validating new rows to :py:class:`~hdmf.common.table.DynamicTable` is simple. The
# :py:func:`~hdmf.common.table.DynamicTable.add_row` method will automatically check each column for a
# :py:class:`~hdmf.term_set.TermSet` (via the term_set attribute). If the attribute is set, the the data will be
# validated for that column using that column's :py:class:`~hdmf.term_set.TermSet`. If there is invalid data, the
# row will not be added and the user will be prompted to fix the new data in order to populate the table.
species.add_row(Species_1='Mus musculus', Species_2='Mus musculus')

######################################################
# Validate new columns in a DynamicTable with TermSet
# ----------------------------------------------------
# As mentioned prior, validating in a :py:class:`~hdmf.common.table.DynamicTable` is determined
# by the columns. The :py:func:`~hdmf.common.table.DynamicTable.add_column` method has a term_set attribute
# as if you were making a new instance of :py:class:`~hdmf.common.table.VectorData`. When set, this attribute
# will be used to validate the data. The column will not be added if there is invalid data.
col1 = VectorData(
name='Species_1',
description='...',
data=['Homo sapiens'],
term_set=terms,
)
species = DynamicTable(name='species', description='My species', columns=[col1])
species.add_column(name='Species_2',
description='Species data',
data=['Mus musculus'],
term_set=terms)
2 changes: 2 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,8 @@ dynamic = ["version"]

[project.optional-dependencies]
zarr = ["zarr>=2.12.0"]
tqdm = ["tqdm>=4.41.0"]
linkml = ["linkml-runtime>=1.5.0"]

[project.urls]
"Homepage" = "https://github.com/hdmf-dev/hdmf"
Expand Down
1 change: 1 addition & 0 deletions requirements-doc.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@ sphinx>=4 # improved support for docutils>=0.17
sphinx_rtd_theme>=1 # <1 does not work with docutils>=0.17
sphinx-gallery
sphinx-copybutton
linkml-runtime==1.5.0
5 changes: 4 additions & 1 deletion requirements-min.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,11 @@
h5py==2.10 # support for selection of datasets with list of indices added in 2.10
importlib-metadata==4.2.0; python_version < "3.8" # TODO: remove when minimum python version is 3.8
importlib-resources==5.12.0; python_version < "3.9" # TODO: remove when when minimum python version is 3.9
jsonschema==2.6.0
jsonschema==3.2.0
numpy==1.16 # numpy>=1.16,<1.18 does not provide wheels for python 3.8 and does not build well on windows
pandas==1.0.5 # when this is changed to >=1.5.0, see TODO items referenced in #762
ruamel.yaml==0.16
scipy==1.1 # scipy>=1.1,<1.4 does not provide wheels for python 3.8 and building scipy can fail due to incompatibilities with numpy
linkml-runtime==1.5.0
tqdm==4.41.0
zarr==2.12.0
1 change: 1 addition & 0 deletions requirements-opt.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# pinned dependencies that are optional. used to reproduce an entire development environment to use HDMF
tqdm==4.65.0
zarr==2.14.2
linkml-runtime==1.5.0
1 change: 1 addition & 0 deletions src/hdmf/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
from .container import Container, Data, DataRegion, ExternalResourcesManager
from .region import ListSlicer
from .utils import docval, getargs
from .term_set import TermSet


@docval(
Expand Down
76 changes: 75 additions & 1 deletion src/hdmf/common/resources.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,11 @@
import numpy as np
from . import register_class, EXP_NAMESPACE
from . import get_type_map
from ..container import Table, Row, Container, AbstractContainer, ExternalResourcesManager
from ..container import Table, Row, Container, AbstractContainer, Data, ExternalResourcesManager
from ..data_utils import DataIO
from ..utils import docval, popargs, AllowPositional
from ..build import TypeMap
from ..term_set import TermSet
from glob import glob
import os
import zipfile
Expand Down Expand Up @@ -405,6 +407,78 @@ def _get_file_from_container(self, **kwargs):
msg = 'Could not find file. Add container to the file.'
raise ValueError(msg)

@docval({'name': 'file', 'type': ExternalResourcesManager, 'doc': 'The file associated with the container.',
'default': None},
{'name': 'container', 'type': (str, AbstractContainer), 'default': None,
'doc': ('The Container/Data object that uses the key or '
'the object_id for the Container/Data object that uses the key.')},
{'name': 'attribute', 'type': str,
'doc': 'The attribute of the container for the external reference.', 'default': None},
{'name': 'field', 'type': str, 'default': '',
'doc': ('The field of the compound data type using an external resource.')},
{'name': 'key', 'type': (str, Key), 'default': None,
'doc': 'The name of the key or the Key object from the KeyTable for the key to add a resource for.'},
{'name': 'term_set', 'type': TermSet, 'default': None,
'doc': 'The TermSet to be used if the container/attribute does not have one.'}
)
def add_ref_term_set(self, **kwargs):
file = kwargs['file']
container = kwargs['container']
attribute = kwargs['attribute']
key = kwargs['key']
field = kwargs['field']
term_set = kwargs['term_set']

if term_set is None:
if attribute is None:
try:
term_set = container.term_set
except AttributeError:
msg = "Cannot Find TermSet"
raise AttributeError(msg)
else:
term_set = container[attribute].term_set
if term_set is None:
msg = "Cannot Find TermSet"
raise ValueError(msg)

if file is None:
file = self._get_file_from_container(container=container)

# if key is provided then add_ref proceeds as normal
# use key provided as the term in the term_set for entity look-up
if key is not None:
data = [key]
else:
if attribute is None:
data_object = container
else:
data_object = container[attribute]
if isinstance(data_object, (Data, DataIO)):
data = data_object.data
elif isinstance(data_object, (list, np.ndarray)):
data = data_object
missing_terms = []
for term in data:
try:
term_info = term_set[term]
except ValueError:
missing_terms.append(term)
continue
entity_id = term_info[0]
entity_uri = term_info[2]
self.add_ref(file=file,
container=container,
attribute=attribute,
key=term,
field=field,
entity_id=entity_id,
entity_uri=entity_uri)
if len(missing_terms)>0:
return {"Missing Values in TermSet": missing_terms}
else:
return True

@docval({'name': 'key_name', 'type': str, 'doc': 'The name of the Key to get.'},
{'name': 'file', 'type': ExternalResourcesManager, 'doc': 'The file associated with the container.',
'default': None},
Expand Down

0 comments on commit 0c01dd7

Please sign in to comment.