Warning
This repository is deprecated. Please consider putting issues in:
- NMDC Schema for schema changes/issues.
- NMDC Ontology for issues related to producing terms or term subsets.
- NMDC Runtime for isseus related to ETL issues.
Metadata management for the National Microbiome Data Collaborative
The purpose of this repository is to manage metadata for the National Microbiome Data Collaborative (NMDC). The NMDC is a multi-organizational effort to enable integrated microbiome data across diverse areas in medicine, agriculture, bioenergy, and the environment. This integrated platform facilitates comprehensive discovery of and access to multidisciplinary microbiome data in order to unlock new possibilities with microbiome data science.
Tasks managed by the repository are:
- Generating the schema
- Deploying the documentation
- Integrating metadata from multiple environmental data repositories
The NMDC Introduction to metadata and ontologies primer describes the context for this project.
See the slides describing the schema
The NMDC schema is used during the translation process to specify how metadata elements are related.
The schema is also available as:
Documentation for the NMDC schema can be browsed here:
A zipped file of the NMDC can be downloaded here (JSON format).
We use SSSOM to map fields in primary data sources to standard terms. The mapping between the GOLD data and MIxS terms this SSSOM file.
Entities in the schema are annotated with characteristics. When possible, we use standard terminologies and ontologies to define these characteristics. These standards include:
We are actively involved in updating the MIxS standards (mixs-ng) and creating an RDF version of MIxS (mixs-rdf).
See also our analysis of MIxS descriptors
At present, we ingest metadata from the Joint Genome Institute (JGI) and the Environmental Molecular Sciences Lab (EMSL).
The NMDC schema and translation process will be modified as more metadata sources become available.
We use Jupyter notebooks to integrate the metadata sources. This allows us to iterate quickly in a transparent and interactive manner as new metadata sources become available.
Development of more comprehensive ETL pipeline will progress as the metadata sources and schema become more concrete.
See identifiers documentation