A BIOM to Linked Data mapping and dataset
Java CSS Shell
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
data
doc
lib
ontology
result
sparql
src/eu/genomic/resources/biom2ld
test/eu/genomic/resources/biom2ld/HDF5
.gitignore
BIOMhdf52rdf.jar
BIOMhdf52rdf.sh
README.md
VIRTUOSO_HOWTO.md

README.md

BIOM-LD

Summary

This is a first hacky attempt towards a full mapping of the BIOM format to a Linked Data, and beyond.

Rationale

The Biological Observation Matrix (BIOM) format is a HDF5 and JSON based specification to represent biological observation tables. For example, it is used in the Earth Microbiome Project for storing environmental metagenomics data in a common an efficient format. A BIOM table can be converted to a Linked Data dataset (from HDF5+JSON to RDF+OWL with HTTP resolvable URIs) and obtain a Linked Data representation that can be directly plugged into the web of data.

Moving to a Linked Data representation has the following advantages, specially from the publication/interoperability point of view (but not from the efficient storage point of view):

  • Publishing our data in Linked Data means that other datasets can be linked to ours (i.e. our dataset becomes more "discoverable" over the web) or we can link our dataset to other datasets and integrate information easily.
  • Since we are using RDF, it is easy to merge our datasets with other datasets. This is specially interesting if common vocabularies like EnvO or NCBI taxonomy are used to represent row and column metadata, and/or SPARQL federated queries or SPARQL R are used to query the data.
  • Since the BIOM specification is represented as an OWL ontology, the specification, rather than being pure text, becomes computationally explicit: programs that consume BIOM data can be more easily written, reasoning can be used to check validity, specific validators (e.g. for metadata, value ranges, ... ) can be more easily written, and in general any programmatic endeavour becomes easier and more maintainable in the long term.

The mapping

A BIOM file is basically a sparse table with some metadata attached to it: in this mapping, it becomes and instance that is a member of any of the subclasses of biom:Table, for example biom:OTUTable. The information about the table (generated by, table type, etc.) is translated to triples whose subject is the table instance. The cell values are also translated to triples that are linked to the table instance. An example is included in the ontology.

Linked Data

  • Normal triple store for table metadata and rows and columns, SADI (BerkeleyDB) for cells, SHARE as client.
  • Linked Data Fragments as server (possibly decompose into normal triple store + BerkeleyDB as above) and client.
  • Normal setting: triple store, some special LD server + AJAX client a la LODestar.
  • Binary RDF (HDT).