# HDF5 and ontologies

HDF5 itself is considered self-describing due to the ability to store metadata (attributes) with raw data. However, this is only the prerequisite. Especially, achieving (easy) re-usability, requires standardized metadata, which is publically defined and accessible. This means, that data must be describable with persistent identifiers, as known from linked data solutions.

One solution of describing data is using controlled vocabularies or even better ontologies. In fact, an [ontology exists]("http://purl.allotrope.org/ontologies/hdf5/1.8#"), which allows describing the structural content of an HDF5 file (groups, datasets, attributes, properties etc.). The ``h5rdmtoolbox`` has implemented a conversion function, translating an HDF5 into a JSON-LD file. This is outlined here.

In [1]:
import h5rdmtoolbox as h5tbx

In [8]:
with h5tbx.File(h5tbx.utils.generate_temporary_filename(), 'w') as h5:
    h5.dump()

In [2]:
with h5tbx.File(mode='w') as h5:
    h5.create_dataset('ds', shape=(3, ))
    grp = h5.create_group('grp')
    sub_grp = grp.create_group('sub_grp')
    sub_grp.create_dataset('ds', data=3.4)
    sub_grp['ds'].attrs['units'] = 'm/s'
    h5.dump()

In [3]:
print(h5tbx.jsonld.dumps(h5.hdf_filename, indent=2))

file://C:\Users\da4323\AppData\Local\h5rdmtoolbox\h5rdmtoolbox\tmp\tmp_14\tmp0.hdf does not look like a valid URI, trying to serialize this will break.


{
  "@context": {
    "hdf": "http://purl.allotrope.org/ontologies/hdf5/1.8#"
  },
  "@graph": [
    {
      "@id": "N6b2255e43af944fea47f853d81dbbdfa",
      "@type": "https://schema.org/SoftwareSourceCode",
      "https://schema.org/softwareVersion": "1.2.3a1"
    },
    {
      "@id": "N2978dd3b885f44f0a657c1ef533584bc",
      "@type": "http://www.molmod.info/semantics/pims-ii.ttl#Variable"
    },
    {
      "@id": "N7d2008daea2442e68031f64f9c39e472",
      "@type": "http://www.molmod.info/semantics/pims-ii.ttl#Variable",
      "units": "m/s"
    },
    {
      "@id": "Nb5c31fbf72004b71a2eb1ce03bddff9c",
      "@type": "hdf:rootGroup"
    },
    {
      "@id": "file://C:\\Users\\da4323\\AppData\\Local\\h5rdmtoolbox\\h5rdmtoolbox\\tmp\\tmp_14\\tmp0.hdf",
      "@type": "hdf:File"
    },
    {
      "@id": "_:N394bb9b3975848be847f5e258e1d296a",
      "http://w3id.org/nfdi4ing/metadata4ing#hasParameter": {
        "@id": "N2978dd3b885f44f0a657c1ef533584bc"
      }
    },
    {
      "

In [4]:
hdf_jsonld = h5tbx.dump_jsonld(h5.hdf_filename, skipND=None)

In [5]:
print(hdf_jsonld)

{
    "@context": {
        "owl": "http://www.w3.org/2002/07/owl#",
        "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
        "hdf5": "http://purl.allotrope.org/ontologies/hdf5/1.8#"
    },
    "@type": "hdf5:File",
    "hdf5:rootGroup": {
        "@type": "hdf5:Group",
        "hdf5:attribute": [],
        "hdf5:name": "/",
        "hdf5:member": [
            {
                "@type": "hdf5:Group",
                "hdf5:attribute": [
                    {
                        "@type": "hdf5:Attribute",
                        "hdf5:name": "@type",
                        "hdf5:value": "https://schema.org/SoftwareSourceCode",
                        "@id": "Nbb963b373a6a487da4afe3b6f1b6c224"
                    },
                    {
                        "@type": "hdf5:Attribute",
                        "hdf5:name": "__h5rdmtoolbox_version__",
                        "hdf5:value": "1.2.3a1",
                        "@id": "N858b4c26824e478088632ca9afc5ee03"
         

## Query the HDF-JSONLD file

In [6]:
sparql_query = """PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
        PREFIX hdf5: <http://purl.allotrope.org/ontologies/hdf5/1.8#>
        
        SELECT ?ds_name ?ds_size
        WHERE {
            ?group rdf:type hdf5:Dataset .
            ?group hdf5:name ?ds_name .
            ?group hdf5:size ?ds_size .
        }
        """

In [7]:
import rdflib
g = rdflib.Graph()
g.parse(data=hdf_jsonld, format='json-ld')
results = g.query(sparql_query)

# convert results to dataframe:
import pandas as pd
df = pd.DataFrame(results.bindings)
df

Unnamed: 0,ds_name,ds_size
0,/ds,3
1,/grp/sub_grp/ds,1
