# HDF5 and ontologies

HDF5 itself is considered self-describing due to the ability to store metadata (attributes) with raw data. However, this is only the prerequisite. Especially, achieving (easy) re-usability, requires standardized metadata, which is publically defined and accessible. This means, that data must be describable with persistent identifiers, as known from linked data solutions.

One solution of describing data is using controlled vocabularies or even better ontologies. In fact, an [ontology exists](http://purl.allotrope.org/ontologies/hdf5/1.8#), which allows describing the structural content of an HDF5 file (groups, datasets, attributes, properties etc.). The ``h5rdmtoolbox`` has implemented a conversion function, translating an HDF5 into a JSON-LD file. This is outlined here.

In [None]:
import h5rdmtoolbox as h5tbx

Let's create a sample HDF5 file first:

In [None]:
with h5tbx.File(mode='w') as h5:
    h5.create_dataset('test_dataset', shape=(3, ))
    grp = h5.create_group('grp')
    sub_grp = grp.create_group('Fan')
    sub_grp.create_dataset('D3', data=300)
    sub_grp['D3'].attrs['units', 'http://w3id.org/nfdi4ing/metadata4ing#hasUnits'] = 'mm'
    sub_grp['D3'].rdf['units'].object = 'https://qudt.org/vocab/unit/MilliM'
    sub_grp['D3'].attrs['standard_name', 'https://matthiasprobst.github.io/ssno/#standard_name'] = 'blade_diameter3'
    h5.dump(False)

## Dump the semantic metadata to JSON-LD format

The semantic metadata is stored in the RDF dictionaries of the HDF5 file, which the `h5rdmtoolbox`  can work with. Call `h5tbx.jsonld.dumps()` to extract it:

In [None]:
print(h5tbx.serialize(h5.hdf_filename,
                      indent=2,
                      context={'schema': 'http://schema.org/',
                               "ssno":  "https://matthiasprobst.github.io/ssno/#",
                               "m4i": "http://w3id.org/nfdi4ing/metadata4ing#"}))

## Dump the structural metadata to JSON-LD format

The structural or organizational metadata is the HDF5 internal layout, like groups, datasets, attributes and their properties including their relations:

In [None]:
hdf_jsonld = h5tbx.dump_jsonld(h5.hdf_filename, skipND=None)
print(hdf_jsonld)

## Query the HDF-JSONLD file

The obtained JSON-LD file can be used to search for specific information. In the below example, all datasets with their sizes are extracted:

In [None]:
sparql_query = """PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX hdf5: <http://purl.allotrope.org/ontologies/hdf5/1.8#>

SELECT ?ds_name ?ds_size
WHERE {
    ?group rdf:type hdf5:Dataset .
    ?group hdf5:name ?ds_name .
    ?group hdf5:size ?ds_size .
}
"""

In [None]:
import rdflib
g = rdflib.Graph()
g.parse(data=hdf_jsonld, format='json-ld')
results = g.query(sparql_query)

In [None]:
for b in results.bindings:
    print(b)

In [None]:
# convert results to dataframe:
import pandas as pd
df = pd.DataFrame(results.bindings)
df