# HDF5 and RDF: Toward FAIR Attributes

HDF5 files are often described as *self-describing*, meaning they contain internal metadata about their structure, such as groups, datasets, datatypes, and attributes. Tools like `h5py`, `HDFView`, or `h5dump` can parse and display this structure without external documentation.

However, this self-description is:
- **Structural**
- **Syntactic**
- **Low-level**

It tells you **what is stored**, but not **what it means**. For example, an attribute named `"units"` with the value `"counts"` says nothing about whether it refers to photon counts, electrical pulses, or normalized integers — and it's unlikely to align with shared standards or ontologies.

To make HDF5 data **understandable**, **interoperable**, and **reusable**, especially by machines, **semantic annotation** is essential.

---

## FAIR Principle F1: Use Globally Unique and Persistent Identifiers

According to [F1 of the *FAIR Principles*](https://www.go-fair.org/fair-principles/f1-meta-data-assigned-globally-unique-persistent-identifiers/):

> *"Globally unique and persistent identifiers remove ambiguity in the meaning of your published data by assigning a unique identifier to every element of metadata and every concept/measurement in your dataset. In this context, identifiers consist of an internet link (e.g., a URL that resolves to a web page that defines the concept such as a particular human protein). Many data repositories will automatically generate globally unique and persistent identifiers to deposited datasets. Identifiers can help other people understand exactly what you mean, and they allow computers to interpret your data in a meaningful way (i.e., computers that are searching for your data or trying to automatically integrate them). Identifiers are essential to the human-machine interoperation that is key to the vision of Open Science. In addition, identifiers will help others to properly cite your work when reusing your data."*

In this context, each concept or attribute — such as a physical unit, method, instrument, or material — should be identified using a **globally resolvable identifier**, such as a URI or IRI.

---

### Using `h5rdmtoolbox` for Semantic Annotation

The `h5rdmtoolbox` provides functionality to **link HDF5 attributes to identifiers**, enabling FAIR-compliant metadata. For each attribute, both the **name** and the **value** can be associated with an **IRI (Internationalized Resource Identifier)**, connecting the metadata to shared vocabularies, ontologies, or data catalogs.

The following section demonstrates how to annotate HDF5 attributes with semantic identifiers using `h5rdmtoolbox`.


## Concept: Representing HDF5 Metadata as RDF Triples

HDF5 metadata can be interpreted in terms of [RDF triples](https://en.wikipedia.org/wiki/Semantic_triple), the foundational structure of the Semantic Web. An RDF triple consists of:

- a **subject** – the thing being described (e.g., a group or dataset)
- a **predicate** – the property or relationship (e.g., an attribute name)
- an **object** – the value or target of the property (e.g., an attribute value)

So, each attribute in an HDF5 file can naturally be viewed as a semantic statement:

> **subject** = HDF5 object (group or dataset)  
> **predicate** = attribute name  
> **object** = attribute value

This interpretation enables the transformation of binary HDF5 metadata into structured, queryable, and machine-interpretable knowledge.

---

### From Human Understanding to Machine Interpretability

Humans may be able to understand the contents of an HDF5 file based on its naming conventions or documentation. For example, a dataset of random data might include an attribute like `"creator": "Alice"`, and we understand that "Alice" refers to a person.

However, machines cannot reliably interpret such informal metadata. To make meaning explicit and unambiguous, we must associate **globally unique identifiers**, such as [URIs](https://en.wikipedia.org/wiki/Uniform_Resource_Identifier), with HDF5 components.

> For example, the attribute `"contact"` is ambiguous: is it a person or an organization? A well-chosen URI can clarify this by linking to a concept from a standard vocabulary or ontology.

---

In the following section, we will semantically annotate an HDF5 file that contains a dataset of random values, along with metadata about its creator — identified using an ORCID iD. This demonstrates how to convert conventional metadata into **machine-readable RDF**.

In [1]:
import h5rdmtoolbox as h5tbx

## Describing an HDF5 file with persistent metadata

### Example part 1: A contact person

In this example, we create a HDF5 group, that contains all relevant contact data of the author of the file. The content if the group thus describes the contact person and therefore *is* a person. The group itself, gets the predicate *has author* and relates the HDF5 to the author:

In [2]:
with h5tbx.File() as h5:
    grp = h5.create_group('contact', attrs=dict(orcid='https://orcid.org/0000-0001-8729-0482'))   
    grp.rdf.predicate = 'https://schema.org/author'
    grp.rdf.type = 'http://xmlns.com/foaf/0.1/Person'  # what the content of group is, namely a foaf:Person
    grp.rdf.subject = 'https://orcid.org/0000-0001-8729-0482'  # corresponds to @ID in JSON-LD
    grp.rdf.predicate['orcid'] =  'http://w3id.org/nfdi4ing/metadata4ing#orcidId'
    grp.attrs['first_name', 'http://xmlns.com/foaf/0.1/firstName'] = 'Matthias'

    o = grp.rdf.predicate['orcid']
    
    h5.dump(collapsed=False)

hdf_filename = h5.hdf_filename

Using the `rdf` accessory, we can assign the objects (dataset, groups, attributes) with the internationalized resource identifier (IRI). An IRI is a web resource and points to the definition in an ontology, e.g. "contact" is a "Person" and is defined in the ontology FOAF: 'http://xmlns.com/foaf/0.1/Person'. The person "has a researcher ID". This predicate is described in the M4i ([metadata4ing](https://nfdi4ing.pages.rwth-aachen.de/metadata4ing/metadata4ing/)) ontology: 'http://w3id.org/nfdi4ing/metadata4ing#orcid'

### Assigning metadata to the file rather than the root group

If we want to describe the file using attributes, the root group "/" is the way to go. However, there we might want to distinguish between the actual file and the root group. For this, we can also use the accessory `frdf`, which allows assigning RDF triples to the file.

In the following example, we add the creation date as a root group attribute but explain it as a file attribute rather than a group attribute.

Using the method `serialize()` the content is displayed as a Linked Data form:

In [3]:
from datetime import datetime

with h5tbx.File(mode='w') as h5:
    h5.frdf.subject = "https://example.org/myfile-id" # the ID of this file container
    
    h5.attrs["creation_date"] = datetime.today()
    h5.frdf["creation_date"].predicate = "http://purl.org/dc/terms/created"

print(h5tbx.serialize(h5.hdf_filename, format="ttl", structural=False, semantic=True, indent=2))

@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<https://example.org/myfile-id> dcterms:created "20250803182520237447"^^xsd:string .




From now on, let's same some work and use the package `namespacelib`, which simplifies the work with the namespaces, so that we don't have to type the full IRI address. Some popular ones are implemented in the `rdflib` package, too:

In [4]:
from ontolutils.namespacelib import M4I, OBO, QUDT_UNIT, QUDT_KIND
from rdflib.namespace import FOAF

As a result, we can type the following:

In [5]:
M4I.orcidId  # equal to http://w3id.org/nfdi4ing/metadata4ing#orcidId

rdflib.term.URIRef('http://w3id.org/nfdi4ing/metadata4ing#orcidId')

### Example part 2: A random data dataset
Next, we add the random data dataset with units. We can even describe what type the data is. In our case it shall be velocity data. Without this specification it would otherwise not be clear to the user (or a machine):

In [6]:
import numpy as np

with h5tbx.File(hdf_filename, mode='r+') as h5:    
    ds = h5.create_dataset('grp/random_velocity', data=np.random.random(100))
    ds.attrs.create('units',
                    rdf_predicate=M4I.hasUnit,
                    data='m/s',
                    rdf_object=QUDT_UNIT.M_PER_SEC)
    ds.attrs.create('quantity_kind',
                     data='velocity',
                     rdf_predicate=M4I.hasKindOfQuantity,
                     rdf_object=QUDT_KIND.Velocity)

    h5.dump(collapsed=False)

Now, let's go further and describe how the random dataset was created and that the contact was involved in it:

In [7]:
from datetime import datetime

with h5tbx.File(hdf_filename, mode='r+') as h5:  
    proc = h5.create_group('processing_info')
    proc.rdf.subject = M4I.ProcessingStep
    proc.attrs['has_participants', OBO.has_participant] = h5['contact']
    start_time = datetime.today()
    end_time = datetime.today()
    proc.attrs.create('start_time', data=start_time,
                      rdf_predicate='https://schema.org/startTime')
    proc.attrs.create('end_time', data=end_time,
                      rdf_predicate='https://schema.org/startTime')
    proc.attrs['output', 'http://purl.obolibrary.org/obo/RO_0002234'] = h5['grp/random_velocity'].name

In [8]:
h5tbx.dump(hdf_filename, collapsed=False)

### Example part 3: Associating an entity

Until now, we used IRIs to assign meaning to HDF5 attributes, e.g. `proc.rdf.subject = M4I.ProcessingStep`.

Sometimes, data cannot be expressed by a single IRI, because there is no globally unique identifier. We might describe the object by an entity with its properties.

In the example below, the attribute "standard_name" of the dataset "u" refers to "x_velocity" being the [Standard name](https://matthiasprobst.github.io/ssno#StandardName) of the HDF5 dataset "u". A Standard name has a name, description and SI unit and may be associated to a Standard Name Table in which it is listed. In our case, the Standard name "x_velocity" has no globally unique identifier, hence we need to describe it by a JOSN-LD string:

In [9]:
sn_xvel = """{
    "@context": {
        "ssno": "https://matthiasprobst.github.io/ssno#"
    },
    "@type": "ssno:StandardName",
    "ssno:standardName": "x_velocity",
    "ssno:unit": "http://qudt.org/vocab/unit/M-PER-SEC",
    "ssno:description": "X-component of a velocity vector."
}"""

Let's assign this entity to the attribute "standard_name". Note, that when dumping the data, the "LD"-icon appears (linked data):

In [10]:
with h5tbx.File() as h5:
    h5.create_dataset("u", data=[1,2,3], attrs={"standard_name": "x_velocity"})
    h5.u.rdf["standard_name"].predicate = "https://matthiasprobst.github.io/ssno#hasStandardName"
    # h5.u.rdf["standard_name"].object = sn_xvel
    h5.u.rdf["standard_name"].object = sn_xvel
    h5.dump(False)
    
    serialization = h5.serialize(fmt="ttl", structural=False)

The JSON-LD dump shows that "standard_name" is correctly associated with our JSON-LD string for the `ssno:StandardName`:

In [11]:
print(serialization)

@prefix ssno: <https://matthiasprobst.github.io/ssno#> .

[] ssno:hasStandardName [ a ssno:StandardName ;
            ssno:description "X-component of a velocity vector." ;
            ssno:standardName "x_velocity" ;
            ssno:unit "http://qudt.org/vocab/unit/M-PER-SEC" ] .




## How to make use of the FAIR HDF5 file?

There are three ways, how the above IRI assignments help us and how we might want to use the information:
1. Visual inspection by dumping the content to screen: This will outline the file (meta) content and we can click on the attributes with IRIs, which will explain the attribute (data)
2. We can extract a *JSON-LD* file. This is useful for other processes. We can also investigate this file further with tools like [JSON-LD-playground](https://json-ld.org/playground/).
3. Access IRI in (Python) code

### 1. Visual inspection

The *dump()* method will now add IRI-icons. Click on it and get redirected to the resources:

In [12]:
h5tbx.dump(hdf_filename, collapsed=False)

### 2. JSON-LD extraction

Write the JSON-LD or Turtle (ttl) file and share it with others or a repository. The toolbox provides `dump`-methods through the `jsonld` module or - in the newer version of h5tbx - the `serialize` method, allowing to write various linked data formats.

It might look a bit overwelming, however dedicated scripts can perfectly work with it while humans still can read it (with a bit of practice and patience...):

In [13]:
print(
    h5tbx.serialize(
        hdf_filename,
        format="ttl",
        indent=2,
        context={'m4i': 'http://w3id.org/nfdi4ing/metadata4ing#',
                 'foaf': 'http://xmlns.com/foaf/0.1/',
                 'obo': 'http://purl.obolibrary.org/obo/'}
    )
)

@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix hdf: <http://purl.allotrope.org/ontologies/hdf5/1.8#> .
@prefix m4i: <http://w3id.org/nfdi4ing/metadata4ing#> .
@prefix ns1: <http://purl.obolibrary.org/obo/> .
@prefix schema: <https://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

hdf:H5T_IEEE_F64LE a hdf:Datatype .

m4i:ProcessingStep ns1:RO_0000057 "/contact"^^xsd:string ;
    ns1:RO_0002234 "/grp/random_velocity"^^xsd:string ;
    schema:startTime "2025-08-03T18:25:20.424739"^^xsd:string .

<https://orcid.org/0000-0001-8729-0482> a foaf:Person ;
    m4i:orcidId "https://orcid.org/0000-0001-8729-0482"^^xsd:string ;
    foaf:firstName "Matthias"^^xsd:string .

[] a hdf:File ;
    hdf:rootGroup [ a hdf:Group ;
            hdf:member [ a hdf:Group ;
                    hdf:attribute [ a hdf:StringAttribute ;
                            hdf:data "Matthias"^^xsd:string ;
                            hdf:name "first_name

## 3. Access IRI in code

You may want to access the IRI of an attribute with Python within the HDF5 file. E.g. while working with the file, you may ask "Hey, what is 'contact' exactly?" or "What does the attribute 'orcid' mean?"

In [14]:
with h5tbx.File(hdf_filename) as h5:
    person_iri = h5.contact.rdf.subject
    orcid_iri = h5.contact.rdf.predicate['orcid']

... Well "contact" is a "Person" defined by the FOAF ontology:

In [15]:
person_iri

'https://orcid.org/0000-0001-8729-0482'

... and "orcid" is a predicate defined by the metadata4ing ontology:

In [16]:
orcid_iri

'http://w3id.org/nfdi4ing/metadata4ing#orcidId'

### 3.1 Find data based on IRIs

In [17]:
import rdflib.graph as g

graph = g.Graph()
graph.parse('hdf_meta.jsonld', format='json-ld')

<Graph identifier=N837dcf5545cd41e09d7f3ca22a5b9703 (<class 'rdflib.graph.Graph'>)>

Note, that we need to provide the PREFIXES, if the json-ld data/file does not include the context.

In [18]:
res = graph.query("""
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX m4i: <http://w3id.org/nfdi4ing/metadata4ing#>

SELECT ?id ?orcid
WHERE {
    ?id a foaf:Person .
    ?id m4i:orcid ?orcid .
    }
""")

In [19]:
for r in res:
    print(r)

## 4. Examples:

### 4.1 Read metadata from JSON and write to HDF5

Suppose we want to store information about the used software to the HDF5 file. It exists as a JSON-LD file based on the codemeta ontolog. For this example, we use the *h5rdmtoolbox* codemeta.json file from the github repository:

In [20]:
from h5rdmtoolbox.utils import download_file
from pprint import pprint

Download the file:

In [21]:
codemeta_url = 'https://raw.githubusercontent.com/matthiasprobst/h5RDMtoolbox/main/codemeta.json'
dowloaded_filename = download_file(codemeta_url)



Read the data with `ontolutils.dquery`:

In [22]:
from ontolutils import dquery

In [23]:
data = dquery(subject='schema:SoftwareSourceCode',
              source=dowloaded_filename,
              context={"schema": "http://schema.org/"})
pprint(data[0])

{'@context': {'applicationCategory': 'http://schema.org/applicationCategory',
              'author': 'http://schema.org/author',
              'codeRepository': 'http://schema.org/codeRepository',
              'description': 'http://schema.org/description',
              'license': 'http://schema.org/license',
              'name': 'http://schema.org/name',
              'operatingSystem': 'http://schema.org/operatingSystem',
              'programmingLanguage': 'http://schema.org/programmingLanguage',
              'version': 'http://schema.org/version'},
 '@id': '_:N07fb9e49070e43828cc76cdb49b1d7d2',
 '@type': 'http://schema.org/SoftwareSourceCode',
 'applicationCategory': 'file:///C:/Users/matth/AppData/Local/h5rdmtoolbox/h5rdmtoolbox/Cache/2.2.0/Engineering',
 'author': [{'@id': 'https://orcid.org/0000-0002-4116-0065',
             '@type': 'http://schema.org/Person',
             'affiliation': {'@id': 'https://ror.org/04t3en479',
                             '@type': 'http://sc

The data are written into the HDF5 file by using `jsonld.to_hdf()`:

In [24]:
from h5rdmtoolbox.wrapper import jsonld

In [25]:
with h5tbx.File('test.hdf', 'w') as h5:
    jsonld.to_hdf(data=data[0],
                 grp=h5.create_group('software_code'))
    h5.dump(False)

In [26]:
import ontolutils

In [27]:
with h5tbx.File(mode='w') as h5:
    _ = h5.create_dataset('test_dataset', data=np.array([[1, 2], [3, 4], [5.4, 1.9]]))
    h5.create_dataset('grp/subgrp/vel', data=4)
    h5.attrs['name', ontolutils.SCHEMA.name] = 'test attr'

    ttl = h5tbx.serialize(h5.filename, structural=True, format="ttl")
print(ttl)

@prefix hdf: <http://purl.allotrope.org/ontologies/hdf5/1.8#> .
@prefix schema: <https://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

hdf:H5T_IEEE_F64LE a hdf:Datatype .

hdf:H5T_INTEL_I32 a hdf:Datatype .

[] a hdf:File ;
    hdf:rootGroup [ a hdf:Group ;
            hdf:attribute [ a hdf:StringAttribute ;
                    hdf:data "test attr"^^xsd:string ;
                    hdf:name "name" ] ;
            hdf:member [ a hdf:Group ;
                    hdf:member [ a hdf:Group ;
                            hdf:member [ a hdf:Dataset ;
                                    hdf:dataspace [ a hdf:ScalarDataspace ] ;
                                    hdf:datatype hdf:H5T_INTEL_I32,
                                        "H5T_INTEGER" ;
                                    hdf:layout hdf:H5D_CONTIGUOUS ;
                                    hdf:maximumSize -1 ;
                                    hdf:name "/grp/subgrp/vel" ;
                                    hdf

## Describing attribute meanings without RDF

Sometimes, there is no IRI (yet) defined but the need to give an additional comment on the attribute. This can be done by as follows:

In [28]:
with h5tbx.File() as h5:
    grp = h5.create_group('contact')

    # Set an attribute as usual
    grp.attrs['type'] = 'Contact'

    # Update the attribute definition afterwards:
    grp.rdf['type'].definition = 'The role of the Person'

    # Alternatively, it can be assigned simultaneously via h5tbx.Attribute:
    grp.attrs['fname'] = h5tbx.Attribute(value='Matthias',
                                        definition='The first name of the contact')
    h5.dump(False)

    jdict = h5.dump_jsonld(indent=2)

In [29]:
print(jdict)

{
  "@context": {
    "hdf": "http://purl.allotrope.org/ontologies/hdf5/1.8#",
    "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  },
  "@graph": [
    {
      "@id": "_:tmp4.hdf",
      "@type": "hdf:File",
      "hdf:rootGroup": {
        "@id": "_:tmp4.hdf/"
      }
    },
    {
      "@id": "_:tmp4.hdf/",
      "@type": "hdf:Group",
      "hdf:member": {
        "@id": "_:tmp4.hdf/contact"
      },
      "hdf:name": "/"
    },
    {
      "@id": "_:tmp4.hdf/contact",
      "@type": "hdf:Group",
      "hdf:attribute": [
        {
          "@id": "_:tmp4.hdf/contact@fname"
        },
        {
          "@id": "_:tmp4.hdf/contact@type"
        }
      ],
      "hdf:name": "/contact"
    },
    {
      "@id": "_:tmp4.hdf/contact@fname",
      "@type": "hdf:StringAttribute",
      "hdf:data": "Matthias",
      "hdf:name": "fname"
    },
    {
      "@id": "_:tmp4.hdf/contact@type",
      "@type": "hdf:StringAttribute",
      "hdf:data": "Contact",
      "hdf:name": "type"
    }