# HDF5 and RDF: FAIR Attributes

According to [F1 of the *FAIR Principles*](https://www.go-fair.org/fair-principles/f1-meta-data-assigned-globally-unique-persistent-identifiers/) attributes shall be assigned to globally unique and persistent identifiers.

Here's what www.go-fair.org says about it:

*"Globally unique and persistent identifiers remove ambiguity in the meaning of your published data by assigning a unique identifier to every element of metadata and every concept/measurement in your dataset. In this context, identifiers consist of an internet link (e.g., a URL that resolves to a web page that defines the concept such as a particular human protein). Many data repositories will automatically generate globally unique and persistent identifiers to deposited datasets. Identifiers can help other people understand exactly what you mean, and they allow computers to interpret your data in a meaningful way (i.e., computers that are searching for your data or trying to automatically integrate them). Identifiers are essential to the human-machine interoperation that is key to the vision of Open Science. In addition, identifiers will help others to properly cite your work when reusing your data."*

The *h5rdmtoolbox* allows assigning attributes (and their data) to identifiers. For this, each name and value of an attribute may obtain an IRI (internationalized resource identifier). The following outlines, how it is done.

## Concept

We can interpret HDF5 objects, their attribute names and attribute values as [RDF triples](https://en.wikipedia.org/wiki/Semantic_triple) (subject-predicate-object), where...
- ... a group or dataset is a *subject*
- ... the attribute name is a <u>predicate</u>
- ... and the attriute value is an **object**

In the following, we would like to describe the content of an HDF5 file. There will be a dataset or random data generated by a person, which can be identified/described by a researcher ID (ORCID).

We as humans may understand the content of such an HDF5 file. For machines to interpret the data, we need to associate [URIs](https://en.wikipedia.org/wiki/Uniform_Resource_Identifier) with the HDF5 objects. In fact, sometimes it may also not very clear to humans, what is meant with a certain attribute. To be unambiguous about it, a URI helps. Think of the attribute "contact", we will define. Is it a person or an organization? Note, that URI and [IRI](https://en.wikipedia.org/wiki/Internationalized_Resource_Identifier) may be used synonymously - IRI is built on URI by expanding the set of permitted characters.

Let's build the example step by step. We start with **creating the group "contact"**:

In [None]:
import h5rdmtoolbox as h5tbx

## Describing an HDF5 file with persistent metadata

### Example part 1: A contact person
The file is written by an author. We create a group. It contains all relevant contact data, i.e. the ORCID. The content if the group thus describes the contact person and therefore *is* a person. The group itself, however, gets the predicate *has author*:

In [None]:
with h5tbx.File(mode='w') as h5:
    grp = h5.create_group('contact', attrs=dict(orcid='https://orcid.org/0000-0001-8729-0482'))   
    grp.rdf.predicate = 'https://schema.org/author'
    grp.rdf.type = 'http://xmlns.com/foaf/0.1/Person'  # what the content of group is, namely a foaf:Person
    grp.rdf.subject = 'https://orcid.org/0000-0001-8729-0482'  # corresponds to @ID in JSON-LD
    grp.rdf.predicate['orcid'] =  'http://w3id.org/nfdi4ing/metadata4ing#orcidId'
    grp.attrs['first_name', 'http://xmlns.com/foaf/0.1/firstName'] = 'Matthias'

    o = grp.rdf.predicate['orcid']
    
    h5.dump(collapsed=False)

hdf_filename = h5.hdf_filename

Using the `rdf` accessory, we can assign the objects (dataset, groups, attributes) with the internationalized resource identifier (IRI). An IRI a web resource and points to the definition in an ontology, e.g. "contact" is a "Person" and is defined in the ontology FOAF: 'http://xmlns.com/foaf/0.1/Person'. The person "has a researcher ID". This predicate is described in the M4i ([metadata4ing](https://nfdi4ing.pages.rwth-aachen.de/metadata4ing/metadata4ing/)) ontology: 'http://w3id.org/nfdi4ing/metadata4ing#orcid'

### Assigning metadata to the file rather than the root group

If we want to describe the file using attributes, the root group "/" is the way to go. However, there we might want to distinguish between the actual file and the root group. For this, we can also use the accessory `frdf`, which allows assigning RDF triples to the file.

In the following example, we add the creation date as a root group attribute but explain it as a file attribute rather than a group attribute.

Using the method `serialize()` the content is displayed as a Linked Data form:

In [None]:
from datetime import datetime

with h5tbx.File(mode='w') as h5:
    h5.attrs["creation_date"] = datetime.today()
    h5.frdf["creation_date"].predicate = "http://purl.org/dc/terms/created"

print(h5tbx.serialize(h5.hdf_filename, format="ttl", structural=False, semantic=True, indent=2))

From now on, let's same some work and use the package `namespacelib`, which simplifies the work with the namespaces, so that we don't have to type the full IRI address. Some popular ones are implemented in the `rdflib` package, too:

In [None]:
from ontolutils.namespacelib import M4I, OBO, QUDT_UNIT, QUDT_KIND
from rdflib.namespace import FOAF

As a result, we can type the following:

In [None]:
M4I.orcidId  # equal to http://w3id.org/nfdi4ing/metadata4ing#orcidId

### Example part 2: A random data dataset
Next, we add the random data dataset with units. We can even describe what type the data is. In our case it shall be velocity data. Without this specification it would otherwise not be clear to the user (or a machine):

In [None]:
import numpy as np

with h5tbx.File(hdf_filename, mode='r+') as h5:    
    ds = h5.create_dataset('grp/random_velocity', data=np.random.random(100))
    ds.attrs.create('units',
                    rdf_predicate=M4I.hasUnit,
                    data='m/s',
                    rdf_object=QUDT_UNIT.M_PER_SEC)
    ds.attrs.create('quantity_kind',
                     data='velocity',
                     rdf_predicate=M4I.hasKindOfQuantity,
                     rdf_object=QUDT_KIND.Velocity)

    h5.dump(collapsed=False)

Now, let's go further and describe how the random dataset was created and that the contact was involved in it:

In [None]:
from datetime import datetime

with h5tbx.File(hdf_filename, mode='r+') as h5:  
    proc = h5.create_group('processing_info')
    proc.rdf.subject = M4I.ProcessingStep
    proc.attrs['has_participants', OBO.has_participant] = h5['contact']
    start_time = datetime.today()
    end_time = datetime.today()
    proc.attrs.create('start_time', data=start_time,
                      rdf_predicate='https://schema.org/startTime')
    proc.attrs.create('end_time', data=end_time,
                      rdf_predicate='https://schema.org/startTime')
    proc.attrs['output', 'http://purl.obolibrary.org/obo/RO_0002234'] = h5['grp/random_velocity'].name

In [None]:
h5tbx.dump(hdf_filename, collapsed=False)

### Example part 3: Assigning JSON-LD to describe data

Until now, we used IRIs to assign meaning to HDF5 attributes, e.g. `proc.rdf.subject = M4I.ProcessingStep`.

Sometimes, data cannot be expressed by a single IRI, because there is no globally unique identifier. Let's examine this case by using the [SSNO](https://matthiasprobst.github.io/ssno/) Ontology.

In the example below, the attribute "standard_name" of the dataset "u" refers to "x_velocity" being the [Standard name](https://matthiasprobst.github.io/ssno#StandardName) of the HDF5 dataset "u". A Standard name has a name, description and SI unit and may be associated to a Standard Name Table in which it is listed. In our case, the Standard name "x_velocity" has no globally unique identifier, hence we need to describe it by a JOSN-LD string:

In [None]:
sn_xvel = """{
    "@context": {
        "ssno": "https://matthiasprobst.github.io/ssno#"
    },
    "@type": "ssno:StandardName",
    "ssno:standardName": "x_velocity",
    "ssno:unit": "http://qudt.org/vocab/unit/M-PER-SEC",
    "ssno:description": "X-component of a velocity vector."
}"""

In [None]:
with h5tbx.File() as h5:
    h5.create_dataset("u", data=[1,2,3], attrs={"standard_name": "x_velocity"})
    h5.u.rdf["standard_name"].predicate = "https://matthiasprobst.github.io/ssno#hasStandardName"
    # h5.u.rdf["standard_name"].object = sn_xvel
    h5.u.rdf["standard_name"].object = sn_xvel
    h5.dump(False)
    h5jld = h5.dump_jsonld(indent=2, structural=False)

The JSON-LD dump shows that "standard_name" is correctly associated with our JSON-LD string for the `ssno:StandardName`:

In [None]:
print(h5jld)

## How to make use of the FAIR HDF5 file?

There are three ways, how the above IRI assignments help us and how we might want to use the information:
1. Visual inspection by dumping the content to screen: This will outline the file (meta) content and we can click on the attributes with IRIs, which will explain the attribute (data)
2. We can extract a *JSON-LD* file. This is useful for other processes. We can also investigate this file further with tools like [JSON-LD-playground](https://json-ld.org/playground/).
3. Access IRI in (Python) code

### 1. Visual inspection

The *dump()* method will now add IRI-icons. Click on it and get redirected to the resources:

In [None]:
h5tbx.dump(hdf_filename, collapsed=False)

### 2. JSON-LD extraction

Write the JSON-LD or Turtle (ttl) file and share it with others or a repository. The toolbox provides `dump`-methods through the `jsonld` module or - in the newer version of h5tbx - the `serialize` method, allowing to write various linked data formats.

It might look a bit overwelming, however dedicated scripts can perfectly work with it while humans still can read it (with a bit of practice and patience...):

In [None]:
print(
    h5tbx.serialize(
        hdf_filename,
        format="ttl",
        indent=2,
        context={'m4i': 'http://w3id.org/nfdi4ing/metadata4ing#',
                 'foaf': 'http://xmlns.com/foaf/0.1/',
                 'obo': 'http://purl.obolibrary.org/obo/'}
    )
)

## 3. Access IRI in code

You may want to access the IRI of an attribute with Python within the HDF5 file. E.g. while working with the file, you may ask "Hey, what is 'contact' exactly?" or "What does the attribute 'orcid' mean?"

In [None]:
with h5tbx.File(hdf_filename) as h5:
    person_iri = h5.contact.rdf.subject
    orcid_iri = h5.contact.rdf.predicate['orcid']

... Well "contact" is a "Person" defined by the FOAF ontology:

In [None]:
person_iri

... and "orcid" is a predicate defined by the metadata4ing ontology:

In [None]:
orcid_iri

### 3.1 Find data based on IRIs

In [None]:
import rdflib.graph as g

graph = g.Graph()
graph.parse('hdf_meta.jsonld', format='json-ld')

Note, that we need to provide the PREFIXES, if the json-ld data/file does not include the context.

In [None]:
res = graph.query("""
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX m4i: <http://w3id.org/nfdi4ing/metadata4ing#>

SELECT ?id ?orcid
WHERE {
    ?id a foaf:Person .
    ?id m4i:orcid ?orcid .
    }
""")

In [None]:
for r in res:
    print(r)

## 4. Examples:

### 4.1 Read metadata from JSON and write to HDF5

Suppose we want to store information about the used software to the HDF5 file. It exists as a JSON-LD file based on the codemeta ontolog. For this example, we use the *h5rdmtoolbox* codemeta.json file from the github repository:

In [None]:
from h5rdmtoolbox.utils import download_file
from pprint import pprint

Download the file:

In [None]:
codemeta_url = 'https://raw.githubusercontent.com/matthiasprobst/h5RDMtoolbox/main/codemeta.json'
dowloaded_filename = download_file(codemeta_url)

Read the data with `ontolutils.dquery`:

In [None]:
from ontolutils import dquery

In [None]:
data = dquery(subject='schema:SoftwareSourceCode',
              source=dowloaded_filename,
              context={"schema": "http://schema.org/"})
pprint(data[0])

The data are written into the HDF5 file by using `jsonld.to_hdf()`:

In [None]:
from h5rdmtoolbox.wrapper import jsonld

In [None]:
with h5tbx.File('test.hdf', 'w') as h5:
    jsonld.to_hdf(data=data[0],
                 grp=h5.create_group('software_code'))
    h5.dump(False)

In [None]:
import ontolutils

In [None]:
with h5tbx.File(mode='w') as h5:
    _ = h5.create_dataset('test_dataset', data=np.array([[1, 2], [3, 4], [5.4, 1.9]]))
    h5.create_dataset('grp/subgrp/vel', data=4)
    h5.attrs['name', ontolutils.SCHEMA.name] = 'test attr'

    ttl = h5tbx.serialize(h5.filename, structural=True, format="ttl")
print(ttl)

## Describing attribute meanings without RDF

Sometimes, there is no IRI (yet) defined but the need to give an additional comment on the attribute. This can be done by as follows:

In [None]:
with h5tbx.File() as h5:
    grp = h5.create_group('contact')

    # Set an attribute as usual
    grp.attrs['type'] = 'Contact'

    # Update the attribute definition afterwards:
    grp.rdf['type'].definition = 'The role of the Person'

    # Alternatively, it can be assigned simultaneously via h5tbx.Attribute:
    grp.attrs['fname'] = h5tbx.Attribute(value='Matthias',
                                        definition='The first name of the contact')
    h5.dump(False)

    jdict = h5.dump_jsonld(h5.hdf_filename, indent=2)

In [None]:
print(jdict)