# FAIR Attributes

According to [F1 of the *FAIR Principles*](https://www.go-fair.org/fair-principles/f1-meta-data-assigned-globally-unique-persistent-identifiers/) attributes shall be assigned to globally unique and persistent identifiers.

Here's what www.go-fair.org says about it:

*"Globally unique and persistent identifiers remove ambiguity in the meaning of your published data by assigning a unique identifier to every element of metadata and every concept/measurement in your dataset. In this context, identifiers consist of an internet link (e.g., a URL that resolves to a web page that defines the concept such as a particular human protein). Many data repositories will automatically generate globally unique and persistent identifiers to deposited datasets. Identifiers can help other people understand exactly what you mean, and they allow computers to interpret your data in a meaningful way (i.e., computers that are searching for your data or trying to automatically integrate them). Identifiers are essential to the human-machine interoperation that is key to the vision of Open Science. In addition, identifiers will help others to properly cite your work when reusing your data."*

The *h5rdmtoolbox* allows assigning attributes (and their data) to identifiers. For this, each name and value of an attribute may obtain an IRI (internationalized resource identifier). The following outlines, how it is done.

## Concept

We can interpret HDF5 objects, their attribute names and attribute values as [RDF triples](https://en.wikipedia.org/wiki/Semantic_triple) (subject-predicate-object), where...
- ... a group or dataset is a *subject*
- ... the attribute name is a <u>predicate</u>
- ... and the attriute value is an **object**

In the following, we would like to describe the content of an HDF5 file. There will be a dataset or random data generated by a person, which can be identified/described by a researcher ID (ORCID).

We as humans may understand the content of such an HDF5 file. For machines to interpret the data, we need to associate [URIs](https://en.wikipedia.org/wiki/Uniform_Resource_Identifier) with the HDF5 objects. In fact, sometimes it may also not very clear to humans, what is meant with a certain attribute. To be unambiguous about it, a URI helps. Think of the attribute "contact", we will define. Is it a person or an organization? Note, that URI and [IRI](https://en.wikipedia.org/wiki/Internationalized_Resource_Identifier) may be used synonymously - IRI is built on URI by expanding the set of permitted characters.

Let's build the example step by step. We start with **creating the group "contact"**:

In [21]:
import h5rdmtoolbox as h5tbx

## Describing an HDF5 file with persistent metadata

### Example part 1: A contact person
The file is written by an author. We create a group. It contains all relevant contact data, i.e. the ORCID. The content if the group thus describes the contact person and therefore *is* a person. The group itself, however, gets the predicate *has author*:

In [22]:
with h5tbx.File(mode='w') as h5:
    grp = h5.create_group('contact', attrs=dict(orcid='https://orcid.org/0000-0001-8729-0482'))
    # enrich with URIs:
    # 1. the group gets "has author":
    # grp.iri.subject = 'https://schema.org/author'

    # grp.iri.predicate = 'https://schema.org/author'
    
    grp.iri.predicate = 'https://schema.org/author'
    grp.iri.subject = 'http://xmlns.com/foaf/0.1/Person'  # what the content of group is, namely a foaf:Person
    grp.iri.predicate['orcid'] =  'http://w3id.org/nfdi4ing/metadata4ing#orcid'
    grp.attrs['first_name', 'http://xmlns.com/foaf/0.1/firstName'] = 'Matthias'
    # print(grp.iri.subject)
    h5.dump(collapsed=False)

hdf_filename = h5.hdf_filename

Using the `iri` accessory, we can assign the objects (dataset, groups, attributes) with the internationalized resource identifier (IRI). An IRI a web resource and points to the definition in an ontology, e.g. "contact" is a "Person" and is defined in the ontology FOAF: 'http://xmlns.com/foaf/0.1/Person'. The person "has a researcher ID". This predicate is described in the M4i ([metadata4ing](https://nfdi4ing.pages.rwth-aachen.de/metadata4ing/metadata4ing/)) ontology: 'http://w3id.org/nfdi4ing/metadata4ing#orcid'

From now on, let's same some work and use the package `namespacelib`, which simplifies the work with the namespaces, so that we don't have to type the full IRI address. Some popular ones are implemented in the `rdflib` package, too:

In [23]:
from namespacelib import M4I, OBO, QUDT_UNIT, QUDT_KIND
from rdflib.namespace import FOAF

As a result, we can type the following:

In [24]:
M4I.orcidId  # equal to http://w3id.org/nfdi4ing/metadata4ing#orcidId

rdflib.term.URIRef('http://w3id.org/nfdi4ing/metadata4ing#orcidId')

### Example part 2: A random data dataset
Next, we add the random data dataset with units. We can even describe what type the data is. In our case it shall be velocity data. Without this specification it would otherwise not be clear to the user (or a machine):

In [25]:
import numpy as np

with h5tbx.File(hdf_filename, mode='r+') as h5:    
    ds = h5.create_dataset('random_velocity', data=np.random.random(100))
    ds.attrs.create('units',
                    predicate=M4I.hasUnit,
                    data='m/s',
                    object=QUDT_UNIT.M_PER_SEC)
    ds.attrs.create('quantity_kind',
                     data='velocity',
                     predicate=M4I.hasKindOfQuantity,
                     object=QUDT_KIND.Velocity)

    h5.dump(collapsed=False)

Now, let's go further and describe how the random dataset was created and that the contact was involved in it:

In [26]:
from datetime import datetime

with h5tbx.File(hdf_filename, mode='r+') as h5:  
    proc = h5.create_group('processing_info')
    proc.iri.subject = M4I.ProcessingStep
    proc.attrs['has_participants', OBO.has_participant] = h5['contact']
    start_time = datetime.today()
    end_time = datetime.today()
    proc.attrs.create('start_time', data=start_time,
                      predicate='https://schema.org/startTime')
    proc.attrs.create('end_time', data=end_time,
                      predicate='https://schema.org/startTime')
    proc.attrs['output', 'http://purl.obolibrary.org/obo/RO_0002234'] = h5['random_velocity'].name

In [27]:
h5tbx.dump(hdf_filename, collapsed=False)

## How to make use of the FAIR HDF5 file?

There are three ways, how the above IRI assignments help us and how we might want to use the information:
1. Visual inspection by dumping the content to screen: This will outline the file (meta) content and we can click on the attributes with IRIs, which will explain the attribute (data)
2. We can extract a *JSON-LD* file. This is useful for other processes. We can also investigate this file further with tools like [JSON-LD-playground](https://json-ld.org/playground/).
3. Access IRI in (Python) code

### 1. Visual inspection

The *dump()* method will now add IRI-icons. Click on it and get redirected to the resources:

In [28]:
h5tbx.dump(hdf_filename, collapsed=False)

### 2. JSON-LD extraction

Write the JSON-LD file and share it with others or a repository. The toolbox provides `dump`-methods through the `jsonld` module. It might look a bit overwelming, however dedicated scripts can perfectly work with it while humans still can read it (with a bit of practice and patience...):

In [29]:
from h5rdmtoolbox import jsonld

In [30]:
from h5rdmtoolbox import jsonld

print(
    jsonld.dumps(
        hdf_filename,
        indent=2,
        context={'m4i': 'http://w3id.org/nfdi4ing/metadata4ing#',
                 'foaf': 'http://xmlns.com/foaf/0.1/'}
    )
)

{
  "@context": {
    "foaf": "http://xmlns.com/foaf/0.1/",
    "m4i": "http://w3id.org/nfdi4ing/metadata4ing#"
  },
  "@graph": [
    {
      "@id": "https://www.local-domain.org:/contact",
      "@type": "foaf:Person",
      "foaf:firstName": "Matthias",
      "m4i:orcid": "https://orcid.org/0000-0001-8729-0482"
    },
    {
      "@id": "https://www.local-domain.org:/",
      "https://w3id.org/okn/o/sd#SoftwareVersion": "1.2.3",
      "m4i:hasParameter": {
        "@id": "https://www.local-domain.org:/random_velocity"
      }
    },
    {
      "@id": "https://www.local-domain.org:/random_velocity",
      "@type": "http://www.molmod.info/semantics/pims-ii.ttl#Variable",
      "m4i:hasKindOfQuantity": "velocity",
      "m4i:hasUnit": "m/s"
    },
    {
      "@id": "https://www.local-domain.org:/processing_info",
      "@type": "m4i:ProcessingStep",
      "http://purl.obolibrary.org/obo/RO_0000057": {
        "@id": "https://www.local-domain.org:/contact"
      },
      "http://purl.

Dump it to the file rather than to the screen:

In [31]:
with open('hdf_meta.jsonld', 'w') as f:
    jsonld.dump(hdf_filename, f, indent=2,
        context={'m4i': 'http://w3id.org/nfdi4ing/metadata4ing#',
                 'foaf': 'http://xmlns.com/foaf/0.1/'})
                # context={'foaf': 'http://xmlns.com/foaf/0.1/',
                #                                    'm4i': 'http://w3id.org/nfdi4ing/metadata4ing#'})

## 3. Access IRI in code

You may want to access the IRI of an attribute with Python within the HDF5 file. E.g. while working with the file, you may ask "Hey, what is 'contact' exactly?" or "What does the attribute 'orcid' mean?"

In [32]:
with h5tbx.File(hdf_filename) as h5:
    person_iri = h5.contact.iri.subject
    orcid_iri = h5.contact.iri['orcid']

... Well "contact" is a "Person" defined by the FOAF ontology:

In [33]:
person_iri

rdflib.term.URIRef('http://xmlns.com/foaf/0.1/Person')

... and "orcid" is a predicate defined by the metadata4ing ontology:

In [34]:
orcid_iri

{'predicate': 'http://w3id.org/nfdi4ing/metadata4ing#orcid', 'object': None}

### 3.1 Find data based on IRIs

In [35]:
import rdflib.graph as g

graph = g.Graph()
graph.parse('hdf_meta.jsonld', format='json-ld')

<Graph identifier=N09a0501397c849dbac05ca6cd92f8bb9 (<class 'rdflib.graph.Graph'>)>

Note, that we need to provide the PREFIXES, if the json-ld data/file does not include the context.

In [36]:
res = graph.query("""
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX m4i: <http://w3id.org/nfdi4ing/metadata4ing#>

SELECT ?id ?orcid
WHERE {{
    ?id a foaf:Person .
    ?id m4i:orcid ?orcid .
    }}
""")

In [37]:
for r in res:
    print(r)

(rdflib.term.URIRef('https://www.local-domain.org:/contact'), rdflib.term.Literal('https://orcid.org/0000-0001-8729-0482'))


## 4. Examples:

### 4.1 Read metadata from JSON and write to HDF5

Suppose we want to store information about the used software to the HDF5 file. It exists as a JSON-LD file based on the codemeta ontolog. For this example, we use the *h5rdmtoolbox* codemeta.json file from the github repository:

In [38]:
codemeta_url = 'https://raw.githubusercontent.com/matthiasprobst/h5RDMtoolbox/main/codemeta.json'

The interface class is called *Metadata*. It allows to read from JSON(-LD) files:

In [39]:
from h5rdmtoolbox.convention import Metadata

from h5rdmtoolbox.utils import download_file

In [40]:
dowloaded_filename = download_file(codemeta_url, None)
m = Metadata.from_json(filename=dowloaded_filename)

AttributeError: type object 'Metadata' has no attribute 'from_json'

Simply open an HDF5 file and call the *write()* method:

In [None]:
with h5tbx.File() as h5:
    grp = h5.create_group('software_info')
    m.write(grp)
    h5.dump(False)    

### 4.2 Fill out a metadata template (MetadataModel)

In the next example, we don't yet have data stored in a file, but we got a template file. The template defines which fields are required. We should rather call it *model*, because it technically uses the *BaseModel* class from *pydantic*.

The most common use case is, that a metadata model is provided as a JSON file. We first need to create one. For this example, we want to define metadata fields to store personal information. We expect the following fields (also shown are the types and defaults):


|   field name  | type | default | 
|---------|:-:|:-:|
| first_name | str |   |
| last_name     | str |  |
| age    | a positive integer |  | 
| mailbox    | A valid email string | None   |
| website   | A valid http url |None|
| interests   | string or a list of strings | 'programming' |

Note, if the default is None, the field is optional. If a value is given (see *interests*), this value is used if no other is given.

In [None]:
from pydantic import EmailStr, HttpUrl, PositiveInt
from typing import List
from rdflib.namespace import FOAF

#### Construction of a metadata model file

Below, the construction of a metadata model for our "User"-example is given. The entries will be explained afterwards.

In [None]:
user_model = {
    '@context': {
        'first_name': str(FOAF.firstName),
        'last_name': str(FOAF.lastName)
    },
    '@type': str(FOAF.Person),
    'orcidid': ['str', None],  # syntax: [TYPE, DEFAULT]
    'first_name': 'str',
    'last_name': 'str',
    'interests': ['Union[str, List[str]]', 'programming'],
    'age': 'PositiveInt',
    'mailbox': ['EmailStr', None],
    'website': ['HttpUrl', None]
}

**1. Data type**<br>
The general data type to define a model is a JSON dictionary.

**2. Special fields**<br>
Two special fields can be found (while they are optional, it is recommended to provide them!):
- @context: Allows to define IRIs for data keys (like in a JSON-LD file)
- @type: The IRI for the model

All other fields are the expected fields for the user (first name, ...)

**3. Types and defaults**<br>
The value of each metadata field (e.g. *first_name*) must at least be a string, which states the type (e.g. "str"). A default value can also be defined. For this, a tuple or list or two entries need to be given [`<type>`, `<default>`]. The default can be None, which makes the field optional

**3.1 Special types**
Beyond *int*, *str* or *float* specific types defined in the package *typing* or *pydantic* can be used. Examples from the above code are "Union", or "EmailStr". We can also use our own models. This is shown in a later example.

To mimic the use case, which expects JSON files rather than dictionaries, let's write it to a file:

In [None]:
import json
from pprint import pprint

fname = h5tbx.utils.generate_temporary_filename(suffix='.json')
with open(fname, 'w') as f:
    json.dump(user_model, f)

#### Create a model class

To instantiate a metadata model, call the class method `from_json`:

In [None]:
from h5rdmtoolbox.convention import MetadataModel

In [None]:
UserName = MetadataModel.from_json(fname, 'UserName')

#### Create a user

To instantiate a metadata model (so a user in our case), call the class method `from_json`:

In [None]:
john_doe = UserName(first_name='John', last_name='Doe', age=32, orcidid='https://orcid.org/0000-0001-8729-0482')
john_doe

#### Validation
We did provide types for a reason: The metadata fields are validated (pydantic does it in the background). Examples for invalid users are:

In [None]:
# invalid age!
try:
    UserName(first_name='John', last_name='Doe', age=-3)
except Exception as e:
    print(e)

In [None]:
# invalid website!
try:
    UserName(first_name='John', last_name='Doe', age=30, website='invalid')
except Exception as e:
    print(e)

#### Writing the metadata model to HDF5

Let's write the data to an HDF5 file:

In [None]:
with h5tbx.File() as h5:
    grp = h5.create_group('john')
    john_doe.write(grp)
    h5.dump(False)

### 4.3 Reusing models

In the previous example, we created a simple user. Now we want to create a new user class, which has the field *affiliation* which shall be a type, which we defined by a model. For this, first create the affiliation model. For the sake of simplicity, we will not use @context and @type and only use a few fields:

In [None]:
affiliation_model_dict = {
    'name': "str",
    'url': ["HttpUrl", None],
}

aff_fname = h5tbx.utils.generate_temporary_filename(suffix='.json')
with open(aff_fname, 'w') as f:
    json.dump(affiliation_model_dict, f)

In [None]:
Affiliation = MetadataModel.from_json(aff_fname, 'Affiliation')

Now, that we create the *Affiliation* model, we can create a new *User* model and add the field *affiliation*:

In [None]:
user_model_dict = {
    'first_name': "str",
    'affiliation': "Affiliation",
}

user_fname = h5tbx.utils.generate_temporary_filename(suffix='.json')
with open(user_fname, 'w') as f:
    json.dump(user_model_dict, f)

The user model is created as always. However, we now need to parse user-define types, in our case "Affiliation".

In [None]:
user = MetadataModel.from_json(user_fname, 'User', user_types={'Affiliation': Affiliation})

Test user creations:

In [None]:
user(first_name='John', affiliation={'name': 'My Institution'})

In [None]:
try:
    user(first_name='John')
except Exception as e:
    print(e)  # We are missing the affiliation entry!

**Note**, that the affiliation is created automatically as a sub-group when data is written to HDF5:

In [None]:
john_with_affiliation = user(first_name='John', affiliation={'name': 'My Institution'})
with h5tbx.File() as h5:
    grp = h5.create_group('users')
    john_with_affiliation.write(grp)
    h5.dump(False)