# FAIR Attributes

According to [F1 of the *FAIR Principles*](https://www.go-fair.org/fair-principles/f1-meta-data-assigned-globally-unique-persistent-identifiers/) attributes shall be assigned to globally unique and persistent identifiers.

Here's what www.go-fair.org says about it:

*"Globally unique and persistent identifiers remove ambiguity in the meaning of your published data by assigning a unique identifier to every element of metadata and every concept/measurement in your dataset. In this context, identifiers consist of an internet link (e.g., a URL that resolves to a web page that defines the concept such as a particular human protein). Many data repositories will automatically generate globally unique and persistent identifiers to deposited datasets. Identifiers can help other people understand exactly what you mean, and they allow computers to interpret your data in a meaningful way (i.e., computers that are searching for your data or trying to automatically integrate them). Identifiers are essential to the human-machine interoperation that is key to the vision of Open Science. In addition, identifiers will help others to properly cite your work when reusing your data."*

The *h5rdmtoolbox* allows assigning attributes (and their data) to identifiers. For this, each name and value of an attribute may obtain an IRI (internationalized resource identifier). The following outlines, how it is done.

## Concept

We understand HDF5 objects, attributes and attribute values as RDF triples:
- obj (dataset/group) $\rightarrow$ *subject*
- attribute $\rightarrow$ <u>predicate</u>
- data $\rightarrow$ **object**

Then, we can "explain" the data in the following way:
- The *group "contact"* <u>is</u> **Person**
- The *group "contact"* <u>has ORCiD</u> **\<value\>**
- The *dataset "u"* <u>hasUnit</u> **"m/s"**
- *"m/s"* <u>is</u> **"https://qudt.org/vocab/unit/M-PER-SEC"**
- The *dataset "u"* <u>has kind of quantity</u> **Velocity** (defined by qudt)
- etc.

Let's build such a file:

In [1]:
import h5rdmtoolbox as h5tbx

## Associate IRI to attributes

An IRI can be assigned during or after attribute creation. Various possibilities are shown below.

Note, that you can type the IRIs yourself, however, it is safer to use implemented namespace objects as provided by *rdflib* (e.g. FOAF). The toolbox provides the NFDI4Ing-supported ontology **metadata4ing (m4i)** which is very useful for engineering data. The toolbox implements it in the same way as *rdflib* does:

In [2]:
from h5rdmtoolbox.namespace import M4I, OBO, QUDT_UNIT, QUDT_QUANTITYKIND
from rdflib.namespace import FOAF

In [3]:
import numpy as np
import time
from datetime import datetime

hdf_filename = h5tbx.utils.generate_temporary_filename()

with h5tbx.File(hdf_filename) as h5:
    person = h5.create_group('contact')
    person.iri = FOAF.Person
    person.attrs.create(name='first_name',
                        predicate=FOAF.firstName,
                        data='Matthias')
    person.attrs.create(name='orcid',
                        predicate=M4I.orcidId,
                        data='https://orcid.org/0000-0001-8729-0482')

    st = datetime.now()
    time.sleep(2)
    ds = h5.create_dataset('random_velocity', data=np.random.random(100))
    et = datetime.now()
    ds.attrs.create('units', predicate=M4I.hasUnit,
                    data='m/s',
                    object=QUDT_UNIT.M_PER_SEC)
    ds.attrs.create('quantity_kind',
                     data='velocity',
                     predicate=M4I.hasKindOfQuantity,
                     object=QUDT_QUANTITYKIND.Velocity)

    proc = h5.create_group('proc_random_number')
    proc.iri = M4I.ProcessingStep
    proc.attrs['has_participants', OBO.has_participant] = h5['contact']
    proc.attrs.create('start_time', data=st,
                      predicate='https://schema.org/startTime')
    proc.attrs.create('end_time', data=et,
                      predicate='https://schema.org/startTime')
    proc.attrs['output', 'http://purl.obolibrary.org/obo/RO_0002234'] = ds
    

## Make use of FAIR metadata

There are three ways, how the above assignment helps us and how we might want to use IRIs:
1. Visual inspection by dumping the content to screen: This will outline the file (meta) content and we can click on the attributes with IRIs, which will explain the attribute (data)
2. We can extract a *JSON-LD* file. This is useful for other processes. We can also investigate this file further with tools like [JSON-LD-playground](https://json-ld.org/playground/).
3. Access IRI in (Python) code

### 1. Visual inspection

The *dump()* method will now add IRI-icons. Click on it and get redirected to the resources:

In [4]:
h5tbx.dump(hdf_filename, collapsed=False)

### 2. JSON-LD extraction

In [5]:
from h5rdmtoolbox import jsonld

In [6]:
print(jsonld.dumps(hdf_filename, indent=2))

[
  {
    "@id": "h5name:/proc_random_number",
    "@type": [
      "http://w3id.org/nfdi4ing/metadata4ing#ProcessingStep"
    ],
    "http://purl.obolibrary.org/obo/RO_0000057": [
      {
        "@id": "h5name:/contact"
      }
    ],
    "http://purl.obolibrary.org/obo/RO_0002234": [
      {
        "@id": "h5name:/random_velocity"
      }
    ],
    "https://schema.org/startTime": [
      {
        "@value": "2024-01-09T16:33:24.446128"
      },
      {
        "@value": "2024-01-09T16:33:22.439550"
      }
    ]
  },
  {
    "@id": "h5name:/random_velocity",
    "@type": [
      "http://www.molmod.info/semantics/pims-ii.ttl#Variable"
    ],
    "http://w3id.org/nfdi4ing/metadata4ing#hasKindOfQuantity": [
      {
        "@value": "velocity"
      }
    ],
    "http://w3id.org/nfdi4ing/metadata4ing#hasUnit": [
      {
        "@value": "m/s"
      }
    ]
  },
  {
    "@id": "h5name:/",
    "http://w3id.org/nfdi4ing/metadata4ing#hasParameter": [
      {
        "@id": "h5name:/rand

## 3. Access IRI in code

You may want to access the IRI of an attribute with the Python:

In [7]:
with h5tbx.File(hdf_filename) as h5:
    person_iri = h5.contact.iri.subject
    orcid_iri = h5.contact.iri['orcid']

In [8]:
person_iri

rdflib.term.URIRef('http://xmlns.com/foaf/0.1/Person')

In [9]:
orcid_iri

{'predicate': 'http://w3id.org/nfdi4ing/metadata4ing#orcidId', 'object': None}

## 4. Examples:

### 4.1 Read metadata from JSON and write to HDF5

Suppose we want to store information about the used software to the HDF5 file. It exists as a JSON-LD file based on the codemeta ontolog. For this example, we use the *h5rdmtoolbox* codemeta.json file from the github repository:

In [10]:
codemeta_url = 'https://raw.githubusercontent.com/matthiasprobst/h5RDMtoolbox/main/codemeta.json'

The interface class is called *Metadata*. It allows to read from JSON(-LD) files:

In [12]:
from h5rdmtoolbox.convention import Metadata

from h5rdmtoolbox.utils import download_file

In [13]:
dowloaded_filename = download_file(codemeta_url, None)
m = Metadata.from_json(filename=dowloaded_filename)



Simply open an HDF5 file and call the *write()* method:

In [14]:
with h5tbx.File() as h5:
    grp = h5.create_group('software_info')
    m.write(grp)
    h5.dump(False)    

### 4.2 Fill out a metadata template (MetadataModel)

In the next example, we don't yet have data stored in a file, but we got a template file. The template defines which fields are required. We should rather call it *model*, because it technically uses the *BaseModel* class from *pydantic*.

The most common use case is, that a metadata model is provided as a JSON file. We first need to create one. For this example, we want to define metadata fields to store personal information. We expect the following fields (also shown are the types and defaults):


|   field name  | type | default | 
|---------|:-:|:-:|
| first_name | str |   |
| last_name     | str |  |
| age    | a positive integer |  | 
| mailbox    | A valid email string | None   |
| website   | A valid http url |None|
| interests   | string or a list of strings | 'programming' |

Note, if the default is None, the field is optional. If a value is given (see *interests*), this value is used if no other is given.

In [15]:
from pydantic import EmailStr, HttpUrl, PositiveInt
from typing import List
from rdflib.namespace import FOAF

#### Construction of a metadata model file

Below, the construction of a metadata model for our "User"-example is given. The entries will be explained afterwards.

In [16]:
user_model = {
    '@context': {
        'first_name': str(FOAF.firstName),
        'last_name': str(FOAF.lastName)
    },
    '@type': str(FOAF.Person),
    'orcidid': ['str', None],  # syntax: [TYPE, DEFAULT]
    'first_name': 'str',
    'last_name': 'str',
    'interests': ['Union[str, List[str]]', 'programming'],
    'age': 'PositiveInt',
    'mailbox': ['EmailStr', None],
    'website': ['HttpUrl', None]
}

**1. Data type**<br>
The general data type to define a model is a JSON dictionary.

**2. Special fields**<br>
Two special fields can be found (while they are optional, it is recommended to provide them!):
- @context: Allows to define IRIs for data keys (like in a JSON-LD file)
- @type: The IRI for the model

All other fields are the expected fields for the user (first name, ...)

**3. Types and defaults**<br>
The value of each metadata field (e.g. *first_name*) must at least be a string, which states the type (e.g. "str"). A default value can also be defined. For this, a tuple or list or two entries need to be given [`<type>`, `<default>`]. The default can be None, which makes the field optional

**3.1 Special types**
Beyond *int*, *str* or *float* specific types defined in the package *typing* or *pydantic* can be used. Examples from the above code are "Union", or "EmailStr". We can also use our own models. This is shown in a later example.

To mimic the use case, which expects JSON files rather than dictionaries, let's write it to a file:

In [17]:
import json
from pprint import pprint

fname = h5tbx.utils.generate_temporary_filename(suffix='.json')
with open(fname, 'w') as f:
    json.dump(user_model, f)

#### Create a model class

To instantiate a metadata model, call the class method `from_json`:

In [19]:
from h5rdmtoolbox.convention import MetadataModel

In [20]:
UserName = MetadataModel.from_json(fname, 'UserName')

#### Create a user

To instantiate a metadata model (so a user in our case), call the class method `from_json`:

In [21]:
john_doe = UserName(first_name='John', last_name='Doe', age=32, orcidid='https://orcid.org/0000-0001-8729-0482')
john_doe

{'orcidid': 'https://orcid.org/0000-0001-8729-0482', 'first_name': 'John', 'last_name': 'Doe', 'interests': 'programming', 'age': 32}

#### Validation
We did provide types for a reason: The metadata fields are validated (pydantic does it in the background). Examples for invalid users are:

In [22]:
# invalid age!
try:
    UserName(first_name='John', last_name='Doe', age=-3)
except Exception as e:
    print(e)

1 validation error for UserName
age
  Input should be greater than 0 [type=greater_than, input_value=-3, input_type=int]
    For further information visit https://errors.pydantic.dev/2.5/v/greater_than


In [23]:
# invalid website!
try:
    UserName(first_name='John', last_name='Doe', age=30, website='invalid')
except Exception as e:
    print(e)

1 validation error for UserName
website
  Input should be a valid URL, relative URL without a base [type=url_parsing, input_value='invalid', input_type=str]
    For further information visit https://errors.pydantic.dev/2.5/v/url_parsing


#### Writing the metadata model to HDF5

Let's write the data to an HDF5 file:

In [24]:
with h5tbx.File() as h5:
    grp = h5.create_group('john')
    john_doe.write(grp)
    h5.dump(False)

### 4.3 Reusing models

In the previous example, we created a simple user. Now we want to create a new user class, which has the field *affiliation* which shall be a type, which we defined by a model. For this, first create the affiliation model. For the sake of simplicity, we will not use @context and @type and only use a few fields:

In [25]:
affiliation_model_dict = {
    'name': "str",
    'url': ["HttpUrl", None],
}

aff_fname = h5tbx.utils.generate_temporary_filename(suffix='.json')
with open(aff_fname, 'w') as f:
    json.dump(affiliation_model_dict, f)

In [26]:
Affiliation = MetadataModel.from_json(aff_fname, 'Affiliation')

Now, that we create the *Affiliation* model, we can create a new *User* model and add the field *affiliation*:

In [27]:
user_model_dict = {
    'first_name': "str",
    'affiliation': "Affiliation",
}

user_fname = h5tbx.utils.generate_temporary_filename(suffix='.json')
with open(user_fname, 'w') as f:
    json.dump(user_model_dict, f)

The user model is created as always. However, we now need to parse user-define types, in our case "Affiliation".

In [28]:
user = MetadataModel.from_json(user_fname, 'User', user_types={'Affiliation': Affiliation})

parsing Affiliation with <h5rdmtoolbox.convention.metadata.MetadataModel object at 0x000001BB3F0A1B80>


Test user creations:

In [29]:
user(first_name='John', affiliation={'name': 'My Institution'})

{'first_name': 'John', 'affiliation': {'name': 'My Institution'}}

In [30]:
try:
    user(first_name='John')
except Exception as e:
    print(e)  # We are missing the affiliation entry!

1 validation error for User
affiliation
  Field required [type=missing, input_value={'first_name': 'John'}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.5/v/missing


**Note**, that the affiliation is created automatically as a sub-group when data is written to HDF5:

In [31]:
john_with_affiliation = user(first_name='John', affiliation={'name': 'My Institution'})
with h5tbx.File() as h5:
    grp = h5.create_group('users')
    john_with_affiliation.write(grp)
    h5.dump(False)