This tutorial shows you
- The basic workflow with this package
- The contents and possibilities of the `Schema` class
- What the dictionary with metadata should look like
- What throws errors or warnings in validation

# Basic workflow
In the most basic use case of the package, you create a `Schema` instance with the path to your JSON schema (which should follow a specific format), you have a dictionary with name-value pairs for metadata, and you apply them to one or more iRODS objects (data objects or collections).

In this tutorial we won't show how to apply them (see the README at the top of the repository) but how to check that the metadata is compatible with the schema, what requirements are checked and what the consequences of mismatches are.

In [3]:
!pip install python-irodsclient



In [1]:
#| echo: false
import sys
sys.path.append('../')

First we import what we need.
The `Schema` class is the most important tool in this package: it reads a schema from file, validates it and lets you validate and apply metadata.
`check_metadata()` is a function called by the `apply()` method of a `Schema` but you can use it to validate a dictionary of metadata against a schema.

In [2]:
from mango_mdschema import Schema, check_metadata

ModuleNotFoundError: No module named 'irods'

Create a schema by providing the path to the file.
(In the future, `pathlib.Path` objects will also be accepted).
See below for more information about this class.

In [4]:
my_schema = Schema('book-v2.0.0-published.json')

Provide the metadata as a dictionary with not-namespaced attribute names and values (see below for specifications). If you have multiple values for the same attribute name (i.e. in a repeatable field or a multiple-value multiple-choice field), you should provide them as an array.

In [5]:
my_metadata = {
    'title' : "A book not written yet",
    'author' : {
        'name' : "Fulano De Tal",
        'email' : "fulano.detal@kuleuven.be"
    },
    'ebook' : 'Available',
    'publishing_date' : '2015-02-01'
}

Validate the metadata against the schema with `check_metadata()`. The `verbose` argument also prints warnings when non required fields or required fields with default values are not provided.
The output of this function is a list of `irods.meta.iRODSMeta` objects with namespaced attribute names. 

You can now assign them to a data object or collection with atomic operations, or let this package do it by running `my_schema.apply(my_object, my_metadata)` instead. This also checks if there is already metadata linked to the schema and replaces it and updates the metadata related to the schema version, so it's more in line with what the ManGO portal does.

In [6]:
check_metadata(my_schema, my_metadata)

[<iRODSMeta None mgs.book.title A book not written yet None>,
 <iRODSMeta None mgs.book.author.name Fulano De Tal 1>,
 <iRODSMeta None mgs.book.author.email fulano.detal@kuleuven.be 1>,
 <iRODSMeta None mgs.book.ebook Available None>,
 <iRODSMeta None mgs.book.publishing_date 2015-02-01 None>,
 <iRODSMeta None mgs.book.publisher Tor None>]

## The `Schema` class

The `Schema` class represents a schema. As such, it has `name` and `version` attributes, as well as the `prefix` used for all AVU names. The prefix is the combination of a prefix given in the constructor (by default 'mgs') and the name of the schema. For example, initializing with `Schema('book-v2.0.0-published.json', 'irods')` would generate the prefix 'irods.book' for all the metadata related to this schema.

In [7]:
f"Metadata annotated with the schema '{my_schema.name}' (current version: {my_schema.version}) carry the prefix '{my_schema.prefix}'."

"Metadata annotated with the schema 'book' (current version: 2.0.0) carry the prefix 'mgs.book'."

When instantiating a `Schema`, some basic validation is performed. For example, only 'published' schemas are accepted.

In [8]:
#| error: true
import json
with open('book-v3.0.0-draft.json', 'r') as f:
    draft_schema = json.load(f)
    print(draft_schema['status'])
Schema('book-v3.0.0-draft.json')

draft


ValueError: The schema is not published: it cannot be used to apply metadata!

The code also checks that the fields make sense and have the necessary fields in the right format.

In [None]:
#| error: true
with open('bad-schema.json', 'r') as f:
    bad_schema = json.load(f)
    print(bad_schema['properties'])
Schema('bad-schema.json')

{'title': {'title': 'Book title', 'type': 'title', 'required': True}}


ValueError: The type of the 'title' field is not valid.

If you are not entirely familiar with your schema, you can check its contents by printing it or listing its `required_fields` attribute. This attribute is a dictionary of with the names required fields as keys and their default value, if available, as value. This is particularly important because you will only get errors if a required field _without default_ is not provided or the value provided for it is wrong. For required fields with defaults and non-required fields, wrong or missing values will simply be ignored. You will get warnings if you set `verbose` to `True`, though.

In [None]:
print(my_schema)

In [None]:
my_schema.required_fields # note: 'author' is required because it contains required fields

{'title': None, 'publishing_date': None, 'publisher': 'Tor', 'author': None}

A schema also has a method to check the requirements of a specific field, namely whether they are required and have a default, whether they are repeatable, and any other characteristic used in validation.

In [None]:
my_schema.check_requirements('title')

[1mType[0m: text.
[1mRequired[0m: True. [1mDefault[0m: None.
[1mRepeatable[0m: False.


In [None]:
my_schema.check_requirements('cover_colors')

[1mType[0m: select.
[1mRequired[0m: False.
[1mRepeatable[0m: False.
Choose at least one of the following values:
- red
- blue
- green
- yellow


When checking the requirements of a composite field, it also lists the requirements of its subfields.

In [None]:
my_schema.check_requirements('author')

[1mType[0m: object.
[1mRequired[0m: True. (2 of its 3 fields are required.)
[1mRepeatable[0m: True.

Composed of the following fields:
[4mname[0m
[1mType[0m: text.
[1mRequired[0m: True. [1mDefault[0m: None.
[1mRepeatable[0m: False.

[4mage[0m
[1mType[0m: integer.
[1mRequired[0m: False.
[1mRepeatable[0m: False.
integer between 12 and 99.

[4memail[0m
[1mType[0m: email.
[1mRequired[0m: True. [1mDefault[0m: None.
[1mRepeatable[0m: True.
matching the following regex: [91m@kuleuven.be$[0m.


Composite fields also have `required_fields` attributes and, like schemas, a `fields` attribute listing all the fields.

In [None]:
my_schema.fields['author'].required_fields

{'name': None, 'email': None}

## Metadata format
The `metadata` argument of `check_metadata()` and `Schema.apply()` (which calls `check_metadata()`) must be a dictionary in which the keys represent the names/IDs of the fields _without namespacing_ and the values, the value of the AVU to add.

If the field is a checkbox for which multiple values have been selected _or_ a repeatable field with multiple values, then the value in the dictionary should be a list of such values. For example, the code below includes metadata for a checkbox. As you can see, this generates multiple AVUs with the same name and different values.

In [None]:
my_metadata.update({'cover_colors' : ['red', 'blue']})
check_metadata(my_schema, my_metadata)

[<iRODSMeta None mgs.book.title A book not written yet None>,
 <iRODSMeta None mgs.book.author.name Fulano De Tal 1>,
 <iRODSMeta None mgs.book.author.email fulano.detal@kuleuven.be 1>,
 <iRODSMeta None mgs.book.ebook Available None>,
 <iRODSMeta None mgs.book.publishing_date 2015-02-01 None>,
 <iRODSMeta None mgs.book.cover_colors red None>,
 <iRODSMeta None mgs.book.cover_colors blue None>,
 <iRODSMeta None mgs.book.publisher Tor None>]

For composite fields, the value should be a dictionary with the same format: keys are field names without namespacing and values, the right value. In this case, we are providing the following values for 'author', which is a composite field:

```python
{ 'name' : 'Fulano De Tal', 'email' : 'fulano.detal@kuleuven.be' }
```

This results in two AVUs with `mgs.book.author.name` and `mgs.book.author.email` as name, respectively, the corresponding values, and `0` as unit.
The goal of the unit is to keep AVUs within the same composite field together, particularly when the composite field is repeatable. For example, we could submit two authors by providing a list with two dictionaries.
As a result, we get two AVUs with `mgs.book.author.name` and two with `mgs.book.author.email`, and the unit indicates which email goes with each name.

In [None]:
my_metadata['author']

{'name': 'Fulano De Tal', 'email': 'fulano.detal@kuleuven.be'}

In [None]:
my_metadata['author'] = [
    {'name': 'Fulano De Tal', 'email': 'fulano.detal@kuleuven.be'},
    {'name': 'Jane Doe', 'email': 'jane_doe@kuleuven.be'}
]
checked_metadata = check_metadata(my_schema, my_metadata)
[x for x in checked_metadata if x.name.startswith('mgs.book.author')]

[<iRODSMeta None mgs.book.author.name Fulano De Tal 1>,
 <iRODSMeta None mgs.book.author.email fulano.detal@kuleuven.be 1>,
 <iRODSMeta None mgs.book.author.name Jane Doe 2>,
 <iRODSMeta None mgs.book.author.email jane_doe@kuleuven.be 2>]

Actually, the email of the author is also a repeatable field, so we could get more instances of `mgs.book.author.email`, always with the unit indicating who it belongs to.

In [None]:
my_metadata['author'][1]['email'] = ['jane_doe@kuleuven.be', 'sweetdoe@kuleuven.be']
checked_metadata = check_metadata(my_schema, my_metadata)
[x for x in checked_metadata if x.name.startswith('mgs.book.author')]

[<iRODSMeta None mgs.book.author.name Fulano De Tal 1>,
 <iRODSMeta None mgs.book.author.email fulano.detal@kuleuven.be 1>,
 <iRODSMeta None mgs.book.author.name Jane Doe 2>,
 <iRODSMeta None mgs.book.author.email jane_doe@kuleuven.be 2>,
 <iRODSMeta None mgs.book.author.email sweetdoe@kuleuven.be 2>]

## Metadata validation

There are two levels of validation for metadata: presence and appropriateness. They are applied both at the level of the schema and on each composite field.

- If a required field is missing, and there is no default value, it will throw an error.

In [None]:
#| error: true
check_metadata(my_schema, {'title' : 'I only have a title'})

KeyError: 'The following required fields are missing and there is no default: mgs.book.publishing_date, mgs.book.author.'

- If a required field is missing and there is a default value, you will only get a warning if `verbose` is `True`; the default value is then used.
- If a non-required field is missing, you will only get a warning if `verbose` is `True`.

In [None]:
check_metadata(my_schema, my_metadata, verbose = True)



[<iRODSMeta None mgs.book.title A book not written yet None>,
 <iRODSMeta None mgs.book.author.name Fulano De Tal 1>,
 <iRODSMeta None mgs.book.author.email fulano.detal@kuleuven.be 1>,
 <iRODSMeta None mgs.book.author.name Jane Doe 2>,
 <iRODSMeta None mgs.book.author.email jane_doe@kuleuven.be 2>,
 <iRODSMeta None mgs.book.author.email sweetdoe@kuleuven.be 2>,
 <iRODSMeta None mgs.book.ebook Available None>,
 <iRODSMeta None mgs.book.publishing_date 2015-02-01 None>,
 <iRODSMeta None mgs.book.cover_colors red None>,
 <iRODSMeta None mgs.book.cover_colors blue None>,
 <iRODSMeta None mgs.book.publisher Tor None>]

- If a field is provided that is not included in a schema, it will be ignored, and you will only get a warning if `verbose` is `True`.

In [9]:
mini_md = {'title' : 'I only have a title', 'publishing_date' : '2023-04-28',
           'author' : {'name' : 'Name Surname', 'email' : 'rightemail@kuleuven.be'},
           'publishing_house' : 'Oxford'}
check_metadata(my_schema, mini_md, verbose = True)



[<iRODSMeta None mgs.book.title I only have a title None>,
 <iRODSMeta None mgs.book.publishing_date 2023-04-28 None>,
 <iRODSMeta None mgs.book.author.name Name Surname 1>,
 <iRODSMeta None mgs.book.author.email rightemail@kuleuven.be 1>,
 <iRODSMeta None mgs.book.publisher Tor None>]

In [10]:
del mini_md['publishing_house']

Once the presence of fields has been checked, we move on to appropriateness: are the values ok based on the requirements of the different fields?
The `validators` package is used to validate the range of numbers as well as email and urls.

On this level, there is no effect of `verbose`, because if you are providing a value you will most certainly want to know that it has failed:

- when required fields _with no default value_ are wrong an error is thrown.
- when required fields with default values are wrong, the default is used and a warning is printed.
- when non required fields are wrong, they are ignored and a warning is printed.

### Numbers
Integer and float simple fields must be of type `int` or `float` respectively or something that can be converted to such format.
`validators.between()` is used to make sure that the number is within the provided range.

In [None]:
my_metadata['author'][0]['age'] = 30
checked_metadata = check_metadata(my_schema, my_metadata)
[x for x in checked_metadata if x.name.startswith('mgs.book.author') and x.units == '1']

[<iRODSMeta None mgs.book.author.name Fulano De Tal 1>,
 <iRODSMeta None mgs.book.author.email fulano.detal@kuleuven.be 1>,
 <iRODSMeta None mgs.book.author.age 30 1>]

In [None]:
# provided as a string that can be converted with `int()`
my_metadata['author'][0]['age'] = '30'
checked_metadata = check_metadata(my_schema, my_metadata)
[x for x in checked_metadata if x.name.startswith('mgs.book.author') and x.units == '1']

[<iRODSMeta None mgs.book.author.name Fulano De Tal 1>,
 <iRODSMeta None mgs.book.author.email fulano.detal@kuleuven.be 1>,
 <iRODSMeta None mgs.book.author.age 30 1>]

In [None]:
# provided as a float that can be converted with `int()`
my_metadata['author'][0]['age'] = 30.5
checked_metadata = check_metadata(my_schema, my_metadata)
[x for x in checked_metadata if x.name.startswith('mgs.book.author') and x.units == '1']

[<iRODSMeta None mgs.book.author.name Fulano De Tal 1>,
 <iRODSMeta None mgs.book.author.email fulano.detal@kuleuven.be 1>,
 <iRODSMeta None mgs.book.author.age 30 1>]

In [None]:
#| error: true
# wrong range: it should be between 12 and 99
my_metadata['author'][0]['age'] = 103
checked_metadata = check_metadata(my_schema, my_metadata)
[x for x in checked_metadata if x.name.startswith('mgs.book.author') and x.units == '1']



[<iRODSMeta None mgs.book.author.name Fulano De Tal 1>,
 <iRODSMeta None mgs.book.author.email fulano.detal@kuleuven.be 1>]

In [None]:
# wrong format
my_metadata['author'][0]['age'] = 'thirty'
checked_metadata = check_metadata(my_schema, my_metadata)
[x for x in checked_metadata if x.name.startswith('mgs.book.author') and x.units == '1']



[<iRODSMeta None mgs.book.author.name Fulano De Tal 1>,
 <iRODSMeta None mgs.book.author.email fulano.detal@kuleuven.be 1>]

Values are also checked if there are no minimum or maximum specified.

In [12]:
mini_md['market_price'] = 9.99
checked_metadata = check_metadata(my_schema, mini_md)
[x for x in checked_metadata if x.name.endswith('price')]

[<iRODSMeta None mgs.book.market_price 9.99 None>]

### Dates, times and datetimes
Dates, times and datetimes can be provided as `datetime.date`, `datetime.time` or `datetime.datetime` objects or as strings that can be converted as such via their `fromisoformat()` or `fromtimestamp()` methods. The final value is a string in ISO Format.

In [None]:
from datetime import date
import time

In [None]:
mini_md['publishing_date'] = date.today()
check_metadata(my_schema, mini_md)

[<iRODSMeta None mgs.book.title I only have a title None>,
 <iRODSMeta None mgs.book.publishing_date 2023-06-01 None>,
 <iRODSMeta None mgs.book.author.name Name Surname 1>,
 <iRODSMeta None mgs.book.author.email rightemail@kuleuven.be 1>,
 <iRODSMeta None mgs.book.publisher Tor None>]

In [None]:
mini_md['publishing_date'] = date.fromtimestamp(time.time())
check_metadata(my_schema, mini_md)

[<iRODSMeta None mgs.book.title I only have a title None>,
 <iRODSMeta None mgs.book.publishing_date 2023-06-01 None>,
 <iRODSMeta None mgs.book.author.name Name Surname 1>,
 <iRODSMeta None mgs.book.author.email rightemail@kuleuven.be 1>,
 <iRODSMeta None mgs.book.publisher Tor None>]

In [None]:
#| error: true
mini_md['publishing_date'] = '03/11/1990'
check_metadata(my_schema, mini_md)

ValueError: None of the values provided for `mgs.book.publishing_date` are valid.

### URLs and emails


In [None]:
my_metadata['author'][0]['age'] = 30
my_metadata['author'][1]['email'].append('bademail@whatevs')
check_metadata(my_schema, my_metadata)



[<iRODSMeta None mgs.book.title A book not written yet None>,
 <iRODSMeta None mgs.book.author.name Fulano De Tal 1>,
 <iRODSMeta None mgs.book.author.email fulano.detal@kuleuven.be 1>,
 <iRODSMeta None mgs.book.author.age 30 1>,
 <iRODSMeta None mgs.book.author.name Jane Doe 2>,
 <iRODSMeta None mgs.book.author.email jane_doe@kuleuven.be 2>,
 <iRODSMeta None mgs.book.author.email sweetdoe@kuleuven.be 2>,
 <iRODSMeta None mgs.book.ebook Available None>,
 <iRODSMeta None mgs.book.publishing_date 2015-02-01 None>,
 <iRODSMeta None mgs.book.cover_colors red None>,
 <iRODSMeta None mgs.book.cover_colors blue None>,
 <iRODSMeta None mgs.book.publisher Tor None>]

In [None]:
my_metadata['author'][1]['email'].pop()

'bademail@whatevs'

In the example above, _one_ of the provided emails is wrong, and therefore it is just ignored.
In the next section we see that if the value is missing or the only provided value is wrong, an error is thrown,
because the field is required.

### Regular expressions

URLs, emails and simple text can also have a `pattern` attribute providing a regular expression that checks its appropriateness. In the case of these emails, we have additional validation to make sure that the domain is "kuleuven.be":

In [None]:
my_schema.check_requirements('author')

[1mType[0m: object.
[1mRequired[0m: True. (2 of its 3 fields are required.)
[1mRepeatable[0m: True.

Composed of the following fields:
[4mname[0m
[1mType[0m: text.
[1mRequired[0m: True. [1mDefault[0m: None.
[1mRepeatable[0m: False.

[4mage[0m
[1mType[0m: integer.
[1mRequired[0m: False.
[1mRepeatable[0m: False.
integer between 12 and 99.

[4memail[0m
[1mType[0m: email.
[1mRequired[0m: True. [1mDefault[0m: None.
[1mRepeatable[0m: True.
matching the following regex: [91m@kuleuven.be$[0m.


In [None]:
my_metadata['author'][0]['age'] = 30
my_metadata['author'][1]['email'].append('wrong_domain@gmail.com')
check_metadata(my_schema, my_metadata)



[<iRODSMeta None mgs.book.title A book not written yet None>,
 <iRODSMeta None mgs.book.author.name Fulano De Tal 1>,
 <iRODSMeta None mgs.book.author.email fulano.detal@kuleuven.be 1>,
 <iRODSMeta None mgs.book.author.age 30 1>,
 <iRODSMeta None mgs.book.author.name Jane Doe 2>,
 <iRODSMeta None mgs.book.author.email jane_doe@kuleuven.be 2>,
 <iRODSMeta None mgs.book.author.email sweetdoe@kuleuven.be 2>,
 <iRODSMeta None mgs.book.ebook Available None>,
 <iRODSMeta None mgs.book.publishing_date 2015-02-01 None>,
 <iRODSMeta None mgs.book.cover_colors red None>,
 <iRODSMeta None mgs.book.cover_colors blue None>,
 <iRODSMeta None mgs.book.publisher Tor None>]

In [None]:
my_metadata['author'][1]['email'].pop()

'wrong_domain@gmail.com'

### Composite fields

Within composite fields, the same rules apply as for schemas. First, presence is checked: required values without default throw an error when they are missing, while other cases of missing or extra fields throw warnings only when `verbose` is `True`.

Moreover, composite fields are never required themselves based on the schema, but they are required if any of their fields are required.

As shown above, bad values throw warnings in all cases.

In [None]:
#| error: true
my_metadata['author'].append({'name' : 'etal'})
check_metadata(my_schema, my_metadata)

KeyError: 'The following required fields are missing and there is no default: mgs.book.author.email.'

In [None]:
#error: true
my_metadata['author'][2]['email'] = 'bademail.com'
check_metadata(my_schema, my_metadata)

ValueError: None of the values provided for `mgs.book.author.email` are valid.

In [None]:
my_metadata['author'][2]['email'] = 'bademail@kuleuven.be'
check_metadata(my_schema, my_metadata)

[<iRODSMeta None mgs.book.title A book not written yet None>,
 <iRODSMeta None mgs.book.author.name Fulano De Tal 1>,
 <iRODSMeta None mgs.book.author.email fulano.detal@kuleuven.be 1>,
 <iRODSMeta None mgs.book.author.age 30 1>,
 <iRODSMeta None mgs.book.author.name Jane Doe 2>,
 <iRODSMeta None mgs.book.author.email jane_doe@kuleuven.be 2>,
 <iRODSMeta None mgs.book.author.email sweetdoe@kuleuven.be 2>,
 <iRODSMeta None mgs.book.author.name etal 3>,
 <iRODSMeta None mgs.book.author.email bademail@kuleuven.be 3>,
 <iRODSMeta None mgs.book.ebook Available None>,
 <iRODSMeta None mgs.book.publishing_date 2015-02-01 None>,
 <iRODSMeta None mgs.book.cover_colors red None>,
 <iRODSMeta None mgs.book.cover_colors blue None>,
 <iRODSMeta None mgs.book.publisher Tor None>]

# Final notes on implementation

The metadata can be applied to an object or collection `item` with `my_schema.apply(item, my_metadata)`, which basically calls `check_metadata()` and then provides the list to `item.metadata.apply_atomic_operations()`, adding each of the AVUs.
In addition, another AVU is added with name `{prefix}.__version__` (e.g. `mgs.book.__version__`) indicating the version of the schema used for annotation (in this case, "2.0.0").

However, before actually adding the metadata, `apply()` does two things:

- It checks whether there already is a `{prefix}.__version__` AVU and prints a warning if it's different from the version of the current schema.
- It removes all existing metadata with the same prefix.

This is the same behavior from the ManGO portal: it replaces all existing metadata linked to a schema with the metadata provided in this instance.