# Adding URL to `processing:software`

We've been planning to use the [processing extension](https://github.com/stac-extensions/processing) to keep a record of software used in processing. The relevant part of the schema looks like this:

In [1]:
from json import dumps, load
from urllib.request import urlopen
from IPython.display import JSON

processing_schema_url = "https://stac-extensions.github.io/processing/v1.0.0/schema.json"
with urlopen(processing_schema_url) as file_pointer:
    processing_schema = load(file_pointer)

JSON(data=processing_schema["definitions"]["fields"]["properties"]["processing:software"], expanded=True)

<IPython.core.display.JSON object>

This can't easily be extended to include other values, like the URL of the software. One possibility would be to include the URL inside the key:

```json
{
    "Sentinel-1 IPF <https://example.org/sentinel-1-ipf>": "002.71"
}
```

Another option would be to link directly to the version:

```json
{
    "Sentinel-1 IPF": "[002.71](https://example.org/sentinel-1-ipf/002.71)"
}
```

Both options have several issues:

- Anyone who wants *just* the name or version or URL now needs to parse the field, which is complicated and risky.
- There's basically infinite flexibility in how the URL and its title could be formatted; the above just shows email style and Markdown. Other popular possibilities include Mediawiki and HTML, just to name a few. This makes a difference to how the field is parsed, and makes it much harder to validate the contents of the field.
- Including more than one version of the same software, for example if it's been processed more than once, is going to make the values much harder to parse.

The obvious conclusion is that we need a breaking change to the schema to allow a more flexible structure. We can deal with all the above issues by making `processing:software` a list of objects. Let's try building a list-of-objects structure & validation:

In [2]:
from sys import stderr
from jsonschema import Draft7Validator

def validate(schema, instance):
    found_error = False

    for error in Draft7Validator(schema).iter_errors(instance):
        found_error = True
        print(error.message, file=stderr)

    if not found_error:
        print("Validated successfully")

example = [
    {
        "name": "only name"
    },
    {
        "link": "https://example.org/only-link"
    },
    {
        "name": "all properties",
        "link": "https://example.org",
        "version": "1.0"
    },
    {
        "name": "other properties",
        "licensor": "ACME Industries"
    },
# Uncomment to check error handling
#     {
#         "name": 1,
#         "link": 2,
#         "version": 3
#     },
#     {
#         "licensor": "ACME Industries"
#     }
]

software_schema = {
    "type": "array",
    "items": {
        "type": "object",
        "anyOf": [
            {"required": ["name"]},
            {"required": ["link"]}
        ],
        "properties": {
            "name": {
                "type": "string"
            },
            "link": {
                "type": "string",
                "format": "iri"
            },
            "version": {
                "type": "string"
            }
        }
    }
}

validate(software_schema, example)

Validated successfully


This implementation allows users to add other fields they feel are relevant, like processing output, contact persons, license, and so on. This seems to be the default across STAC extensions.

Why I used `link` rather than `url`:

- As [IRI, URI, URL, URN and their differences](https://fusion.cs.uni-jena.de/fusion/2016/11/18/iri-uri-url-urn-and-their-differences/) explains, IRI is a superset of URI, and URI is a superset of URL. So calling it `url` is not correct.
- `link` is more universally understood than `iri`.

I used `iri` as the type rather than `url` because IRIs allow non-ASCII characters in URLs. This would allow things like macronated place names in filenames, which would be especially good when an ASCIIfied version of the name would be ambiguous.

Applying the above to the original extension and LINZ extension collection example:

In [3]:
processing_schema["definitions"]["fields"]["properties"]["processing:software"] = software_schema

with urlopen("https://stac.linz.govt.nz/v0.0.10/linz/examples/collection.json") as file_pointer:
    collection = load(file_pointer)

collection["stac_extensions"].append(processing_schema_url)
collection["assets"]["example"]["processing:software"] = example

# Work around bug in original schema, where a valid `providers` object means the `assets` processing properties are not validated
del collection["providers"]

validate(processing_schema, collection)

# print(dumps(processing_schema, indent=2))
# print(dumps(collection, indent=2))

Validated successfully
