# Prerequisites 

To run this pipeline, you need the `ogc-na` Python module to be installed in your environment (for example, by running `pip install ogc-na`).

In [1]:
from ogc.na import download, ingest_json, update_vocabs
from ogc.na.domain_config import DomainConfiguration
import json
from rdflib import Graph

# Running the pipeline

This section shows a step-by-step usage of the OGC Rainbow data download + semantic uplift + entailment + validation pipeline.

We will run all the steps in the current directory.

## Download Google Spreadsheet as CSV

Google Spreadsheets can be downloaded by using a specially-crafted URL:

In [3]:
GS_ID = '1zOGLWpTr784nTzBO-S_Es_WUyRsK7650'
GS_URL = f"https://docs.google.com/spreadsheets/d/{GS_ID}/export?format=csv"
CSV_DEST = 'iso19156-3-examples.csv'
download.download_file(GS_URL, CSV_DEST, object_diff=False)
print("Downloaded", GS_URL)

Downloaded https://docs.google.com/spreadsheets/d/1zOGLWpTr784nTzBO-S_Es_WUyRsK7650/export?format=csv


The `object_diff` key is used when working with JSON files, so in this case we need to disable it.

## Convert CSV to JSON

Once we have our `iso19156-3-examples.csv` file, we need to convert it to JSON.

`ingest_json`, the module [used to perform semantic uplifts](https://opengeospatial.github.io/ogc-na-tools/reference/ogc/na/ingest_json/) (turning plain JSON into JSON-LD and/or RDF/Turtle), has an [input filter that can work with CSV files](https://opengeospatial.github.io/ogc-na-tools/reference/ogc/na/input_filters/csv/). Normally, we would create a whole semantic uplift definition (following the steps in [this tutorial](https://opengeospatial.github.io/ogc-na-tools/tutorials/#how-to-create-a-json-ld-uplift-context-definition)), but in this case we just want a JSON version of our spreadsheet, so we can use an extremely simple uplift definition:

In [4]:
csv_to_json_def = '''
input-filter:
  csv:
'''
with open('csv_to_json.yml', 'w') as f:
    f.write(csv_to_json_def)
print('csv_to_json.yml created')

csv_to_json.yml created


Then we can run the `ingest_json` module with [that definition](csv_to_json.yml) to obtain our JSON document:

In [5]:
result = ingest_json.process_file(input_fn='iso19156-3-examples.csv',
                                  jsonld_fn='iso19156-3-examples.csv.json',
                                  context_fn='csv_to_json.yml')
print('iso19156-3-examples.csv.json created')

iso19156-3-examples.csv.json created


As you can see, the newly-created [iso19156-3-examples.csv.json](iso19156-3-examples.csv.json) document contains an object with two keys:

* `metadata`, which contains metadata about the input file, the filter that was used, etc.
* `data`, with an array of objects representing the rows in the spreadsheet.

## Semantic uplift

Once our data is in JSON format, we can perform a semantic uplift on it, which is basically converting plain old JSON into JSON-LD and/or RDF in Turtle format.

For this step, we will use [an already existing uplift definition](https://raw.githubusercontent.com/avillar/iso19157-3-sample/master/properties-uplift.yml):

In [6]:
download.download_file('https://raw.githubusercontent.com/avillar/iso19157-3-sample/master/properties-uplift.yml',
                       'properties-uplift.yml',
                       object_diff=False)
print('properties-uplift.yml downloaded')

properties-uplift.yml created


Let us review what [this uplift definition](properties-uplift.yml) does:

1. It contains 4 `transform`s ([jq](https://stedolan.github.io/jq/) expressions to manipulate the input document):
    1. We take the value of the `data` key and discard the rest.
    2. We walk through the data tree and remove (set to `null`) all values that empty strings, strings made up of blank space characters only, or strings that are just a dash (`-`).
    3. Since some headers contained colons, we remove all colons inside property keys in the object.
    4. We add a `skos:ConceptScheme` with some metadata as the top-level object, and put all of the row objects inside its `concepts` property.
2. After the `transform`s run, we add the `skos:Concept` type for all of the rows (remember that they are now inside the `concepts` property of the top-level object).
3. Finally, we add the JSON-LD context at the root level of the document (indicated by using the `$` JSON path).

We run it like so:

In [6]:
result = ingest_json.process_file(input_fn='iso19156-3-examples.csv.json',
                                  jsonld_fn='iso19156-3-examples.csv.jsonld',
                                  ttl_fn='iso19156-3-examples.csv.ttl',
                                  context_fn='properties-uplift.yml')
print('Semantic uplift done')

The above command will generate two files:

* [iso19156-3-examples.csv.jsonld](iso19156-3-examples.csv.jsonld), which is the uplifted JSON-LD version after running the `transform`s, setting the `types` and adding the `context`.
* [iso19156-3-examples.csv.ttl](iso19156-3-examples.csv.ttl), its equivalent version in Turtle.

You will note that the Turtle version does not have all of the properties that the JSON-LD document does; this is because, from the RDF side of things, any property that is not linked to an RDF predicate is simply ignored.

## Entailment and validation

At this point, we have an RDF graph version of our initial spreadsheet inside the [iso19156-3-examples.csv.ttl](iso19156-3-examples.csv.ttl) file. The next step involves performing entailment (inferring new data that we can add to our graph) and validation (verifying that our data is "up to code"). We can do this by leveraging a couple of technologies:

* [SHACL](https://www.w3.org/TR/shacl/) allows us to write entailment rules and validation constraints for RDF data.
* [The profiles vocabulary](https://www.w3.org/TR/dx-prof/) can be used to create profiles, and link those with entailment and validation resources.
  * The OGC defines several profiles that can be used for validation and entailment of RDF resources.

To associate our own Turtle files with the profiles, we first need to create a [DomainConfiguration](https://opengeospatial.github.io/ogc-na-tools/reference/ogc/na/domain_config/#ogc.na.domain_config.DomainConfiguration) (you can find a full example with comments [here](https://opengeospatial.github.io/ogc-na-tools/examples/#sample-domain-configuration)):

In [7]:
domain_cfg_content = '''
@prefix dcfg: <http://www.example.org/ogc/domain-cfg#> .
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix profiles: <http://www.opengis.net/def/metamodel/profiles/> .

_:iso19157-3-sample a dcat:Catalog ;
  dct:title "ISO19157-3 Sample" ;
  dcat:dataset _:examples ;
  dcfg:hasProfileSource "sparql:http://defs-dev.opengis.net:8080/rdf4j-server/repositories/profiles" ;
  dcfg:ignoreProfileArtifactErrors true ;
.

_:examples a dcat:Dataset, dcfg:DomainConfiguration ;
  dct:identifier "examples" ;
  dct:description "Entailment and validation for examples" ;
  dcfg:glob "*.ttl" ;
  dct:conformsTo profiles:skos_shared, profiles:skos_conceptscheme, profiles:skos_conceptscheme_ogc, profiles:vocprez_ogc ;
.
'''
domain_cfg = DomainConfiguration(Graph().parse(data=domain_cfg_content))
profile_registry = domain_cfg.profile_registry
print('Found profiles:')
print('\n'.join(str(profile_uri) for profile_uri in sorted(profile_registry.profiles)))

Found profiles:
http://www.opengis.net/def/metamodel/profiles/dcatprov
http://www.opengis.net/def/metamodel/profiles/dcatqb
http://www.opengis.net/def/metamodel/profiles/json_ld_context
http://www.opengis.net/def/metamodel/profiles/ogcapi-bbox
http://www.opengis.net/def/metamodel/profiles/ogcapi-common
http://www.opengis.net/def/metamodel/profiles/ogcapi-edr
http://www.opengis.net/def/metamodel/profiles/ogcapi-features
http://www.opengis.net/def/metamodel/profiles/ogcapi-geopose
http://www.opengis.net/def/metamodel/profiles/ogcapi-geopose-euler
http://www.opengis.net/def/metamodel/profiles/owl2skos
http://www.opengis.net/def/metamodel/profiles/skos_conceptscheme
http://www.opengis.net/def/metamodel/profiles/skos_conceptscheme_ogc
http://www.opengis.net/def/metamodel/profiles/skos_shared
http://www.opengis.net/def/metamodel/profiles/vocprez_ogc
http://www.w3.org/2004/02/skos/core
http://www.w3.org/ns/dcat


The `profile_registry` above is a [ProfileRegistry](https://opengeospatial.github.io/ogc-na-tools/reference/ogc/na/profile/#ogc.na.profile.ProfileRegistry), which reads the values for the catalog's `dcfg:hasProfileSource` (`sparql:http://defs-dev.opengis.net:8080/rdf4j-server/repositories/profiles`, a SPARQL endpoint, in our case), and obtains all the profile definitions, including its resources, artifacts and dependencies, from them.

Apart from that, we define a `dcfg:DomainConfiguration` to run entailment and validation processes on our Turtle document (which will fall inside of the `dcfg:glob`'s scope) with resources from 4 profiles: `skos_shared`, `skos_conceptscheme`, `skos_conceptscheme_ogc` and `vocprez_ogc`.

Since the previous uplift `result` has a `graph` property with the RDF, we can run entailments directly on it:

In [9]:
cfg_entry = domain_cfg.entries.find_entry_for_file('iso19156-3-examples.csv.ttl')
entailed_graph, entail_artifacts = profile_registry.entail(result.graph, cfg_entry.conforms_to, inplace=False)
entailed_graph.serialize('iso19156-3-examples.csv-entailed.ttl', format='ttl')
print('Entailement done. Artifacts used:')
print('-','\n- '.join(entail_artifacts))

[rdflib.term.URIRef('http://www.opengis.net/def/metamodel/profiles/skos_shared'), rdflib.term.URIRef('http://www.opengis.net/def/metamodel/profiles/vocprez_ogc'), rdflib.term.URIRef('http://www.opengis.net/def/metamodel/profiles/skos_conceptscheme_ogc'), rdflib.term.URIRef('http://www.opengis.net/def/metamodel/profiles/skos_conceptscheme')]


KeyboardInterrupt: 

The above code will find the profiles defined for the `iso19156-3-examples.csv.ttl` file name, and then run the specific entailments for the profiles found in the configuration entry's `dct:conformsTo`. The entailment result (`entailed_graph`) is stored, and a list of all the found entailment artifacts is then written to the console.

It is important to note that in our example we could have skipped creating the full `DomainConfiguration`, instantiating a `ProfileRegistry` directly with the profiles SPARQL endpoint instead, but as we will see later, it is much easier to work with the former in an automated CI/CD environment.

Validation is done similarly:

In [None]:
validation_result = profile_registry.validate(entailed_graph, cfg_entry.conforms_to, log_artifact_errors=True)
print('Validation done')

The validation process output logging warnings when profiles and/or artifacts are missing, but if `log_artifact_errors` is `True`, it will take a *best-effort* approach.

The `validation_result` will then contain a summary of all the validation errors found, both globally and per profile:

In [None]:
print("Global:", validation_result.result)
for profile_report in validation_result.reports:
    print(f"{profile_report.profile_uri}: {profile_report.report.result}")

We also get full plain text error reports:

In [None]:
print(validation_result.text)

As well as an RDF representation:

In [None]:
print(validation_result.graph.serialize(format='ttl'))

In our example, we can see that the validation for `skos_shared` fails because the SHACL resource employed has errors, but `vocprez_ogc` detects several errors in our data.

## Uploading the results

Finally, if we want to upload our entailed graphs to a [SPARQL Graph Store Protocol](https://www.w3.org/TR/sparql11-http-rdf-update/)-compatible service, we can use `update_vocabs.load_vocab` to do so:

In [None]:
if False: # disable upload in this notebook
    update_vocabs.load_vocab(result.graph, 'http://example.com/graph-identifier', 'http://example.com/sparql/graph-store', ('username', 'password'))

The RDF data will be uploading using the [HTTP PUT method mechanism](https://www.w3.org/TR/sparql11-http-rdf-update/#http-put), which will replace all data in the specified graph URI (`http://example.com/graph-identifier` above) with the contents of the provided graph.

# CI/CD environments

All of the ogc-na modules can be run directly from the command line, which means that in a CI/CD environment all of them can be run as shell commands.

Take the [iso19157-3-sample](https://github.com/avillar/iso19157-3-sample) GitHub repository as an example. It contains several files that should be quite familiar to us by now:

* [csv2python.yml](https://github.com/avillar/iso19157-3-sample/blob/master/csv2python.yml) is the uplift definition to turn CSV files into JSON.
* [properties-uplift.yml](https://github.com/avillar/iso19157-3-sample/blob/master/properties-uplift.yml) is the uplift definition that converts our JSON rows into JSON-LD/Turtle.
* [.ogc/catalog.ttl](https://github.com/avillar/iso19157-3-sample/blob/master/.ogc/catalog.ttl) contains a `DomainConfiguration` like the one we created before, as well as an `UplifConfiguration` to map `*.csv.json` files to `properties-uplift.yml`.

Apart from that, we have [.ogc/config.yml](https://github.com/avillar/iso19157-3-sample/blob/master/.ogc/config.yml) with the file download configuration (which we typed in manually above).

The pipeline is then run using the [download-and-uplift.yaml](https://github.com/avillar/iso19157-3-sample/blob/master/.github/workflows/download-and-uplift.yaml) GitHub workflow configuration; the important bits in that file are the chain of `python` commands that appear after the environment is set up:

```shell
# Download file(s)
# Inputs:
#  - .ogc/config.yml (file download spec)
# Outputs:
#  - iso19156-3-examples.csv
python -m ogc.na.download --spec .ogc/config.yml

# Search for CSV files and convert them to JSON
# Inputs:
#  - iso19156-3-examples.csv (input file)
#  - csv2python.yml (uplift definition)
# Outputs:
#  - iso19156-3-examples.csv.json
find . -name '*.csv' | while read CSV_FILE; do python -m ogc.na.ingest_json \
--skip-on-missing-context --json-ld --context csv2python.yml "${CSV_FILE}" > "${CSV_FILE}.json"
done

# Do the properties-uplift.yml semantic uplift. The --use-git-status instructs the script to
# take its input file names from whatever is modified/added in the current git working directory
# Inputs:
#  - .ogc/catalog.ttl (domain configurations)
#  - iso19156-3-examples.csv.json (input file)
#  - properties-uplift.yml (uplift definition found in catalog)
# Outputs:
#  - iso19156-3-examples.csv.jsonld (uplifted JSON-LD)
#  - iso19156-3-examples.csv.ttl (uplifted Turtle)
python -m ogc.na.ingest_json --batch --use-git-status --skip-on-missing-context \
--json-ld --ttl --work-dir . --domain-config .ogc/catalog.ttl

# Run the entailment, validation and upload in a single step
# Inputs:
#  - .ogc/catalog.ttl (domain configurations)
#  - iso19156-3-examples.csv.ttl (input file)
# Outputs:
#  - entailed/iso19156-3-examples.csv.jsonld (entailed expanded JSON-LD)
#  - entailed/iso19156-3-examples.csv.ttl (entailed Turtle)
#  - entailed/iso19156-3-examples.csv.rdf (entailed RDF/XML)
#  - entailed/iso19156-3-examples.csv.txt (validation report)
python -m ogc.na.update_vocabs -w . .ogc/catalog.ttl --use-git-status \
--base-uri https://raw.githubusercontent.com/${{github.repository}}/${{github.ref_name}} \
--update --graph-store http://defs-dev.opengis.net:8061/fuseki-hosted/data
```

