# Mapping - DCAT

Testing approaches to mapping DCAT to schema.org

For SPREP the profile and it's mapping to schema.org are found at https://resources.data.gov/resources/podm-field-mapping/#field-mappings and the source URL at: https://pacific-data.sprep.org/sites/default/files/pod_data/data.json

## Current thinking

* JSON-LD Frame with default values
* SPARQL construct on these resulting frame to generate the new triples

## Additional resources
I wanted to highlight new resources available from the [FAIRsharing RDA Working Group](https://www.rd-alliance.org/group/fairsharing-registry-connecting-data-policies-standards-databases.html). See email below. In particular, they have released:
* a [crosswalk](https://fairsharing.org/3641) of DCAT, Datacite, ISO 19115, and several other metadata standards to schema.org. (This is actually from the RDA Research Metadata Schemas WG …).
* a new [registry of standards, databases, and data policies](https://fairsharing.org/)
each of these is quite broad and worth perusing
* an API for harvesting and modifying the records in the registry; see also some [documentation](https://fairsharing.gitbook.io/fairsharing/).


## Mapping references

* https://www.w3.org/2015/spatial/wiki/ISO_19115_-_DCAT_-_Schema.org_mapping
* https://ec-jrc.github.io/dcat-ap-to-schema-org/
* https://data.gov.au/data/dataset/67ca5de1-8774-4678-9d1b-8b1cb70ab33c.jsonld

Note:  We should consider using the subjectOf property to link the generated schema.org to 
the source DCAT record where we can.  

## Methodology

We will load the DCAT JOSN-LD example and explore approaches to converting this to a form that can be used for 
schema.org.  

Possible approaches include

* Inferencing
    * ref: https://derwen.ai/docs/kgl/infer/
* SPARQL CONSTRUCT
    * https://rdflib.readthedocs.io/en/stable/apidocs/rdflib.html
    * https://derwen.ai/docs/kgl/ex4_0/
* JSON-LD APIs
    * https://w3c.github.io/json-ld-framing/#omit-default-flag
* Context modification

In [10]:
import kglab
import json
import rdflib
import numpy as np

In [11]:
# load our JSON into a var to use later
f = open('./data/dcatEx.json',)
j = json.load(f)
f.close()

## JSON-LD

Use a frame to pull the elements we want to map, then alter the context for that 
frame or otehrwise cast to new namespace.

Frame with defaults and then work to convert to new names space with SPARQL construct

## SPARQL CONSTRUCT example

Refs:
* https://derwen.ai/docs/kgl/ex4_0/

In [12]:
from icecream import ic
from pathlib import Path

txt = Path('./data/dcatEx.json').read_text()

g = rdflib.Graph()
g.parse(data=txt, format="json-ld")

<Graph identifier=N11e7a8aed3264f7c921e88a4dd434ff4 (<class 'rdflib.graph.Graph'>)>

In [13]:
sparql = """
    SELECT ?s ?p ?o 
    WHERE {
        ?s ?p ?o .
    }
    LIMIT 1
"""

In [14]:
for row in g.query(sparql):
    ic(row.asdict())

ic| row.asdict(): {'o': rdflib.term.Literal('Maranoa-Balonne-Condamine subregion'),
                   'p': rdflib.term.URIRef('http://www.w3.org/ns/dcat#keyword'),
                   's': rdflib.term.URIRef('https://data.gov.au/dataset/67ca5de1-8774-4678-9d1b-8b1cb70ab33c')}


In [15]:
sparqlc = """
 PREFIX dbpedia: <http://dbpedia.org/resource/>
 PREFIX foaf: <http://xmlns.com/foaf/0.1/>
 PREFIX dc: <http://purl.org/dc/elements/1.1/>
 PREFIX dct: <http://purl.org/dc/terms/>
 PREFIX mo: <http://purl.org/ontology/mo/>
 PREFIX schema: <https://schema.org/>

CONSTRUCT { 
       ?s schema:identifier ?o .
 }
 WHERE { 
       ?s dct:identifier ?o .
 }
"""

qres = g.query(sparqlc)
context = {"@vocab": "https://schema.org/", "@language": "en"}
print(qres.serialize(format='json-ld', context=context, indent=4))

# g.parse(qres, format="nt")
    
# for row in qres:
#     print("-----")
#     print(row)

b'{\n    "@context": {\n        "@language": "en",\n        "@vocab": "https://schema.org/"\n    },\n    "@id": "https://data.gov.au/dataset/67ca5de1-8774-4678-9d1b-8b1cb70ab33c",\n    "identifier": {\n        "@value": "67ca5de1-8774-4678-9d1b-8b1cb70ab33c"\n    }\n}'


In [16]:
import kglab

namespaces =  {
    "adms": "http://www.w3.org/ns/adms#",
    "dcat": "http://www.w3.org/ns/dcat#",
    "dct": "http://purl.org/dc/terms/",
    "foaf": "http://xmlns.com/foaf/0.1/",
    "gsp": "http://www.opengis.net/ont/geosparql#",
    "locn": "http://www.w3.org/ns/locn#",
    "owl": "http://www.w3.org/2002/07/owl#",
    "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
    "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
    "schema": "http://schema.org/",
    "skos": "http://www.w3.org/2004/02/skos/core#",
    "time": "http://www.w3.org/2006/time",
    "vcard": "http://www.w3.org/2006/vcard/ns#",
    "xsd": "http://www.w3.org/2001/XMLSchema#"
  }

kg = kglab.KnowledgeGraph(
    name = "DCAT example",
    base_uri = "https://www.example.org/",
    namespaces = namespaces,
    )

kg.load_jsonld("./data/dcatEx.json")

<kglab.kglab.KnowledgeGraph at 0x7fda4f11b790>

In [17]:
sparql2 = """
    SELECT ?s  ?o 
    WHERE {
        ?s dct:description ?o .
    }
"""

In [18]:
import pandas as pd
pd.set_option("max_rows", None)

df = kg.query_as_df(sparql2)
df.head(20)

OptionError: Pattern matched multiple keys

In [19]:
# pyvis_graph = kg.visualize_query(sparql2, notebook=True)
# pyvis_graph.force_atlas_2based()
# pyvis_graph.show("tmp.fig06.html")

## SHACL Rules

In [20]:
import pyshacl
from pyshacl import validate


In [21]:
conforms, v_graph, v_text = validate(data_graph="./data/learning.jsonld", 
                shacl_graph='./shapes/oih_learning.ttl', 
                data_graph_format="json-ld", 
                shape_graph_format="ttl", 
                inference='none', 
                serialize_report_graph="json-ld")

Failed to convert Literal lexical form to value. Datatype=http://www.w3.org/2001/XMLSchema#dateTime, Converter=<function parse_datetime at 0x7fdaaee87880>
Traceback (most recent call last):
  File "/home/fils/.local/lib/python3.10/site-packages/isodate/isodatetime.py", line 51, in parse_datetime
    datestring, timestring = datetimestring.split('T')
ValueError: not enough values to unpack (expected 2, got 1)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/fils/.conda/envs/dev/lib/python3.10/site-packages/rdflib/term.py", line 2084, in _castLexicalToPython
    return conv_func(lexical)  # type: ignore[arg-type]
  File "/home/fils/.local/lib/python3.10/site-packages/isodate/isodatetime.py", line 53, in parse_datetime
    raise ISO8601Error("ISO 8601 time designator 'T' missing. Unable to"
isodate.isoerror.ISO8601Error: ISO 8601 time designator 'T' missing. Unable to parse datetime string '2019-03-21'
Failed to convert

In [22]:
print(conforms)
print(v_graph)
print(v_text)

True
b'[\n  {\n    "@id": "_:N16694cb1bc104293bde5a11afa24cda0",\n    "@type": [\n      "http://www.w3.org/ns/shacl#ValidationReport"\n    ],\n    "http://www.w3.org/ns/shacl#conforms": [\n      {\n        "@value": true\n      }\n    ]\n  }\n]'
Validation Report
Conforms: True



In [23]:
from pyshacl import Validator

# v = Validator(data_graph=dg_basin, shacl_graph=rule, options={"inference": "rdfs"},ont_graph=ont)
# conforms, report_graph, report_text = v.run()
# expanded_graph = v.target_graph 

df = Path('./data/data.ttl').read_text()
dg = rdflib.Graph()
dg.parse(data=df, format="ttl")

sf = Path('./shapes/shape.ttl').read_text()
sg = rdflib.Graph()
sg.parse(data=sf, format="ttl")

v = Validator(data_graph=dg, shacl_graph=sg,  options={"inference": "none", "advanced": True})  # turn off rdfs inferencing
conforms, report_graph, report_text = v.run()
expanded_graph = v.target_graph 

In [24]:
# print(conforms)
# print(v_graph)
# print("------------")
# print(v_text)
# print(expanded_graph)

In [25]:
print(expanded_graph.serialize(format="ttl")) #.decode("utf-8"))

@prefix ex: <http://example.com/ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

ex:InvalidRectangle a ex:Rectangle .

ex:NonSquareRectangle a ex:Rectangle ;
    ex:height 2 ;
    ex:width 3 .

ex:SquareRectangle a ex:Rectangle,
        ex:Square ;
    ex:height 4 ;
    ex:width 4 .




## Notes on SHACL AF Rules

We need to add in PROV triples in this process to note the generation of these triples and
the souce IRI tht results in the product IRI and the actor (?reference)

Maybe review: https://www.w3.org/TR/2013/REC-prov-o-20130430/#qualifiedPrimarySource

In [26]:
df = Path('./data/dcat.ttl').read_text()
dg = rdflib.Graph()
dg.parse(data=df, format="turtle")

sf = Path('./shapes/dcatsdoOLD.ttl').read_text()
sg = rdflib.Graph()
sg.parse(data=sf, format="ttl")

v = Validator(data_graph=dg, shacl_graph=sg,  options={"inference": "none", "advanced": True})  # turn off rdfs inferencing
conforms, report_graph, report_text = v.run()
expanded_graph = v.target_graph 

output = str(expanded_graph.serialize(format="ttl")) #.decode("utf-8"))

print(output)

#save file
# file = open("output.txt", "w")
# file.write(output)
# file.close()

@prefix dct: <http://purl.org/dc/terms/> .
@prefix ns1: <http://www.w3.org/ns/dcat#> .
@prefix ns2: <http://www.w3.org/ns/locn#> .
@prefix ns3: <http://xmlns.com/foaf/0.1/> .
@prefix schema: <https://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<https://data.gov.au/dataset/67ca5de1-8774-4678-9d1b-8b1cb70ab33c> a ns1:Dataset ;
    dct:description "a descriptiono of this dataset" ;
    dct:identifier "67ca5de1-8774-4678-9d1b-8b1cb70ab33c" ;
    dct:issued "2016-03-23T05:08:17.991412"^^xsd:dateTime ;
    dct:language "eng" ;
    dct:modified "2019-11-19T23:18:49.871451"^^xsd:dateTime ;
    dct:publisher <https://data.gov.au/organization/69f37b4c-bdf0-4c85-bd56-82fa6d6b087a> ;
    dct:spatial [ a dct:Location ;
            ns2:geometry "POLYGON ((110.0012 -10.0012, 115.0080 -10.0012, 155.0080 -45.0036, 110.0012 -45.0036, 110.0012 -10.0012))"^^<http://www.opengis.net/ont/geosparql#wktLiteral>,
                "{\"type\": \"Polygon\", \"coordinates\": [[[110.0012, -10.0

## Test with Pacific group style data

In [27]:
df = Path('./data/pacificTest.json').read_text()
dg = rdflib.Graph()
dg.parse(data=df, format="json-ld")

sf = Path('./shapes/dcatsdoOLD.ttl').read_text()
sg = rdflib.Graph()
sg.parse(data=sf, format="ttl")

v = Validator(data_graph=dg, shacl_graph=sg,  options={"inference": "none", "advanced": True})  # turn off rdfs inferencing
conforms, report_graph, report_text = v.run()
expanded_graph = v.target_graph 

output = str(expanded_graph.serialize(format="ttl")) #.decode("utf-8"))

print(output)

@prefix : <https://project-open-data.cio.gov/v1.1/schema/> .
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix schema: <https://schema.org/> .

<https://pacific-data.sprep.org/data.json> a dcat:Catalog ;
    :conformsTo "https://project-open-data.cio.gov/v1.1/schema" ;
    :dataset [ a dcat:Dataset ;
            :accessLevel "public" ;
            :accrualPeriodicity "" ;
            :conformsTo "" ;
            :contactPoint [ :fn "pacific@dmin" ;
                    :hasEmail "" ] ;
            :describedByType "" ;
            :description "Reports on the state of the world's sea turtles" ;
            :distribution [ :description "In contrast to the properly grim outlook of just a few decades ago, these are pretty good times for sea turtles. In a 2017 paper titled “Global Sea Turtle Conservation Successes,” Antonio Mazaris and colleagues reported that published estimates of sea turtle populations tend to be increasing rather than decreasing globally. We have also seen the status

## Work on full collection


In [33]:
df = Path('./data/datahub.json').read_text()
dg = rdflib.Graph()
dg.parse(data=df, format="json-ld")

sf = Path('./shapes/dcatsdoOLD.ttl').read_text()
sg = rdflib.Graph()
sg.parse(data=sf, format="ttl")

v = Validator(data_graph=dg, shacl_graph=sg,  options={"inference": "none", "advanced": True})  # turn off rdfs inferencing
conforms, report_graph, report_text = v.run()
expanded_graph = v.target_graph 

output = str(expanded_graph.serialize(format="ttl")) #.decode("utf-8"))

#save file
file = open("output.txt", "w")
file.write(output)
file.close()

## JSON-LD frame to pull out the newly generated triples

Need to generate a frame that will result in just the new triples being framed out.   Though not really required, it might be useful in some cases and would also allow us to better test the queries (or at least in a simpler manner).   Also the frame could populate in default values that might be harder with SHACL.  