# DCAT to Schema.org via SHACL AF

Testing approaches to mapping DCAT to schema.org

Current thinking

* JSON-LD Frame with default values
* SPARQL construct on these resulting frame to generate the new triples

Mapping references

* https://www.w3.org/2015/spatial/wiki/ISO_19115_-_DCAT_-_Schema.org_mapping
* https://ec-jrc.github.io/dcat-ap-to-schema-org/
* https://data.gov.au/data/dataset/67ca5de1-8774-4678-9d1b-8b1cb70ab33c.jsonld


## Methodology

We will load the DCAT JOSN-LD example and explore approaches to converting this to a form that can be used for 
schema.org.  

Possible approaches include

* Inferencing
    * ref: https://derwen.ai/docs/kgl/infer/
* SPARQL CONSTRUCT
    * https://rdflib.readthedocs.io/en/stable/apidocs/rdflib.html
    * https://derwen.ai/docs/kgl/ex4_0/
* JSON-LD APIs
    * https://w3c.github.io/json-ld-framing/#omit-default-flag
* Context modification

In [1]:
!pip install -q kglab

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
boto3 1.17.102 requires botocore<1.21.0,>=1.20.102, but you have botocore 1.20.49 which is incompatible.[0m


In [2]:
import kglab
import json
import rdflib

In [3]:
# load our JSON into a var to use later
f = open('dcatEx.json',)
j = json.load(f)
f.close()

## JSON-LD

Use a frame to pull the elements we want to map, then alter the context for that 
frame or otehrwise cast to new namespace.

Frame with defaults and then work to convert to new names space with SPARQL construct

## SPARQL CONSTRUCT example

Refs:
* https://derwen.ai/docs/kgl/ex4_0/

In [4]:
from icecream import ic
from pathlib import Path

txt = Path('dcatEx.json').read_text()

g = rdflib.Graph()
g.parse(data=txt, format="json-ld")

<Graph identifier=Nad1628ac8eb84e5881b07f2b9ac96afd (<class 'rdflib.graph.Graph'>)>

In [5]:
sparql = """
    SELECT ?s ?p ?o 
    WHERE {
        ?s ?p ?o .
    }
    LIMIT 1
"""

In [6]:
for row in g.query(sparql):
    ic(row.asdict())

ic| row.asdict(): {'o': rdflib.term.Literal('LAND-Cover'),
                   'p': rdflib.term.URIRef('http://www.w3.org/ns/dcat#keyword'),
                   's': rdflib.term.URIRef('https://data.gov.au/dataset/67ca5de1-8774-4678-9d1b-8b1cb70ab33c')}


In [7]:
sparqlc = """
 PREFIX dbpedia: <http://dbpedia.org/resource/>
 PREFIX foaf: <http://xmlns.com/foaf/0.1/>
 PREFIX dc: <http://purl.org/dc/elements/1.1/>
 PREFIX dct: <http://purl.org/dc/terms/>
 PREFIX mo: <http://purl.org/ontology/mo/>
 PREFIX schema: <https://schema.org/>

CONSTRUCT { 
       ?s schema:identifier ?o .
 }
 WHERE { 
       ?s dct:identifier ?o .
 }
"""

qres = g.query(sparqlc)
context = {"@vocab": "https://schema.org/", "@language": "en"}
print(qres.serialize(format='json-ld', context=context, indent=4))

# g.parse(qres, format="nt")
    
# for row in qres:
#     print("-----")
#     print(row)

b'{\n    "@context": {\n        "@language": "en",\n        "@vocab": "https://schema.org/"\n    },\n    "@id": "https://data.gov.au/dataset/67ca5de1-8774-4678-9d1b-8b1cb70ab33c",\n    "identifier": {\n        "@value": "67ca5de1-8774-4678-9d1b-8b1cb70ab33c"\n    }\n}'


In [8]:
import kglab

namespaces =  {
    "adms": "http://www.w3.org/ns/adms#",
    "dcat": "http://www.w3.org/ns/dcat#",
    "dct": "http://purl.org/dc/terms/",
    "foaf": "http://xmlns.com/foaf/0.1/",
    "gsp": "http://www.opengis.net/ont/geosparql#",
    "locn": "http://www.w3.org/ns/locn#",
    "owl": "http://www.w3.org/2002/07/owl#",
    "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
    "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
    "schema": "http://schema.org/",
    "skos": "http://www.w3.org/2004/02/skos/core#",
    "time": "http://www.w3.org/2006/time",
    "vcard": "http://www.w3.org/2006/vcard/ns#",
    "xsd": "http://www.w3.org/2001/XMLSchema#"
  }

kg = kglab.KnowledgeGraph(
    name = "DCAT example",
    base_uri = "https://www.example.org/",
    namespaces = namespaces,
    )

kg.load_jsonld("dcatEx.json")

<kglab.kglab.KnowledgeGraph at 0x7f702ccf86a0>

In [9]:
sparql2 = """
    SELECT ?s  ?o 
    WHERE {
        ?s dct:description ?o .
    }
"""

In [10]:
import pandas as pd
pd.set_option("max_rows", None)

df = kg.query_as_df(sparql2)
df.head(20)

Unnamed: 0,s,o
0,<https://data.gov.au/dataset/67ca5de1-8774-467...,Data File
1,<https://data.gov.au/dataset/67ca5de1-8774-467...,## **Abstract** \n\nThis dataset and its metad...


In [11]:
pyvis_graph = kg.visualize_query(sparql2, notebook=True)

pyvis_graph.force_atlas_2based()
pyvis_graph.show("tmp.fig06.html")

## SHACL Rules

In [12]:
import pyshacl

In [13]:
from pyshacl import validate

conforms, v_graph, v_text = validate(data_graph="./learning.jsonld", 
                shacl_graph='./oih_learning.ttl', 
                data_graph_format="json-ld", 
                shape_graph_format="ttl", 
                inference='none', 
                serialize_report_graph="json-ld")

In [14]:
print(conforms)
print(v_graph)
print(v_text)

True
b'[\n  {\n    "@id": "_:N8be15736f0b945d19b55b266f8475c54",\n    "@type": [\n      "http://www.w3.org/ns/shacl#ValidationReport"\n    ],\n    "http://www.w3.org/ns/shacl#conforms": [\n      {\n        "@value": true\n      }\n    ]\n  }\n]'
Validation Report
Conforms: True



In [15]:
from pyshacl import Validator

# v = Validator(data_graph=dg_basin, shacl_graph=rule, options={"inference": "rdfs"},ont_graph=ont)
# conforms, report_graph, report_text = v.run()
# expanded_graph = v.target_graph 

df = Path('data.ttl').read_text()
dg = rdflib.Graph()
dg.parse(data=df, format="ttl")

sf = Path('shape.ttl').read_text()
sg = rdflib.Graph()
sg.parse(data=sf, format="ttl")

v = Validator(data_graph=dg, shacl_graph=sg,  options={"inference": "none", "advanced": True})  # turn off rdfs inferencing
conforms, report_graph, report_text = v.run()
expanded_graph = v.target_graph 

In [16]:
# print(conforms)
# print(v_graph)
# print("------------")
# print(v_text)
# print(expanded_graph)

In [17]:
print(expanded_graph.serialize(format="ttl").decode("utf-8"))

@prefix ex: <http://example.com/ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

ex:InvalidRectangle a ex:Rectangle .

ex:NonSquareRectangle a ex:Rectangle ;
    ex:height 2 ;
    ex:width 3 .

ex:SquareRectangle a ex:Rectangle,
        ex:Square ;
    ex:height 4 ;
    ex:width 4 .




## Notes on SHACL AF Rules

We need to add in PROV triples in this process to note the generation of these triples and
the souce IRI tht results in the product IRI and the actor (?reference)

Maybe review: https://www.w3.org/TR/2013/REC-prov-o-20130430/#qualifiedPrimarySource

In [18]:
df = Path('dcat.ttl').read_text()
dg = rdflib.Graph()
dg.parse(data=df, format="ttl")

sf = Path('dcatsdo.ttl').read_text()
sg = rdflib.Graph()
sg.parse(data=sf, format="ttl")

v = Validator(data_graph=dg, shacl_graph=sg,  options={"inference": "none", "advanced": True})  # turn off rdfs inferencing
conforms, report_graph, report_text = v.run()
expanded_graph = v.target_graph 

print(expanded_graph.serialize(format="ttl").decode("utf-8"))



@prefix dct: <http://purl.org/dc/terms/> .
@prefix ex: <http://example.com/ns#> .
@prefix ns1: <http://www.w3.org/ns/dcat#> .
@prefix ns2: <http://xmlns.com/foaf/0.1/> .
@prefix ns3: <http://www.w3.org/ns/locn#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<https://data.gov.au/dataset/67ca5de1-8774-4678-9d1b-8b1cb70ab33c> a ns1:Dataset ;
    ex:area2 "a descriptiono of this dataset" ;
    dct:description "a descriptiono of this dataset" ;
    dct:identifier "67ca5de1-8774-4678-9d1b-8b1cb70ab33c" ;
    dct:issued "2016-03-23T05:08:17.991412"^^xsd:dateTime ;
    dct:language "eng" ;
    dct:modified "2019-11-19T23:18:49.871451"^^xsd:dateTime ;
    dct:publisher <https://data.gov.au/organization/69f37b4c-bdf0-4c85-bd56-82fa6d6b087a> ;
    dct:spatial [ a dct:Location ;
            ns3:geometry "POLYGON ((110.0012 -10.0012, 115.0080 -10.0012, 155.0080 -45.0036, 110.0012 -45.0036, 110.0012 -10.0012))"^^<http://www.opengis.net/ont/geosparql#wktLiteral>,
                "{\"type\": \