<a href="https://colab.research.google.com/github/labra/RailDataForum_ValidationWorkshop/blob/main/RailDataForumExamples.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Rail data forum - Validation workshop examples

We will use [rdflib](https://rdflib.readthedocs.io/) to run the examples in Python

In [3]:
!pip install rdflib

Collecting rdflib
  Downloading rdflib-7.1.4-py3-none-any.whl.metadata (11 kB)
Downloading rdflib-7.1.4-py3-none-any.whl (565 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/565.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m565.1/565.1 kB[0m [31m27.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rdflib
Successfully installed rdflib-7.1.4


In [25]:
from rdflib import Graph, Namespace

Create an example RDF graph

In [5]:
g = Graph()

In [6]:
person_ttl_str = """
prefix : <http://example.org/>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>

:timbl a :Human ;
       :name       "Tim Berners-Lee" ;
       :birthPlace :london ;
       :birthDate  "1955-06-08"^^xsd:date ;
       :employer   :CERN ;
       :knows     :jose .

:london a :City, :Metropolis ;
        :country :UK .

:CERN a :Organization ;
      :name "CERN"    .

:jose :name       "Jose" ;
      :birthPlace :Oviedo .

:Oviedo a :City ;
        :country :Spain .
""";

Parse and load the example into the RDF graph

In [7]:
g.parse(data=person_ttl_str, format="ttl")

<Graph identifier=N6caa421ae062412fb3e9826f50b30577 (<class 'rdflib.graph.Graph'>)>

Now, we can do some SPARQL queries like finding the birth place and birth date of Tim Berners Lee.

In [8]:
query_str = """
PREFIX : <http://example.org/>

SELECT ?person ?name ?birthDate ?birthPlace ?country WHERE {
  ?person a :Human ;
    :name ?name ;
    :birthDate ?birthDate ;
    :birthPlace ?birthPlace .
  ?birthPlace :country ?country
}
""";

Run a SPARQL query

In [9]:
result = g.query(query_str);

In [11]:
for row in result:
    print(f"Name: {row.name}, birth date: {row.birthDate}, birthPlace: {row.birthPlace}")

Name: Tim Berners-Lee, birth date: 1955-06-08, birthPlace: http://example.org/london


RDF is very flexible and so, it can contain errors. For example, the folling RDF data contains some errors, like humans without a name, with more than 2 names, with a name that is a number, or that know something that is not a human.

In [16]:
rdf_with_errors_str ="""
prefix : <http://example.org/>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>

:timbl a :Human ;
       :name       "Tim Berners-Lee" ;
       :birthPlace :london ;
       :birthDate  "1955-06-08"^^xsd:date ;
       :employer   :CERN ;
       :knows     :jose .

:london a :City, :Metropolis ;
        :country :UK .

:CERN a :Organization .

:jose :name       "Jose" ;
      :birthPlace :Oviedo .

:Oviedo a :City ;
        :country :Spain .

:alice a :Human ;
      :name 234 .

:bob a :Human ;
     :name "John" , "Jane" .

:carol a :Human ;
     :birthPlace :london .

:dave a :Human ;
     :name "John" ;
     :knows :CERN .
""";

In [17]:
g = Graph()

In [18]:
g.parse(data=rdf_with_errors_str, format="ttl")

<Graph identifier=Nf67b302ff38941c48524bd8fa14755f7 (<class 'rdflib.graph.Graph'>)>

As you can see, the RDF tools are quite permissive and don't complain about those errors following the principle that Any one can say Any thing about Any topic.

## Using SPARQL to validate

It is possible (but can be tedious) to validate the data using SPARQL. For example, the following query checks if there are nodes with a single value for name which isn't a string

In [19]:
sparql_check_str = """
PREFIX : <http://example.org/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?person WHERE {
  ?person a :Human ;
  {
    SELECT ?person ?name WHERE {
    ?person :name ?name .
    FILTER (!isLiteral(?name) || datatype(?name) != xsd:string)
    }
  } UNION {
    SELECT ?person (COUNT(?name) AS ?nameCount)
    WHERE {
        ?person a :Human .
        OPTIONAL { ?person :name ?name . }
    }
    GROUP BY ?person
    HAVING (COUNT(?name) != 1)
  }
}
""";


In [20]:
result = g.query(sparql_check_str);


In [21]:
for row in result:
    print(row)

(rdflib.term.URIRef('http://example.org/alice'),)
(rdflib.term.URIRef('http://example.org/bob'),)
(rdflib.term.URIRef('http://example.org/carol'),)


As an exercise, you can try to define another SPARQL query that checks that a human only knows other humans.

## Validating using ShEx

[ShEx](https://shex.io/) has been designed as a concise and human readable language to describe and validate RDF data. The following ShEx specifies a shape for humans:

In [22]:
person_schema = """
prefix : <http://example.org/>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>

:Person {
 :name       xsd:string       ;
 :birthPlace @:Place        ? ;
 :birthDate  xsd:date       ? ;
 :employer   @:Organization * ;
 :knows      @:Person       *
}

:Place {
 :country . ?
}

:Organization {
    a [ :Organization ]
}
"""

In [23]:
!pip install pyshex

Collecting pyshex
  Downloading PyShEx-0.8.1-py3-none-any.whl.metadata (1.0 kB)
Collecting cfgraph>=0.2.1 (from pyshex)
  Downloading CFGraph-0.2.1.tar.gz (2.6 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pyshexc==0.9.1 (from pyshex)
  Downloading PyShExC-0.9.1-py2.py3-none-any.whl.metadata (940 bytes)
Collecting rdflib-shim (from pyshex)
  Downloading rdflib_shim-1.0.3-py3-none-any.whl.metadata (918 bytes)
Collecting shexjsg>=0.8.2 (from pyshex)
  Downloading ShExJSG-0.8.2-py2.py3-none-any.whl.metadata (997 bytes)
Collecting sparqlslurper>=0.5.1 (from pyshex)
  Downloading sparqlslurper-0.5.1-py3-none-any.whl.metadata (430 bytes)
Collecting sparqlwrapper>=1.8.5 (from pyshex)
  Downloading SPARQLWrapper-2.0.0-py3-none-any.whl.metadata (2.0 kB)
Collecting jsonasobj>=1.2.1 (from pyshexc==0.9.1->pyshex)
  Downloading jsonasobj-1.3.1-py3-none-any.whl.metadata (995 bytes)
Collecting pyjsg>=0.11.10 (from pyshexc==0.9.1->pyshex)
  Downloading PyJSG-0.11.10-py3-none-any.

In [24]:
from pyshex import ShExEvaluator

In [26]:
ex = Namespace("http://example.org/")

In [28]:
nodes = [ ex.timbl, ex.alice, ex.bob, ex.carol, ex.dave ]

In [30]:
for node in nodes:
  rs = ShExEvaluator().evaluate(g, person_schema, focus=node, start=ex.Person)
  for r in rs:
    if r.result:
      print(f"Node {node} Passed as Person")
    else:
      print(f"Node {node} Failed as Person: {r.reason}")

Node http://example.org/timbl Passed as Person
Node http://example.org/alice Failed as Person:   Testing :alice against shape http://example.org/Person
    Datatype mismatch - expected: http://www.w3.org/2001/XMLSchema#string actual: http://www.w3.org/2001/XMLSchema#integer
Node http://example.org/bob Failed as Person:   Testing :bob against shape http://example.org/Person
    Triples:
      :bob :name "Jane" .
      :bob :name "John" .
   2 triples exceeds max {1,1}
Node http://example.org/carol Failed as Person:   Testing :carol against shape http://example.org/Person
       No matching triples found for predicate :name
Node http://example.org/dave Failed as Person:   Testing :dave against shape http://example.org/Person
    Testing :CERN against shape http://example.org/Person
         No matching triples found for predicate :name


## Validating with SHACL

In [31]:
shacl_shapes_str = """
prefix :       <http://example.org/>
prefix sh:     <http://www.w3.org/ns/shacl#>
prefix xsd:    <http://www.w3.org/2001/XMLSchema#>
prefix schema: <http://schema.org/>

:PersonShape a sh:NodeShape ;
   sh:targetClass :Human ;
   sh:property [
    sh:path      :name ;
    sh:minCount 1; sh:maxCount 1;
    sh:datatype xsd:string ;
  ] ;
  sh:property [
    sh:path :birthPlace ;
    sh:node :PlaceShape
  ] ;
  sh:property [
   sh:path     :birthDate ;
   sh:maxCount 1;
   sh:datatype xsd:date;
  ] ;
  sh:property [
    sh:path :employer ;
    sh:node :OrganizationShape
  ] ;
  sh:property [
    sh:path :knows ;
    sh:node :PersonShape
  ] .

:PlaceShape a sh:NodeShape ;
  sh:property [
    sh:path :country ;
    sh:nodeKind sh:IRI ;
    sh:minCount 1 ;
    sh:maxCount 1 ;
  ] .

:OrganizationShape a sh:NodeShape .
"""

In [32]:
!pip install pyshacl

Collecting pyshacl
  Downloading pyshacl-0.30.1-py3-none-any.whl.metadata (35 kB)
Collecting owlrl<8,>=7.1.2 (from pyshacl)
  Downloading owlrl-7.1.3-py3-none-any.whl.metadata (3.6 kB)
Collecting html5rdf<2,>=1.2 (from rdflib[html]!=7.1.2,<8.0,>=7.1.1->pyshacl)
  Downloading html5rdf-1.2.1-py2.py3-none-any.whl.metadata (7.5 kB)
Downloading pyshacl-0.30.1-py3-none-any.whl (1.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m44.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading owlrl-7.1.3-py3-none-any.whl (51 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.9/51.9 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading html5rdf-1.2.1-py2.py3-none-any.whl (109 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m109.8/109.8 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: html5rdf, owlrl, pyshacl
Successfully installed html5rdf-1.2.1 owlrl-7.1.3 pyshacl-0.30.1


In [33]:
from pyshacl import validate

In [34]:
shapes = Graph()

In [35]:
shapes.parse(data=shacl_shapes_str, format="ttl")

<Graph identifier=N43a1d79d203240bf925beb2a453bba3c (<class 'rdflib.graph.Graph'>)>

In [37]:
r = validate(g, shacl_graph = shapes)

In [38]:
conforms, results_graph, results_text = r
print(f"Conforms: {conforms}")
print(results_graph.serialize(format="turtle"))

Conforms: False
@prefix : <http://example.org/> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

[] a sh:ValidationReport ;
    sh:conforms false ;
    sh:result [ a sh:ValidationResult ;
            sh:focusNode :bob ;
            sh:resultMessage "More than 1 values on :bob->:name" ;
            sh:resultPath :name ;
            sh:resultSeverity sh:Violation ;
            sh:sourceConstraintComponent sh:MaxCountConstraintComponent ;
            sh:sourceShape _:n46a9a97b32fc413a92a6bb02cd0f0d03b1 ],
        [ a sh:ValidationResult ;
            sh:detail [ a sh:ValidationResult ;
                    sh:focusNode :CERN ;
                    sh:resultMessage "Less than 1 values on :CERN->:name" ;
                    sh:resultPath :name ;
                    sh:resultSeverity sh:Violation ;
                    sh:sourceConstraintComponent sh:MinCountConstraintComponent ;
                    sh:sourceShape _:n46a9a97b32fc413a92a6bb02cd0f0