# The SPARQL CONSTRUCT query and inferencing

In this Notebook you will learn how to use the SPARQL CONSTRUCT query that enables the user to create new triples. This is different from simply adding known triples, as can be done with the INSERT command, because the triples to be added are discovered by querying an existing graph.

You will then see how the CONSTRUCT query can be used to provide some simple inferencing.

As usual we begin by importing the appropriate packages and setting up helper functions.


In [None]:
# Import the necessary packages
import rdflib
from SPARQLWrapper import SPARQLWrapper, JSON


# Add some helper functions

# Print the triples in a given graph
def printtriples(agraph): 
    for subj, pred, obj in agraph:
        print(subj)
        print(pred)
        print(obj)
        print('')

## The CONSTRUCT query



The CONSTRUCT query creates a set of new triples, i.e. a new graph.

The following code downloads an RDF graph, performs a query on it, constructs a new graph from the results of the query and saves the resulting graph in a file.

The query finds all triples whose subjects are of type `Person` in the Berners-Lee/card dataset, then assigns each found subject together with its predicate and object to the variables `?subj`, `?pred` and `?obj` which form a triple indicated in the curly braces following the CONSTRUCT keyword.

Finally, all found triples are written to the file `person.ttl` in Turtle format.

In [None]:
# Create a new empty graph object(mygraph) and then download an RDF graph (Berners-Lee cards) 
# and save it in the graph object
mygraph = rdflib.Graph()
mygraph.parse("http://www.w3.org/People/Berners-Lee/card.rdf")

# Build a CONSTRUCT query
q = '''CONSTRUCT { ?subj ?pred ?obj . }
        WHERE {
            # Find all individuals in the graph who are of type Person
            ?subj
                <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
                <http://xmlns.com/foaf/0.1/Person> .
            # Find all triples in the graph with a subject that is an individual 
            # found in the previous pattern
            ?subj ?pred ?obj .
        }'''


# Run the CONSTRUCT query against mygraph and save the result as a graph object referenced
# by newgraph
newgraph = mygraph.query(q)

# Print the triples in the constructed graph
print("*** Facts about persons ***")

printtriples(newgraph)
    
# Save in turtle format
newgraph.serialize("person.ttl", format="turtle")



The CONSTRUCT query has two parts. The first part is a pattern, enclosed in curly braces, following the CONSTRUCT keyword. Values for the identifiers in this pattern are found by matching patterns in the WHERE clause, the second part of the query.

The WHERE clause is identical in syntax and purpose to the WHERE clause used in a SELECT query.

The aim is to search graphs for triples that match the patterns in the WHERE clause and bind values to the variables in those patterns. At the end of the processing of the WHERE clause, the bound values are used to create new triples according to the pattern following the CONSTRUCT keyword.

It is permitted to have more than one pattern following the CONSTRUCT keyword (hence the use of the curly braces). The CONSTRUCT query will build triples for each pattern in the braces (the patterns must be separated by full stops).

### Activity 1

Create a file containing large countries and their populations obtained from the file `European Geography.ttl` in the `data/` folder. In this context, a large country has a population of at least 10 million.


In [None]:
# Insert your solution here.

#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.


In [None]:
# Create a new empty graph
geog = rdflib.Graph()

# File is in the data/ folder
geog.parse('data/European Geography.ttl', format="turtle")

q = '''CONSTRUCT { ?country ?pred ?population . }
        WHERE {
            ?country <http://www.example.org/hasPopulation> ?population .
            ?country ?pred ?population
            FILTER (?population >= 10000000)
        }'''

newgraph = geog.query(q)

# Print the triples in the constructed graph
print("*** Populations of Countries ***")

printtriples(newgraph)
    
# Save in turtle format
newgraph.serialize("populations.ttl", format="turtle")

## Merging two graphs

When building a mash-up in which an application draws data from several sources, you may wish to combine (merge) a number of graphs into a single graph. This can be done quite simply using SPARQL's '`+`' operator as illustrated in the next example.

Here, the code builds two graphs - one relating to Germany and one relating to France - each with three triples, and then merges the two graphs.

In [None]:
# Create two separate graphs: one for Germany and one for France
germg = rdflib.Graph()
frang = rdflib.Graph()

# Create namespace object
geogNS = rdflib.Namespace("http://www.example.org/geography/")

# Create prefixes for property resources
locatedIn = rdflib.URIRef("http://www.example.org/locatedIn")
hasBorders = rdflib.URIRef("http://example.org/hasBorders")

# Add triples to Germany graph
germg.add((geogNS["Germany"], rdflib.RDF["type"], geogNS["country"]))
germg.add((geogNS["Germany"], locatedIn, geogNS["Europe"]))
germg.add((geogNS["Germany"], hasBorders, geogNS["France"]))

# Add triples for France
frang.add((geogNS["France"], rdflib.RDF["type"], geogNS["country"]))
frang.add((geogNS["France"], locatedIn, geogNS["Europe"]))
frang.add((geogNS["Germany"], hasBorders, geogNS["France"]))

# Create a new graph for Europe
eurog = rdflib.Graph()

#*******************************************************
# Merge graphs for Germany and France into Europe graph
eurog = germg + frang
#*******************************************************

# Print number of triples in each graph
print("No of triples in Germany graph = ", len(germg))
print("No of triples in France graph = ", len(frang))
print("No of triples in Combined graph = ", len(eurog))
print('')

# Print each triple in Europe graph
printtriples(eurog)

The '`+`' operator between graphs is like the union of two sets because triples occurring more than once are not duplicated in the merged graph.

## Inferencing with the CONSTRUCT query

The first step is to create a graph to experiment with (`geog`).

This will be a reduced version of the European Geography dataset that you used above.

In [None]:
# Create a new empty graph
geog = rdflib.Graph()

# Create a namespace
geogNS = rdflib.Namespace("http://www.example.org/geography/")

# Create resources with this namespace
germany = geogNS["Germany"]  
france = geogNS["France"] 
austria = geogNS["Austria"]
belgium = geogNS["Belgium"]
europe = geogNS["Europe"]
country = geogNS["country"]
continent = geogNS["continent"]


# Create some properties
hasBorder = rdflib.URIRef("http://www.example.org/hasBorders")
locatedIn = rdflib.URIRef("http://www.example.org/locatedIn")
hasPopulation = rdflib.URIRef("http://www.example.org/hasPopulation")
hasName = rdflib.URIRef("http://www.example.org/name")

# A property for saying that Europe contains countries
contains = rdflib.URIRef("http://www.example.org/contains")

# Add some data
geog.add((germany, rdflib.RDF["type"], country))
geog.add((france, rdflib.RDF["type"], country))
geog.add((austria, rdflib.RDF["type"], country))
geog.add((belgium, rdflib.RDF["type"], country))
geog.add((europe, rdflib.RDF["type"], continent))
geog.add((germany, locatedIn, europe))
geog.add((france, locatedIn, europe))
geog.add((austria, locatedIn, europe))
geog.add((belgium, locatedIn, europe))
geog.add((germany, hasBorder, france))
geog.add((germany, hasBorder, austria))
geog.add((germany, hasPopulation, rdflib.Literal(82000000)))
geog.add((germany, hasName, rdflib.Literal("Deutschland")))

printtriples(geog)

Before running a query, examine the `geog` triples above. Can you say what results should be obtained if you were to ask what triples have the property `contains`?

Confirm your suspicions by running the following query to find all countries contained in Europe using the property `contains`.

In [None]:
from SPARQLWrapper import SPARQLWrapper

# A routine to print out the countries located in a continent, if any.
# The first argument, results, contains triples.
# The argument, elem, should be set to 0 if the continent is the subject of a triple,
# or 2 if the continent is the object of a triple.
# The argument continent should be set to the name of a continent (string).
def printCountriesInContinent(results, elem, continent):
    if len(results) == 0:
        print("None found")
    else:
        continent = "http://www.example.org/geography/" + continent
        for row in results:
            if (str(row[elem]) == continent):
                print (row[0], row[1], row[2], sep="\n")
                print()
                
# A routine to print out the countries located in a continent, if any
def printCountriesLocatedInContinent(results, continent):
    print("*** Countries located in", continent, "***")
    printCountriesInContinent(results, 2, continent)
    
# A routine to print out the countries that a continent contains, if any
def printContinentContainsCountries(results, continent):
    print("*** " + continent + " contains countries ***")
    printCountriesInContinent(results, 0, continent)
 
   
# Set up query string
# Query searches for triples with Europe as subject and with the property contains
q1 = '''SELECT ?country
    WHERE {
        <http://www.example.org/geography/Europe>  
        <http://www.example.org/contains> 
        ?country.
    }
'''
# Run query against geog graph  
r1 = geog.query(q1)

# Print results
printContinentContainsCountries(r1, "Europe")

Naturally, no countries were found because there are no triples in the graph with `contains` as their predicate, which is what the SPARQL query was trying to match.

Suppose we add a further triple which states that the property `contains` is the inverse of the property `locatedIn`. This is how it would be done with OWL using Manchester syntax:

    ObjectProperty: contains InverseOf: locatedIn

Then, if we used a query engine that was capable of inferencing, it would be able to infer, for example, that the triple `(germany locatedIn europe)` would be equivalent to `(europe contains germany)`. Note that there would not be any explicit triples in the graph with `contains` as their property.

Here is an attempt to do this using rdflib.

In [None]:
# Introduce the inverse property as it would be done in OWL 2
# This is the inverseOf property characteristic
inverseOf = rdflib.URIRef("http://www.example.org/inverseOf")

# Declare that *contains* is a (type of) property
geog.add((contains, rdflib.RDF["type"], rdflib.RDF["Property"]))

# Declare that *contains* is the inverse property of *locatedIn*
geog.add((contains, inverseOf, locatedIn))

# Re-run query against geog graph to see what the query engine does 
r1 = geog.query(q1)

# Print out results
printContinentContainsCountries(r1, "Europe")



Still no results.

If the query had been presented to a query engine that does have an inferencing facility, it would know that those triples involving `locatedIn` are the inverse of `contains` and would be able to tell you, for example, that since `germany` is located in `europe`, `europe` must contain `germany`. Sadly, the query engine associated with rdfib does not have an inferencing capability.

However, if we now add the following SPARQL CONSTRUCT query, something amazing happens.

In [None]:
q2 = '''CONSTRUCT { ?resource2 ?property1 ?resource1 . }
    WHERE {
        ?property1 <http://www.example.org/inverseOf> ?property2 .
        ?resource1 ?property2 ?resource2 .
    }
'''

r2 = geog.query(q2)

printContinentContainsCountries(r2, "Europe")

Suddenly, there appear to be triples in the graph with `contains` as their property. How has this happened?

Before reading our explanation, can you explain from the query what triples are found by the WHERE clause and hence what triples are constructed? One way to go about this is to find one triple that matches the pattern in the WHERE clause and then determine what triple would be constructed that matches the pattern following the CONSTRUCT keyword.

The CONSTRUCT query has clearly created triples. To see how, here is a walk through of the actions taken by the SPARQL query engine when running this query.

The first pattern in the CONSTRUCT's WHERE clause is:

    (?property1 http://www.example.org/inverseOf ?property2)

There is only one triple in the graph that matches this pattern:

    (contains, inverseOf, locatedIn)

Therefore, `?property1` is bound to the value `contains` and `?property2` is bound to the value `locatedIn`.

Moving on to the second triple pattern:

    (?resource1 ?property2 ?resource2)

The above bindings result in this pattern becoming:

    (?resource1 locatedIn ?resource2)
    
There are several triples in the graph that match with this pattern, one of which is:

    (germany, locatedIn, europe)

With this binding, the triple CONSTRUCTed `(?resource2 ?property1 ?resource1)` is:

    (europe contains germany)

which is one of the query's results that is printed out.

So here is a way of using SPARQL to achieve a form of inferencing even when the query engine does not have an inferencing capability.

There are several similar SPARQL inferencing rules such as the one in the SPARQL CONSTRUCT query for `inverseOf` shown above. Indeed, there is the SPARQL Inferencing Notation (SPIN) that takes this idea much further.

Based on its CONSTRUCT keyword, SPARQL can be considered to be a rule language. The SPIN library at http://topbraid.org/spin/owlrl contains the complete OWL 2 RL specification in executable form, formalised in SPARQL CONSTRUCT rules.

If you are interested, more information about SPIN can be found at http://www.topquadrant.com/technology/sparql-rules-spin/.


### Activity 2

Here is some more data about the geographical relationships between the countries of the United Kingdom. Note that only the United Kingdom is asserted to be in Europe.

In [None]:
england = geogNS["England"]  
scotland = geogNS["Scotland"]
wales = geogNS["Wales"]
northernIreland = geogNS["Northern_Ireland"]
uk = geogNS["United_Kingdom"]
gb = geogNS["Great_Britain"]

geog.add((england, locatedIn, gb))
geog.add((scotland, locatedIn, gb))
geog.add((wales, locatedIn, gb))
geog.add((gb, locatedIn, uk))
geog.add((northernIreland, locatedIn, uk))
geog.add((uk, locatedIn, europe))

printtriples(geog)
print('')



It is possible to infer from this data that, for example, Wales is located in Europe because Wales is located in Great Britain and Great Britain is located in the United Kingdom which is located in Europe. This is an example of the transitive property of the predicate `locatedIn`.

The following asserts that the property `locatedIn` is transitive. A triple to this effect is added to the graph.

In [None]:
# This is the transitive property characteristic for locatedIn
transitive = rdflib.URIRef("http://www.example.org/transitive")

# Declare that locatedIn is transitive
geog.add((locatedIn, rdflib.RDF["type"], transitive))

Complete the following CONSTRUCT query that implements the transitive property. That is,

* if `resource1` is related to `resource2`, and `resource2` is related in the same way to `resource3`, then the relationship is transitive.


In [None]:
# A query that implements transitivity property
q3 = '''CONSTRUCT { ?resource1 ?property1 ?resource3 . }
    WHERE {
        #Find a property from the class transitive
        ?property  a <http://www.example.org/transitive> .
        
        #Additional code goes here
        
        
    }
    '''

# Run query (r3 is a graph)
r3 = geog.query(q3)

# Output the triples in graph
printCountriesLocatedInContinent(r3, "Europe")

#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.


In [None]:
q3 = '''CONSTRUCT { ?resource1 ?property1 ?resource3 . }
    WHERE {
        #Find a property from the class transitive
        ?property  a <http://www.example.org/transitive> .
        
        #Find two resources that are related by this property
        ?resource1 ?property1 ?resource2 .
        
        #Find a third resource that is related in the same way to resource2
        ?resource2 ?property1 ?resource3 .
        
        #If bindings for resource1 and resource3 have been found
        #the query will create a triple.
    }
'''

r3 = geog.query(q3)

printCountriesLocatedInContinent(r3, "Europe")

The following code simply merges the newly found triples with the existing graph.

In [None]:
# Create a new, empty graph
newGraph = rdflib.Graph()

# Add the contents of the results to the new graph
for row in r3:
    newGraph.add(row)
    
printtriples(newGraph)

# Merge new graph with existing graph
newGraph = newGraph + geog

print("Extended graph")
printtriples(newGraph)

## Summary

In this Notebook you have seen:

1. How the CONSTRUCT query searches a graph for matches to patterns in its WHERE clause, binding variables as it does so. The values bound to the variables are used to create new triples that are defined in pattern(s) defined after the CONSTRUCT query.

2. How two graphs can be merged into a third graph using the '`+`' operator.

3. How it is possible to use the CONSTRUCT query to perform some simple inferencing. This is achieved by:

    (a) defining a name for a property characteristic, e.g.

        inverseOf = rdflib.URIRef("http://www.example.org/inverseOf")

    (b) ensuring that the properties (one or more) that are to have this characteristic are declared within the graph, e.g. for binary relations:

        geog.add((contains, rdflib.RDF["type"], rdflib.RDF["Property"]))
        geog.add((locatedIn, rdflib.RDF["type"], rdflib.RDF["Property"]))

    (c) add a triple that denotes the fact that, in the case of binary relationships, the properties are related by the characteristic, e.g.

        geog.add((contains, inverseOf, locatedIn))

    or, in the case of a unary relationship, the property is defined to be of the appropriate type, e.g.

         geog.add((locatedIn, rdflib.RDF["type"], transitive))

    (d) define a CONSTRUCT query whose WHERE clause:

         selects a property with the required characteristic;
         selects triple(s) involving the selected property

    (e) define a triple pattern that expresses the new relationship (placed in braces after the CONSTRUCT keyword).

## What next?

If you are working through this Notebook as part of an inline exercise, return to the module materials now.

If you are working through this set of Notebooks as a whole, move on to `26.3 Visualisation`.