# Exploring the RDFLib package

***Before you start working through this notebook please take note of the following.***

*When attempting to access data from a remote website you may find that the website is down for maintenance. In this circumstance, you are likely to see a great deal of output from the Python system culminating in an error message such as 'URLError'. The only way forward is to leave what you are doing and check back later to see whether the website has become available.*

*However, there is another scenario in which a similar URLError occurs. If you have allowed your computer to sleep or hibernate, it is possible that your virtual machine (VM) will lose information about the ports used to access resources across the Web. In this case you should restart the VM as follows:*

    close all browser tabs relating to the Notebooks
    
*then, either*

    double-click on the shortcut 'vagrant reload'
    
*or, at a command prompt, execute the commands, in order:*

    vagrant halt

    vagrant up

    vagrant provision

*You can then reopen your Notebooks.*

## 1 Loading data from a file

This Notebook will take you through the basic steps for writing Python code to read, create and save RDF graphs (sets of triples).

The first thing to do is import the appropriate library, in this case rdflib.

In [1]:
import rdflib

Next, create an empty graph in memory and assign it to a variable of your choice.

In [2]:
mygraph = rdflib.Graph()

Now you are ready to read triples from a dataset, located at a specified Web address, into the graph. 

The format of the dataset might be Turtle, XML or some other encoding, but the rdflib `parse` routine will attempt to convert the dataset into a Python data structure whatever the format. However, it is a good idea to specify the encoding using the `format` argument, e.g. `format="xml"`.

**Important:** Normally, the result should appear quite quickly (a few seconds). If you find that the query takes a long time and/or an error is reported such as *URLError* it is likely that the dataset is currently unavailble. Your only choice is to try again at a later time. Unfortunately, in such circumstances the error message is not very helpful in telling you what the underlying problem is.

If you find that the dataset is unavailable, skip to 'Alternative example' below which uses a dataset that is stored in your Notebooks file; you can return to the main thread of this Notebook later.

In [3]:
mygraph.parse("http://www.w3.org/People/Berners-Lee/card.rdf", format="xml")

<Graph identifier=N36526bbea28f434989c299dd34c31639 (<class 'rdflib.graph.Graph'>)>

To check that all is well, compute the number of triples in the graph.

In [4]:
len(mygraph)

87

rdflib graphs emulate Python's container types and are best thought of as a *set* of unordered triples.

The rdflib library also redefines some of Python's methods to behave in ways appropriate to RDF triples; the `len` function is one.

Putting this all together:

In [5]:
import rdflib

# Create a new graph named mygraph
mygraph = rdflib.Graph()

# Get data from a dataset 
mygraph.parse("http://www.w3.org/People/Berners-Lee/card.rdf", format="xml")

# Query the data (find the number of triples)
len(mygraph)

87

### Alternative example

Try the following which uses a dataset that is stored in the `data/` folder.

In [6]:
import rdflib

# Create a new graph named mygraph
mygraph = rdflib.Graph()

# Get data from a dataset 
mygraph.parse("data/European Geography.ttl", format="turtle")

# Query the data (find the number of triples)
len(mygraph)

533

### Summary

To read a graph into memory: import rdflib, create a new empty graph using `rdflib.Graph()`, and then read in the data from a file (specified as a URI) using the `parse()` method.

## 2 Viewing the triples

You can iterate through the triples in a graph using a FOR loop (in the following code, `trip` is an arbitrary variable name).

Examine a few of the triples (there are quite a few) to get a feeling for the format of the output. The output will depend on whether you attempted the alternative example or not.

In [7]:
# A routine to print out the first few triples
def printtriples(graph, limit):
    n = 0
    for trip in graph:
        print(trip)
        print('')
        n = n+1
        if n >= limit:
            break
        
printtriples(mygraph, 10)

(rdflib.term.URIRef('http://www.example.org/geography/Georgia'), rdflib.term.URIRef('http://www.example.org/locatedIn'), rdflib.term.URIRef('http://www.example.org/geography/Asia'))

(rdflib.term.URIRef('http://www.example.org/geography/Serbia'), rdflib.term.URIRef('http://www.example.org/hasArea'), rdflib.term.Literal('77474', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#integer')))

(rdflib.term.URIRef('http://www.example.org/geography/Austria'), rdflib.term.URIRef('http://www.example.org/hasBorder'), rdflib.term.URIRef('http://www.example.org/geography/Switzerland'))

(rdflib.term.URIRef('http://www.example.org/geography/Slovenia'), rdflib.term.URIRef('http://www.example.org/hasBorder'), rdflib.term.URIRef('http://www.example.org/geography/Italy'))

(rdflib.term.URIRef('http://www.example.org/geography/Austria'), rdflib.term.URIRef('http://www.example.org/hasName'), rdflib.term.Literal('Oesterreich'))

(rdflib.term.URIRef('http://www.example.org/geography/Hungary'),

The first few triples illustrate all you need to know about the representation of a triple using rdflib. At this stage just concentrate on the structure of the triples.

In the output, each triple is enclosed in parentheses and its three elements - subject, predicate and object - are separated by commas. As expected, each subject and predicate is represented by a URI and an object is either a URI or a literal. For example, if you were to lay out the subject, predicate and object of one of the triples printed above on separate lines you would see something like:

    (
    rdflib.term.URIRef('https://www.w3.org/People/Berners-Lee/card#i'),

    rdflib.term.URIRef('http://xmlns.com/foaf/0.1/givenname'),

    rdflib.term.Literal('Timothy')
    )

Or, if you attempted the alternative example:

    (
    rdflib.term.URIRef('http://www.example.org/geography/Bosnia_Hertzogovena'), 

    rdflib.term.URIRef('http://www.example.org/hasName'),

    rdflib.term.Literal('Bosnia and Herzegovina')
    )

Note: You may not actually see either of these particular triples in your output as there is no guarantee that you will obtain the same 10 triples each time you print them out (they are stored in a Python set).

In our examples, the meaning of the first triple is 'the given name of the person is Timothy', and the second is 'the country Bosnia_Hertzogovena has the name Bosnia and Herzegovina' but it is the structure of a triple that we want you to focus on at this stage.

Take care here. In this discussion the term 'object' refers to the third element of an RDF triple. To distinguish this use of the word 'object' with that used in an object-oriented programming language (OOPL) we shall refer to the OOPL use of 'object' as a 'Pythom object'.

In rdflib, each element that is represented by a URI is an `rdflib.term.URIRef` Python object with the actual URI given as a string argument as in:

    rdflib.term.URIRef('http://www.w3.org/People/Berners-Lee/card#i')

and

    rdflib.term.URIRef('http://www.example.org/geography/Bosnia_Hertzogovena')

A literal is an `rdflib.term.Literal` Python object with the value given as a string argument. For example,

    rdflib.term.Literal('Timothy')

or

    rdflib.term.Literal('Bosnia and Herzegovina')

The encoding of some literal arguments can be quite large!

There is a fourth type of Python object in rdflib, `rdf.term.BNode`, which you may observe as either the subject or the object of a triple. For example,

    rdflib.term.BNode('N9e3a51b39d754597995f08f84f7962c8')

A BNode is a *blank node* which has several uses within RDF. One use of a blank node is to represent a resource that has not yet been identified either because we don't know what it should be (it is missing in the data) or it will never exist (it isn't meaningful). 

A BNode has a (long) reference string as argument (two BNodes with the same argument represent the same missing item of data). For now, accept that a specific BNode reference string can appear both as the subject of one triple and the object of another. To discover more about blank nodes see Part 25, Section 3.1 of the module text.

## 3  Getting to the content of the triples

A better way of printing out a triple is to print the subject, predicate and object on separate lines.  In this format, the URIs are easier to read and the literal values and BNodes are easier to identify. The following function prints out a limited number of triples in a given graph (if the argument `limit` is set to zero, all triples in the graph are printed).

In [8]:
def printtriples(agraph, limit): 
    n = 0
    for subj, pred, obj in agraph:
        print(subj)
        print(pred)
        print(obj)
        print('')
        if limit > 0:
            n = n+1
            if n == limit:
                break
            
        
# Try it out (print only the first 5 triples)
printtriples(mygraph, 5)


http://www.example.org/geography/Georgia
http://www.example.org/locatedIn
http://www.example.org/geography/Asia

http://www.example.org/geography/Serbia
http://www.example.org/hasArea
77474

http://www.example.org/geography/Austria
http://www.example.org/hasBorder
http://www.example.org/geography/Switzerland

http://www.example.org/geography/Slovenia
http://www.example.org/hasBorder
http://www.example.org/geography/Italy

http://www.example.org/geography/Austria
http://www.example.org/hasName
Oesterreich



## 4  Saving a graph to a file

An rdfLib graph can be saved to a file very easily using the `serialize` method. Serializing a graph means transforming the graph into a linear text format, which can then be written to a file. As you already know, there are several standard syntaxes for RDF including Turtle, NTriples and JSON, so the `serialize` method should specify both the name of the file to be written to and the format of the triples. 

Try the following, which should create three new files in the same folder as this Notebook with different formats. Run the code and then examine the contents of your Notebook folder.

In [9]:
mygraph.serialize("mygraph.ttl", format="turtle")
mygraph.serialize("mygraph.nt", format="nt") # nt stands for NTriples
mygraph.serialize("mygraph.xml") # default is RDF/XML 

The names given to the files in the first argument of `serialize` are your choice, although it is best to use the standard file extensions (`ttl`, `nt` and `xml`). 

You should give a full pathname rather than just a filename when you want files to be stored in a different folder (than the one containing your Notebooks).

To specify the formats you should use the strings shown in the examples above which are rdflib standards. The default RDF/XML can also be specified by writing `format="pretty-xml"`.


## 5  Adding triples to a graph

An RDF graph consists of nodes (representing resources) related via properties (predicates). You can create individual nodes in rdflib and then combine them into triples which can then be added to a graph.

As you have seen, in rdflib there are three kinds of node: URIRef, Literal and BNode. The following code creates three nodes and adds them in the form of a triple to a new, empty graph named `geog`. The variable names `germany`, `population` and `germanyPopulation` are arbitrary but hopefully meaningful.

The URIs used in this example use the domain www.example.org which is a real domain that has been established for use in illustrative examples in documents. You may use this domain in examples without prior consent.

Notice the two sets of parentheses after `add`. The inner set surround the three nodes, as in `(subject,predicate,object)`, to form a triple. The triple, with its parentheses, is then an argument to the `add` method and is placed inside the outer set of parentheses.

In [10]:
geog = rdflib.Graph()

# Create a node with the URI for the subject 'Germany'
germany = rdflib.URIRef("http://www.example.org/geography/Germany")
# Create a node with the URI for the predicate 'population'
hasPopulation = rdflib.URIRef("http://www.example.org/population")
# Create a node with the literal value '82000000'
germanyPopulation = rdflib.Literal(82000000)

# Add the triple consisting of a subject, predicate and object to the graph
geog.add((germany, hasPopulation, germanyPopulation))

printtriples(geog, 0)

http://www.example.org/geography/Germany
http://www.example.org/population
82000000



The next step adds several triples to the graph `geog`. This requires having to repeat the same URI many times. You already know that the same kind of problem occurred when building RDF graphs in Part 24. The answer there was to use the idea of a prefix: define a short name to stand for the initial part of a URI. You can do something similar in rdflib where you use the rdflib class `Namespace` to create a variable representing the complete URI. You can then create nodes quite easily with a shorter syntax as illustrated in the code below.

In [12]:
# Create a namespace
geogNS = rdflib.Namespace("http://www.example.org/geography/")

# Create a resource with this namespace
germany = geogNS["Germany"]  
# This is shorter than rdflib.URIRef("http://www.example.org/geography/Germany")

# See if it works
# print(germany)

# Create more resources with the same namespace
france = geogNS["France"] 
austria = geogNS["Austria"]
europe = geogNS["Europe"]
country = geogNS["country"]
continent = geogNS["continent"]

# see if it works still
print(country)

http://www.example.org/geography/country


In this example, you will use properties representing the concepts of `hasBorder` (a country that borders another country), `locatedIn` (the continent that a country is within), `hasPopulation` (the size of a country's population), `hasName` (the common name of a country) and `hasCapital` (the name of a country's capital city). 

To do this, create the appropriate URIRef nodes for the properties (predicates).

In [13]:
hasBorder = rdflib.URIRef("http://www.example.org/hasBorder")
locatedIn = rdflib.URIRef("http://www.example.org/locatedIn")
hasPopulation = rdflib.URIRef("http://www.example.org/hasPopulation")
hasName = rdflib.URIRef("http://www.example.org/name")
hasCapital = rdflib.URIRef("http://www.example.org/hasCapital")


It's always a good idea to define the domains of the subjects. This can be done by using the RDF `type` predicate. In rdflib this is easily done by using `rdflib.RDF["type"]` (which stands for http://www.w3.org/1999/02/22-rdf-syntax-ns#type).

Hence, to state that 'Germany is a country' you construct the triple:

    (germany, rdflib.RDF["type"], country)
    
Aside: rdflib contains predefined namespaces for the most common RDF schemas (ontologies) including RDF and FOAF. For example, `rdflib.FOAF["knows"]` is equivalent to http://xmls.com/foaf/0.1/knows.

We can put all this together to contruct a set of triples representing assertional data about some countries that we know about.

In [14]:
# germany, france, and austria are countries
geog.add((germany, rdflib.RDF["type"], country))
geog.add((france, rdflib.RDF["type"], country))
geog.add((austria, rdflib.RDF["type"], country))
# europe is a continent
geog.add((europe, rdflib.RDF["type"], continent))
# these countries are located in europe
geog.add((germany, locatedIn, europe))
geog.add((france, locatedIn, europe))
geog.add((austria, locatedIn, europe))
# france and austria border germany
geog.add((germany, hasBorder, france))
geog.add((germany, hasBorder, austria))
# the current population of germany
geog.add((germany, hasPopulation, rdflib.Literal(82000000)))
# the names by which germany is commonly known
geog.add((germany, hasName, rdflib.Literal("Deutschland")))
geog.add((germany, hasName, rdflib.Literal("Germany")))
# the capital city of germany
geog.add((germany, hasCapital, rdflib.Literal("Berlin")))


We have used many variables in this example which isn't strictly necessary. We could have written, for example,

    geog.add((geogNS["Germany"], rdflib.RDF["type"], geogNS["country"]))

and avoided introducing the variables `germany` and `country`. However, we think that using variables with well-chosen identifiers makes for easier reading.

Now let's see what has been created and store it in a file.

In [15]:
# Output the number of triples in the store
print(len(geog))
print('')
# Pretty-print each triple
printtriples(geog, 0)
 
# Save in Turtle format
geog.serialize("geog.ttl", format="turtle")


14

http://www.example.org/geography/Germany
http://www.example.org/locatedIn
http://www.example.org/geography/Europe

http://www.example.org/geography/Germany
http://www.example.org/hasCapital
Berlin

http://www.example.org/geography/Germany
http://www.example.org/hasPopulation
82000000

http://www.example.org/geography/Germany
http://www.example.org/name
Germany

http://www.example.org/geography/France
http://www.example.org/locatedIn
http://www.example.org/geography/Europe

http://www.example.org/geography/Germany
http://www.example.org/hasBorder
http://www.example.org/geography/Austria

http://www.example.org/geography/Germany
http://www.w3.org/1999/02/22-rdf-syntax-ns#type
http://www.example.org/geography/country

http://www.example.org/geography/Austria
http://www.example.org/locatedIn
http://www.example.org/geography/Europe

http://www.example.org/geography/Germany
http://www.example.org/population
82000000

http://www.example.org/geography/Germany
http://www.example.org/hasBord

You should now have a new file named `geog.ttl` stored in your Notebooks folder. 

### Activity 1

1. Add more triples to the `geog` dataset. Use the following data:

    France borders Germany and Belgium    
    The population of France is 66030000    
    The capital of France is Paris    
    Belgium borders France and Germany    
    The population of Belgium is 11200000    
    The capital of Belgium is Brussels    
    Belgium has several names: België, Belgique and Belgien


2. Print the contents of the updated `geog` dataset.

3. Save the updated `geog` dataset to a file.

4. Open the saved file to check that all worked well. 

Add, and run, your code in the cell below.

In [18]:
belgium = geogNS["Belgium"]

geog.add((france, hasBorder, belgium))
geog.add((france, hasBorder, germany))

geog.add((france, hasPopulation, rdflib.Literal(66030000)))
geog.add((france, hasCapital, rdflib.Literal("Paris")))

geog.add((belgium, rdflib.RDF["type"], country))

geog.add((belgium, hasBorder, france))
geog.add((belgium, hasBorder, germany))

geog.add((belgium, hasPopulation, rdflib.Literal(11200000)))
geog.add((belgium, hasCapital, rdflib.Literal("Brussels")))

geog.add((belgium, hasName, rdflib.Literal('België')))
geog.add((belgium, hasName, rdflib.Literal('Belgique')))
geog.add((belgium, hasName, rdflib.Literal('Belgien')))


In [21]:
printtriples(geog, 0)

http://www.example.org/geography/Germany
http://www.example.org/locatedIn
http://www.example.org/geography/Europe

http://www.example.org/geography/Germany
http://www.example.org/hasCapital
Berlin

http://www.example.org/geography/France
http://www.example.org/locatedIn
http://www.example.org/geography/Europe

http://www.example.org/geography/Germany
http://www.w3.org/1999/02/22-rdf-syntax-ns#type
http://www.example.org/geography/country

http://www.example.org/geography/Austria
http://www.example.org/locatedIn
http://www.example.org/geography/Europe

http://www.example.org/geography/France
http://www.example.org/hasBorder
http://www.example.org/geography/Belgium

http://www.example.org/geography/Belgium
http://www.example.org/name
België

http://www.example.org/geography/Belgium
http://www.example.org/name
Belgique

http://www.example.org/geography/Germany
http://www.example.org/population
82000000

http://www.example.org/geography/Austria
http://www.w3.org/1999/02/22-rdf-syntax-ns#ty

In [22]:
# Save in Turtle format
geog.serialize("geog.ttl", format="turtle")



#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.


In [None]:
#Add country namespace
belgium = geogNS["Belgium"]

#Add data elements
geog.add((belgium, rdflib.RDF["type"], country))
geog.add((belgium, locatedIn, europe))
geog.add((belgium, hasPopulation, rdflib.Literal(11200000)))

geog.add((belgium, hasName, rdflib.Literal("België")))
geog.add((belgium, hasName, rdflib.Literal("Belgique")))
geog.add((belgium, hasName, rdflib.Literal("Belgien")))

geog.add((belgium, hasBorder, germany))
geog.add((belgium, hasBorder, france))
geog.add((belgium, hasCapital, rdflib.Literal("Brussels")))

geog.add((france, hasPopulation, rdflib.Literal(66030000)))

Check that all has worked well:

In [None]:
print(len(geog))
print('')
# Pretty-print each triple
printtriples(geog, 0)

# Save in Turtle format
geog.serialize("geog.ttl", format="turtle")
geog.serialize("geog.xml")

In [23]:
!head geog.ttl

@prefix ns1: <http://www.example.org/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://www.example.org/geography/Austria> a <http://www.example.org/geography/country> ;
    ns1:locatedIn <http://www.example.org/geography/Europe> .

<http://www.example.org/geography/Belgium> a <http://www.example.org/geography/country> ;


### Activity 2

Write code that will read the data stored in the file `geog.ttl` and print out 10 triples and verify that these are indeed triples that you saved to the file in the previous activity.

In [25]:
# Insert your solution here."]))
agraph = rdflib.Graph()

agraph.parse("geog.ttl", format="turtle")

printtriples(agraph, 10)

http://www.example.org/geography/France
http://www.example.org/hasBorder
http://www.example.org/geography/Germany

http://www.example.org/geography/Germany
http://www.example.org/hasBorder
http://www.example.org/geography/France

http://www.example.org/geography/Belgium
http://www.example.org/name
Belgien

http://www.example.org/geography/France
http://www.example.org/hasPopulation
66030000

http://www.example.org/geography/Austria
http://www.w3.org/1999/02/22-rdf-syntax-ns#type
http://www.example.org/geography/country

http://www.example.org/geography/Austria
http://www.example.org/locatedIn
http://www.example.org/geography/Europe

http://www.example.org/geography/Germany
http://www.example.org/population
82000000

http://www.example.org/geography/France
http://www.example.org/hasBorder
http://www.example.org/geography/Belgium

http://www.example.org/geography/Belgium
http://www.example.org/name
Belgique

http://www.example.org/geography/France
http://www.example.org/hasCapital
Paris


SyntaxError: invalid syntax (<ipython-input-31-45fb237a3c90>, line 4)

#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.


In [None]:
agraph = rdflib.Graph()

agraph.parse("geog.ttl", format="turtle")

printtriples(agraph, 10)

## Summary

In this Notebook you have seen how to create a graph of triples using the features of the Python library rdflib.

To create a new, empty graph, use: 

    mygraph = rdflib.Graph()

To copy data from an existing graph held in a file use the `parse` function as in:

    mygraph.parse("geog.ttl", format="turtle")

where the first argument of `parse` is the URL of the file and the second parameter is the format of the triples in the file.

The data held in memory can be written to a file using the `serialize` function as in:

    mygraph.serialize("mygraph.ttl", format="turtle")

where the first parameter is the name of the file and the second parameter is the format of the triples in the file.

It is a good idea to create a function that will print out (some of) the triples held in the graph so that you can check the contents of the graph.

A new triple can be added to a graph using the `add` function as in:

    geog.add((germany, hasPopulation, germanyPopulation))

Here the variables `germany`, `hasPopulation` and `germanyPopulation` refer to the subject, predicate and object of the triple. The variables representing the subject and predicate must be Python objects of type `rdflib.URIRef` specifying the appropriate URI as in:

    germany = rdflib.URIRef("http://www.example.org/geography/Germany")
    
    hasPopulation = rdflib.URIRef("http://www.example.org/population")

The third element of the triple, the object, must be either of type `redflib.URIRef` or a literal value of type `rdf.Literal` as in:

    germanyPopulation = rdflib.Literal(82000000)

In any realistic situation there will be many triples utilising the same URIs so, to make the job of creating the data easier and to make the resulting graph easier to read, use the function `Namespace` to create a namespaces as in:

    geogNS = rdflib.Namespace("http://www.example.org/geography/")

It is then possible to use the shorter construct:

    geogNS["Germany"} 

in place of:

    rdflib.URIRef("http://www.example.org/geography/Germany")

## What next?

If you are working through this Notebook as part of an inline exercise, return to the module materials now.

If you are working through this set of Notebooks as a whole, move on to 25.2 Querying using SPARQL.