# Knowledge and Data: Practical Assignment 2
## Manipulate local and external RDF Knowledge Graphs 

YOUR NAME: STEFANIA DIAMANTE CONTE

YOUR VUNetID: sce760 2739767

*(If you do not provide your name and VUNetID we will not accept your submission).*

### Learning objectives

At the end of this exercise you should be able to perform some simple manipulations of RDF Data using the rdflib library. You should be able to: 

1. Add and retrieve information from a local RDF database
2. Represent RDF data in other formats, such as the .dot format for graph visualisation
3. Retrieve information from an RDF database created from Web Data
4. Query information from the Web with SPARQL

### Practicalities

Follow this Notebook step-by-step. 

Of course, you can do the exercises in any Programming Editor of your liking. 
But you do not have to. Feel free to simply write code in the Notebook. When 
everything is filled in and works, save the Notebook and submit it 
as a Jupyter Notebook, i.e. with an .ipynb extension. Please use as name of the 
Notebook your studentID+Assignment2.ipynb.  

Other than in courses dedicated to programming we will not evaluate the style
of the programs. But we will test your programs on other data than we provide, 
and your program should give the correct answers to those test-data as well. 

# A. Tasks related to local RDF Knowledge Graphs

This first cell will open a file 'example-from-slide.ttl' using the rdflib library. The first Practical Assignment should have taught you that manipulating symbols as strings is a major pain. 

Programming libraries, such as **rdflib**, help you with this mess once and for all, by parsing the files, creating appropriate datastructures (Graph()) and providing useful functions (such as serialize(), save() and much more). 
Check the website of rdflib http://rdflib.readthedocs.io/: this library does most of the hard work for you.

In [78]:
# Before starting with the tasks of this assignment, do not forget to install **rdflib** so we can start using it. 
%pip install rdflib

Note: you may need to restart the kernel to use updated packages.


In [79]:
from rdflib import Graph, RDF, Namespace, Literal, URIRef

g = Graph()

EX = Namespace('http://example.com/kad0/')
g.bind('ex',EX)

def serialize_graph():
    # g.serialize() returns a string
    print(g.serialize(format='turtle'))

def save_graph(filename):
    with open(filename, 'w') as f:
        g.serialize(f, format='nt')
        
def load_graph(filename):
    with open(filename, 'r') as f:
        g.parse(f, format='turtle')   

The file 'example-from-slides.ttl' formalises the knowledge base from the slides from Module 1, and a bit more. 

Here is how it looks when you load it into your program and serialise it with rdflib in turtle. 

In [80]:
load_graph('example-from-slides.ttl')
serialize_graph()

@prefix ex1: <http://example.com/kad/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

ex1:Germany a ex1:EuropeanCountry .

ex1:Netherlands a ex1:Country ;
    ex1:hasCapital ex1:Amsterdam ;
    ex1:hasName "The Netherlands" ;
    ex1:neighbours ex1:Belgium .

ex1:hasCapital rdfs:range ex1:Capital ;
    rdfs:subPropertyOf ex1:containsCity .

ex1:Amsterdam a ex1:Capital .

ex1:Belgium a ex1:Country .

ex1:EuropeanCountry rdfs:subClassOf ex1:Country .

ex1:containsCity rdfs:domain ex1:Country ;
    rdfs:range ex1:City .

ex1:Capital rdfs:subClassOf ex1:City .




Now, we can manipulate the graph very easily, e.g. like in the following very simple function, which returns the predicate(s) that relate a subject to a literal object: 

In [81]:
for s,p,o in g:
    if type(o) is Literal:
        print(p)

http://example.com/kad/hasName


### - Task 1: (1 Point) Add information to an RDF graph

Add triples to the knowledge graph. Make sure that they have the right namespaces. 

Similarily to the triples already present in the file 'example-from-slides.ttl':
- add at least three new countries with their name and capital 
- add at least one triple with the neighbour predicate

Check: http://rdflib.readthedocs.io/en/stable/intro_to_creating_rdf.html

Remember that ```a``` is Turtle shorthand for ```rdf:type```.

In [82]:
ex = Namespace("http://example.com/kad/")
owl = Namespace("http://www.w3.org/2002/07/owl#")
rdf = Namespace("http://www.w3.org/1999/02/22-rdf-syntax-ns#")
rdfs = Namespace("http://www.w3.org/2000/01/rdf-schema#")


# add triples here to the graph 'g' (do not forget the namespaces).
g.add((ex.Belgium, rdf.type, ex.Country))
g.add((ex.Belgium, rdf.hasCapital, ex.Brussel))
g.add((ex.Belgium, rdf.hasName, Literal("Belgium")))
g.add((ex.Belgium, rdf.Neighbours, ex.France))

g.add((ex.France, rdf.type, ex.Country))
g.add((ex.France, rdf.hasCapital, ex.Paris))
g.add((ex.France, rdf.hasName, Literal("France")))
g.add((ex.France, rdf.Neighbours, ex.Spain))

g.add((ex.Germany, rdf.type, ex.Country))
g.add((ex.Germany, rdf.hasCapital, ex.Berlin))
g.add((ex.Germany, rdf.hasName, Literal("Germany")))
g.add((ex.Germany, rdf.Neighbours, ex.France))


serialize_graph()

@prefix ex1: <http://example.com/kad/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

ex1:Germany a ex1:Country,
        ex1:EuropeanCountry ;
    rdf:Neighbours ex1:France ;
    rdf:hasCapital ex1:Berlin ;
    rdf:hasName "Germany" .

ex1:Netherlands a ex1:Country ;
    ex1:hasCapital ex1:Amsterdam ;
    ex1:hasName "The Netherlands" ;
    ex1:neighbours ex1:Belgium .

ex1:hasCapital rdfs:range ex1:Capital ;
    rdfs:subPropertyOf ex1:containsCity .

ex1:Amsterdam a ex1:Capital .

ex1:Belgium a ex1:Country ;
    rdf:Neighbours ex1:France ;
    rdf:hasCapital ex1:Brussel ;
    rdf:hasName "Belgium" .

ex1:EuropeanCountry rdfs:subClassOf ex1:Country .

ex1:containsCity rdfs:domain ex1:Country ;
    rdfs:range ex1:City .

ex1:Capital rdfs:subClassOf ex1:City .

ex1:France a ex1:Country ;
    rdf:Neighbours ex1:Spain ;
    rdf:hasCapital ex1:Paris ;
    rdf:hasName "France" .




*After you ran the previous code (adding triples) the next cells will be executed on your extended graph. That is ok.*

### - Task 2a: (1 Point) Get structured information from an RDF graph (all Literals)

Use the functions available in the RDFLib library. Write a small function to print all Literals. 

Hint: there is a function in rdflib to test the type of an object (check previous examples in this notebook)

In [83]:
for s,p,o in g:
    if type(o) is Literal:
        print(o)

France
Belgium
Germany
The Netherlands


### - Task 2b: (1 Point) Get structured information from an RDF graph (all unique Predicates)

Please provide another function that gives a **unique** list of the predicates, ordered by occurrence (most occurring first). The answer will look like similar to this: 
<br>http://www.w3.org/2000/01/rdf-schema#label
<br>http://www.w3.org/1999/02/22-rdf-syntax-ns#type
<br>http://example.com/sw2016/locatedIn
<br>http://www.w3.org/2000/01/rdf-schema#range

In [84]:

list = []
for s,p,o in g:
    list.append(p)


sorted_list = sorted(list, key = list.count, reverse = True)

def unique(list):
    unique = []
    for predicate in list:
        if predicate not in unique:
            unique.append(predicate)
            
    for predicate in unique:
        print(predicate)
        
unique(sorted_list)
        
        
    


http://www.w3.org/1999/02/22-rdf-syntax-ns#type
http://www.w3.org/1999/02/22-rdf-syntax-ns#hasName
http://www.w3.org/1999/02/22-rdf-syntax-ns#hasCapital
http://www.w3.org/1999/02/22-rdf-syntax-ns#Neighbours
http://www.w3.org/2000/01/rdf-schema#subClassOf
http://www.w3.org/2000/01/rdf-schema#range
http://www.w3.org/2000/01/rdf-schema#subPropertyOf
http://example.com/kad/neighbours
http://www.w3.org/2000/01/rdf-schema#domain
http://example.com/kad/hasName
http://example.com/kad/hasCapital


# B. Tasks related to Graph visualisations 

### - Task 3a: (2 Point) From RDF to .dot 


In the lecture, we have seen two ways of writing a knowledge graph (simple n-triples, and simple turtle). Let us consider a 3rd syntax, this time a syntax that is useful for visualisation. One standard for visualising graphs is the .dot format.

Print the knowledge graph in .dot file format. Check https://graphviz.gitlab.io/documentation/ and https://graphviz.readthedocs.io/en/stable/ for the documentation. You will only need very little of this information, and the most relevant information can be found in the examples that are given. 

<br>Basically, an RDF graph in .dot format starts with 
<br>digraph G { 
    and then a list of links of the following form 
<br>s -> o [label="p"]
    for every (s p o ) in KG (separated by ;
<br>Do not forget to end with a closing bracket. }

An example is 
     
     digraph G { s1 -> o1 [label="p1"] ; s2 -> o2 [label="p2"] } 
     
for an RDF graph {(s1 p1 o1),(s2 p2 o2)}

In [85]:
# install and import the graphviz library
%pip install graphviz
import graphviz

Note: you may need to restart the kernel to use updated packages.


First, create an auxiliary function which strips the namespaces from URIs. This is necessary to make the node names readable when visualizing the .dot graph.

In [86]:
def strip(string):
    if '#' in string:
        string = string.split('#')[-1]
    else: 
        string = string.split('/')[-1]
    return string 



Next, convert your graph to the .dot format

In [87]:
dot = graphviz.Digraph(strict=True, graph_attr={"dpi":"52"}, node_attr={"shape":"box"}) # adjust dpi to scale graph

for s,p,o in g:
    # dot.node(strip(s))
    # dot.node(strip(p))
    # dot.node(strip(o))
    # specify edges 
    dot.edge(strip(s),strip(o), label=str(strip(p)))
    


View the end result as .dot syntax and as a graph:

In [None]:
print(dot.source)
dot.view
dot.render(directory='Graph', view=True) 


strict digraph {
	graph [dpi=52]
	node [shape=box]
	EuropeanCountry -> Country [label=subClassOf]
	hasCapital -> containsCity [label=subPropertyOf]
	Germany -> EuropeanCountry [label=type]
	France -> Country [label=type]
	Netherlands -> Country [label=type]
	France -> France [label=hasName]
	Belgium -> Brussel [label=hasCapital]
	Belgium -> France [label=Neighbours]
	Germany -> France [label=Neighbours]
	hasCapital -> Capital [label=range]
	Amsterdam -> Capital [label=type]
	containsCity -> City [label=range]
	Belgium -> Belgium [label=hasName]
	Capital -> City [label=subClassOf]
	Netherlands -> Belgium [label=neighbours]
	France -> Paris [label=hasCapital]
	Germany -> Berlin [label=hasCapital]
	Belgium -> Country [label=type]
	Germany -> Germany [label=hasName]
	containsCity -> Country [label=domain]
	Netherlands -> "The Netherlands" [label=hasName]
	Netherlands -> Amsterdam [label=hasCapital]
	France -> Spain [label=Neighbours]
	Germany -> Country [label=type]
}



'Graph\\Digraph.gv.pdf'

### - Task 3b: (1 Point) From RDF to .dot with "semantic information"

There is a conceptual distinction between properties, instances and classes (sets of instances). A simple way of checking is the following

1. in a triple (s a o), with predicate a (which is a special abbreviation for the predicate rdf:type), the s is an Instance, and o is a Class. 
2. in a triple (s rdfs:subClassOf o) both s and o are Classes. 
3. in a triple (p rdfs:domain o) p is a Property and o is a Class. 
4. in a triple (p rdfs:range o)  p is a Property and o is a Class. 

Update the .dot representation for your RDF graph that distinguishes between types of links (RDF vocabulary vs others) and types of nodes (Classes versus Entities versus Literals) via different colors. Hint: you can use the 'color' attribute in the ```node``` and ```edge``` function.

Check how your graph looks once finished.

In [89]:
print("digraph G {")
for s,p,o in g:
    s = (strip(s))
    p = (strip(p))
    o = (strip(o))
    
    

    
    color = "black"
    if p == "type":
        color = "red"
        print('"'+s+'"'+"[color=browm]")
        print('"'+o+'"'+"[color=yellow]")
    elif p == "subClassOf":
        color = "purple"
        print('"'+ s + '"'+"[color=yellow]")
        print('"' + o + '"'+"[color=yellow]")
    elif s == "domain":
        color = "yellow"
        print('"' + o + '"'+"[color=orange]")
        print('"' + o + '"'+"[color=yellow]")
    elif s == "range":
        color = "gray"
        print('"' + o + '"'+"[color=orange]")
        print('"' + o + '"'+"[color=yellow]")
    print('{} -> {} [label="{}", color={}];'.format(s, o, p, color))
print("}")
             

digraph G {
"EuropeanCountry"[color=yellow]
"Country"[color=yellow]
EuropeanCountry -> Country [label="subClassOf", color=purple];
hasCapital -> containsCity [label="subPropertyOf", color=black];
"Germany"[color=browm]
"EuropeanCountry"[color=yellow]
Germany -> EuropeanCountry [label="type", color=red];
"France"[color=browm]
"Country"[color=yellow]
France -> Country [label="type", color=red];
"Netherlands"[color=browm]
"Country"[color=yellow]
Netherlands -> Country [label="type", color=red];
France -> France [label="hasName", color=black];
Belgium -> Brussel [label="hasCapital", color=black];
Belgium -> France [label="Neighbours", color=black];
Germany -> France [label="Neighbours", color=black];
hasCapital -> Capital [label="range", color=black];
"Amsterdam"[color=browm]
"Capital"[color=yellow]
Amsterdam -> Capital [label="type", color=red];
containsCity -> City [label="range", color=black];
Belgium -> Belgium [label="hasName", color=black];
"Capital"[color=yellow]
"City"[color=yellow

### - Task 4: (1 Point) Deriving implicit knowledge (a bit of schema)

We will look into Schema information in the latter modules, but let us try already to find some implicit information in a first bit of inferencing: whenever there are two statements (s a o) and (o rdfs:subClassOf o2) we can derive (and later prove) that (s a o2). 

Write a procedure that adds all implied triples to our knowledge graph. 

In [91]:
s_o = []
list = []

for s,p,o in g:
    if p == RDF.type:
        s_o.append((s, o))
        
for s,p,o in g:
    for tuples in s_o:
        if 'subClassOf' in p and s == tuples[1]:
            g.add((tuples[0], RDF.type, o))
            list.append((tuples[0], RDF.type, o))
            
print(s_o)

[(rdflib.term.URIRef('http://example.com/kad/Germany'), rdflib.term.URIRef('http://example.com/kad/EuropeanCountry')), (rdflib.term.URIRef('http://example.com/kad/France'), rdflib.term.URIRef('http://example.com/kad/Country')), (rdflib.term.URIRef('http://example.com/kad/Netherlands'), rdflib.term.URIRef('http://example.com/kad/Country')), (rdflib.term.URIRef('http://example.com/kad/Amsterdam'), rdflib.term.URIRef('http://example.com/kad/Capital')), (rdflib.term.URIRef('http://example.com/kad/Belgium'), rdflib.term.URIRef('http://example.com/kad/Country')), (rdflib.term.URIRef('http://example.com/kad/Amsterdam'), rdflib.term.URIRef('http://example.com/kad/City')), (rdflib.term.URIRef('http://example.com/kad/Germany'), rdflib.term.URIRef('http://example.com/kad/Country'))]


# C. Tasks related to local copies of external RDF Datasets using SPARQL

Until now, we have manipulated local knowledge graphs, but as we claimed in the lectures, the advantage of knowledge graphs is that they can easily be linked with other datasets on the Web. 

In the remaining 3 tasks, we will manipulate data from the Web, and ask complex queries over this web data. 

In the first task, we will access web data, make a local copy of it, and then query it. In the other two tasks, we will query live data directly from web Knowledge Graphs (in this case, the SPARQL endpoint of DBPedia). 

### - Task 5: (1 Point) Show and manipulate data about RDF resources on the Web 

With rdflib we can easily load a local graph, but we can just as well retrieve a graph from the Web. Here, we will do so using the *requests* library, which allows us to fire a request to any server and/or SPARQL endpoint and to capture the response. The following snippet does so for the resource Netherlands from Dbpedia, by using the 'DESCRIBE' keyword to give us all triples about The Netherlands, and then loads it in a RDFlib Graph object. Note that, in the next assignment, we will learn a more high-level approach that hides most of the raw request details.

In [92]:
# install the library
%pip install requests

Note: you may need to restart the kernel to use updated packages.


In [93]:
import requests

endpoint = "https://dbpedia.org/sparql"
query = 'DESCRIBE <http://dbpedia.org/resource/Netherlands>'

payload = {'query':query, 'format':'text/turtle'}
response = requests.post(endpoint, data = payload)

g = Graph()
g.parse(data=response.text, format='ttl')

Now do the same for Belgium

In [None]:
query = 'DESCRIBE <http://dbpedia.org/resource/Belgium>'

payload = {'query':query, 'format':'text/turtle'}
response = requests.post(endpoint, data = payload)

g = Graph()
g.parse(data=response.text, format='ttl')  # calling parse again merges the graphs

<Graph identifier=Na0e3cfefb5a44c69bc48cb8498023ce4 (<class 'rdflib.graph.Graph'>)>

Let us start by showing diverse bits of information w.r.t  The Netherlands and Belgium in DBPedia. It is very similar to task 1, but now with Web Data. 

First, query the graph g (now containing the DBPedia information about both countries) and check which motor ways cross both countries.

In [None]:
qres = g.query(
   """
    PREFIX dbr: <http://dbpedia.org/resource/>
    PREFIX dbo: <http://dbpedia.org/ontology/>
    SELECT ?s
        WHERE {
            ?s dbo:country dbr:Netherlands .
            ?s dbo:country dbr:Belgium .
        }
        LIMIT 10
       """)
for row in qres:
    print("%s" % row)


Write a query to check whether you can find someone who was born in The Netherlands and died in Belgium? You need to look at the data to know which property you should check for. 

To get an intuition of what is in the knowledge graph you might want to look at the human readable rendering on : http://dbpedia.org/resource/Netherlands

In [None]:
qres = g.query(
   """
    PREFIX dbr: <http://dbpedia.org/resource/>
    PREFIX dbo: <http://dbpedia.org/ontology/>
    SELECT ?s
        WHERE {
            ?s dbo:birthPlace dbr:Netherlands .
            ?s dbo:deathPlace dbr:Belgium .
        }
        LIMIT 10
       """)
for row in qres:
    print("%s" % row)

### - Task 6: (2 Points) Ask SPARQL against live data using Yasgui

Yasgui (https://yasgui.triply.cc) is a nice graphical interface for asking queries.

Run a new query against http://dbpedia.org/sparql that does the following:

- Find all languages spoken in countries that are not official languages of that country
- The query should return two colums: the country, and the number of languages.
- Order the countries by the number of unofficial languages, from high to low.

In [None]:
'''
SPARQL Query:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dbo: <http://dbpedia.org/ontology/>
SELECT ?country (COUNT(?spokenIn) AS ?unofficialLanguages)
WHERE {
    ?country rdf:type dbo:Country. ?spokenIn dbo:spokenIn ?country.
    FILTER NOT EXISTS {?spokenIn dbo:language ?officialLanguage}
}
GROUP BY ?country
HAVING(COUNT(?spokenIn) > 1)
ORDER BY DESC(?unofficialLanguages)
'''

'\nSPARQL Query:\nPREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>\nPREFIX dbo: <http://dbpedia.org/ontology/>\nSELECT ?country (COUNT(?spokenIn) AS ?unofficialLanguages)\nWHERE {\n    ?country rdf:type dbo:Country. ?spokenIn dbo:spokenIn ?country.\n    FILTER NOT EXISTS {?spokenIn dbo:language ?officialLanguage}\n}\nGROUP BY ?country\nHAVING(COUNT(?spokenIn) > 1)\nORDER BY DESC(?unofficialLanguages)\n'