# Knowledge and Data 2018: Second practical assignment 
## Manipulate local and external RDF Data 

Your name: Laura Went

Your VUNetID: lwt480 (studentNr.: 2570654)

If you do not provide your name and VUNetID we will not accept your submission. 

### Learning objectives

At the end of this exercise you should be able to perform some simple manipulations of RDF Data using the rdflib library. You should be able to 

1. add and retrieve information from a local RDF database
2. represent RDF data in other formats, such as the .dot format for graph visualisation
3. retrieve information from an RDF database created from Web Data
4. query information from the Web with SPARQL


### Practicalities

Follow this Notebook step-by-step. 

Of course, you can do the exercises in any Programming Editor of your liking. 
But you do not have to. Feel free to simply write code in the Notebook. When 
everything is filled in and works, safe the Notebook and submit it 
as a Jupyter Notebook, i.e. with an ipynb extension. Please use as name of the 
Notebook your studentID+Assignment2.ipynb.  

Other than in courses dedicated to programming we will not evaluate the style
of the programs. But we will test your programs on other data than we provide, 
and your program should give the correct answers to those test-data as well. 

## Exercises related to local RDF

This first cell will open a file 'example-from-slide.ttl' using the rdflib library. The first Practical Assignment should have taught you that manipulating symbols as strings is a major pain. Programming libraries, such as rdflib, help you with mess once and for all, by parsing the files, creating appropriate datastructures (Graph()) and providing useful functions (such as serialize(), save() and much more. Check the website of rdflib http://rdflib.readthedocs.io/: this library does most of the hard work for you. 

In [3]:
from rdflib import Graph, RDF, Namespace, Literal, URIRef

g = Graph()

EX = Namespace('http://example.com/kad2017/')
g.bind('ex',EX)

def serialize():
    # g.serialize() returns a byte string (b'...')
    # .decode() is parsing the byte string into a python3 string
    print(g.serialize(format='turtle').decode("utf-8"))

def save(filename):
    with open(filename, 'w') as f:
        g.serialize(f, format='turtle')
        
def load(filename):
    with open(filename, 'r') as f:
        g.load(f, format='turtle')   

The file 'example-from-slides.ttl' formalises the knowledge base from the slides from Module 1, and a bit more. 

Here is how it looks when you load it into your program and serialise it with rdflib in turtle. 

In [4]:
load('example-from-slides.ttl')
serialize()

@prefix dbo: <http://dbpedia.org/ontology/> .
@prefix dbr: <http://dbpedia.org/resource/> .
@prefix ex: <http://example.com/kad2017/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

ex:Germany a ex:EuropeanCountry .

ex:Netherlands a ex:Country ;
    ex:has_Capital ex:Amsterdam ;
    ex:has_Name "TheNetherlands" ;
    ex:neighbours ex:Belgium .

ex:hasCapital rdfs:range ex:Capital ;
    rdfs:subPropertyOf ex:containsCity .

ex:Amsterdam a ex:Capital .

ex:Belgium a ex:Country .

ex:EuropeanCountry rdfs:subClassOf ex:Country .

ex:containsCity rdfs:domain ex:Country ;
    rdfs:range ex:City .

ex:Capital rdfs:subClassOf ex:City .




Now, we can manipulate, the graph very easily, e.g. like in the following very simple function: 

In [5]:
for s,p,o in g:
    if type(o) == Literal:
        print(p)

http://example.com/kad2017/has_Name


### Task 1a) (1 Point) Add information to an RDF graph

Add 10 triples to the knowledge graph. Make sure that they have the right namespaces: 

http://rdflib.readthedocs.io/en/stable/intro_to_creating_rdf.html

Add at least a new country and its capital, as well as at least one triple with a new predicate. 

In [67]:
from rdflib import Graph, RDF, RDFS, Namespace, Literal, URIRef
g = Graph()
EX = Namespace('http://example.com/kad2017/')
g.bind('ex',EX)

country = URIRef("http://example.com/kad2017/Country")
europeanCountry = URIRef("http://example.com/kad2017/EuropeanCountry")
CzechRepublic = EX["CzechRepublic"] 

g.add( (CzechRepublic, RDF.type, country) )
g.add( (CzechRepublic, EX["has_Name"], Literal("CzechRepublic")) )
g.add( (country, RDFS.subClassOf, europeanCountry) )
g.add( (CzechRepublic, EX["same_As"], EX["CZ"]) )
g.add( (CzechRepublic, RDF.type, EX["Country"]) )

g.add( (EX["Prague"], RDF.type, EX["City"]) )
g.add( (EX["Prague"], EX["has_Name"], Literal("Prague")) )
g.add( (EX["CZ"], EX["has_Capital"], EX["Praque"]) )
g.add( (EX["Prague"], EX["capital_Of"], EX["CzechRepublic"]) )

g.add( (EX["Czech"], EX["language_Of"], EX["CzechRepublic"]))


print(g.serialize(format='turtle').decode("utf-8"))




@prefix ex: <http://example.com/kad2017/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

ex:Czech ex:language_Of ex:CzechRepublic .

ex:Prague a ex:City ;
    ex:capital_Of ex:CzechRepublic ;
    ex:has_Name "Prague" .

ex:CZ ex:has_Capital ex:Praque .

ex:Country rdfs:subClassOf ex:EuropeanCountry .

ex:CzechRepublic a ex:Country ;
    ex:has_Name "CzechRepublic" ;
    ex:same_As ex:CZ .




After you ran this code (adding triples) the next cells will be executed on your extended graph. That is ok. 

### Task 1b) (1 Point) Get structured information from an RDF graph  

Use the functions available in the RDFLib library. Write a small function to print all Literals. Hint: there is a function in rdflib to test the type of an object. 

In [68]:
for s,p,o in g:
    if type(o)==Literal:
        print (o)

CzechRepublic
Prague


Please provide another function that gives a unique list of the predicates, ordered by occurrence (most occurring first). The answer will look like this: 
<br>http://www.w3.org/2000/01/rdf-schema#label
<br>http://www.w3.org/1999/02/22-rdf-syntax-ns#type
<br>http://example.com/sw2016/locatedIn
<br>http://www.w3.org/2000/01/rdf-schema#range

In [69]:
predicateCounts = {}
for s,p,o in g:
    if p in predicateCounts:
        predicateCounts[p] += 1
    elif p not in predicateCounts:
        predicateCounts[p] = 1
predicateCountsSorted = sorted(predicateCounts.items(), key=lambda x: x[1])
predicateCountsSorted.reverse()

for x,y in predicateCountsSorted:
    print(str(y) + " times:  ", end='')
    print(x)

2 times:  http://example.com/kad2017/has_Name
2 times:  http://www.w3.org/1999/02/22-rdf-syntax-ns#type
1 times:  http://www.w3.org/2000/01/rdf-schema#subClassOf
1 times:  http://example.com/kad2017/has_Capital
1 times:  http://example.com/kad2017/same_As
1 times:  http://example.com/kad2017/capital_Of
1 times:  http://example.com/kad2017/language_Of


### Task 2) An RDF syntaxes for Graph visualisations 

### Task 2a): (1 Points) From RDF to .dot 


In the lecture, we have seen two ways of writing a knowledge graph (simple n-triples, and simple turtle). Let us consider a 3rd syntax, this time a syntax that is useful for visualisation. One standard for visualising graphs is the .dot format.

Print the knowledge graph in .dot file format. Check https://graphviz.gitlab.io/documentation/ for the documentation. You will only need very little of this information, and the most relevant information can be found in the examples that are given. 

<br>Basically, an RDF graph in .dot format starts with 
<br>digraph G { 
    and then a list of linkes of the following form 
<br>s -> o [label="p"]
    for every (s p o ) in KG (separated by ;
<br>Do not forget to end with a closing bracket. }

An example is 
     
     digraph G { s1 -> o1 [label="p1"] ; s2 -> o2 [label="p2"] } 
     
for an RDF graph {(s1 p1 o1),(s2 p2 o2)}

In [70]:
def shorten(uri):
    if isinstance(uri, Literal):
        return uri

    components = g.namespace_manager.compute_qname(uri)
    return '"%s:%s"'%(components[0], components[2])
def rdfDot():
    printGraph = "digraph G {"
    for s,p,o in g:
        printGraph += "\n" + shorten(s) + " -> " + shorten(o) + " [label=" + shorten(p) + "] ;" 
    printGraph = printGraph[:-1] + "\n}" 
    print(printGraph)
        
rdfDot()

digraph G {
"ex:Czech" -> "ex:CzechRepublic" [label="ex:language_Of"] ;
"ex:Prague" -> "ex:CzechRepublic" [label="ex:capital_Of"] ;
"ex:CzechRepublic" -> "ex:Country" [label="rdf:type"] ;
"ex:CzechRepublic" -> CzechRepublic [label="ex:has_Name"] ;
"ex:CzechRepublic" -> "ex:CZ" [label="ex:same_As"] ;
"ex:CZ" -> "ex:Praque" [label="ex:has_Capital"] ;
"ex:Prague" -> Prague [label="ex:has_Name"] ;
"ex:Country" -> "ex:EuropeanCountry" [label="rdfs:subClassOf"] ;
"ex:Prague" -> "ex:City" [label="rdf:type"] 
}


You can check how your graph looks like. Just copy&paste your output into, e.g.,  http://www.webgraphviz.com/

### Task 2b) (1 Point) From RDF to .dot with "semantic information"

There is a conceptual distinction between properties, instances and classes (sets of instances). A simple way of checking is the following

1. in a triple (s a o), with predicate a (which is a special abbreviation for the predicate rdf:type), the s is an Instance, and o is a Class. 
2. in a triple (s rdfs:subClassOf o) both s and o are Classes. 
3. in a triple (p rdfs:domain o) p is a Property and o is a Class. 
4. in a triple (p rdfs:range o)  p is a Property and o is a Class. 

Make a .dot representation for an RDF graph that distinguishes between types of links (RDF vocabulary vs others) and types of nodes (Classes versus Instances) via different colors. 

In [71]:
# instance red, class green , RDF blue, links(ex) orange, property yellow

def rdfDot2():
    printGraph = "digraph G {"
    for s,p,o in g:
        if shorten(p) == '"rdf:type"':
            printGraph += "\n" + shorten(s) + " [color=red] ;\n" + shorten(o) + " [color=green] ;\n edge [color=blue] ;\n" + shorten(s) + " -> " + shorten(o) + " [label=" + shorten(p) + "] ;" 
        elif shorten(p) == '"rdfs:subClassOf"':
            printGraph += "\n" + shorten(s) + " [color=green] ;\n" + shorten(o) + " [color=green] ;\n edge [color=blue] ;\n" + shorten(s) + " -> " + shorten(o) + " [label=" + shorten(p) + "] ;" 
        elif shorten(p) == '"rdfs:domain"':
            printGraph += "\n" + shorten(s) + " [color=yellow] ;\n" + shorten(o) + " [color=green] ;\n edge [color=blue] ;\n" + shorten(s) + " -> " + shorten(o) + " [label=" + shorten(p) + "] ;" 
        elif shorten(p) == '"rdfs:range"':
            printGraph += "\n" + shorten(s) + " [color=yellow] ;\n" + shorten(o) + " [color=green] ;\n edge [color=blue] ;\n" + shorten(s) + " -> " + shorten(o) + " [label=" + shorten(p) + "] ;" 
        else:
            printGraph += "\n" + shorten(s) + " [color=green] ;\n" + shorten(o) + " [color=green] ;\n edge [color=orange] ;\n" + shorten(s) + " -> " + shorten(o) + " [label=" + shorten(p) + "] ;" 
    printGraph = printGraph[:-1] + "\n}" 
    print(printGraph)
        
rdfDot2()


digraph G {
"ex:Czech" [color=green] ;
"ex:CzechRepublic" [color=green] ;
 edge [color=orange] ;
"ex:Czech" -> "ex:CzechRepublic" [label="ex:language_Of"] ;
"ex:Prague" [color=green] ;
"ex:CzechRepublic" [color=green] ;
 edge [color=orange] ;
"ex:Prague" -> "ex:CzechRepublic" [label="ex:capital_Of"] ;
"ex:CzechRepublic" [color=red] ;
"ex:Country" [color=green] ;
 edge [color=blue] ;
"ex:CzechRepublic" -> "ex:Country" [label="rdf:type"] ;
"ex:CzechRepublic" [color=green] ;
CzechRepublic [color=green] ;
 edge [color=orange] ;
"ex:CzechRepublic" -> CzechRepublic [label="ex:has_Name"] ;
"ex:CzechRepublic" [color=green] ;
"ex:CZ" [color=green] ;
 edge [color=orange] ;
"ex:CzechRepublic" -> "ex:CZ" [label="ex:same_As"] ;
"ex:CZ" [color=green] ;
"ex:Praque" [color=green] ;
 edge [color=orange] ;
"ex:CZ" -> "ex:Praque" [label="ex:has_Capital"] ;
"ex:Prague" [color=green] ;
Prague [color=green] ;
 edge [color=orange] ;
"ex:Prague" -> Prague [label="ex:has_Name"] ;
"ex:Country" [color=green] ;
"

### Task 3) (2 Points) Visualising implicit knowledge (a bit of schema)

We will look into Schema information in the latter modules, but let us try already to find some implicit information in a first bit of inferencing: whenever there are two statements (s a o) and (o rdfs:subClassOf o2) we can derive (and later prove) that (s a o2). 

Write a procedure that adds all implied triples for our knowledge base. 

Produce a .dot version of the graph with those implies implicit triples, and mark the edges of those triples with a different color or arrow style. 

In [76]:
def implicit():
    typeDict = {}
    extra = ""
    for s,p,o in g:
        if shorten(p) == '"rdf:type"':
            typeDict[shorten(s)] = shorten(o)
    for s,p,o in g:
        if shorten(p) == '"rdfs:subClassOf"':
            for key,val in typeDict.items():
                if shorten(s) == val:
                    extra += "\n edge [color=cyan] ;\n" + key + " -> " + shorten(o) + " [label=" + '"rdf:type"' + "] ;"
    printGraph = "digraph G {"
    for s,p,o in g:
        if shorten(p) == '"rdf:type"':
            printGraph += "\n" + shorten(s) + " [color=red] ;\n" + shorten(o) + " [color=green] ;\n edge [color=blue] ;\n" + shorten(s) + " -> " + shorten(o) + " [label=" + shorten(p) + "] ;" 
        elif shorten(p) == '"rdfs:subClassOf"':
            printGraph += "\n" + shorten(s) + " [color=green] ;\n" + shorten(o) + " [color=green] ;\n edge [color=blue] ;\n" + shorten(s) + " -> " + shorten(o) + " [label=" + shorten(p) + "] ;" 
        elif shorten(p) == '"rdfs:domain"':
            printGraph += "\n" + shorten(s) + " [color=yellow] ;\n" + shorten(o) + " [color=green] ;\n edge [color=blue] ;\n" + shorten(s) + " -> " + shorten(o) + " [label=" + shorten(p) + "] ;" 
        elif shorten(p) == '"rdfs:range"':
            printGraph += "\n" + shorten(s) + " [color=yellow] ;\n" + shorten(o) + " [color=green] ;\n edge [color=blue] ;\n" + shorten(s) + " -> " + shorten(o) + " [label=" + shorten(p) + "] ;" 
        else:
            printGraph += "\n" + shorten(s) + " [color=green] ;\n" + shorten(o) + " [color=green] ;\n edge [color=orange] ;\n" + shorten(s) + " -> " + shorten(o) + " [label=" + shorten(p) + "] ;" 
    printGraph += extra
    printGraph = printGraph[:-1] + "\n}" 
    print(printGraph)        
    
    
implicit()


digraph G {
"ex:Czech" [color=green] ;
"ex:CzechRepublic" [color=green] ;
 edge [color=orange] ;
"ex:Czech" -> "ex:CzechRepublic" [label="ex:language_Of"] ;
"ex:Prague" [color=green] ;
"ex:CzechRepublic" [color=green] ;
 edge [color=orange] ;
"ex:Prague" -> "ex:CzechRepublic" [label="ex:capital_Of"] ;
"ex:CzechRepublic" [color=red] ;
"ex:Country" [color=green] ;
 edge [color=blue] ;
"ex:CzechRepublic" -> "ex:Country" [label="rdf:type"] ;
"ex:CzechRepublic" [color=green] ;
CzechRepublic [color=green] ;
 edge [color=orange] ;
"ex:CzechRepublic" -> CzechRepublic [label="ex:has_Name"] ;
"ex:CzechRepublic" [color=green] ;
"ex:CZ" [color=green] ;
 edge [color=orange] ;
"ex:CzechRepublic" -> "ex:CZ" [label="ex:same_As"] ;
"ex:CZ" [color=green] ;
"ex:Praque" [color=green] ;
 edge [color=orange] ;
"ex:CZ" -> "ex:Praque" [label="ex:has_Capital"] ;
"ex:Prague" [color=green] ;
Prague [color=green] ;
 edge [color=orange] ;
"ex:Prague" -> Prague [label="ex:has_Name"] ;
"ex:Country" [color=green] ;
"

# Tasks related to local copies of external RDF Datasets using SPARQL

Until now, we have manipulated local knowledge, but as we claimed in the lectures, the advantage of knowledge graphs is that they can easily be linked with other datasets on the web. 

In these remaining 3 tasks, we will manipulate webdata, and ask complex queries over this web data. In the first task, we will access web-data, make a local copy and then query it. 

In the other two tasks, we will query live data directly from web databases (in this case, the SPARQL endpoint of DBPedia). 

### Task 4) (1 Point) Show and manipulate data about RDF resources on the Web 

With rdflib we can easily load a graph, even if it comes from a source on the Web. The following snupped loads as graph the information about the resource Amsterdam from Dbpedia.

In [77]:
import rdflib
from rdflib import Literal, URIRef
g=rdflib.Graph()
g.load('http://dbpedia.org/resource/Amsterdam')
g.load('http://dbpedia.org/resource/Rotterdam')

Let us start by showing diverse bits of information w.r.t  Amsterdam and Rotterdam n DBPedia. It is a very similar tasks to task 1, but now with Web Data. 

First, query the graph g (now containing the DBPedia information about Amsterdam and Rotterdam) and check whether you can find someone who was born in Amsterdam (is dbo:birthPlace of) and died in Rotterdam (is dbo:deathPlace of)?

In [79]:
# THIS code is an example to make you get going. It shows how to run a sparql query on g. 
# gres is then a list of results, which you can print. 
#
# You should adapt the code for your purposes.

qres = g.query(
    """PREFIX dbr:<http://dbpedia.org/resource/>
        PREFIX dbo:<http://dbpedia.org/ontology/>
        SELECT ?value
        WHERE {
        ?value dbo:birthPlace dbr:Amsterdam .
        ?value dbo:deathPlace dbr:Rotterdam .
        }
        LIMIT 10
       """)

for row in qres:
    print("%s" % row)

http://dbpedia.org/resource/Haya_van_Someren
http://dbpedia.org/resource/Anthony_Sweijs
http://dbpedia.org/resource/Jan_Stolker
http://dbpedia.org/resource/Jan_van_Beveren


Write a query to check whether there is an album that was recorded both in Rotterdam and Amsterdam? You need to look at the data to know which property you should check for. To get an intuition of what is in the knowledge graph you might want to look at the human readable rendering on : http://dbpedia.org/resource/Amsterdam

In [80]:
#see above

### Task 5) (2 points) Ask SPARQL against live data using Yasgui

Yasgui (http://yasgui.org/) is a nice graphical interface for asking queries 
Run a new query against http://dbpedia.org/sparql that does the following:

    Find all languages spoken in countries that are not official languages of that country
    The query should return two colums: the country, and the number of languages.
    Order the countries by the number of unofficial languages, from high to low.

In [85]:
# Add here the SPARQL query (not Python) code. (copy&paste from Yasgui)
# When executing against Yasgui you should get an answer. 
# Don't worry that executing this cel will give an error message. 

PREFIX dbp: <http://dbpedia.org/property/>
PREFIX dbo: <http://dbpedia.org/ontology/>

SELECT DISTINCT ?x (count(?language) as ?count)
WHERE {
 ?x a dbo:Country;
    a dbo:Location .
  {?x dbo:language|dbp:languages ?language} MINUS {?x dbo:officialLanguage ?language} .
} ORDER BY DESC(?count)

SyntaxError: invalid syntax (<ipython-input-85-023ff33efc96>, line 5)

### Task 6) (2 Point) SPARQL 

Write a SPARQL query which returns all relationship(s) between the series "Homeland" and "Claire Danes" (literally).

Use Yasgui to design the correct SPARQL query, and copy paste it in the cell below. 

In [1]:
# Add here the SPARQL query (not Python) code. (copy&paste from Yasgui)
# When executing against Yasgui you should get an answer. 
# Don't worry that executing this cel will give an error message. 

PREFIX foaf: <http://xmlns.com/foaf/0.1/>

SELECT DISTINCT ?p
WHERE {
  ?Homeland foaf:name "Homeland"@en .
?Claire foaf:name "Claire Danes"@en .
  {?Homeland ?p ?Claire} UNION {?Claire ?p ?Homeland} .
} 


SyntaxError: invalid syntax (<ipython-input-1-5ad9a369df70>, line 5)