# Access and query remote data 

 1. Execute SPARQL queries on a local RDF file
 2. Query a remote SPARQL endpoint
 3. Data integration
 

## 1. Execute SPARQL queries on a local RDF file

RDFlib allows us to iterate over triples, and also to perform SPARQL queries on local data (e.g. a file):

 * parse the RDF file into a RDFLib graph
 * prepare a SPARQL query as a string
 * use the RDFLib method `query` to query the graph
 * iterate over results of the query
 
Results are organised in a **list of tuples**. Every tuple includes as many values as the number of query variables, served in the same order as they appear in the `SELECT` clause. 
Values of the tuples can be accessed by position (e.g. `query_res[0]`) or by variable name (e.g. `query_res["class"]`)


In [1]:
#uncomment if colab
#!pip install rdflib 
import rdflib

# create an empty Graph
g = rdflib.ConjunctiveGraph()

# parse a local RDF file by specifying the format
result = g.parse("https://raw.githubusercontent.com/marilenadaquino/information_visualization/main/2022-2023/resources/artchives.nq", format='nquads')

# perform the query over the graph
query_results = g.query(
    """SELECT ?class (COUNT(?individual) AS ?tot)
       WHERE { ?individual a ?class .}
       GROUP BY ?class ?tot""")

# retrieve a list of tuples
for query_res in query_results:
    print(query_res[0], query_res["tot"]) # notice the two alternative ways to recall values in the tuple

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting rdflib
  Downloading rdflib-6.1.1-py3-none-any.whl (482 kB)
[K     |████████████████████████████████| 482 kB 5.3 MB/s 
[?25hCollecting isodate
  Downloading isodate-0.6.1-py2.py3-none-any.whl (41 kB)
[K     |████████████████████████████████| 41 kB 756 kB/s 
Installing collected packages: isodate, rdflib
Successfully installed isodate-0.6.1 rdflib-6.1.1
http://www.wikidata.org/entity/Q9388534 25
http://www.wikidata.org/entity/Q5 24
http://www.wikidata.org/entity/Q31855 5


## 2. Query a remote SPARQL endpoint

To query an endpoint you have several strategies:

 * **query the endpoint from a GUI**, download results (e.g. in JSON format) and use python to parse results (time-consuming, but better for big queries)
 * **query the endpoint REST API** with python, retrieve data in the desired format, parse it as you like (faster to repeat, but ineffective if you have big queries, due to potential limits of the endpoint)


### 2.1 Query the endpoint from a GUI

A SPARQL endpoint usually offers a number of operations that you can perform on results:

 * **display** data in several formats (tabular, json raw response, charts)
 * **download** data as JSON objects, CSV tables, and XML documents.

**Note for future yourself. What format to choose for download?** It depends. JSON is usually better for manipulating results (deduplication, cleaning, enrichment). 

Let's perform the previous example query on the ARTchives endpoint, and have a look at the JSON result of the query (select the tab raw response).

In [None]:
query = """SELECT ?class (COUNT(?individual) AS ?tot)
       WHERE { ?individual a ?class .}
       GROUP BY ?class ?tot"""

endpoint = "http://artchives.fondazionezeri.unibo.it/sparql"

results = """
{
  "head" : {
    "vars" : [ "class", "tot" ]
  },
  "results" : {
    "bindings" : [ {
      "class" : {
        "type" : "uri",
        "value" : "http://www.wikidata.org/entity/Q31855"
      },
      "tot" : {
        "datatype" : "http://www.w3.org/2001/XMLSchema#integer",
        "type" : "literal",
        "value" : "6"
      }
    }, {
      "class" : {
        "type" : "uri",
        "value" : "http://www.wikidata.org/entity/Q5"
      },
      "tot" : {
        "datatype" : "http://www.w3.org/2001/XMLSchema#integer",
        "type" : "literal",
        "value" : "25"
      }
    }, {
      "class" : {
        "type" : "uri",
        "value" : "http://www.wikidata.org/entity/Q9388534"
      },
      "tot" : {
        "datatype" : "http://www.w3.org/2001/XMLSchema#integer",
        "type" : "literal",
        "value" : "26"
      }
    } ]
  }
}
"""

The JSON returned by **any SPARQL query** has always the same structure, namely: a dictionary with two keys: `head` and `results`.

<pre>
{
  <span style="color:blue">"head"</span> : {
    "vars" : [ "class", "tot" ]
  },
  <span style="color:blue">"results"</span> : {
    "bindings" : [ {
      "class" : {
...
</pre>

**HEADINGS** 

The value of the key `head` is a dictionary with only one key called `vars`. The value of `vars` is a list including all the query variables (in the same order as in the query). In our case the query variables are `["class", "tot"]`, and these correspond to the names of columns in the tabular view of results
 
<pre>
{
  <b>"head"</b> : {
    <span style="color:blue">"vars" : [ "class", "tot" ]</span>
  },
  "results" : {
    "bindings" : [ {
      "class" : {
...
</pre>

**RESULTS GROUPING** 

The value of the key `results` is a dictionary with only one key called `bindings`, whose value is a list of dictionaries. 

<pre>
{
  "head" : {
    "vars" : [ "class", "tot" ]</span>
  },
  <b>"results"</b> : {
    <span style="color:blue">"bindings"</span> : [ 
        {...},
        {...},
        {...}
    ]
  }
}    
</pre>

**SINGLE RESULT** 

Each dictionary in the list corresponds to a row of the tabular results.

Each dictionary/row includes as many dictionaries as the number of query variables (in our case: `class` and `tot`). The keys in the dictionary/row are the names of the column/query varaiables.

<pre>
{
  "head" : {
    "vars" : [ "class", "tot" ]</span>
  },
  <b>"results"</b> : {
    "bindings" : [ 
        <span style="color:blue">{
            "class": {...},
            "tot": {...}
        }</span>,
        <span style="color:blue">{
            "class": {...},
            "tot": {...}
        }</span>,
        <span style="color:blue">{
            "class": {...},
            "tot": {...}
        }</span>
    ]
  }
}    
</pre>

**PARTS OF A SINGLE RESULT** 

The value of the key/column (class, tot) corresponds to a cell of the tabular result. For every cell two/three variables are recorded according to the type of value, namely:

 * the type, that can be either `uri` or `literal`

 <pre>
      "class" : {
        <span style="color:blue">"type" : "uri",</span>
        "value" : "http://www.wikidata.org/entity/Q31855"
      },
      "tot" : {
        "datatype" : "http://www.w3.org/2001/XMLSchema#integer",
        <span style="color:blue">"type" : "literal",</span>
        "value" : "6"
      }
 </pre>

 * the actual `value` (either the http URI or the string)
 
 <pre>
      "class" : {
        "type" : "uri",
        <span style="color:blue">"value" : "http://www.wikidata.org/entity/Q31855"</span>
      },
      "tot" : {
        "datatype" : "http://www.w3.org/2001/XMLSchema#integer",
        "type" : "literal",
        <span style="color:blue">"value" : "6"</span>
      }
 </pre>
 
 * if literal, the `datatype` of the literal:
 
 <pre>
      "class" : {
        "type" : "uri",
        "value" : "http://www.wikidata.org/entity/Q31855"
      },
      "tot" : {
        <span style="color:blue">"datatype" : "http://www.w3.org/2001/XMLSchema#integer",</span>
        "type" : "literal",
        "value" : "6"
      }
</pre>




### 2.2 Parse a SPARQL-JSON result

**Download the results** of the query in a file called `sparql_query_result.json` or query the online version of the same file in our repository.

To query SPARQL-JSON results we need a couple of python libraries:

 * use `json` to parse the object into a dictionary
 * if the data to be queried are online, use `requests` to access the remote file


In [7]:
import json , pprint

pp = pprint.PrettyPrinter(indent=1) # just to pretty print results

# uncomment if you run the code locally
# with open('../resources/sparql_query_result.json','r') as results:
#    data = json.load(results)  
#    pprint.pprint(data)

# if you run the code in colab
import requests
response = requests.get("https://raw.githubusercontent.com/marilenadaquino/information_visualization/main/2022-2023/resources/sparql_query_result.json")
data = json.loads(response.text)
pprint.pprint(data)

{'head': {'vars': ['class', 'tot']},
 'results': {'bindings': [{'class': {'type': 'uri',
                                     'value': 'http://www.wikidata.org/entity/Q31855'},
                           'tot': {'datatype': 'http://www.w3.org/2001/XMLSchema#integer',
                                   'type': 'literal',
                                   'value': '6'}},
                          {'class': {'type': 'uri',
                                     'value': 'http://www.wikidata.org/entity/Q5'},
                           'tot': {'datatype': 'http://www.w3.org/2001/XMLSchema#integer',
                                   'type': 'literal',
                                   'value': '25'}},
                          {'class': {'type': 'uri',
                                     'value': 'http://www.wikidata.org/entity/Q9388534'},
                           'tot': {'datatype': 'http://www.w3.org/2001/XMLSchema#integer',
                                   'type': 'literal',
       

Get colum names.

In [6]:
for column in data["head"]["vars"]: # enter the list
    print(column)

class
tot


Get results and print the values

In [8]:
for result in data["results"]["bindings"]:  # enter the list of dictionaries // do you remember "for row in rows"?
    res_class = result['class']['value']    # the value of the cell under column "class"
    res_tot = result['tot']['value']        # the value of the cell under column "tot"
    print('The class', res_class,'has', res_tot, 'individuals')

The class http://www.wikidata.org/entity/Q31855 has 6 individuals
The class http://www.wikidata.org/entity/Q5 has 25 individuals
The class http://www.wikidata.org/entity/Q9388534 has 26 individuals


### 2.3 Query a SPARQL endpoint with RDFLib and SPARQLWrapper libraries

Querying data via SPARQL endpoint GUI is not always convenient: 

 * it is highly discouraged when you need to show your work as a one-click application. 
 * in many cases, downloading data locally and parse them in a RDFLib graph it's not possible or convenient, e.g. dumping the entire Wikidata graph would require you ~60GB storage (only for the zipped file!). 
 * while the online data keep being updated, the local copy goes easily **out-to-date**.

[SPARQLWrapper](https://sparqlwrapper.readthedocs.io/en/latest/main.html) is an extension of RDFLib that allows you to **query remote SPARQL endpoints** and to get up-to-date result data in the same JSON format you'd get if you download result data from the GUI.

In order to query a remote SPARQL endpoint you'll need:

 * If you have mac *you may need* to tweak the certificates (use an unverified one) for querying an external service 
 * get the URL of the API of the SPARQL endpoint (sometimes it is the same as the URL of the GUI)
 * prepare the SPARQL query 
 * create the wrapper around the SPARQL API (via SPARQLWrapper library)
 * send the query and get the JSON results

In [12]:
# uncomment if you run on colab
#!pip install SPARQLWrapper
from SPARQLWrapper import SPARQLWrapper, JSON
import ssl

# if you have mac and you run this locally, uncomment
#ssl._create_default_https_context = ssl._create_unverified_context

# get the endpoint API
wikidata_endpoint = "https://query.wikidata.org/bigdata/namespace/wdq/sparql"

# prepare the query : 10 random triples
my_SPARQL_query = """
SELECT *
WHERE {?s ?p ?o}
LIMIT 10
"""

# set the endpoint 
sparql_wd = SPARQLWrapper(wikidata_endpoint)
# set the query
sparql_wd.setQuery(my_SPARQL_query)
# set the returned format
sparql_wd.setReturnFormat(JSON)
# get the results
results = sparql_wd.query().convert()

# manipulate the result
for result in results["results"]["bindings"]:
    print(result["s"]["value"], result["p"]["value"], result["o"]["value"])

http://wikiba.se/ontology#Dump http://creativecommons.org/ns#license http://creativecommons.org/publicdomain/zero/1.0/
http://wikiba.se/ontology#Dump http://schema.org/softwareVersion 1.0.0
http://wikiba.se/ontology#Dump http://schema.org/dateModified 2021-09-24T23:00:01Z
http://wikiba.se/ontology#Dump http://schema.org/dateModified 2021-09-24T23:00:04Z
http://wikiba.se/ontology#Dump http://schema.org/dateModified 2021-09-24T23:00:08Z
http://wikiba.se/ontology#Dump http://schema.org/dateModified 2021-09-24T23:00:12Z
http://wikiba.se/ontology#Dump http://schema.org/dateModified 2021-09-24T23:00:17Z
http://wikiba.se/ontology#Dump http://schema.org/dateModified 2021-09-24T23:00:21Z
http://wikiba.se/ontology#Dump http://schema.org/dateModified 2021-09-24T23:00:26Z
http://wikiba.se/ontology#Dump http://schema.org/dateModified 2021-09-24T23:00:29Z


## 3. Data integration 

### An example: integrate art historians' birth places from Wikidata

In the last tutorial we saw how to add triples about historians' birthplaces (`wdt:P19`). We looked in wikidata for the URI of the birth place of Federico Zeri and we manually added the new triple  to the graph. 

Now we can do this operation **systematically**:
*For every art historian in ARTchives that is also in Wikidata, we get their birthplaces and we add this information to our graph*

To do that we need to:

 1. get the list of historians in ARTchives that are also available in Wikidata
 2. prepare a SPARQL query that returns the birthplace of a list of art historians 
 3. we send the query to Wikidata
 4. if there is a result, we add a triple to our graph
 

### 3.1 Get the list of historians

We need to take into account a couple of caveats.
 
 * How to distinguish historians from other people that are mentioned in ARTchives? We use the pattern `?collection wdt.P170(has creator) ?creator`, which is the only mandatory predicate that distinguishes historians from other people.
 * How do we select only the historians that are both in ARTchives and Wikidata? We match a substring in the URI (if it includes the substring "wikidata.org/entity/", we are sure this is a Wikidata entity.


In [13]:
from rdflib import Namespace , Literal , URIRef
from rdflib.namespace import RDF , RDFS

# bind the uncommon namespaces
wd = Namespace("http://www.wikidata.org/entity/") # remember that a prefix matches a URI until the last slash (or hashtag #)
wdt = Namespace("http://www.wikidata.org/prop/direct/")
art = Namespace("https://w3id.org/artchives/")

# Get the list of art historians in our graph "g"
arthistorians_list = set()

# iterate over the triples in the graph
for s,p,o in g.triples(( None, wdt.P170, None)):   # people "o" are the creator "wdt.P170" of a collection "s"
    if "wikidata.org/entity/" in str(o):           # look for the substring to filter wikidata entities only
        arthistorians_list.add('<' + str(o) + '>')     # remember to transform them in strings! 
    
print(arthistorians_list)

{'<http://www.wikidata.org/entity/Q537874>', '<http://www.wikidata.org/entity/Q19997512>', '<http://www.wikidata.org/entity/Q55453618>', '<http://www.wikidata.org/entity/Q1373290>', '<http://www.wikidata.org/entity/Q1629748>', '<http://www.wikidata.org/entity/Q41616785>', '<http://www.wikidata.org/entity/Q1641821>', '<http://www.wikidata.org/entity/Q1271052>', '<http://www.wikidata.org/entity/Q60185>', '<http://www.wikidata.org/entity/Q1296486>', '<http://www.wikidata.org/entity/Q457739>', '<http://www.wikidata.org/entity/Q2824734>', '<http://www.wikidata.org/entity/Q88907>', '<http://www.wikidata.org/entity/Q1715096>', '<http://www.wikidata.org/entity/Q18935222>', '<http://www.wikidata.org/entity/Q995470>', '<http://www.wikidata.org/entity/Q1712683>', '<http://www.wikidata.org/entity/Q3051533>', '<http://www.wikidata.org/entity/Q1089074>', '<http://www.wikidata.org/entity/Q3057287>', '<http://www.wikidata.org/entity/Q90407>', '<http://www.wikidata.org/entity/Q85761254>', '<http://www.

Explanation of the code above:

* We iterate over the triples that respect the pattern to get the URIs of the historians. 
 * Then we parse URIs (that RDFLIb considers as `RDFLib.URIRef`) as strings. 
 * we match the substing of Wikidata URIs in our strings
 * We create a list of URIs to be used in a SPARQL query with the `VALUES` operator, therefore we wrap the URI strings in the hooks `<>` 
 *  we add these URIs to a set (a list of unique values)


### 3.2 Prepare the query

We have two options:
 * for each URI, we prepare a query to be sent to Wikidata. However, this implies sending many small queries to an external service that may have (reasonably) imposed a query limit (If you ever get an error `429: Too Many requests`, see [here](https://stackoverflow.com/questions/62396801/how-to-handle-too-many-requests-on-wikidata-using-sparqlwrapper) the reason.)
 * **BETTER** we send only one (bigger) query to the Wikidata endpoint where all the historians are terms specified in a `VALUES` list. The result table of our query will include for every row (1) the URI of the historian, (2) the URI of the birthplace, and (3) the label of the birth place *in english only!* (be aware that Wikidata has labels for every language!)


In [14]:
# prepare the values to be queried
historians = ' '.join(arthistorians_list) # <uri1> <uri2> <uri3> ... <uriN>

# prepare the query
birthplace_query = """
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT DISTINCT ?historian ?birthplace ?birthplace_label
WHERE {
    VALUES ?historian {"""+historians+"""} . # look how we include a variable in a query string!
    ?historian wdt:P19 ?birthplace . 
    ?birthplace rdfs:label ?birthplace_label .
    FILTER (langMatches(lang(?birthplace_label), "EN"))
    } 
"""

 
### 3.3 Send the query to Wikidata


In [15]:
# set the endpoint 
sparql_wd = SPARQLWrapper(wikidata_endpoint)
# set the query
sparql_wd.setQuery(birthplace_query)
# set the returned format
sparql_wd.setReturnFormat(JSON)
# get the results
results = sparql_wd.query().convert()

**INTEGRATE THE DATA INTO THE GRAPH** Once the wrapper and the query are set we manipulate results:
 
 * only if the birthplace is found we look also for its label 
 * only for those birthplaces that have both URI and label we create a new triple to be added to our graph.
 
**STORE** Now that we have added all these new triples to our in-memory graph, we can store these data into a new file.

In [None]:
# manipulate the result
for result in results["results"]["bindings"]:
    historian_uri = result["historian"]["value"]
    print("historian:", historian_uri)
    if "birthplace" in result: # some historians may have no birthplace recorded in Wikidata!
        birthplace = result["birthplace"]["value"]
        if "birthplace_label" in result: 
            birthplace_label = result["birthplace_label"]["value"]
            print("found:", birthplace, birthplace_label)
            
            # only if both uri and label are found we add them to the graph
            g.add(( URIRef(historian_uri) , URIRef(wdt.P19) , URIRef(birthplace) ))
            g.add(( URIRef(birthplace) , RDFS.label , Literal(birthplace_label) ))
    else:
        print("nothing found in wikidata :(")

g.serialize(destination='artchives_birthplaces.nq', format='nquads')

## For your project

If you want to use ARTchives in your project, you can:
 
 * query and manipulate ARTchives locally (use the `artchives.nq` file) via RDFlib methods or local SPARQL queries
 * query external sources remotely (Wikidata and others) using SPARQLWrapper
 * save the data extracted from external sources along with ARTchives data locally (create a new file)

## Resources 

Some SPARQL tutorials:
  
 * [SPARQL Tutorial - Apache](https://jena.apache.org/tutorials/sparql.html)
 * [SPARQL tutorial - stardog](https://www.stardog.com/tutorials/sparql/)
 * [SPARQL tutorial - Programming historian](https://programminghistorian.org/en/lessons/retired/graph-databases-and-SPARQL)
 * [SPARQL tutorial - Wikidata](https://www.wikidata.org/wiki/Wikidata:SPARQL_tutorial) - very useful for your project! It teaches you how to query data on Wikidata 
 
SPARQL and RDFLib:
   
 * [RDFLib documentation on SPARQL](https://rdflib.readthedocs.io/en/stable/intro_to_sparql.html)
 
SPARQLWrapper:
 
 * [SPARQLWrapper documentation](https://sparqlwrapper.readthedocs.io/en/latest/main.html)
 
Wikidata resources:
 * [index of categories](https://www.wikidata.org/wiki/Category:Wikidata:SPARQL_query_service)