<a href="https://colab.research.google.com/github/momo54/Sage-Jupy/blob/main/Sage_Jupy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Running [SaGe](https://sage.univ-nantes.fr) in a Jupyter Notebook

Sage is a SPARQL query engine for public Linked Data providers that implements Web preemption. The SPARQL engine includes a smart Sage client and a Sage SPARQL query server hosting RDF datasets (hosted using PostgreSQL or HDT). SPARQL queries are suspended by the web server after a fixed quantum of time and resumed upon client request. Using Web preemption, Sage ensures stable response times for query execution and completeness of results under high load.

The complete approach and experimental results are available in a Research paper accepted at The Web Conference 2019, available here. Thomas Minier, Hala Skaf-Molli and Pascal Molli. "SaGe: Web Preemption for Public SPARQL Query services" in Proceedings of the 2019 World Wide Web Conference (WWW'19), San Francisco, USA, May 13-17, 2019.

We appreciate your feedback/comments/questions to be sent to our mailing list or our issue tracker on [github](https://github.com/sage-org).

## Installation

We install SaGe just with the HDT backend. There are other backend to store and update data, but not supported directly in Jupyter Notebook.

In [1]:
!pip install sage-engine
!pip install pybind11
!pip install hdt


Collecting hdt
  Using cached https://files.pythonhosted.org/packages/51/82/41f1e4a131881da64a1ab2c4675dd93020a1a7109be08a2eb790cb6b92c6/hdt-2.3.tar.gz
Collecting pybind11==2.2.4
  Using cached https://files.pythonhosted.org/packages/f2/7c/e71995e59e108799800cb0fce6c4b4927914d7eada0723dd20bae3b51786/pybind11-2.2.4-py2.py3-none-any.whl
Building wheels for collected packages: hdt
  Building wheel for hdt (setup.py) ... [?25l[?25hdone
  Created wheel for hdt: filename=hdt-2.3-cp37-cp37m-linux_x86_64.whl size=5278053 sha256=4832150c39f2a3363e0aec2197d4cadebd3725697e68e34ab93b38c57ed52eb7
  Stored in directory: /root/.cache/pip/wheels/c6/64/28/ee2f54a78b64368f3e633637a0707549ba7a6e1c30079d966b
Successfully built hdt
Installing collected packages: pybind11, hdt
  Found existing installation: pybind11 2.6.2
    Uninstalling pybind11-2.6.2:
      Successfully uninstalled pybind11-2.6.2
Successfully installed hdt-2.3 pybind11-2.2.4


## Configuration



We need a dataset and to configure the server to use this dataset.


*   config.yaml is a simple configuration file for SaGe


1.   Quantum is fixed to 75ms 
2.   max_results=2000


*   swdf.hdt is the 'semantic web dog foord ' dataset in the HDT format. SaGe can use HDT file, or PostGres Backend or a SQLlite backend... HDT is nice when running in a Jupyter Netbooks.




In [2]:
!wget http://gaia.infor.uva.es/hdt/swdf-2012-11-28.hdt.gz
!gunzip -f swdf-2012-11-28.hdt.gz
## just a config.yaml.
!wget -q "https://raw.githubusercontent.com/momo54/Sage-Jupy/main/config.yaml" -O config.yaml
!cat config.yaml

--2021-06-06 08:43:39--  http://gaia.infor.uva.es/hdt/swdf-2012-11-28.hdt.gz
Resolving gaia.infor.uva.es (gaia.infor.uva.es)... 157.88.123.104
Connecting to gaia.infor.uva.es (gaia.infor.uva.es)|157.88.123.104|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2403825 (2.3M) [application/x-gzip]
Saving to: ‘swdf-2012-11-28.hdt.gz’


2021-06-06 08:43:41 (1.72 MB/s) - ‘swdf-2012-11-28.hdt.gz’ saved [2403825/2403825]

name: SaGe Test server
maintainer: Chuck Norris
quota: 75
max_results: 2000
default_graph_uri: http://localhost:8000/sparql/part
graphs:
-
  name: swdf
  uri: http://example.org/swdf
  description: Semantic Web Dog Food in HDT
  backend: hdt-file
  file: swdf-2012-11-28.hdt
-
  name: swdfsq
  uri: http://example.org/swdf-sq
  description: Semantic Web Dog Food in SQlite
  backend: sqlite
  database: data-sq.db


## Starting the server

The SaGe server is started with 2 workers, a quantum of 75ms and maxpage size of 2000 results

In [3]:
%%bash --bg --out script_out
sage config.yaml -p 8000 -w 2 -h "0.0.0.0" > server_out

Starting job # 0 in a separate thread.


In [4]:
## print server output
!tail server_out

Test if the SaGe Server is running. You should see ""The SaGe SPARQL query server is running!"

---



In [5]:
## just testing the server is running...
!curl http://0.0.0.0:8000

curl: (7) Failed to connect to 0.0.0.0 port 8000: Connection refused


## Running queries

As a web server, SaGe can be queryied in any language. 
Below, we show how to do that in Python (as we are in Jupyter Notebook). We also provide a JS client and Java client.

Just Call the SaGe server for only one quantum. The server interupt the query after a quantum exhausted or the max results reached.

In [12]:
import requests
from json import dumps     

###
query='select * where {?s a ?o . ?o a ?s}'
####

entrypoint='http://0.0.0.0:8000/sparql'
default_graph_uri='http://example.org/swdf'
headers = {"accept": "text/html",
        "content-type": "application/json",
        "next": None}
payload = {"query": query,
        "defaultGraph": default_graph_uri}
has_next = True                                                                                                                                         
count = 0                                                                                                                                               
nbResults = 0                                                                                                                                           
nbCalls = 0  
limit = 10

## call the server
response = requests.post(entrypoint, headers=headers, data=dumps(payload))

## the results
json_response = response.json() 
nbResults += len(json_response['bindings'])
print(f'got:{nbResults}')

## print some results
for bindings in json_response['bindings']:
    print(str(bindings))
    count += 1
    if count >= limit:
      break

## the link to continue the execution
has_next = json_response['next']                                                                                                                    
payload["next"] = json_response["next"]
nbCalls += 1

print(f'and the next link is {json_response["next"]}')

got:0
and the next link is EucEGuQECvcBClIKAj9vEi9odHRwOi8vd3d3LnczLm9yZy8xOTk5LzAyLzIyLXJkZi1zeW50YXgtbnMjdHlwZRoCP3MiF2h0dHA6Ly9leGFtcGxlLm9yZy9zd2RmGkkKAj9vEkNodHRwOi8vZGF0YS5zZW1hbnRpY3dlYi5vcmcvY29uZmVyZW5jZS9scmVjLzIwMDgvcGFwZXJzLzEwMS9hdXRob3JzGjQKAj9zEi5odHRwOi8vd3d3LnczLm9yZy8xOTk5LzAyLzIyLXJkZi1zeW50YXgtbnMjU2VxIgQ0NTA0KhoyMDIxLTA2LTA2VDA4OjQ1OjM0LjE1ODgxMTLtAQpSCgI/cxIvaHR0cDovL3d3dy53My5vcmcvMTk5OS8wMi8yMi1yZGYtc3ludGF4LW5zI3R5cGUaAj9vIhdodHRwOi8vZXhhbXBsZS5vcmcvc3dkZhJBCgI/bxI7aHR0cDovL2RhdGEuc2VtYW50aWN3ZWIub3JnL2NvbmZlcmVuY2UvbHJlYy8yMDA4L3BhcGVycy8xMDESNQoCP3MSL2h0dHA6Ly9zd3JjLm9udG93YXJlLm9yZy9vbnRvbG9neSNJblByb2NlZWRpbmdzIgEwKhoyMDIxLTA2LTA2VDA4OjQ1OjM0LjE1ODgxMVpBCgI/bxI7aHR0cDovL2RhdGEuc2VtYW50aWN3ZWIub3JnL2NvbmZlcmVuY2UvbHJlYy8yMDA4L3BhcGVycy8xMDFaNQoCP3MSL2h0dHA6Ly9zd3JjLm9udG93YXJlLm9yZy9vbnRvbG9neSNJblByb2NlZWRpbmdz


## Inspecting the Next Link

We can decode the value of the next link
As you can see. The next link contain the state of the suspended query

In [13]:
from sage.http_server.utils import decode_saved_plan, encode_saved_plan
from sage.query_engine.protobuf.iterators_pb2 import (RootTree,SavedProjectionIterator,SavedScanIterator)
import sys

next_link=json_response["next"]

print(f'next link size:{sys.getsizeof(next_link)} bytes')
print(f'next link:{next_link}')
if next_link is not None:
  saved_plan = next_link
  plan = decode_saved_plan(saved_plan)
  root = RootTree()
  root.ParseFromString(plan)
  print(root)

next link size:873 bytes
next link:EucEGuQECvcBClIKAj9vEi9odHRwOi8vd3d3LnczLm9yZy8xOTk5LzAyLzIyLXJkZi1zeW50YXgtbnMjdHlwZRoCP3MiF2h0dHA6Ly9leGFtcGxlLm9yZy9zd2RmGkkKAj9vEkNodHRwOi8vZGF0YS5zZW1hbnRpY3dlYi5vcmcvY29uZmVyZW5jZS9scmVjLzIwMDgvcGFwZXJzLzEwMS9hdXRob3JzGjQKAj9zEi5odHRwOi8vd3d3LnczLm9yZy8xOTk5LzAyLzIyLXJkZi1zeW50YXgtbnMjU2VxIgQ0NTA0KhoyMDIxLTA2LTA2VDA4OjQ1OjM0LjE1ODgxMTLtAQpSCgI/cxIvaHR0cDovL3d3dy53My5vcmcvMTk5OS8wMi8yMi1yZGYtc3ludGF4LW5zI3R5cGUaAj9vIhdodHRwOi8vZXhhbXBsZS5vcmcvc3dkZhJBCgI/bxI7aHR0cDovL2RhdGEuc2VtYW50aWN3ZWIub3JnL2NvbmZlcmVuY2UvbHJlYy8yMDA4L3BhcGVycy8xMDESNQoCP3MSL2h0dHA6Ly9zd3JjLm9udG93YXJlLm9yZy9vbnRvbG9neSNJblByb2NlZWRpbmdzIgEwKhoyMDIxLTA2LTA2VDA4OjQ1OjM0LjE1ODgxMVpBCgI/bxI7aHR0cDovL2RhdGEuc2VtYW50aWN3ZWIub3JnL2NvbmZlcmVuY2UvbHJlYy8yMDA4L3BhcGVycy8xMDFaNQoCP3MSL2h0dHA6Ly9zd3JjLm9udG93YXJlLm9yZy9vbnRvbG9neSNJblByb2NlZWRpbmdz
proj_source {
  join_source {
    scan_left {
      pattern {
        subject: "?o"
        predicate: "http://www.w3.org/1999/02/22-rdf-syn

Sending the next link back to server allow to restart the query from where it has been stopped. Basically, it works as next/next/next until no more results are available...

In [15]:
if has_next :
  response = requests.post(entrypoint, headers=headers, data=dumps(payload))
  json_response = response.json()                                                                                                                     
  has_next = json_response['next']                                                                                                                    
  payload["next"] = json_response["next"]
  nbResults += len(json_response['bindings'])
  nbCalls += 1
  count=0
  for bindings in json_response['bindings']:
    print(str(bindings))
    count += 1
    if count >= limit:
      break


## Observing progression

If we decode the next link again, we can see that scan progressed (compare 'last_read' fields)

In [16]:
from sage.http_server.utils import decode_saved_plan, encode_saved_plan
from sage.query_engine.protobuf.iterators_pb2 import (RootTree,SavedProjectionIterator,SavedScanIterator)
next_link=json_response["next"]
print(f'the next link {next_link} contains')
if next_link is not None:
  saved_plan = next_link
  plan = decode_saved_plan(saved_plan)
  root = RootTree()
  root.ParseFromString(plan)
  print(root)

the next link EpIEGo8ECtwBClIKAj9vEi9odHRwOi8vd3d3LnczLm9yZy8xOTk5LzAyLzIyLXJkZi1zeW50YXgtbnMjdHlwZRoCP3MiF2h0dHA6Ly9leGFtcGxlLm9yZy9zd2RmGjsKAj9vEjVodHRwOi8vZGF0YS5zZW1hbnRpY3dlYi5vcmcvcGVyc29uL2VyaWthLWRlLWZyYW5jZXNjbxomCgI/cxIgaHR0cDovL3htbG5zLmNvbS9mb2FmLzAuMS9QZXJzb24iBTE0MjkyKhoyMDIxLTA2LTA2VDA4OjQ1OjM0LjE1ODgxMTLQAQpSCgI/cxIvaHR0cDovL3d3dy53My5vcmcvMTk5OS8wMi8yMi1yZGYtc3ludGF4LW5zI3R5cGUaAj9vIhdodHRwOi8vZXhhbXBsZS5vcmcvc3dkZhImCgI/cxIgaHR0cDovL3htbG5zLmNvbS9mb2FmLzAuMS9QZXJzb24SMwoCP28SLWh0dHA6Ly9kYXRhLnNlbWFudGljd2ViLm9yZy9wZXJzb24vZXJpay13aWxkZSIBMCoaMjAyMS0wNi0wNlQwODo0NTozNC4xNTg4MTFaJgoCP3MSIGh0dHA6Ly94bWxucy5jb20vZm9hZi8wLjEvUGVyc29uWjMKAj9vEi1odHRwOi8vZGF0YS5zZW1hbnRpY3dlYi5vcmcvcGVyc29uL2VyaWstd2lsZGU= contains
proj_source {
  join_source {
    scan_left {
      pattern {
        subject: "?o"
        predicate: "http://www.w3.org/1999/02/22-rdf-syntax-ns#type"
        object: "?s"
        graph: "http://example.org/swdf"
      }
      mu {
        key: "?o"
        valu

## Iterate until termination

Well, now we iterate until the end

In [17]:
while has_next :
  response = requests.post(entrypoint, headers=headers, data=dumps(payload))
  json_response = response.json()                                                                                                                     
  has_next = json_response['next']                                                                                                                    
  payload["next"] = json_response["next"]
  nbResults += len(json_response['bindings'])
  nbCalls += 1

## print some bindings...
count=0
for bindings in json_response['bindings']:
  print(str(bindings))
  count += 1
  if count >= limit:
    break

print(f'got {nbResults} results')
print(f'made {nbCalls} calls')

got 0 results
made 5 calls
