Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors in SPARQL query results #8

Closed
ceteri opened this issue Mar 20, 2022 · 8 comments
Closed

Errors in SPARQL query results #8

ceteri opened this issue Mar 20, 2022 · 8 comments

Comments

@ceteri
Copy link

ceteri commented Mar 20, 2022

Hi @Tpt ,

First, thank you very much for the excellent oxrdflib library. We've had multiple requests to integrate this with our kglab project, and in some cases (e.g., with unions and axes) we see queries that have ~2 orders of magnitude better performance than with the default RDFlib.Store implementation.

One of our use cases at BASF has identified a couple issues, and we wanted to provide a minimal code example to replicate these errors. The following script:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import itertools
import sys
import time
import traceback
import typing

from icecream import ic
import oxrdflib
import rdflib


TTL_DATA = """
@prefix ex: <https://example.com/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

ex:Foo a owl:Class ;
    rdfs:label "Call me Foo"^^xsd:string ;
    rdfs:comment "A foo-like substance, commonly found in dimethyloxsorbate"^^xsd:string
.

ex:Bar a owl:Class ;
    rdfs:label "My name is Bar"
.
"""

QUERIES = {
    "BASE": """
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?label ?comment
  WHERE {
    OPTIONAL { ?item rdfs:comment ?comment } .
    OPTIONAL { ?item rdfs:label ?label } .
    FILTER(?item != owl:Nothing)
  }
    """,

    "NO_OPTIONAL" : """
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?label
  WHERE {
    ?item rdfs:label ?label .
    FILTER(?item != owl:Nothing)
  }
    """,

    "NO_PREFIX" : """
SELECT ?label
  WHERE {
    OPTIONAL { ?item rdfs:label ?label } .
    FILTER(?item != owl:Nothing)
  }
    """
}


def run_query (
    data: str,
    plugin: typing.Optional[str],
    query: str,
    item: str,
    bind: bool,
    ) -> None:
    """measure the timing and behavior for a SPARQL query"""
    ic(plugin, query, item, bind)

    if plugin is not None:
        g = rdflib.Graph(store=plugin)
    else:
        g = rdflib.Graph()

    g.parse(data=data, format="ttl")

    sparql = QUERIES[query].strip()

    bindings = {
        "item": rdflib.term.URIRef("https://example.com/" + item),
    }

    if not bind:
        for var, val in bindings.items():
            bind_var = "?" + var
            bind_val = "<" + str(val) + ">"
            ic(bind_var, bind_val)
            sparql = sparql.replace(bind_var, bind_val)

    print(sparql)

    # query init
    init_time = time.time()

    if bind:
        query_iter = g.query(sparql, initBindings=bindings)
    else:
        query_iter = g.query(sparql)

    duration = time.time() - init_time
    print(f"query init: {duration:10.3f} sec")

    # query exec
    count = 0
    init_time = time.time()

    for row in query_iter:
        ic(row)
        print(row.asdict())
        count += 1

    duration = time.time() - init_time
    print(f"query exec: {duration:10.3f} sec")

    if count < 1:
        print("MISSING RESULT")

    print()


if __name__ == "__main__":
    PLUGIN_LIST = [ None, "Oxigraph", ]
    QUERY_LIST =  [ "BASE", "NO_OPTIONAL", "NO_PREFIX", ]
    ITEM_LIST = [ "Foo", "Bar", ]
    BIND_LIST = [ True, False, ]

    for plugin, query, item, bind in itertools.product(PLUGIN_LIST, QUERY_LIST, ITEM_LIST, BIND_LIST):
        try:
            run_query(
                data = TTL_DATA,
                plugin = plugin,
                query = query,
                item = item,
                bind = bind,
            )
        except SyntaxError as ex:
            traceback.print_exc()

... was run with Python 3.8.10 on macOS with oxrdflib 0.3.0 installed from the repo (not PyPi) and produces these results:

ic| plugin: None, query: 'BASE', item: 'Foo', bind: True
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?label ?comment
  WHERE {
    OPTIONAL { ?item rdfs:comment ?comment } .
    OPTIONAL { ?item rdfs:label ?label } .
    FILTER(?item != owl:Nothing)
  }
query init:      0.196 sec
ic| row: (rdflib.term.Literal('Call me Foo', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#string')),
          rdflib.term.Literal('A foo-like substance, commonly found in dimethyloxsorbate', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#string')))
{'label': rdflib.term.Literal('Call me Foo', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#string')), 'comment': rdflib.term.Literal('A foo-like substance, commonly found in dimethyloxsorbate', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#string'))}
query exec:      0.005 sec

ic| plugin: None, query: 'BASE', item: 'Foo', bind: False
ic| bind_var: '?item', bind_val: '<https://example.com/Foo>'
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?label ?comment
  WHERE {
    OPTIONAL { <https://example.com/Foo> rdfs:comment ?comment } .
    OPTIONAL { <https://example.com/Foo> rdfs:label ?label } .
    FILTER(<https://example.com/Foo> != owl:Nothing)
  }
query init:      0.008 sec
ic| row: (rdflib.term.Literal('Call me Foo', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#string')),
          rdflib.term.Literal('A foo-like substance, commonly found in dimethyloxsorbate', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#string')))
{'label': rdflib.term.Literal('Call me Foo', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#string')), 'comment': rdflib.term.Literal('A foo-like substance, commonly found in dimethyloxsorbate', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#string'))}
query exec:      0.001 sec

ic| plugin: None, query: 'BASE', item: 'Bar', bind: True
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?label ?comment
  WHERE {
    OPTIONAL { ?item rdfs:comment ?comment } .
    OPTIONAL { ?item rdfs:label ?label } .
    FILTER(?item != owl:Nothing)
  }
query init:      0.007 sec
ic| row: (rdflib.term.Literal('My name is Bar'), None)
{'label': rdflib.term.Literal('My name is Bar')}
query exec:      0.001 sec

ic| plugin: None, query: 'BASE', item: 'Bar', bind: False
ic| bind_var: '?item', bind_val: '<https://example.com/Bar>'
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?label ?comment
  WHERE {
    OPTIONAL { <https://example.com/Bar> rdfs:comment ?comment } .
    OPTIONAL { <https://example.com/Bar> rdfs:label ?label } .
    FILTER(<https://example.com/Bar> != owl:Nothing)
  }
query init:      0.009 sec
ic| row: (rdflib.term.Literal('My name is Bar'), None)
{'label': rdflib.term.Literal('My name is Bar')}
query exec:      0.001 sec

ic| plugin: None, query: 'NO_OPTIONAL', item: 'Foo', bind: True
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?label
  WHERE {
    ?item rdfs:label ?label .
    FILTER(?item != owl:Nothing)
  }
query init:      0.006 sec
ic| row: (rdflib.term.Literal('Call me Foo', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#string')),)
{'label': rdflib.term.Literal('Call me Foo', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#string'))}
query exec:      0.001 sec

ic| plugin: None, query: 'NO_OPTIONAL', item: 'Foo', bind: False
ic| bind_var: '?item', bind_val: '<https://example.com/Foo>'
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?label
  WHERE {
    <https://example.com/Foo> rdfs:label ?label .
    FILTER(<https://example.com/Foo> != owl:Nothing)
  }
query init:      0.005 sec
ic| row: (rdflib.term.Literal('Call me Foo', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#string')),)
{'label': rdflib.term.Literal('Call me Foo', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#string'))}
query exec:      0.001 sec

Based on the TTL input and SPARQL queries used, each query should have one result row. The issues appear to be:

  • SPARQL queries must have a PREFIX defined for each of the namespaces references in the query, or oxrdflib will throw a SyntaxError exception
  • the OPTIONAL clause does not appear to work correctly; see the case that prints MISSING RESULT where the ?comment variable was within an OPTIONAL clause
  • the initBindings parameter does not appear to work correctly; see the same MISSING RESULT case which works correctly versus when we do an explicit string replace in the query string (which produces correct results)

Please let us know if we can help troubleshoot any further?

cc: @paoespinozarias @neobernad @jelisf @Mec-iS @davidshumway

@Tpt
Copy link
Contributor

Tpt commented Mar 20, 2022

Hi! Thank you for trying oxrdflib.

We see queries that have ~2 orders of magnitude better performance than with the default RDFlib.Store implementation.

It's amazing \o/. Oxigraph has no real query optimizer yet so I am hoping even better performances in the future.

SPARQL queries must have a PREFIX defined for each of the namespaces references in the query, or oxrdflib will throw a SyntaxError exception

It seems I missed a behavior of the rdflib query evaluator that binds some namespaces by default. oxrdflib does not do it yet, I'm going to do a patch release that fixes this limitation.

the OPTIONAL clause does not appear to work correctly; see the case that prints MISSING RESULT where the ?comment variable was within an OPTIONAL clause
the initBindings parameter does not appear to work correctly; see the same MISSING RESULT case which works correctly versus when we do an explicit string replace in the query string (which produces correct results)

Thank you! I have not had time yet to investigate it yet. I believe it's likely to be something related to the way initBindings are managed in oxrdflib (Oxigraph does not provide a similar option so initBindings is converted to a VALUES clause.

@Tpt
Copy link
Contributor

Tpt commented Mar 20, 2022

Thank you! I have not had time yet to investigate it yet. I believe it's likely to be something related to the way initBindings are managed in oxrdflib (Oxigraph does not provide a similar option so initBindings is converted to a VALUES clause.

After investigation I have found the cause of the error: initBindings in your test is setting a variable that is only using inside of the query and not in the SELECT ouput. The join then fails and no results are return. I believe I should move the VALUES clause at the beggining of the WHERE clause. It might be "fun" to do that properly without having to parse the query.

@ceteri
Copy link
Author

ceteri commented Mar 23, 2022

Thank you kindly @Tpt !

Yes, our queries have lots of UNIONs and binding variables that work with OWL restrictions.

For now, as a workaround, we have a function to expand binding variables into a query explicitly: https://github.com/DerwenAI/kglab/blob/274064e2c096d5778ab980ce3f2730e262ba1c6c/kglab/kglab.py#L1185

Using with that (on an enterprise use case) we were able get through our suite of regression tests with quite a variety of SPARQL queries, so it appears to be working well.

A typical editing session / user workflow in that app requires 60-90 SPARQL queries, some of which had required several minutes each to run. Now the time for an entire workflow has dropped to seconds instead!

Please let us know if we can help with test or evaluation in any way. We'll be watching the new releases closely :)

@Tpt
Copy link
Contributor

Tpt commented Mar 23, 2022

A typical editing session / user workflow in that app requires 60-90 SPARQL queries, some of which had required several minutes each to run. Now the time for an entire workflow has dropped to seconds instead!

Using with that (on an enterprise use case) we were able get through our suite of regression tests with quite a variety of SPARQL queries, so it appears to be working well.

It is amazing!

For now, as a workaround, we have a function to expand binding variables into a query explicitly: https://github.com/DerwenAI/kglab/blob/274064e2c096d5778ab980ce3f2730e262ba1c6c/kglab/kglab.py#L1185

I am considering adding a similar workaround to oxrdflib, but I would have to support also CONSTRUCT, DESCRIBE and ASK queries so it's a bit harder to do properly. "Fun" thing, rdflib provides a built-in storage system (SPARQLStore) that has the same behavior as oxrdflib. I have opened a task on rdflib about this discrepency. So, avoiding using the initBindings parameter seems to me the safest way to go to be storage independant.

Please let us know if we can help with test or evaluation in any way. We'll be watching the new releases closely :)

Thank you! It would be amazing! If you have automated tests related to rdflib I would love to integrate them in this repostiory test suite (I use python unittest lib for now). Benchmarks would also be very welcomed to track performance when doing changes. f these are not rdflib-specific I would love to integrate them in the main Oxigraph repository to track changes more closely. I already have a bench related to SPARQL and benchs for some common operations but nothing targetting the python bindings.

@ceteri
Copy link
Author

ceteri commented Mar 23, 2022

Many thanks @Tpt !

Hi @Mec-iS are there ways in which we could share or collab with @Tpt on SPARQL-related tests in kglab ?

@Mec-iS
Copy link

Mec-iS commented Mar 23, 2022

sure. let's collect the requirements and examples DerwenAI/kglab#248

@Tpt
Copy link
Contributor

Tpt commented Apr 2, 2022

I have released v0.3.1 with a fix for namespace support.

I have also open #11 about the initBindings parameter.

Feel free to close this issue if you think everything is now fixed or covered by other issues or to keep it open if you encounter other problems.

@ceteri
Copy link
Author

ceteri commented Apr 3, 2022

Looks good, many thanks @Tpt !

@ceteri ceteri closed this as completed Apr 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants