Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARQL join tests & fixes #127

Closed
wants to merge 1 commit into from

Conversation

gsvarovsky
Copy link
Contributor

Trying to track down some issues with SPARQL joins. There are two problems; I have only fixed one. See test/sparql/sparql.select.join.js:

  1. Doing a basic join with two quad patterns (should join quad patterns) passes its test only in MemDOWN. I have scratched my head about why this might be but to no avail. @jacoscaz it would be great if you could have a quick look.
  2. Doing a join where some of the candidate matches considered during the query have a literal in the Subject position was causing an Error to be thrown inside quadstore. I have fixed this by checking for Term Types in unexpected positions. In doing so I noted that the match API in quadstore is more restrictive than that in the RDFJS Store API. It appears that Comunica relies upon the looser definition taking Terms for all positions.

Fix for numeric candidate Subject join via conformance to RDFJS Store spec
should(items[0]['?s']['value']).equal('http://ex.com/bob');
});

it.skip('should join quad patterns with numeric candidate', async function () {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to mention, this test is skipped because it also fails on LevelDOWN and RocksDB. However without the fix in quadstore it fails everywhere.

@jacoscaz
Copy link
Owner

jacoscaz commented Jan 6, 2021

This is very weird! When run separately, the two SELECT queries with ?s <http://ex.com/knows> ?s2 and ?s2 <http://ex.com/greeting> "hi" seem to work all right on all backends. Furthermore, Comunica doesn't seem to be doing a nested join the way I had expected. I'll look into this and let you know.

@jacoscaz
Copy link
Owner

jacoscaz commented Jan 6, 2021

In the meantime, the RDF 1.1 spec doesn't seem to allow Literal nodes as subjects: https://www.w3.org/TR/rdf11-concepts/#section-triples

@gsvarovsky
Copy link
Contributor Author

RDF 1.1 spec doesn't seem to allow Literal nodes as subjects

Indeed not. However, it seems that when the variables candidate bindings are being explored, Comunica nevertheless can end up calling match with a literal as the first argument. Hence, my fix is to make quadstore allow that, but simply return an empty iterator. According to the strict Store API spec, any Term can be used, so I think my fix is correct. Maybe @rubensworks can confirm.

@jacoscaz
Copy link
Owner

jacoscaz commented Jan 6, 2021

@rubensworks a couple of questions for you:

  1. When using patterns like this one
?s2 <http://ex.com/greeting> "hi" .
?s ?p ?s2 .

Comunica seems to invoke the match() method of RDF/JS sources even when s2 is bound to a Literal, which should make it unqualified to be used as a subject if I understand correctly. Leaving aside that quadstore is making this harder to debug because there's something preventing the resulting errors from bubbling up to (or through) Comunica, how should I address cases like this? Should I just return an empty set/iterator of quads?

  1. We have a test case in which the following query results in an empty set of quads even though it shouldn't. This only happens with disk-based storage backends and doesn't happen with the in-memory backend.
      SELECT * {
          ?s <http://ex.com/knows> ?s2 .
          ?s2 <http://ex.com/greeting> "hi"
        }

What's strange is that if I run two separate queries, one for each of those patterns, I get the expected results and I am able to join them by hand. As all the backends are abstracted via the AbstractLevelDOWN interface, could there be a timing issue somewhere? Have you ever encountered anything like this?

@gsvarovsky
Copy link
Contributor Author

Thanks for having a look @jacoscaz. With regard to the Leveldown behaviour being different to Memdown, I've been experimenting with a hypothesis that there is a race condition – in Memdown the race is always won because of the very fast query response, on the microtask queue. In particular, I thought it was suspicious that quadstore's match implementation has to wait for a promise to return before setting the source on the return iterator here:
https://github.com/beautifulinteractions/node-quadstore/blob/560b28204150ff1dc047788c37cf8d8e8c58dbd2/lib/quadstore.ts#L178

TransformIterator is able to take a promise as a first argument, so I tried using the promise from the getStream call. All other tests pass, but it doesn't fix the problem, sadly. Investigations continue!

@jacoscaz
Copy link
Owner

jacoscaz commented Jan 7, 2021

Quick update on this - I've tried refactoring getStream() into a synchronous method so that match() immediately gets the actual iterator rather than the pass-through transform iterator but doing so did not make any difference WRT to the join issue. I wonder if there's something in the lib/get/leveliterator.ts wrapper that is causing this?

@rubensworks
Copy link

Comunica seems to invoke the match() method of RDF/JS sources even when s2 is bound to a Literal, which should make it unqualified to be used as a subject if I understand correctly.

Indeed, strictly speaking, RDF does not allow literals in subject position.
Comunica deliberately does not strictly check this, as some users may want to use it for querying under generalized RDF mode.

Furthermore, always checking term types before binding may introduce some performance overhead, so Comunica expects that the store simply returns empty results in those cases. This should be in line with the semantics of RDF/JS, where generalized RDF is also possible by default.

We have a test case in which the following query results in an empty set of quads even though it shouldn't.

In case you are manually creating asynciterators, you could try enabling the autoStart flag.
This may have something to do with it.
If you're not creating such iterators, then I don't really have an idea what could be causing this.

@jacoscaz
Copy link
Owner

jacoscaz commented Jan 7, 2021

so Comunica expects that the store simply returns empty results in those cases

Sounds good! I was not aware of the generalized RDF mode.

you could try enabling the autoStart flag

Good idea, I'll test this shortly. Thanks!

@jacoscaz
Copy link
Owner

jacoscaz commented Jan 7, 2021

I think I've found the culprit. There are cases in which store.db.approximateSize() rounds the actual non-zero size to 0. This leads Quadstore.prototype.getApproximateSize() to return 0, which in turn causes Quadstore.prototype.countQuads() to return 0, which in turn tricks Comunica into thinking there are no quads available.

@jacoscaz
Copy link
Owner

jacoscaz commented Jan 7, 2021

@gsvarovsky I've just pushed the fix for the broken JOIN queries to master, including your tests. Can you test it?

@jacoscaz
Copy link
Owner

jacoscaz commented Jan 7, 2021

@gsvarovsky just pushed the fix for breaking queries when in generalized rdf mode with a literal subject. I'll release tomorrow or whenever you get a chance to validate these fixes. Just FYI, a more definitive fix (and potentially support for generalized rdf mode, not sure yet) will come in v8.0, for which I'm planning a significant upheaval of the (de)serialization mechanism that will save a ton of storage space and should get us much closer to the raw performance of the AbstractLevelDOWN backend.

@gsvarovsky
Copy link
Contributor Author

Super! I've tested locally using the quadstore unit tests and m-ld's unit test suite with Memdown. I'm buried in a branch which is not ready for integration tests (where persistence is tested), so if you don't mind, I'll take it on faith that Leveldown works too, for now.

@gsvarovsky gsvarovsky closed this Jan 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants