sparql: counts don't seem to be reliable #112

Open
joernhees opened this Issue Dec 3, 2013 · 39 comments

Projects

None yet

7 participants

@joernhees

I'm trying to get a top type count for DBpedia (Virtuoso version 07.00.3207 on Linux (x86_64-redhat-linux-gnu), Single Server Edition):

select ?type count(distinct ?s) as ?c where {
  ?s a ?type.
}
group by ?type
order by desc(?c)
limit 50

returns (apart from other rows) this row 1:
http://dbpedia.org/ontology/Place 89498

Out of curiosity i checked this again with the query below:

select count(distinct ?s) where { ?s a <http://dbpedia.org/ontology/Place> }

tells me it's 754450 2

There's an order of magnitude difference in these 2 counts. Please tell me I'm doing it wrong.

PS: i tried the first query without the group by, order by and limit clause, doesn't make a difference.

@kidehen
@indeyets

@kidehen timeouts are understandable. But giving wrong result because of timeout is the whole different story.

shouldn't it report failure instead? that's what fuseki does, for example.

both outcomes are not helpful, but fuseki doesn't provide false results

@kidehen
@joernhees

Thanks, the time limit explains a bit, but this "feature" is highly confusing if not dangerous because the user (in this case me and i'm not exactly a novice) might not be aware that all the counts might be terribly wrong.

Is there any way to distinguish a "cut-off" result from one which is accurate?

I had assumed that a query which hits a timeout limit would return with an error (something like a 408, even though i'm not sure it's actually the right one) instead of silently returning wrong results.

@kidehen
@joernhees

@kidehen neither of the links you provide describe/warn of the reported problem: that counts can be wrong if a timeout is hit.

I don't seem to have gotten my point across, let me try again:

I'm not arguing with fair use, timeouts or limits to be able to satisfy more users. I'm a fan!

I'm arguing with the way you're treating a timeout.
If a query takes too long there are two ways of dealing with this:
1. return an error, not a result
2. return a result with a BIG WARNING

  1. makes a developer, user or scientist (with a quick one-off sparql query in your web interface) look into it again. They will definitely not run in danger of using a wrong results as there is none!

  2. probably leads to the warning being lost somewhere in the process, never be shown to the user and the numbers taken for granted in the end. This is what happened here. Not even your own HTML result page of your SPARQL Web Interface shows the tiniest hint to the user that he should be careful. Even if you're arguing that this is not the prime "end user"... can you name a widely used sparql client / lib which handles this correctly?

I can see your point of view trying to answer the query as good as you can in the given time, but as this report demonstrates it is more dangerous than just returning an error.

@joernhees

one addenum, sorry:
it should be optional to get partial results, not an implicit default that you then have to check for

@kidehen
@kidehen
@joernhees
@kidehen
@joernhees
@kidehen
@iv-an-ru iv-an-ru was assigned by openlink Feb 13, 2014
@joernhees

@iv-an-ru any update on this?

@sebastianthelen

http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VOSScalableInference contains a paragraph about partial query answering.

Apparently you get a hint that a query result is incomplete when executing it in isql (haven't tested it though).

@jindrichmynarz

What is the custom HTTP header that is returned for partial results? Where is it documented?

@jindrichmynarz

@kidehen: The only headers I see in responses from Virtuoso are Accept-Ranges, Cache-Control, Expires, Server, Connection, Content-Length, Content-Type and Date. I don't see any custom header, which would indicate partial results. This is what I get when running SELECT * WHERE { ?s ?p ?o . }, which is trimmed by ResultSetMaxRows set to 10000 in virtuoso.ini, on the latest develop version of Virtuoso.

@kidehen
@jindrichmynarz

Thanks for the explanation. However, I wasn't able to reproduce it on any other Virtuoso endpoint. For example, using the public DBpedia endpoint to execute SELECT * WHERE { ?s ?p ?o . }:

curl "http://dbpedia.org/sparql?query=SELECT+*+WHERE+%7B+%3Fs+%3Fp+%3Fo+.+%7D&format=text%2Fcsv" -o results.csv -D headers.txt
wc -l results.csv
# => 10001, i.e. trimmed results
cat headers.txt
# HTTP/1.1 200 OK
# Date: Thu, 11 Dec 2014 14:33:28 GMT
# Content-Type: text/csv; charset=UTF-8
# Content-Length: 1484509
# Connection: keep-alive
# Server: Virtuoso/07.10.3211 (Linux) x86_64-redhat-linux-gnu  VDB
# Expires: Thu, 18 Dec 2014 14:33:28 GMT
# Cache-Control: max-age=604800
# Accept-Ranges: bytes
# => i.e. no X-SQL-State header

Is the custom header only sent:

  • For specific versions of Virtuoso?
  • For specific configurations of Virtuoso?
  • With additional query parameters (e.g., CXML_redir_for_subjs, CXML_redir_for_hrefs timeout and debug=on from your example)?
@kidehen
@jindrichmynarz

Kingsley, it seems you haven't got my question. I'm well aware of the effect of the ResultSetMaxRows configuration. Let me try to clarify. My question was about the missing HTTP header indicating partial results. The problem is:

  1. Do curl "http://dbpedia.org/sparql?query=SELECT+*+WHERE+%7B+%3Fs+%3Fp+%3Fo+.+%7D&format=text%2Fcsv" -o results.csv -D headers.txt.
  2. Receive partial results (exactly because of the ResultSetMaxRows=10000).
  3. cat headers.txt => No header indicating partial results is there.

So, this indicates that receiving partial results is not sufficient condition for Virtuoso to provide the HTTP header informing that it indeed sent partial results. What are the necessary conditions of a SPARQL request in order for Virtuoso to send a response with the HTTP header indicating partial results?

@kidehen
@jindrichmynarz

OK, I see that if I run curl "http://dbpedia.org/sparql?query=SELECT+*+WHERE+%7B+%3Fs+%3Fp+%3Fo+.+%7D&format=text%2Fcsv" -o results.csv -D headers.txt, I find X-SPARQL-MaxRows: 10000 in the response headers. This is useful, but it doesn't tell if I have received partial results, because it may be the case that the total number of results is the same as ResultSetMaxRows.

In order to tell if I received partial results I need to execute an additional query, which is my original query wrapped in SELECT (COUNT(*) AS ?count) WHERE { { ... } }:

curl "http://dbpedia.org/sparql?query=SELECT+%28COUNT%28*%29+AS+%3Fcount%29+WHERE+%7B+%7B+SELECT+*+WHERE+%7B+%3Fs+%3Fp+%3Fo+.+%7D+%7D+%7D&format=text%2Fcsv"

The response for this query tells me there are 943138267 results for my query in total. Given that I know this number I can compare it with the number from the X-SPARQL-MaxRows header and conclude that I have indeed received partial results.

As you can see, executing twice as many queries just to be sure one's not receiving partial results is hardly optimal from the developer's perspective. I think a more developer-friendly solution might be to have a HTTP response header serving as a boolean flag indicating if results are partial or not, irrespective of the cause of incompleteness (e.g., ResultSetMaxRows or timeout query parameter).

@joernhees

The whole point of this issue was that the current treatment with its 200 status code and additional headers is too implicit for end users, as well as most developers and libraries.

I'm begging you: can we please not serve timeouts / cut off result sets with a 200 http status code?
Rather serve them with a 206 status code or some other self invented 555 (server reached some limits, partial result only). Then add the headers on top of that, so one can find out what happened?

@kidehen
@IvanMikhailov

I'm sorry I've faield to push the idea of warnings to the Sparql Protocol spec. "Anytime queries" did not exist at that moment but I was sure that "OK" and "error" is too black-and-white for a real word. Now I don't know any good solution. X-this and X-that in headers are informative but the method has a fatal flaw: one looks to that hidden texts not earlier than a rude error appears.

@jindrichmynarz

@kidehen: So your recommended solution to determine if a query has partial results is to execute an additional COUNT query?

I don't believe reporting that response to a query has partial results has a significant cost for Virtuoso. This can be added just in case Virtuoso trims results size (e.g., due ResultSetMaxRows or timeout). No additional computation is needed.

BTW, SPARQL is usually (e.g., 1, 2) said to have semantics based on closed-world assumption.

@joernhees and @iv-an-ru: I agree that a non-HTTP 200 code would be nicer, but there having some way how to tell partial results (e.g., custom header) is better than no way.

@kidehen
@kidehen
@jindrichmynarz

I get that if SPARQL had cursors this would be solved differently.

What you don't do, as the server provider is implement
that in a way that simply introduces performance overhead that isn't
understood by clients.

Sorry, I have trouble parsing this sentence.

You want us to tell you that the hard LIMIT is X out of a total of Y.

No. This is already provided by the X-SPARQL-MaxRows header that you have recently introduced into Virtuoso. What I would like to know instead is if this hard limit was applied. If X-SPARQL-MaxRows = 10000 and I receive 10000 results, there's no way of telling if it's complete or partial result set without executing additional COUNT query. Is there another way I'm missing?

@joernhees

I sense some anger in this discussion again. I think this is coming from different points of view rather than anyone attacking Virtuoso. You guys are doing an awesome job. So awesome that we developers come to you as the defacto lead in public SPARQL endpoints to give feedback and ask for things which would make our lives easier / reduce misunderstandings in development.

I guess all our feedback in this issue boils down to that we as developers want to be able to handle partial results better when communicating with virtuoso endpoints.

I think there are several dimensions to this, which are entangled in our discussion:

  • scope (open / closed world?)
  • halting problem: can the server tell that a query was hitting its configured boundaries (i'm avoiding the word LIMIT so it's not confused with the SPARQL clause)? (timeout / max result size / ...?) Can it tell if any of the other endpoints it asks hit them?
  • form of presentation (visibility to the developer?)

Thinking about all three i was reminded of a tiny but powerful rule from the python zen: "explicit is better than implicit".

What could this mean for the dimensions above (just as thoughts):

  • scope: isn't it OK to treat a query as closed world assumption unless it is a federated query / asks for sponging? (So closed world until explicitly stated otherwise?)
  • halting problem:
    • if the scope was closed world, wouldn't the server know if it hits some configured boundary and could just tell us? (I'm not sure how complex this would be internally, but i guess whatever detects if a boundary is exceeded and stops execution could probably also add that information to the result.)
    • Obviously an open world assumption is a different story, but shouldn't the server still be able to inform us when it hits its own boundaries / is waiting for a third party for too long / a third party maybe exceeded its boundaries? (chicken & egg problem)
  • form of presentation:
    • If the client side query explicitly states the boundary which is exceeded i guess a 200 status code with a partial result is ok.
    • If the server runs into boundaries the client query didn't explicitly state (e.g., some defaults, fairness of use, etc. boundaries) then the result should rather not be a 200 as it doesn't force developers to deal with them correctly. Still the partial content could be delivered, either as 206 or in the 5xx range...
    • In both cases (as @jindrichmynarz seems to suggests): Headers which explain and potentially reduce follow up queries "just to find out if a result was partial" / which boundaries were hit would be great.

Chicken & egg problem:

You can read the first paragraph of this post another way: Because you're the defacto lead for public SPARQL endpoints, your defaults are pretty close to becoming the standard. If your default treatment of partial results is not informative for the closed world case, then it can never be for federated queries.

@kidehen
@jindrichmynarz

I am telling you that we have the following distinct items:

  1. query solution
  2. query resultset retrieval.

They are not the same thing.

I don't think I ever confused these two.

Basically, Virtuoso will not do that for you as it is an expensive operation that totally skews what it is doing.

I don't think we understand each other. Let me try to clarify. When Virtuoso trims the results set size to ResultSetMaxRows, it can as well add an additional header indicating the results set is trimmed. No additional computation is needed. You can hook this into the existing logic, which decides whether to trim results sets or not.

@kidehen
@kidehen
@jindrichmynarz

OK, I see I may have used confusing terms (e.g., "trimming"). @kidehen, thank you for pointing that out.

I never meant to imply that Virtuoso first counts the size of a query result set and then trims its size to the ResultSetMaxRows. What I asked about is that when Virtuoso reaches ResultSetMaxRows or timeout and it stops the query execution, it can make this explicit by e.g., adding an HTTP header indicating partial results set. This is what I meant when I said that no additional computation is needed.

@kidehen
@joernhees

@kidehen i think all that @jindrichmynarz suggests is that virtuoso could add a header if (speaking in your terminology): no timeout condition arises but virtuoso has prepared the solution items for transportation (so to speak) via a conveyor that holds > ResultSetMaxRows capacity.

In that case the server knows without additional work: the client won't get all that's on the conveyor belt, so a somehow partial/truncated/limited/whateverword result.

The thing is that if i write a SPARQL Query with LIMIT 100 and 100 results are returned i know i probably should try to continue... (but i'm not sure)... With that header i could be sure that i need to if it's present... bad thing is that i can't be sure i don't need to if it's not present.

But the header would be even more meaningful in other cases: what if i don't specify a limit in my query? I don't see the ResultSetMaxRows set in the virtuoso.ini as a client (or do i?).

With that "ResultSetLimitHit" header i could know that there is maybe more.

Why maybe? (please correct if this is wrong): I think you pointed this out before: the conveyer-belt could be empty by coincidence when the next chunk isn't prepared yet, but the result set size limited by explicit LIMIT clause or by ResultSetMaxRows is reached.

If that's how it works one could even think of two headers: "ResultSetLimitHit" / "ResultSetLimitExceeded" or with one header and two values: "ResultSetLimit: Hit" "ResultSetLimit: Exceeded".

@kidehen
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment