sparql: counts don't seem to be reliable #112
Comments
Extending the timeout parameter increases the time allotted to producing See the different results produced when I doubled up the processing time: Note, there are hard limits also configured on the server that override |
@kidehen timeouts are understandable. But giving wrong result because of timeout is the whole different story. shouldn't it report failure instead? that's what fuseki does, for example. both outcomes are not helpful, but fuseki doesn't provide false results |
This isn't a false result. This is a solution to the query within the constraints of a timeout. This is a feature of Virtuoso. |
Thanks, the time limit explains a bit, but this "feature" is highly confusing if not dangerous because the user (in this case me and i'm not exactly a novice) might not be aware that all the counts might be terribly wrong. Is there any way to distinguish a "cut-off" result from one which is accurate? I had assumed that a query which hits a timeout limit would return with an error (something like a 408, even though i'm not sure it's actually the right one) instead of silently returning wrong results. |
On 12/3/13 9:18 AM, Jörn Hees wrote:
There is a DBpedia fair-use document [1],[2] about this matter. You won't [1] http://dbpedia.org/OnlineAccess -- search on "Fair Use Policy"
This should be part of the response headers. Note:
Yes, there has to be HTTP response metadata indicating the state of The closest analogy here is a quiz contest where you have X seconds to |
@kidehen neither of the links you provide describe/warn of the reported problem: that counts can be wrong if a timeout is hit. I don't seem to have gotten my point across, let me try again: I'm not arguing with fair use, timeouts or limits to be able to satisfy more users. I'm a fan! I'm arguing with the way you're treating a timeout.
I can see your point of view trying to answer the query as good as you can in the given time, but as this report demonstrates it is more dangerous than just returning an error. |
one addenum, sorry: |
On 12/4/13 6:50 AM, Jörn Hees wrote:
No for the public DBpedia instance. We are deliberately not giving
There are also other instances of DBpedia data across:
|
On 12/4/13 6:49 AM, Jörn Hees wrote:
What you are not getting from my comment is the fact that there isn't a With DBpedia and the Web, everything is unpredicatable. Data is
We return an indicator via HTTP response (which you can test for) re.
DBpedia isn't a gospel of any kind. That isn't the purpose here. Please
It is dangerous to attempt the opposite i.e., have no restrictions and We have made a choice to make DBpedia available to the world, backed up |
On 4 Dec 2013, at 14:09, Kingsley Idehen notifications@github.com wrote:
Are you sure that this is your statement? I just wanted the correct counts for types used on the DBpedia endpoint. All i'm asking for are its counts at the current point in time. Partial results are cool when you ask for them (explicitly), i didn't and most people don't. If not explicitly asked for a partial result, it's more dangerous to report them in a very similar fashion to a complete result than reporting an error. Ask a couple of developers what they would expect to happen… Cheers, |
On 12/9/13 12:37 PM, Jörn Hees wrote:
My statement is this: DBpedia is going to produce solutions to SPARQL queries subject to
DBpedia isn't about marketing. I am making a statement about the
You are reporting the fact that you are executing a specific query that [1] http://lod.openlinksw.com/sparql -- all you have to do is simply [2] http://dbpedia-live.openlinksw.com -- which doesn't match LOD for
The DBpedia Endpoint is one point of access for the DBpedia dataset. An
That doesn't eradicate entailment, transitive closures, and related
You are one of many.
There are other endpoints, change the hostname part of the URL, as I've
I guess Google give you complete results?
For you and your use case. You are but one agent.
It been so since 2007, so I don't understand what you are making a fuss
Regards, Kingsley Idehen |
On 9 Dec 2013, at 20:18, Kingsley Idehen notifications@github.com wrote:
wrong, read again i'm fine with that and always have been i report that how you treat a timeout is bad, not that there is a timeout end of my feedback thanks for your time j |
On 12/9/13 2:56 PM, Jörn Hees wrote:
I am not claiming that the timeout treatment is perfect. I've told you This isn't a 408 or 500 condition. |
@iv-an-ru any update on this? |
http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VOSScalableInference contains a paragraph about partial query answering. Apparently you get a hint that a query result is incomplete when executing it in isql (haven't tested it though). |
What is the custom HTTP header that is returned for partial results? Where is it documented? |
@kidehen: The only headers I see in responses from Virtuoso are |
On 12/11/14 3:25 AM, Jindřich Mynarz wrote:
We do provide a number of response headers of which Example:
Links: [1] http://docs.openlinksw.com/virtuoso/anytimequeries.html |
Thanks for the explanation. However, I wasn't able to reproduce it on any other Virtuoso endpoint. For example, using the public DBpedia endpoint to execute curl "http://dbpedia.org/sparql?query=SELECT+*+WHERE+%7B+%3Fs+%3Fp+%3Fo+.+%7D&format=text%2Fcsv" -o results.csv -D headers.txt
wc -l results.csv
# => 10001, i.e. trimmed results
cat headers.txt
# HTTP/1.1 200 OK
# Date: Thu, 11 Dec 2014 14:33:28 GMT
# Content-Type: text/csv; charset=UTF-8
# Content-Length: 1484509
# Connection: keep-alive
# Server: Virtuoso/07.10.3211 (Linux) x86_64-redhat-linux-gnu VDB
# Expires: Thu, 18 Dec 2014 14:33:28 GMT
# Cache-Control: max-age=604800
# Accept-Ranges: bytes
# => i.e. no X-SQL-State header Is the custom header only sent:
|
Because the DBpedia instance has the following in its
Meaning: The maximum SPARQL solution size for this instance is 10,000 records Links: [1] [2] |
Kingsley, it seems you haven't got my question. I'm well aware of the effect of the
So, this indicates that receiving partial results is not sufficient condition for Virtuoso to provide the HTTP header informing that it indeed sent partial results. What are the necessary conditions of a SPARQL request in order for Virtuoso to send a response with the HTTP header indicating partial results? |
Arriving at resultset size, for solution, within timeout. For your Action Item: |
OK, I see that if I run In order to tell if I received partial results I need to execute an additional query, which is my original query wrapped in curl "http://dbpedia.org/sparql?query=SELECT+%28COUNT%28*%29+AS+%3Fcount%29+WHERE+%7B+%7B+SELECT+*+WHERE+%7B+%3Fs+%3Fp+%3Fo+.+%7D+%7D+%7D&format=text%2Fcsv" The response for this query tells me there are 943138267 results for my query in total. Given that I know this number I can compare it with the number from the As you can see, executing twice as many queries just to be sure one's not receiving partial results is hardly optimal from the developer's perspective. I think a more developer-friendly solution might be to have a HTTP response header serving as a boolean flag indicating if results are partial or not, irrespective of the cause of incompleteness (e.g., |
The whole point of this issue was that the current treatment with its 200 status code and additional headers is too implicit for end users, as well as most developers and libraries. I'm begging you: can we please not serve timeouts / cut off result sets with a 200 http status code? |
On 12/13/14 9:17 AM, Jindřich Mynarz wrote:
1-2 is the scope of the query timeout, while Basically,
Err... it is, in the context of what you are trying to emulate i.e., a
This so-called developer cost burden isn't for Virtuoso to bear, it is Performance optimizations in Virtuoso enable you (the developer) to get Kingsley |
I'm sorry I've faield to push the idea of warnings to the Sparql Protocol spec. "Anytime queries" did not exist at that moment but I was sure that "OK" and "error" is too black-and-white for a real word. Now I don't know any good solution. X-this and X-that in headers are informative but the method has a fatal flaw: one looks to that hidden texts not earlier than a rude error appears. |
@kidehen: So your recommended solution to determine if a query has partial results is to execute an additional I don't believe reporting that response to a query has partial results has a significant cost for Virtuoso. This can be added just in case Virtuoso trims results size (e.g., due BTW, SPARQL is usually (e.g., 1, 2) said to have semantics based on closed-world assumption. @joernhees and @iv-an-ru: I agree that a non-HTTP 200 code would be nicer, but there having some way how to tell partial results (e.g., custom header) is better than no way. |
On 12/15/14 6:15 AM, Jörn Hees wrote:
For a query solution that has a fixed resultset size, based on a hard
We can make these configurable by the instance owner, should they not
Conditionally (by way of instance config), as indicated above. |
On 12/15/14 8:33 AM, Jindřich Mynarz wrote:
I am saying to you that the issue of cursors is common, not new. It is
You want us to tell you that the hard If you were running your own Virtuoso instance, you can opt to not have
"Anytime Query" is about query solution preparation and resultset
Even if it did, you are seeking something that isn't offered by other
See my response to the HTTP 200 matter. |
I get that if SPARQL had cursors this would be solved differently.
Sorry, I have trouble parsing this sentence.
No. This is already provided by the |
I sense some anger in this discussion again. I think this is coming from different points of view rather than anyone attacking Virtuoso. You guys are doing an awesome job. So awesome that we developers come to you as the defacto lead in public SPARQL endpoints to give feedback and ask for things which would make our lives easier / reduce misunderstandings in development. I guess all our feedback in this issue boils down to that we as developers want to be able to handle partial results better when communicating with virtuoso endpoints. I think there are several dimensions to this, which are entangled in our discussion:
Thinking about all three i was reminded of a tiny but powerful rule from the python zen: "explicit is better than implicit". What could this mean for the dimensions above (just as thoughts):
Chicken & egg problem:You can read the first paragraph of this post another way: Because you're the defacto lead for public SPARQL endpoints, your defaults are pretty close to becoming the standard. If your default treatment of partial results is not informative for the closed world case, then it can never be for federated queries. |
On 12/15/14 9:42 AM, Jindřich Mynarz wrote:
I am telling you that we have the following distinct items:
They are not the same thing. We set a limit from which you fetch the solution in batches. This is
You have to perform the additional count query because these heuristics In SQL, Scrollable Cursors are a feature. Net effect, they are distinct
Recap: Partial condition arises when Virtuoso can't produce a complete solution Query Solution Size != Query Results Retrieval Max Items Size, at least Do you have an example of a DBMS product that offers what you are |
I don't think I ever confused these two.
I don't think we understand each other. Let me try to clarify. When Virtuoso trims the results set size to |
On 12/15/14 11:18 AM, Jörn Hees wrote:
Yes, explicit is better than implicit for sure. But we also have to In the SQL realm, you would do one of the following:
Yes, but even if its "closed world" you have the issue of data volume
Yes, which is what it is doing. I tells you when it wasn't able to
Thus, given:
It will indicate a partial resultset return via HTTP if it couldn't
It is doing that.
Re. SPARQL-FED we should have the same thing re. timeouts which can
Which is why we can improve things here by making 20X configurable by
Yes, if you configure your instance that way, when we add this feature
Yes, but he isn't distinguishing the solution size from the resultset
Our short-term option is for these 20X responses to be configurable. In
Another possibility, when we have the time, is publish a guide for |
On 12/15/14 12:02 PM, Jindřich Mynarz wrote:
It doesn't
I politely disagree with your assumptions. I don't know if you have an experience with scrollable cursors in the There is a reason why there are no live ad-hoc SQL RDBMS engines on the [1] http://demo.openlinksw.com/XMLAexplorer/XMLAexplorer.html -- example Kingsley |
OK, I see I may have used confusing terms (e.g., "trimming"). @kidehen, thank you for pointing that out. I never meant to imply that Virtuoso first counts the size of a query result set and then trims its size to the |
On 12/15/14 12:36 PM, Jindřich Mynarz wrote:
Virtuoso doesn't stop query execution. The preparation of a solution We have to move the items associated with the solution from virtuoso's In ODBC/JDBC (where these matters are handled with better clarity), you The bottom-line issue here is that we are paging (cursoring) through the The SPARQL Query Protocol, which is for all intents and purposes is the We are going to need If we are going to do scrollable cursors, it should be done right, even |
@kidehen i think all that @jindrichmynarz suggests is that virtuoso could add a header if (speaking in your terminology): no timeout condition arises but virtuoso has prepared the solution items for transportation (so to speak) via a conveyor that holds > ResultSetMaxRows capacity. In that case the server knows without additional work: the client won't get all that's on the conveyor belt, so a somehow partial/truncated/limited/whateverword result. The thing is that if i write a SPARQL Query with LIMIT 100 and 100 results are returned i know i probably should try to continue... (but i'm not sure)... With that header i could be sure that i need to if it's present... bad thing is that i can't be sure i don't need to if it's not present. But the header would be even more meaningful in other cases: what if i don't specify a limit in my query? I don't see the ResultSetMaxRows set in the virtuoso.ini as a client (or do i?). With that "ResultSetLimitHit" header i could know that there is maybe more. Why maybe? (please correct if this is wrong): I think you pointed this out before: the conveyer-belt could be empty by coincidence when the next chunk isn't prepared yet, but the result set size limited by explicit LIMIT clause or by ResultSetMaxRows is reached. If that's how it works one could even think of two headers: "ResultSetLimitHit" / "ResultSetLimitExceeded" or with one header and two values: "ResultSetLimit: Hit" "ResultSetLimit: Exceeded". |
We can do two things here:
Re. #1, this is closer to your message above. I'll pick these items up with my development team. |
@kidehen: I recently stumbled upon this issue when sending a query over the JDBC interface. However, according to the Virtuoso documentation this should only apply to the the SPARQL web service:
Is this the intended behaviour? |
any update on this? |
For what it's worth: today, the original query returns just 3 results select ?type count(distinct ?s) as ?c where {
?s a ?type.
}
group by ?type
order by desc(?c) If you remove the select ?type (count(*) as ?c) where {
?s a ?type.
}
group by ?type
order by desc(?c) you get a lot more results, and the count for If you ask only for select (count(*) as ?c) where {
?s a dbo:Place
} Note: you get a different count for |
I'm trying to get a top type count for DBpedia (Virtuoso version 07.00.3207 on Linux (x86_64-redhat-linux-gnu), Single Server Edition):
returns (apart from other rows) this row 1:
http://dbpedia.org/ontology/Place 89498
Out of curiosity i checked this again with the query below:
tells me it's
754450
2There's an order of magnitude difference in these 2 counts. Please tell me I'm doing it wrong.
PS: i tried the first query without the group by, order by and limit clause, doesn't make a difference.
The text was updated successfully, but these errors were encountered: