New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sparql: counts don't seem to be reliable #112

Open
joernhees opened this Issue Dec 3, 2013 · 41 comments

Comments

Projects
None yet
9 participants
@joernhees

joernhees commented Dec 3, 2013

I'm trying to get a top type count for DBpedia (Virtuoso version 07.00.3207 on Linux (x86_64-redhat-linux-gnu), Single Server Edition):

select ?type count(distinct ?s) as ?c where {
  ?s a ?type.
}
group by ?type
order by desc(?c)
limit 50

returns (apart from other rows) this row 1:
http://dbpedia.org/ontology/Place 89498

Out of curiosity i checked this again with the query below:

select count(distinct ?s) where { ?s a <http://dbpedia.org/ontology/Place> }

tells me it's 754450 2

There's an order of magnitude difference in these 2 counts. Please tell me I'm doing it wrong.

PS: i tried the first query without the group by, order by and limit clause, doesn't make a difference.

@kidehen

This comment has been minimized.

kidehen commented Dec 3, 2013

On 12/2/13 7:53 PM, Jörn Hees wrote:

I'm trying to get a top type count for DBpedia (Virtuoso version
07.00.3207 on Linux (x86_64-redhat-linux-gnu), Single Server Edition):

|select ?type count(distinct ?s) as ?c where {
?s a ?type.
}
group by ?type
order by desc(?c)
limit 50
|

returns (apart from other rows) this row 1
http://dbpedia.org/sparql?default-graph-uri=&qtxt=select+%3Ftype+count%28distinct+%3Fs%29+as+%3Fc+where+%7B%0D%0A++%3Fs+a+%3Ftype.%0D%0A%7D%0D%0Agroup+by+%3Ftype%0D%0Aorder+by+desc%28%3Fc%29%0D%0Alimit+50&format=text%2Fhtml&timeout=30000&debug=on:
|http://dbpedia.org/ontology/Place 89498|

Out of curiosity i checked this again with the query below:

|select count(distinct ?s) where { ?s a http://dbpedia.org/ontology/Place }
|

tells me it's |754450| 2
http://dbpedia.org/sparql?default-graph-uri=&qtxt=select+count%28distinct+%3Fs%29+where+%7B+%3Fs+a+%3Chttp%3A%2F%2Fdbpedia.org%2Fontology%2FPlace%3E+%7D&format=text%2Fhtml&timeout=30000&debug=on

There's an order of magnitude difference in these 2 counts. Please
tell me I'm doing it wrong.

PS: i tried the first query without the group by, order by and limit
clause, doesn't make a difference.


Reply to this email directly or view it on GitHub
#112.

Extending the timeout parameter increases the time allotted to producing
the query solution. This feature is critical to enabling the whole world
use DBpedia rather than specific queries monopolizing processing time.

See the different results produced when I doubled up the processing time:
http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=select+%3Ftype+count%28distinct+%3Fs%29+as+%3Fc+where+%7B%0D%0A++%3Fs+a+%3Ftype.%0D%0A%7D%0D%0Agroup+by+%3Ftype%0D%0Aorder+by+desc%28%3Fc%29%0D%0Alimit+50&format=text%2Fhtml&timeout=60000&debug=on

Note, there are hard limits also configured on the server that override
what may come in from an HTTP client.

Regards,

Kingsley Idehen
Founder & CEO
OpenLink Software
Company Web: http://www.openlinksw.com
Personal Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter Profile: https://twitter.com/kidehen
Google+ Profile: https://plus.google.com/+KingsleyIdehen/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen

@indeyets

This comment has been minimized.

Contributor

indeyets commented Dec 3, 2013

@kidehen timeouts are understandable. But giving wrong result because of timeout is the whole different story.

shouldn't it report failure instead? that's what fuseki does, for example.

both outcomes are not helpful, but fuseki doesn't provide false results

@kidehen

This comment has been minimized.

kidehen commented Dec 3, 2013

On 12/3/13 6:01 AM, Alexey Zakhlestin wrote:

@kidehen https://github.com/kidehen timeouts are understandable. But
giving wrong result because of timeout is the whole different story.

shouldn't it report failure instead? that's what fuseki does, for example.

both outcomes are not helpful, but fuseki doesn't provide false results

This isn't a false result.
This is a solution to the query within the constraints of a timeout.
The server should indicate via HTTP response metadata the nature of the
solution i.e., partial or complete.
This is a feature of Virtuoso.

@joernhees

This comment has been minimized.

joernhees commented Dec 3, 2013

Thanks, the time limit explains a bit, but this "feature" is highly confusing if not dangerous because the user (in this case me and i'm not exactly a novice) might not be aware that all the counts might be terribly wrong.

Is there any way to distinguish a "cut-off" result from one which is accurate?

I had assumed that a query which hits a timeout limit would return with an error (something like a 408, even though i'm not sure it's actually the right one) instead of silently returning wrong results.

@kidehen

This comment has been minimized.

kidehen commented Dec 3, 2013

On 12/3/13 9:18 AM, Jörn Hees wrote:

Thanks, the time limit explains a bit, but this "feature" is highly
confusing if not dangerous because the user (in this case me and i'm
not exactly a novice) might not be aware that all the counts might be
terribly wrong.

There is a DBpedia fair-use document [1][2] about this matter. You won't
have this issue if you are running your own Virtuoso instance with the
DBpedia dataset. Please remember, on the World Wide Web we have to cater
for everyone. The Web presents unique challenges to DBMS technology that
we address in Virtuoso, specifically.

Is there any way to distinguish a "cut-off" result from one which is
accurate?

This should be part of the response headers. Note:
X-SQL-State: S1TAT

I had assumed that a query which hits a timeout limit would return
with an error (something like a 408, even though i'm not sure it's
actually the right one) instead of silently returning wrong results.

Yes, there has to be HTTP response metadata indicating the state of
affairs, to the degree possible. The problem right now is that we don't
have any standardization here. 408 doesn't cut it because it implies the
request couldn't be completed. In our case, we are completing a task
within a set time that has been reached.

The closest analogy here is a quiz contest where you have X seconds to
answer a question, this is the very model to which Virtuoso's query
engine has been developed.


Reply to this email directly or view it on GitHub
#112 (comment).

Links:

[1] http://dbpedia.org/OnlineAccess -- search on "Fair Use Policy"
[2] http://lists.w3.org/Archives/Public/public-lod/2011Aug/0028.html

Regards,

Kingsley Idehen
Founder & CEO
OpenLink Software
Company Web: http://www.openlinksw.com
Personal Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter Profile: https://twitter.com/kidehen
Google+ Profile: https://plus.google.com/+KingsleyIdehen/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen

@joernhees

This comment has been minimized.

joernhees commented Dec 4, 2013

@kidehen neither of the links you provide describe/warn of the reported problem: that counts can be wrong if a timeout is hit.

I don't seem to have gotten my point across, let me try again:

I'm not arguing with fair use, timeouts or limits to be able to satisfy more users. I'm a fan!

I'm arguing with the way you're treating a timeout.
If a query takes too long there are two ways of dealing with this:

  1. return an error, not a result
  2. return a result with a BIG WARNING
  3. makes a developer, user or scientist (with a quick one-off sparql query in your web interface) look into it again. They will definitely not run in danger of using a wrong results as there is none!
  4. probably leads to the warning being lost somewhere in the process, never be shown to the user and the numbers taken for granted in the end. This is what happened here. Not even your own HTML result page of your SPARQL Web Interface shows the tiniest hint to the user that he should be careful. Even if you're arguing that this is not the prime "end user"... can you name a widely used sparql client / lib which handles this correctly?

I can see your point of view trying to answer the query as good as you can in the given time, but as this report demonstrates it is more dangerous than just returning an error.

@joernhees

This comment has been minimized.

joernhees commented Dec 4, 2013

one addenum, sorry:
it should be optional to get partial results, not an implicit default that you then have to check for

@kidehen

This comment has been minimized.

kidehen commented Dec 4, 2013

On 12/4/13 6:50 AM, Jörn Hees wrote:

one addenum, sorry:
it should be optional to get partial results, not an implicit default
that you then have to check for

No for the public DBpedia instance. We are deliberately not giving
anyone the ability to hog the instance. The instance has to be
accessible to the whole world, that's the basic requirement. Again, for
those that want to service specific use of DBpedia there are a range of
options from making your own instances across:

  1. local setup
  2. cloud setup -- e.g., Amazon AMI.

There are also other instances of DBpedia data across:

  1. http://lod.openlinksw.com/sparql -- LOD Cloud cache (this setup has
    more computing power behind it)
  2. http://dbpedia-live.openlinksw.com/sparql
  3. http://live.openlinksw.com/sparql.


Reply to this email directly or view it on GitHub
#112 (comment).

Regards,

Kingsley Idehen
Founder & CEO
OpenLink Software
Company Web: http://www.openlinksw.com
Personal Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter Profile: https://twitter.com/kidehen
Google+ Profile: https://plus.google.com/+KingsleyIdehen/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen

@kidehen

This comment has been minimized.

kidehen commented Dec 4, 2013

On 12/4/13 6:49 AM, Jörn Hees wrote:

@kidehen https://github.com/kidehen neither of the links you provide
describe/warn of the reported problem: that counts can be wrong if a
timeout is hit.

I don't seem to have gotten my point across, let me try again:

I'm /not/ arguing with fair use, timeouts or limits to be able to
satisfy more users. I'm a fan!

I'm arguing with the way you're treating a timeout.
If a query takes too long there are two way of dealing with this:

What you are not getting from my comment is the fact that there isn't a
notion of "query taking too long" the notion is "what solution can be
produced in X amount of seconds, for a given query". There's a world of
difference here. The technical challenge is old, SQL DBMS engines never
even got to tackling this issue since their usage context (closed world)
doesn't expose the problem.

With DBpedia and the Web, everything is unpredicatable. Data is
fundamentally time variant.

  1. return an error, not a result
  2. return a result with a BIG WARNING

We return an indicator via HTTP response (which you can test for) re.
partial results.

makes a developer, user or scientist (with a quick one-off sparql
query in your web interface) look into it again. They will
definitely not run in danger of using a wrong results as there is
none!

DBpedia isn't a gospel of any kind. That isn't the purpose here. Please
think about your request a little. You want a full Transitive Closure
intermingled with entailments for all the entity relationship semantics
in the data space? I am sure (as you digest that last sentence and its
implications) you get the point re., the nature of the pursuit and its
fundamental impracticalities, at Web-scale.

probably leads to the warning being lost somewhere in the process,
never be shown to the user and the numbers taken for granted in
the end. This is what happened here. Not even your own HTML result
page of your SPARQL Web Interface shows the tiniest hint /to the
user/ that he should be careful. Even if you're arguing that this
is not the prime "end user"... can you name a widely used sparql
client / lib which handles this correctly?

I can see your point of view trying to answer the query as good as you
can in the given time, but as this report demonstrates it is more
dangerous than just returning an error.

It is dangerous to attempt the opposite i.e., have no restrictions and
let clients deliberately or inadvertently deprive others of use. As I
said, there are other options for special use of DBpedia. It isn't right
to assume DBpedia is there to produce complete solutions for any kind
of query, issued by any kind of client, at any given point in time.

We have made a choice to make DBpedia available to the world, backed up
with usage restrictions that defend the goal :-)


Reply to this email directly or view it on GitHub
#112 (comment).

Regards,

Kingsley Idehen
Founder & CEO
OpenLink Software
Company Web: http://www.openlinksw.com
Personal Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter Profile: https://twitter.com/kidehen
Google+ Profile: https://plus.google.com/+KingsleyIdehen/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen

@joernhees

This comment has been minimized.

joernhees commented Dec 9, 2013

On 4 Dec 2013, at 14:09, Kingsley Idehen notifications@github.com wrote:

It isn't right
to assume DBpedia is there to produce complete solutions for any kind
of query, issued by any kind of client, at any given point in time.

Are you sure that this is your statement?
It's a marketing disaster.
And it's not what i'm asking for / reporting here as a problem.

I just wanted the correct counts for types used on the DBpedia endpoint.
There is no open world assumption in my query:
i'm neither asking the SPARQL endpoint to resolve redirects, nor is this a federated query.

All i'm asking for are its counts at the current point in time.
Nothing fancy and i could happily live with an error due to time exceeded.

Partial results are cool when you ask for them (explicitly), i didn't and most people don't.

If not explicitly asked for a partial result, it's more dangerous to report them in a very similar fashion to a complete result than reporting an error.

Ask a couple of developers what they would expect to happen…
Rather get an error or a result that looks quite right but isn't?

Cheers,
Jörn

@kidehen

This comment has been minimized.

kidehen commented Dec 9, 2013

On 12/9/13 12:37 PM, Jörn Hees wrote:

On 4 Dec 2013, at 14:09, Kingsley Idehen notifications@github.com
wrote:

It isn't right
to assume DBpedia is there to produce complete solutions for any kind
of query, issued by any kind of client, at any given point in time.

Are you sure that this is your statement?

My statement is this:

DBpedia is going to produce solutions to SPARQL queries subject to
timeout limits and other constraints that have been deliberately
configured to ensure global access, in line with its fair use policy.
This is how DBpedia's SPARQL endpoint has been configured to run since
inception.

It's a marketing disaster.

DBpedia isn't about marketing. I am making a statement about the
technical infrastructure behind the DBpedia SPARQL endpoint.

And it's not what i'm asking for / reporting here as a problem.

You are reporting the fact that you are executing a specific query that
(in the form you are seeking) exceeds some of the fair use constraints.
There are other instances of the DBpedia dataset, associated with
different infrastructure, that will give you more computing power per
timeout restrictions etc.. Examples include:

[1] http://lod.openlinksw.com/sparql -- all you have to do is simply
change the host part of your SPARQL Protocol URL to see what I mean re.
this cluster edition of Virtuoso which also has more computing power
behind it.

[2] http://dbpedia-live.openlinksw.com -- which doesn't match LOD for
capacity but can have less concurrent traffic than the main dbpedia.org
SPARQL endpoint.

I just wanted the correct counts for types used on the DBpedia endpoint.

The DBpedia Endpoint is one point of access for the DBpedia dataset. An
Endpoint != a Dataset. It is a service that provides access to a
dataset. There are other services providing access to the same dataset
that are configured for more intensive use of the data. The main
endpoint is for the whole world, and that's the focus of its
configuration i.e., every one (human or machine) has fair use of the
endpoint.

There is no open world assumption in my query:
i'm neither asking the SPARQL endpoint to resolve redirects, nor is
this a federated query.

That doesn't eradicate entailment, transitive closures, and related
matters. Even if you aren't actually de-referencing HTTP URIs, does that
apply to all other agents (human or machine) ?

All i'm asking for are its counts at the current point in time.

You are one of many.

Nothing fancy and i could happily live with an error due to time
exceeded.

There are other endpoints, change the hostname part of the URL, as I've
already told you.

Partial results are cool when you ask for them (explicitly), i didn't
and most people don't.

I guess Google give you complete results?

If not explicitly asked for a partial result it's more dangerous to
report them in a very similar fashion to a complete result than
reporting an error.

For you and your use case. You are but one agent.

Ask a couple of developers what they would expect to happen…
Rather get an error or a result that looks quite right but isn't?

It been so since 2007, so I don't understand what you are making a fuss
about, especially when the lod.openlinksw.com/sparql instance will more
than likely get you a complete answer based on the nature of its
configuration.

Cheers,
Jörn


Reply to this email directly or view it on GitHub
#112 (comment).

Regards,

Kingsley Idehen
Founder & CEO
OpenLink Software
Company Web: http://www.openlinksw.com
Personal Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter Profile: https://twitter.com/kidehen
Google+ Profile: https://plus.google.com/+KingsleyIdehen/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen

@joernhees

This comment has been minimized.

joernhees commented Dec 9, 2013

On 9 Dec 2013, at 20:18, Kingsley Idehen notifications@github.com wrote:

You are reporting the fact that you are executing a specific query that
(in the form you are seeking) exceeds some of the fair use constraints.

wrong, read again

i'm fine with that and always have been

i report that how you treat a timeout is bad, not that there is a timeout

end of my feedback

thanks for your time

j

@kidehen

This comment has been minimized.

kidehen commented Dec 9, 2013

On 12/9/13 2:56 PM, Jörn Hees wrote:

On 9 Dec 2013, at 20:18, Kingsley Idehen notifications@github.com wrote:

You are reporting the fact that you are executing a specific query that
(in the form you are seeking) exceeds some of the fair use constraints.

wrong, read again

i'm fine with that and always have been

i report that how you treat a timeout is bad, not that there is a
timeout

end of my feedback

thanks for your time

j


Reply to this email directly or view it on GitHub
#112 (comment).

I am not claiming that the timeout treatment is perfect. I've told you
repeatedly that we are using a custom HTTP header due to the lack of a
standard header for this situation.

This isn't a 408 or 500 condition.

@joernhees

This comment has been minimized.

joernhees commented Aug 6, 2014

@iv-an-ru any update on this?

@sebastianthelen

This comment has been minimized.

sebastianthelen commented Sep 18, 2014

http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VOSScalableInference contains a paragraph about partial query answering.

Apparently you get a hint that a query result is incomplete when executing it in isql (haven't tested it though).

@jindrichmynarz

This comment has been minimized.

jindrichmynarz commented Dec 10, 2014

What is the custom HTTP header that is returned for partial results? Where is it documented?

@jindrichmynarz

This comment has been minimized.

jindrichmynarz commented Dec 11, 2014

@kidehen: The only headers I see in responses from Virtuoso are Accept-Ranges, Cache-Control, Expires, Server, Connection, Content-Length, Content-Type and Date. I don't see any custom header, which would indicate partial results. This is what I get when running SELECT * WHERE { ?s ?p ?o . }, which is trimmed by ResultSetMaxRows set to 10000 in virtuoso.ini, on the latest develop version of Virtuoso.

@kidehen

This comment has been minimized.

kidehen commented Dec 11, 2014

On 12/11/14 3:25 AM, Jindřich Mynarz wrote:

@kidehen : The only headers I see in
responses from Virtuoso are Accept-Ranges, Cache-Control,
Expires, Server, Connection, Content-Length, Content-Type
and Date. I don't see any custom header, which would indicate
partial results. This is what I get when running SELECT \* WHERE { ?s ?p ?o . }, which is trimmed by ResultSetMaxRows set to 10000 in
virtuoso.ini, on the latest develop version of Virtuoso.

We do provide a number of response headers of which X-SQL-State: S1TAT
is our fundamental partial results indicator.

Example:

curl -I 
"http://lod.openlinksw.com/sparql?default-graph-uri=&query=select+distinct+*+where+%7B%5B%5D+a+%3Fo%7D+limit+50&format=text%2Fhtml&CXML_redir_for_subjs=121&CXML_redir_for_hrefs=&timeout=30000&debug=on"

HTTP/1.1 200 OK
Date: Thu, 11 Dec 2014 12:26:30 GMT
Content-Type: text/html; charset=UTF-8
Content-Length: 72
Connection: keep-alive
Server: Virtuoso/07.10.3211 (Linux) x86_64-redhat-linux-gnu  VDB
Accept-Ranges: bytes
_X-SQL-State: S1TAT_
X-SQL-Message: RC...: Returning incomplete results, query interrupted by 
result timeout.  Activity:     17 rnd    120K seq      0 same seg  
0 same pg      0 same par      0 disk 0 spec disk  856.6KB /     72 mes
X-Exec-Milliseconds: 31315
X-Exec-DB-Activity: 17 rnd    120K seq      0 same seg       0 same 
pg      0 same par      0 disk      0 spec disk  856.6KB /     72 
messages     11 fork

Links:

[1] http://docs.openlinksw.com/virtuoso/anytimequeries.html
[2] http://lists.w3.org/Archives/Public/public-lod/2013Jun/0004.html

@jindrichmynarz

This comment has been minimized.

jindrichmynarz commented Dec 11, 2014

Thanks for the explanation. However, I wasn't able to reproduce it on any other Virtuoso endpoint. For example, using the public DBpedia endpoint to execute SELECT * WHERE { ?s ?p ?o . }:

curl "http://dbpedia.org/sparql?query=SELECT+*+WHERE+%7B+%3Fs+%3Fp+%3Fo+.+%7D&format=text%2Fcsv" -o results.csv -D headers.txt
wc -l results.csv
# => 10001, i.e. trimmed results
cat headers.txt
# HTTP/1.1 200 OK
# Date: Thu, 11 Dec 2014 14:33:28 GMT
# Content-Type: text/csv; charset=UTF-8
# Content-Length: 1484509
# Connection: keep-alive
# Server: Virtuoso/07.10.3211 (Linux) x86_64-redhat-linux-gnu  VDB
# Expires: Thu, 18 Dec 2014 14:33:28 GMT
# Cache-Control: max-age=604800
# Accept-Ranges: bytes
# => i.e. no X-SQL-State header

Is the custom header only sent:

  • For specific versions of Virtuoso?
  • For specific configurations of Virtuoso?
  • With additional query parameters (e.g., CXML_redir_for_subjs, CXML_redir_for_hrefs timeout and debug=on from your example)?
@kidehen

This comment has been minimized.

kidehen commented Dec 11, 2014

On 12/11/14 9:38 AM, Jindřich Mynarz wrote:

Thanks for the explanation. However, I wasn't able to reproduce it on
any other Virtuoso endpoint. For example, using the public DBpedia
endpoint to execute |SELECT * WHERE { ?s ?p ?o . }|:

curl"http://dbpedia.org/sparql?query=SELECT+*+WHERE+%7B+%3Fs+%3Fp+%3Fo+.+%7D&format=text%2Fcsv" -o results.csv -D headers.txt
wc -l results.csv

=> 10001, i.e. trimmed results

cat headers.txt

HTTP/1.1 200 OK

Date: Thu, 11 Dec 2014 14:33:28 GMT

Content-Type: text/csv; charset=UTF-8

Content-Length: 1484509

Connection: keep-alive

Server: Virtuoso/07.10.3211 (Linux) x86_64-redhat-linux-gnu VDB

Expires: Thu, 18 Dec 2014 14:33:28 GMT

Cache-Control: max-age=604800

Accept-Ranges: bytes

=> i.e. no X-SQL-State header

Is the custom header only sent:

  • For specific versions of Virtuoso?
  • For specific configurations of Virtuoso?
  • With additional query parameters (e.g., |CXML_redir_for_subjs|,
    |CXML_redir_for_hrefs| |timeout| and |debug=on| from your example)?

Because the DBpedia instance has the following in its [SPARQL] INI section:
ResultSetMaxRows = 10000

Meaning:
The maximum SPARQL solution size for this instance is 10,000 records
(for SELECT) [1] and 10,000 entity description triples (for DESCRIBE,
which is the most taxing) [2] and 10,000 triples (for CONSTRUCT) [3].
This is limit combined with query timeout is what determines invocation
of the "anytime query" feature which, is what leads to partial results,
while processing of solution continues within next processing timeout
cycle.

Links:

[1]
http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=select+*+where+%7B%3Fs+a+%3Fo%7D+limit+1&format=text%2Fhtml&timeout=30000&debug=on
-- SELECT with LIMIT 1

[2]
http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=describe+%3Fs+where+%7B%3Fs+a+%3Fo%7D+limit+1&format=application%2Fx-nice-turtle&timeout=30000&debug=on
-- DESCRIBE with LIMIT 1

[3]
http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=construct+%7B%3Fs+a+%3Fo%7D+where+%7B%3Fs+a+%3Fo%7D+limit+1&format=application%2Fx-nice-turtle&timeout=30000&debug=on
-- LIMIT 1 .

Regards,

Kingsley Idehen
Founder & CEO
OpenLink Software
Company Web: http://www.openlinksw.com
Personal Weblog 1: http://kidehen.blogspot.com
Personal Weblog 2: http://www.openlinksw.com/blog/~kidehen
Twitter Profile: https://twitter.com/kidehen
Google+ Profile: https://plus.google.com/+KingsleyIdehen/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen
Personal WebID: http://kingsley.idehen.net/dataspace/person/kidehen#this

@jindrichmynarz

This comment has been minimized.

jindrichmynarz commented Dec 12, 2014

Kingsley, it seems you haven't got my question. I'm well aware of the effect of the ResultSetMaxRows configuration. Let me try to clarify. My question was about the missing HTTP header indicating partial results. The problem is:

  1. Do curl "http://dbpedia.org/sparql?query=SELECT+*+WHERE+%7B+%3Fs+%3Fp+%3Fo+.+%7D&format=text%2Fcsv" -o results.csv -D headers.txt.
  2. Receive partial results (exactly because of the ResultSetMaxRows=10000).
  3. cat headers.txt => No header indicating partial results is there.

So, this indicates that receiving partial results is not sufficient condition for Virtuoso to provide the HTTP header informing that it indeed sent partial results. What are the necessary conditions of a SPARQL request in order for Virtuoso to send a response with the HTTP header indicating partial results?

@kidehen

This comment has been minimized.

kidehen commented Dec 12, 2014

On 12/12/14 3:50 AM, Jindřich Mynarz wrote:

Kingsley, it seems you haven't got my question. I'm well aware of the
effect of the |ResultSetMaxRows| configuration. Let me try to clarify.
My question was about the missing HTTP header indicating partial
results. The problem is:

  1. Do |curl
    "http://dbpedia.org/sparql?query=SELECT+*+WHERE+%7B+%3Fs+%3Fp+%3Fo+.+%7D&format=text%2Fcsv"
    -o results.csv -D headers.txt|.
  2. Receive partial results (exactly because of the
    |ResultSetMaxRows=10000|).
  3. |cat headers.txt| => No header indicating partial results is there.

So, this indicates that receiving partial results is not sufficient
condition for Virtuoso to provide the HTTP header informing that it
indeed sent partial results. What are the necessary conditions of a
SPARQL request in order for Virtuoso to send a response with the HTTP
header indicating partial results?

Arriving at resultset size, for solution, within timeout. For your
example, we already have a solution, and a 10K resultset, within 30000
msec. Thus, not response headers. Put differently, Virtuoso found 10 K
triples in less than 30,000 msec.

Action Item:
A new custom header is being added for this scenario (it will be live by
the time you read this mail), so as to provide additional information to
this situation. Basically, X-MaxRows: {ini-hard-limit-value}, which in
this case would be 10,000. Note the maximum is 2,000,000 for Virtuoso.

Regards,

Kingsley Idehen
Founder & CEO
OpenLink Software
Company Web: http://www.openlinksw.com
Personal Weblog 1: http://kidehen.blogspot.com
Personal Weblog 2: http://www.openlinksw.com/blog/~kidehen
Twitter Profile: https://twitter.com/kidehen
Google+ Profile: https://plus.google.com/+KingsleyIdehen/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen
Personal WebID: http://kingsley.idehen.net/dataspace/person/kidehen#this

@jindrichmynarz

This comment has been minimized.

jindrichmynarz commented Dec 13, 2014

OK, I see that if I run curl "http://dbpedia.org/sparql?query=SELECT+*+WHERE+%7B+%3Fs+%3Fp+%3Fo+.+%7D&format=text%2Fcsv" -o results.csv -D headers.txt, I find X-SPARQL-MaxRows: 10000 in the response headers. This is useful, but it doesn't tell if I have received partial results, because it may be the case that the total number of results is the same as ResultSetMaxRows.

In order to tell if I received partial results I need to execute an additional query, which is my original query wrapped in SELECT (COUNT(*) AS ?count) WHERE { { ... } }:

curl "http://dbpedia.org/sparql?query=SELECT+%28COUNT%28*%29+AS+%3Fcount%29+WHERE+%7B+%7B+SELECT+*+WHERE+%7B+%3Fs+%3Fp+%3Fo+.+%7D+%7D+%7D&format=text%2Fcsv"

The response for this query tells me there are 943138267 results for my query in total. Given that I know this number I can compare it with the number from the X-SPARQL-MaxRows header and conclude that I have indeed received partial results.

As you can see, executing twice as many queries just to be sure one's not receiving partial results is hardly optimal from the developer's perspective. I think a more developer-friendly solution might be to have a HTTP response header serving as a boolean flag indicating if results are partial or not, irrespective of the cause of incompleteness (e.g., ResultSetMaxRows or timeout query parameter).

@joernhees

This comment has been minimized.

joernhees commented Dec 15, 2014

The whole point of this issue was that the current treatment with its 200 status code and additional headers is too implicit for end users, as well as most developers and libraries.

I'm begging you: can we please not serve timeouts / cut off result sets with a 200 http status code?
Rather serve them with a 206 status code or some other self invented 555 (server reached some limits, partial result only). Then add the headers on top of that, so one can find out what happened?

@kidehen

This comment has been minimized.

kidehen commented Dec 15, 2014

On 12/13/14 9:17 AM, Jindřich Mynarz wrote:

OK, I see that if I run |curl
"http://dbpedia.org/sparql?query=SELECT+*+WHERE+%7B+%3Fs+%3Fp+%3Fo+.+%7D&format=text%2Fcsv"
-o results.csv -D headers.txt|, I find |X-SPARQL-MaxRows: 10000| in
the response headers. This is useful, but it doesn't tell if I have
received partial results, because it may be the case that the total
number of results is the same as |ResultSetMaxRows|.

It is indicating to you that you have a resultset size of 10000.

SELECT * FROM {Some-Table} in the "Closed World" SQL RDBMS world. Or
SELECT * WHERE {?s ?p ?o} in the "Open World" RDF RDBMS realm both involve:

  1. query parsing
  2. solution preparation
  3. result set retrieval.

1-2 is the scope of the query timeout, while LIMIT is the resultset
size, in regards to Virtuoso.

Basically, LIMIT indicates maximum size of resultset for retrieval. In a
SQL RDBMS setup, you scrolls through the resultset using "scrollable
cursors" (which had modalities such as; snaphot, static, keyset,
dynamic, and mixed [keyset and dynamic]).

In order to tell if I received partial results I need to execute an
additional query, which is my original query wrapped in |SELECT
(COUNT(*) AS ?count) WHERE { { ... } }|:

curl"http://dbpedia.org/sparql?query=SELECT+%28COUNT%28*%29+AS+%3Fcount%29+WHERE+%7B+%7B+SELECT+*+WHERE+%7B+%3Fs+%3Fp+%3Fo+.+%7D+%7D+%7D&format=text%2Fcsv"

The response for this query tells me there are 943138267 results for
my query in total. Given that I know this number I can compare it with
the number from the |X-SPARQL-MaxRows| header and conclude that I have
indeed received partial results.

As you can see, executing twice as many queries just to be sure one's
not receiving partial results is hardly optimal from the developer's
perspective.

Err... it is, in the context of what you are trying to emulate i.e., a
scrollable cursor. Even when doing this on the SQL side of things the
DBMS will make one of the following, which have costs:

  1. keyset from all the keys in the tables of a query, in advance, w
  2. keyset created dynamical per cursor scroll
  3. partial keyset that replenished during scrolling .

I think a more developer-friendly solution might be to have a HTTP
response header serving as a boolean flag indicating if results are
partial or not, irrespective of the cause of incompleteness (e.g.,
|ResultSetMaxRows| or |timeout| query parameter).

This so-called developer cost burden isn't for Virtuoso to bear, it is
for the developer, until SPARQL provides some cursor-like mechanism
specified. Right now, we can just decide to return false, since an "open
world" query doesn't (theoretically) have a known complete solution, let
alone solution size.

Performance optimizations in Virtuoso enable you (the developer) to get
your count returned quickly. Basically, to each client their heuristic
for paging through data.

Kingsley


Reply to this email directly or view it on GitHub
#112 (comment).

Regards,

Kingsley Idehen
Founder & CEO
OpenLink Software
Company Web: http://www.openlinksw.com
Personal Weblog 1: http://kidehen.blogspot.com
Personal Weblog 2: http://www.openlinksw.com/blog/~kidehen
Twitter Profile: https://twitter.com/kidehen
Google+ Profile: https://plus.google.com/+KingsleyIdehen/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen
Personal WebID: http://kingsley.idehen.net/dataspace/person/kidehen#this

@IvanMikhailov

This comment has been minimized.

Contributor

IvanMikhailov commented Dec 15, 2014

I'm sorry I've faield to push the idea of warnings to the Sparql Protocol spec. "Anytime queries" did not exist at that moment but I was sure that "OK" and "error" is too black-and-white for a real word. Now I don't know any good solution. X-this and X-that in headers are informative but the method has a fatal flaw: one looks to that hidden texts not earlier than a rude error appears.

@jindrichmynarz

This comment has been minimized.

jindrichmynarz commented Dec 15, 2014

@kidehen: So your recommended solution to determine if a query has partial results is to execute an additional COUNT query?

I don't believe reporting that response to a query has partial results has a significant cost for Virtuoso. This can be added just in case Virtuoso trims results size (e.g., due ResultSetMaxRows or timeout). No additional computation is needed.

BTW, SPARQL is usually (e.g., 1, 2) said to have semantics based on closed-world assumption.

@joernhees and @iv-an-ru: I agree that a non-HTTP 200 code would be nicer, but there having some way how to tell partial results (e.g., custom header) is better than no way.

@kidehen

This comment has been minimized.

kidehen commented Dec 15, 2014

On 12/15/14 6:15 AM, Jörn Hees wrote:

The whole point of this issue was that the current treatment with its
200 status code and additional headers is too implicit for end users,
as well as most developers and libraries.

I'm begging you: can we please not serve timeouts / cut off result
sets with a 200 http status code?

For a query solution that has a fixed resultset size, based on a hard
limit, a 200 OK status is accurate. Now what we could consider is some
override modality under which an instance owner sets alternative
response codes for timeout being exceeded. What we can't do is just move
away from 200 OK when:

  1. we have a resource for the "open world" query in question
  2. there are many HTTP clients that treat anything other than 200 OK as
    a fault.

Rather serve them with a 206 status code or some other self invented
555 (server reached some limits, partial result only).

We can make these configurable by the instance owner, should they not
want to work with our defaults.

Then add the headers on top of that, so one can find out what happened?

Conditionally (by way of instance config), as indicated above.

Regards,

Kingsley Idehen
Founder & CEO
OpenLink Software
Company Web: http://www.openlinksw.com
Personal Weblog 1: http://kidehen.blogspot.com
Personal Weblog 2: http://www.openlinksw.com/blog/~kidehen
Twitter Profile: https://twitter.com/kidehen
Google+ Profile: https://plus.google.com/+KingsleyIdehen/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen
Personal WebID: http://kingsley.idehen.net/dataspace/person/kidehen#this

@kidehen

This comment has been minimized.

kidehen commented Dec 15, 2014

On 12/15/14 8:33 AM, Jindřich Mynarz wrote:

@kidehen https://github.com/kidehen: So your recommended solution to
determine if a query has partial results is to execute an additional
|COUNT| query?

I am saying to you that the issue of cursors is common, not new. It is
solved by a spec having a notion of cursors or a developer implementing
that client-side. What you don't do, as the server provider is implement
that in a way that simply introduces performance overhead that isn't
understood by clients.

I don't believe reporting that response to a query has partial results
has a significant cost for Virtuoso.

You want us to tell you that the hard LIMIT is X out of a total of Y.
And I am saying you can figure that out, as you alluded to on the client
side, in your code. That isn't a cost for the server to bear for you
specific usecase scenario.

If you were running your own Virtuoso instance, you can opt to not have
a hard limit in the INI, or set it to the max of 2 million.

This can be added just in case Virtuoso trims results size (e.g., due
|ResultSetMaxRows| or |timeout|). No additional computation is needed.

"Anytime Query" is about query solution preparation and resultset
retrieval within configurable time limits, and max resultset sizes.

BTW, SPARQL is usually (e.g., 1
http://web.ing.puc.cl/%7Emarenas/publications/pods11b.pdf, 2
http://ceur-ws.org/Vol-1272/paper_50.pdf) said to have semantics
based on closed-world assumption.

Even if it did, you are seeking something that isn't offered by other
DBMS engines i.e., an ability to provide a complete response to: SELECT

  • FROM {Source}, irrespective of the size of source, and irrespective of
    how many clients are performing the very same query. It cannot work
    i.e., you cannot have such a thing on the Web which is why there aren't
    any SQL query endpoints that allow ad-hoc queries to massive tables. Or
    am I missing some new public DBMS instance that offers such capability?

@joernhees https://github.com/joernhees and @iv-an-ru
https://github.com/iv-an-ru: I agree that a non-HTTP 200 code would
be nicer, but there having /some/ way how to tell partial results
(e.g., custom header) is better than no way.

See my response to the HTTP 200 matter.


Reply to this email directly or view it on GitHub
#112 (comment).

Regards,

Kingsley Idehen
Founder & CEO
OpenLink Software
Company Web: http://www.openlinksw.com
Personal Weblog 1: http://kidehen.blogspot.com
Personal Weblog 2: http://www.openlinksw.com/blog/~kidehen
Twitter Profile: https://twitter.com/kidehen
Google+ Profile: https://plus.google.com/+KingsleyIdehen/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen
Personal WebID: http://kingsley.idehen.net/dataspace/person/kidehen#this

@jindrichmynarz

This comment has been minimized.

jindrichmynarz commented Dec 15, 2014

I get that if SPARQL had cursors this would be solved differently.

What you don't do, as the server provider is implement
that in a way that simply introduces performance overhead that isn't
understood by clients.

Sorry, I have trouble parsing this sentence.

You want us to tell you that the hard LIMIT is X out of a total of Y.

No. This is already provided by the X-SPARQL-MaxRows header that you have recently introduced into Virtuoso. What I would like to know instead is if this hard limit was applied. If X-SPARQL-MaxRows = 10000 and I receive 10000 results, there's no way of telling if it's complete or partial result set without executing additional COUNT query. Is there another way I'm missing?

@joernhees

This comment has been minimized.

joernhees commented Dec 15, 2014

I sense some anger in this discussion again. I think this is coming from different points of view rather than anyone attacking Virtuoso. You guys are doing an awesome job. So awesome that we developers come to you as the defacto lead in public SPARQL endpoints to give feedback and ask for things which would make our lives easier / reduce misunderstandings in development.

I guess all our feedback in this issue boils down to that we as developers want to be able to handle partial results better when communicating with virtuoso endpoints.

I think there are several dimensions to this, which are entangled in our discussion:

  • scope (open / closed world?)
  • halting problem: can the server tell that a query was hitting its configured boundaries (i'm avoiding the word LIMIT so it's not confused with the SPARQL clause)? (timeout / max result size / ...?) Can it tell if any of the other endpoints it asks hit them?
  • form of presentation (visibility to the developer?)

Thinking about all three i was reminded of a tiny but powerful rule from the python zen: "explicit is better than implicit".

What could this mean for the dimensions above (just as thoughts):

  • scope: isn't it OK to treat a query as closed world assumption unless it is a federated query / asks for sponging? (So closed world until explicitly stated otherwise?)
  • halting problem:
    • if the scope was closed world, wouldn't the server know if it hits some configured boundary and could just tell us? (I'm not sure how complex this would be internally, but i guess whatever detects if a boundary is exceeded and stops execution could probably also add that information to the result.)
    • Obviously an open world assumption is a different story, but shouldn't the server still be able to inform us when it hits its own boundaries / is waiting for a third party for too long / a third party maybe exceeded its boundaries? (chicken & egg problem)
  • form of presentation:
    • If the client side query explicitly states the boundary which is exceeded i guess a 200 status code with a partial result is ok.
    • If the server runs into boundaries the client query didn't explicitly state (e.g., some defaults, fairness of use, etc. boundaries) then the result should rather not be a 200 as it doesn't force developers to deal with them correctly. Still the partial content could be delivered, either as 206 or in the 5xx range...
    • In both cases (as @jindrichmynarz seems to suggests): Headers which explain and potentially reduce follow up queries "just to find out if a result was partial" / which boundaries were hit would be great.

Chicken & egg problem:

You can read the first paragraph of this post another way: Because you're the defacto lead for public SPARQL endpoints, your defaults are pretty close to becoming the standard. If your default treatment of partial results is not informative for the closed world case, then it can never be for federated queries.

@kidehen

This comment has been minimized.

kidehen commented Dec 15, 2014

On 12/15/14 9:42 AM, Jindřich Mynarz wrote:

I get that if SPARQL had cursors this would be solved differently.

What you don't do, as the server provider is implement
that in a way that simply introduces performance overhead that isn't
understood by clients.

Sorry, I have trouble parsing this sentence.

You want us to tell you that the hard LIMIT is X out of a total of Y.

I am telling you that we have the following distinct items:

  1. query solution
  2. query resultset retrieval.

They are not the same thing.

We set a limit from which you fetch the solution in batches. This is
why we have the following INI excerpt, re. DBpedia:

[SPARQL]

ResultSetMaxRows = 10000 ; Resultset size (the maximum amount
of items allowed per resultset retrieval, associated with a query
solution. Use this to page through the solution)
MaxQueryCostEstimationTime = 120 ; in seconds
MaxQueryExecutionTime = 30 ; in seconds ; time it takes to retrieve
10,000 items (in relation to SELECT, CONSTRUCT, or DESCRIBE queries)
DefaultQuery = select distinct * where {?s ?p ?o} limit 50
; default query presented by endpoint page

No. This is already provided by the |X-SPARQL-MaxRows| header that you
have recently introduced into Virtuoso. What I would like to know
instead is if this hard limit was applied. If |X-SPARQL-MaxRows =
10000| and I receive 10000 results, there's no way of telling if it's
complete or partial result set without executing additional |COUNT|
query. Is there another way I'm missing?

You have to perform the additional count query because these heuristics
are yours, not for the DBMS. Basically, Virtuoso will not do that for
you as it is an expensive operation that totally skews what it is doing.
Can you point me to a DBMS that does that, available on the Web
anywhere? Do you for one second thing even Google results page contain N
number of matches for a total solution size?

In SQL, Scrollable Cursors are a feature. Net effect, they are distinct
from basic operations i.e., you don't conflate:

select * from table -- without cursors and the same query with cursors.
APIs like ODBC enable you to fetch data with or without scrollable cursors.

Recap:

Partial condition arises when Virtuoso can't produce a complete solution
within the timeouts outlined in the [SPARQL] INI section (stanza)
outlined above, for a resultset of 10,000.

Query Solution Size != Query Results Retrieval Max Items Size, at least
not in the case of Virtuoso.

Do you have an example of a DBMS product that offers what you are
seeking. Maybe we can make more progress based on such an example.


Reply to this email directly or view it on GitHub
#112 (comment).

Regards,

Kingsley Idehen
Founder & CEO
OpenLink Software
Company Web: http://www.openlinksw.com
Personal Weblog 1: http://kidehen.blogspot.com
Personal Weblog 2: http://www.openlinksw.com/blog/~kidehen
Twitter Profile: https://twitter.com/kidehen
Google+ Profile: https://plus.google.com/+KingsleyIdehen/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen
Personal WebID: http://kingsley.idehen.net/dataspace/person/kidehen#this

@jindrichmynarz

This comment has been minimized.

jindrichmynarz commented Dec 15, 2014

I am telling you that we have the following distinct items:

  1. query solution
  2. query resultset retrieval.

They are not the same thing.

I don't think I ever confused these two.

Basically, Virtuoso will not do that for you as it is an expensive operation that totally skews what it is doing.

I don't think we understand each other. Let me try to clarify. When Virtuoso trims the results set size to ResultSetMaxRows, it can as well add an additional header indicating the results set is trimmed. No additional computation is needed. You can hook this into the existing logic, which decides whether to trim results sets or not.

@kidehen

This comment has been minimized.

kidehen commented Dec 15, 2014

On 12/15/14 11:18 AM, Jörn Hees wrote:

I sense some anger in this discussion again. I think this is coming
from different points of view rather than anyone attacking Virtuoso.
You guys are doing an awesome job. So awesome that we developers come
to you as the defacto lead in public SPARQL endpoints to give feedback
and ask for things which would make our lives easier / reduce
misunderstandings in development.

I guess all our feedback in this issue boils down to that we as
developers want to be able to handle partial results better when
communicating with virtuoso endpoints.

I think there are several dimensions to this, which are entangled in
our discussion:

  • scope (open / closed world?)
  • halting problem: can the server tell that a query was hitting its
    configured boundaries (i'm avoiding the word |LIMIT| so it's not
    confused with the SPARQL clause)? (timeout / max result size /
    ...?) Can it tell if any of the other endpoints it asks hit them?
  • form of presentation (visibility to the developer?)

Thinking about all three i was reminded of a tiny but powerful rule
from the python zen: "explicit is better than implicit"
https://www.python.org/dev/peps/pep-0020/.

Yes, explicit is better than implicit for sure. But we also have to
understand the boundaries.

In the SQL realm, you would do one of the following:

  1. Use scrollable cursors -- the apis differe per SQL rdbms
  2. Use a generic API like ODBC or JDBC -- scrollable cursors
    implementations vary per driver (re., types supported and actual
    performance)
  3. Make your own cursoring -- this is how it was done pre ODBC and JDBC.

What could this mean for the dimensions above (just as thoughts):

  • scope: isn't it OK to treat a query as closed world assumption
    unless it is a federated query / asks for sponging? (So closed
    world until explicitly stated otherwise?)

Yes, but even if its "closed world" you have the issue of data volume
and access frequency to deal with, at Web scale.

  • halting problem:
    o if the scope was closed world, wouldn't the server know if it
    hits some configured boundary and could just tell us?

Yes, which is what it is doing. I tells you when it wasn't able to
complete results retrieval based on the combination of the following
factors:

  1. query cost estimation
  2. query solution production.

Thus, given:

[SPARQL]
ResultSetMaxRows = 10000
MaxQueryCostEstimationTime = 120 ; in seconds
MaxQueryExecutionTime = 30 ; in seconds

It will indicate a partial resultset return via HTTP if it couldn't
prepare a resultset of 10,000 items within 30,000 msecs. What is isn't
doing it first making a count of the solution (or solutionset for
possible additional clarity) and then concluding that because its
retrieval threshold per resultset item is 10,000 that is a partial
solution when it isn't.

  • o (I'm not sure how complex this would be internally, but i
    guess whatever detects if a boundary is exceeded and stops
    execution could probably also add that information to the result.)

It is doing that.

  • o Obviously an open world assumption is a different story, but
    shouldn't the server still be able to inform us when it hits
    its own boundaries / is waiting for a third party for too long
    / a third party maybe exceeded its boundaries? (chicken & egg
    problem)

Re. SPARQL-FED we should have the same thing re. timeouts which can
affect all sorts of things e.g., unions of SERVICE based query patterns
in a query. Ditto unions of SQL queries of SQL Tables attached to Virtuoso.

  • form of presentation:
    o If the client side query explicitly states the limit which is
    exceeded i guess a 200 status code with a partial result is ok.
    o If the server runs into limits the client query didn't
    explicitly state (e.g., some defaults, fairness of use, etc.
    limits) then the result should rather not be a 200 as it
    doesn't force developers to deal with them correctly.

Which is why we can improve things here by making 20X configurable by
the instance owner. I say that because there are HTTP clients that could
fault on 20X because they are coded for 200 OK only.

  • o Still the partial content could be delivered, either as 206 or
    in the 5xx range...

Yes, if you configure your instance that way, when we add this feature
to the [SPARQL] INI section.

  • o In both cases (as @jindrichmynarz
    https://github.com/jindrichmynarz seems to suggests):
    Headers which explain and potentially reduce follow up queries
    "just to find out if a result was partial" / which boundaries
    were hit would be great.

Yes, but he isn't distinguishing the solution size from the resultset
retrieval size, as implemented in Virtuoso. He would like delta
existence and size to be determined by Virtuoso and then used as the
basis for the notion of a "partial result" re., this "anytime query"
feature.

  Chicken & egg problem:

You can read the first paragraph of this post another way: Because
you're the defacto lead for public SPARQL endpoints, your defaults are
pretty close to becoming the standard. If your default treatment of
partial results is not informative for the closed world case, then it
can never be for federated queries.

Our short-term option is for these 20X responses to be configurable. In
addition, we need folks to accept the fact that Virtuoso distinguishes:

  1. query solution
  2. query solution result set retrieval size -- i.e., you can retrieve
    all the items associated with a solution in batches (each batch has a
    max results retrieval size), not one go.

Another possibility, when we have the time, is publish a guide for
emulating scrollable cursors via SPARQL i.e., provide the SPARQL client
heuristic for dealing with massive data, using SPARQL, at Web Scale.


Reply to this email directly or view it on GitHub
#112 (comment).

Regards,

Kingsley Idehen
Founder & CEO
OpenLink Software
Company Web: http://www.openlinksw.com
Personal Weblog 1: http://kidehen.blogspot.com
Personal Weblog 2: http://www.openlinksw.com/blog/~kidehen
Twitter Profile: https://twitter.com/kidehen
Google+ Profile: https://plus.google.com/+KingsleyIdehen/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen
Personal WebID: http://kingsley.idehen.net/dataspace/person/kidehen#this

@kidehen

This comment has been minimized.

kidehen commented Dec 15, 2014

On 12/15/14 12:02 PM, Jindřich Mynarz wrote:

I am telling you that we have the following distinct items:

 1. query solution
 2. query resultset retrieval.

They are not the same thing.

I don't think I ever confused these two.

Basically, Virtuoso will not do that for you as it is an expensive
operation that totally skews what it is doing.

I don't think we understand each other. Let me try to clarify. When
Virtuoso trims the results set size to |ResultSetMaxRows|, it can as
well add an additional header indicating the results set is trimmed.

It doesn't TRIM the result set. It stops fetching data from addresses
(internal to the engine) associated with the solution. The fact that you
use the term "TRIM" indeed reveals the confusion. You trim from a
physical whole. That isn't what's happening here.

No additional computation is needed. You can hook this into the
existing logic, which decides whether to trim results sets or not.

I politely disagree with your assumptions.

I don't know if you have an experience with scrollable cursors in the
realm of SQL. If not, it would help with this conversation. I know what
you want, but you don't seem to be accepting the paradoxical nature of
what you seek, from a DBMS perspective.

There is a reason why there are no live ad-hoc SQL RDBMS engines on the
Web (bar ours [1]), for any client to query.

[1] http://demo.openlinksw.com/XMLAexplorer/XMLAexplorer.html -- example
of an ad-hoc query service for SPARQL and SQL that's live on the Web.

Kingsley


Reply to this email directly or view it on GitHub
#112 (comment).

Regards,

Kingsley Idehen
Founder & CEO
OpenLink Software
Company Web: http://www.openlinksw.com
Personal Weblog 1: http://kidehen.blogspot.com
Personal Weblog 2: http://www.openlinksw.com/blog/~kidehen
Twitter Profile: https://twitter.com/kidehen
Google+ Profile: https://plus.google.com/+KingsleyIdehen/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen
Personal WebID: http://kingsley.idehen.net/dataspace/person/kidehen#this

@jindrichmynarz

This comment has been minimized.

jindrichmynarz commented Dec 15, 2014

OK, I see I may have used confusing terms (e.g., "trimming"). @kidehen, thank you for pointing that out.

I never meant to imply that Virtuoso first counts the size of a query result set and then trims its size to the ResultSetMaxRows. What I asked about is that when Virtuoso reaches ResultSetMaxRows or timeout and it stops the query execution, it can make this explicit by e.g., adding an HTTP header indicating partial results set. This is what I meant when I said that no additional computation is needed.

@kidehen

This comment has been minimized.

kidehen commented Dec 15, 2014

On 12/15/14 12:36 PM, Jindřich Mynarz wrote:

OK, I see I may have used confusing terms (e.g., "trimming"). @kidehen
https://github.com/kidehen, thank you for pointing that out.

I never meant to imply that Virtuoso first counts the size of a query
result set and then trims its size to the |ResultSetMaxRows|. What I
asked about is that when Virtuoso reaches |ResultSetMaxRows| or
|timeout| and it stops the query execution, it can make this explicit
by e.g., adding an HTTP header indicating partial results set. This is
what I meant when I said that no additional computation is needed.

Virtuoso doesn't stop query execution. The preparation of a solution
takes seconds. Its the retrieval of the items associated with the
solution that pose challenges re., transportation from DBMS to client.

We have to move the items associated with the solution from virtuoso's
internal space to that of a virtuoso client. Our timeout condition
arises when we haven't prepared the solution items for transportation
(so to speak) via a conveyor that holds <= ResultSetMaxRows capacity.

In ODBC/JDBC (where these matters are handled with better clarity), you
have query resultset fetching distinct from query solution preparation.
Then, a client fetches resultset items from DB server until there's
nothing left. If using cursors you build keysets (of different kinds:
all keys in tables in query prepared prior to fetching, keys prepared
dynamically prior to each fetch, partial keyset size that's only
generated when exceeded during each fetch ).

The bottom-line issue here is that we are paging (cursoring) through the
items that constitute a query solution. This matter isn't a trivial as
it might appear, at first blush. Ultimately, we can make time provide an
example that outlines a heuristic that can be used by clients trying to
work with this level of granularity.

The SPARQL Query Protocol, which is for all intents and purposes is the
ODBC/JDBC equivalent for SPARQL Queries, is what's lacking here.

We are going to need "Link:" headers on both the client and the server
to make this really work right, in a generic way, at Web scale. A client
has to indicate to the server it wants to work with a cursor, and the
type of cursor should be negotiated b/w client and server, once
negotiated, they keyset mechanism and size will be known, and retrieval
of result can be much smarter.

If we are going to do scrollable cursors, it should be done right, even
it this is via HTTP headers without enhancing the SPARQL Protocol
directly. How about that?


Reply to this email directly or view it on GitHub
#112 (comment).

Regards,

Kingsley Idehen
Founder & CEO
OpenLink Software
Company Web: http://www.openlinksw.com
Personal Weblog 1: http://kidehen.blogspot.com
Personal Weblog 2: http://www.openlinksw.com/blog/~kidehen
Twitter Profile: https://twitter.com/kidehen
Google+ Profile: https://plus.google.com/+KingsleyIdehen/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen
Personal WebID: http://kingsley.idehen.net/dataspace/person/kidehen#this

@joernhees

This comment has been minimized.

joernhees commented Jan 5, 2015

@kidehen i think all that @jindrichmynarz suggests is that virtuoso could add a header if (speaking in your terminology): no timeout condition arises but virtuoso has prepared the solution items for transportation (so to speak) via a conveyor that holds > ResultSetMaxRows capacity.

In that case the server knows without additional work: the client won't get all that's on the conveyor belt, so a somehow partial/truncated/limited/whateverword result.

The thing is that if i write a SPARQL Query with LIMIT 100 and 100 results are returned i know i probably should try to continue... (but i'm not sure)... With that header i could be sure that i need to if it's present... bad thing is that i can't be sure i don't need to if it's not present.

But the header would be even more meaningful in other cases: what if i don't specify a limit in my query? I don't see the ResultSetMaxRows set in the virtuoso.ini as a client (or do i?).

With that "ResultSetLimitHit" header i could know that there is maybe more.

Why maybe? (please correct if this is wrong): I think you pointed this out before: the conveyer-belt could be empty by coincidence when the next chunk isn't prepared yet, but the result set size limited by explicit LIMIT clause or by ResultSetMaxRows is reached.

If that's how it works one could even think of two headers: "ResultSetLimitHit" / "ResultSetLimitExceeded" or with one header and two values: "ResultSetLimit: Hit" "ResultSetLimit: Exceeded".

@kidehen

This comment has been minimized.

kidehen commented Jan 5, 2015

On 1/5/15 1:14 PM, Jörn Hees wrote:

@kidehen https://github.com/kidehen i think all that @jindrichmynarz
https://github.com/jindrichmynarz suggests is that virtuoso could
add a header if (speaking in your terminology): no timeout condition
arises but virtuoso has prepared the solution items for transportation
(so to speak) via a conveyor that holds > ResultSetMaxRows capacity.

In that case the server knows without additional work: the client
won't get all that's on the conveyor belt, so a somehow
partial/truncated/limited/whateverword result.

The thing is that if i write a SPARQL Query with LIMIT 100 and 100
results are returned i know i probably should try to continue... (but
i'm not sure)... With that header i could be sure that i need to if
it's present... bad thing is that i can't be sure i don't need to if
it's not present.

But the header would be even more meaningful in other cases: what if i
don't specify a limit in my query? I don't see the ResultSetMaxRows
set in the virtuoso.ini as a client (or do i?).

With that "ResultSetLimitHit" header i could know that there is maybe
more.

Why maybe? (please correct if this is wrong): I think you pointed this
out before: the conveyer-belt could be empty by coincidence when the
next chunk isn't prepared yet, but the result set size limited by
explicit LIMIT clause or by ResultSetMaxRows is reached.

If that's how it works one could even think of two headers:
"ResultSetLimitHit" / "ResultSetLimitExceeded" or with one header and
two values: "ResultSetLimit: Hit" "ResultSetLimit: Exceeded".

We can do two things here:

  1. add more headers
  2. add service parameters that indicate to Virtuoso the need to perform
    a count as part of the workload -- using this additional parameter
    prevents a costly heuristic from skewing the query solution and
    retrieval times.

Re. #1, this is closer to your message above.

I'll pick these items up with my development team.

Regards,

Kingsley Idehen
Founder & CEO
OpenLink Software
Company Web: http://www.openlinksw.com
Personal Weblog 1: http://kidehen.blogspot.com
Personal Weblog 2: http://www.openlinksw.com/blog/~kidehen
Twitter Profile: https://twitter.com/kidehen
Google+ Profile: https://plus.google.com/+KingsleyIdehen/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen
Personal WebID: http://kingsley.idehen.net/dataspace/person/kidehen#this

@rnavarropiris

This comment has been minimized.

rnavarropiris commented Oct 24, 2016

@kidehen: I recently stumbled upon this issue when sending a query over the JDBC interface. However, according to the Virtuoso documentation this should only apply to the the SPARQL web service:

[SPARQL]
The SPARQL section sets parameters and limits for SPARQL query protocol web service service.
This section should stay commented out as long as SPARQL is not in use.
Section RDF Data Access and Data Management contains detailed description of this functionality.

Is this the intended behaviour?
Is there a way of bypassing this limit (e.g. ResultSetMaxRows=0 as in 'no limit')?

@kanihal

This comment has been minimized.

kanihal commented Apr 7, 2018

any update on this?
has the new header indicator for partial results been implemented?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment