-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incomplete number of Open Archives datasets #831
Comments
In the logs it says:
|
@ddeboer Which component throws this error, Comunica? The command I see in the logs a start item and an error, with more than 2 minutes in between? What is taking so long? |
Line 105 in 8d8082b
|
As the provider of this dataset I still do not understand why this (undocumented) limit is reached, given that The dataset requirements only mention the number datasets after which pagination should be used:
But, the Open Archives datacatalog only contains 91 datasets. So do I need to alter the datacatalog/descriptions in some way or is the issue in Line 74 in 8d8082b
|
50.000 is the limit on the number of result bindings, not the number of triples. It’s good practice to have some limit on your SPARQL queries, although of course we could raise this to another (arbitrary) number. Querying just a single dataset gives a ridiculous number of bindings, perhaps due to multi-lingual labels, distribution blank nodes, OPTIONALs and/or bugs in Comunica: ade.json (18 MB!). Should we perhaps consider splitting into two stages?
That would of course mean ~8000 separate queries in the case of NA. |
I've analyzed the ade.json file and have encountered 10.752 variants of the datasetdescription (see https://validator.schema.org/#url=https%3A%2F%2Fwww.openarchieven.nl%2Fdatasets%2Fade for easy look at source). The datasetdescription has:
The ade.json seems to be some kind of Cartesian product of all these "multiple value" property values (multilanguage or "arrays" like distributions and keywords): 14 * 6 * 2 * 2 * 2 * 2 * 2 * 2 * 2 = 10.752 Hope this analysis makes sense and leads to identifying the bug! |
The following query via Comunica has output as expected, no problem:
The following query seems to have a cartesian product like result (6.696 triples):
The same query on GraphDB (repository oa-datacatalog) gives 100 triples. |
The datacatalog https://www.openarchieven.nl/datasets/ (registration URL https://www.openarchieven.nl/.well-known/datacatalog or via direct link https://www.openarchieven.nl/datasets/datacatalog.ttl) contains 91 datasets. Yet, only 6 are present in the Dataset Register?
Check via query:
The text was updated successfully, but these errors were encountered: