Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incomplete number of Open Archives datasets #831

Closed
coret opened this issue Nov 20, 2023 · 7 comments · Fixed by #937
Closed

Incomplete number of Open Archives datasets #831

coret opened this issue Nov 20, 2023 · 7 comments · Fixed by #937

Comments

@coret
Copy link
Contributor

coret commented Nov 20, 2023

The datacatalog https://www.openarchieven.nl/datasets/ (registration URL https://www.openarchieven.nl/.well-known/datacatalog or via direct link https://www.openarchieven.nl/datasets/datacatalog.ttl) contains 91 datasets. Yet, only 6 are present in the Dataset Register?

Check via query:

PREFIX dct: <http://purl.org/dc/terms/>
SELECT * WHERE {
    ?dataset dct:isPartOf "https://www.openarchieven.nl/datasets/" .
}
@ddeboer
Copy link
Member

ddeboer commented Dec 1, 2023

In the logs it says:

{"level":50,"time":1701406951923,"pid":24,"hostname":"registry-crawler-54f948cd59-vzw2s","msg":"SPARQL query result for https://www.openarchieven.nl/.well-known/datacatalog reached the SPARQL limit of 50000"}

@coret
Copy link
Contributor Author

coret commented Dec 7, 2023

@ddeboer Which component throws this error, Comunica?

The command curl -L -H "Accept: application/n-triples" https://www.openarchieven.nl/.well-known/datacatalog gives 7349 N-triples. Where does the > 50000 (what?) come from?

I see in the logs a start item and an error, with more than 2 minutes in between? What is taking so long?

@ddeboer
Copy link
Member

ddeboer commented Dec 7, 2023

export const sparqlLimit = 50000;

@coret
Copy link
Contributor Author

coret commented Dec 11, 2023

As the provider of this dataset I still do not understand why this (undocumented) limit is reached, given that curl -L -H "Accept: application/n-triples" https://www.openarchieven.nl/.well-known/datacatalog gives 7349 triples.

The dataset requirements only mention the number datasets after which pagination should be used:

Therefore, publishers SHOULD split large data catalogs in parts of at most a 1000 datasets, using the Hydra Core Vocabulary.

But, the Open Archives datacatalog only contains 91 datasets. So do I need to alter the datacatalog/descriptions in some way or is the issue in

async function query(url: URL): Promise<DatasetExt[]> {

@ddeboer
Copy link
Member

ddeboer commented Dec 12, 2023

50.000 is the limit on the number of result bindings, not the number of triples. It’s good practice to have some limit on your SPARQL queries, although of course we could raise this to another (arbitrary) number.

Querying just a single dataset gives a ridiculous number of bindings, perhaps due to multi-lingual labels, distribution blank nodes, OPTIONALs and/or bugs in Comunica: ade.json (18 MB!).

Should we perhaps consider splitting into two stages?

  1. Identify all dataset URIs.
  2. For each individual dataset URI, execute our query.

That would of course mean ~8000 separate queries in the case of NA.

@coret
Copy link
Contributor Author

coret commented Dec 12, 2023

Querying just a single dataset gives a ridiculous number of bindings, perhaps due to multi-lingual labels, distribution blank nodes, OPTIONALs and/or bugs in Comunica: ade.json (18 MB!).

I've analyzed the ade.json file and have encountered 10.752 variants of the datasetdescription (see https://validator.schema.org/#url=https%3A%2F%2Fwww.openarchieven.nl%2Fdatasets%2Fade for easy look at source).

The datasetdescription has:

  • name in 2 languages
  • spatial coverage in 2 languages
  • (an array of) 6 distributions each with a description in 2 languages
  • (an array of) 14 keywords (no @lang)
  • description in 2 languages
  • publisher name in 2 languages
  • creator name in 2 languages
  • (for my theory i'm missing something which would account for another factor 2)

The ade.json seems to be some kind of Cartesian product of all these "multiple value" property values (multilanguage or "arrays" like distributions and keywords): 14 * 6 * 2 * 2 * 2 * 2 * 2 * 2 * 2 = 10.752

Hope this analysis makes sense and leads to identifying the bug!

@coret
Copy link
Contributor Author

coret commented Dec 12, 2023

The following query via Comunica has output as expected, no problem:

$ comunica-sparql https://www.openarch.nl/.well-known/datacatalog "CONSTRUCT WHERE { <https://www.openarchieven.nl/id/dataset_ade> ?p ?o }"
<https://www.openarchieven.nl/id/dataset_ade> a <http://schema.org/Dataset>;
    <http://schema.org/name> "Dataset genealogische metadata Archief Delft via Open Archieven"@nl, "Dataset genealogical metadata Archive Delft via Open Archives"@en;
    <http://schema.org/publisher> <https://www.openarchieven.nl/>;
    <http://schema.org/creator> <https://www.openarchieven.nl/>;
    <http://schema.org/dateCreated> "2023-02-22"^^<http://schema.org/Date>;
    <http://schema.org/dateModified> "2023-02-23"^^<http://schema.org/Date>;
    <http://schema.org/description> "De open data bestaat uit de metadata van 853.880 akten van Archief Delft, met daarop 2.299.475 historische persoonsvermeldingen. De brontypes omvatten bevolkingsregisters, geboorten, huwelijken, overlijdens. Deze dataset kan doorzocht worden via https://www.openarchieven.nl/ade"@nl, "The open data consists of metadata from 853,880 records of Archive Delft, with 2.299.475 historical person observations. The source types included population registers, births, marriages, deaths. This dataset can be searched via https://www.openarchieven.nl/ade"@en;
    <http://schema.org/distribution> _:bc_0_b0_genid-19, _:bc_0_b0_genid-210, _:bc_0_b0_genid-311, _:bc_0_b0_genid-412, _:bc_0_b0_genid-513, _:bc_0_b0_genid-614;
    <http://schema.org/identifier> "https://www.openarchieven.nl/id/dataset_ade";
    <http://schema.org/inLanguage> "nl-NL";
    <http://schema.org/includedInDataCatalog> "https://www.openarchieven.nl/datasets/";
    <http://schema.org/isBasedOn> <https://www.stadsarchiefdelft.nl/collecties/open-data/>;
    <http://schema.org/keywords> "Open Archieven", "Historische persoonsvermeldingen", "Genealogie", "Bevolkingsregisters", "Geboorten", "Huwelijken", "Overlijdens", "Open Archives", "Historical personal data", "Genealogy", "Population registers", "Births", "Marriages", "Deaths";
    <http://schema.org/license> <http://creativecommons.org/publicdomain/zero/1.0/>;
    <http://schema.org/mainEntityOfPage> <https://www.openarchieven.nl/datasets/ade>;
    <http://schema.org/spatialCoverage> "Nederland"@nl, "Netherlands"@en;
    <http://schema.org/thumbnailUrl> <https://www.openarchieven.nl/img/search/ade-oa-nl.png>.

The following query seems to have a cartesian product like result (6.696 triples):

$ comunica-sparql https://www.openarch.nl/.well-known/datacatalog "CONSTRUCT WHERE { <https://www.openarchieven.nl/id/dataset_ade> ?p ?o ; <http://schema.org/distribution> ?d . ?d ?e ?f .}"

The same query on GraphDB (repository oa-datacatalog) gives 100 triples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants