Not all datasets from datacatalog processed #279

coret · 2021-10-05T11:04:22Z

the datasetdescriptions which are already in the Dataset Register aren't updated because no datasets are found

How did you reach that conclusion? This query shows us that the registration URL is valid and was read last night.

This query seems to show the expected encoding formats.

Upon closer examination, it seems that some datasets were updated but some - like dataset_wba (see query 1) - were not updated.

In the triplestore there are 36 datasets in the datacatalog Open Archives Data Catalog (see query 2). That's 46 short, as the following query shows 82 datasets at the source:

comunica-sparql https://www.openarch.nl/datasets/ "SELECT * WHERE { ?s <http://schema.org/dataset> ?o . }"
[
{"?s":"https://www.openarch.nl/id/datacatalog","?o":"https://www.openarch.nl/id/dataset_ade"},
{"?s":"https://www.openarch.nl/id/datacatalog","?o":"https://www.openarch.nl/id/dataset_bhi"},
{"?s":"https://www.openarch.nl/id/datacatalog","?o":"https://www.openarch.nl/id/dataset_elo"},
...
{"?s":"https://www.openarch.nl/id/datacatalog","?o":"https://www.openarch.nl/id/dataset_wba"},
....
]

The https://www.openarch.nl/id/dataset_wba is one of the datasets missing in query 2, which is why this dataset wasn't updated.

Additionaly, I guess our crawler doesn't yet remove datasets graphs from our triplestore when they are no longer present in a datacatalog they were introduced by?

The text was updated successfully, but these errors were encountered:

ddeboer · 2021-10-05T14:11:18Z

Are we running into the limit on the number of triples? How many triples are on https://www.openarch.nl/datasets/? Our query is capped at 10000. We can of course raise that limit, but that comes at the cost of CPU and memory usage. And what if catalogs grow even further, both in breadth (number of datasets) and depth (amount of data per dataset)?

Additionaly, I guess our crawler doesn't yet remove datasets graphs from our triplestore when they are no longer present in a datacatalog they were introduced by?

That’s right. If this is desired behaviour (it wasn’t yet in the design) please create a separate issue.

coret · 2021-10-05T14:29:20Z

How many triples are on https://www.openarch.nl/datasets/?

 $ comunica-sparql https://www.openarch.nl/datasets/ "SELECT * WHERE { ?s ?p ?o . }" | wc -l
15660

Is this maximum of 10.000 triples based on best-practice or metrics?

We can of course raise that limit, but that comes at the cost of CPU and memory usage. And what if catalogs grow even further, both in breadth (number of datasets) and depth (amount of data per dataset)?

I guess we can learn from https://www.openarch.nl/datasets/ to see what the cost is. I previously generated a datacatalog of all customers of De Ree which was much larger. So I certainly feel we must be ready for growth. And not fail silently.

ddeboer · 2021-10-05T14:30:11Z

Is this maximum of 10.000 triples based on best-practice or metrics?

Nope, we just had to set some limit.

And not fail silently.

Do you have a suggestion as to how we can improve this behaviour?

coret · 2021-10-06T12:15:20Z

I previously generated a datacatalog of all customers of De Ree which was much larger.

We want to add this data too, it's available in https://github.com/netwerk-digitaal-erfgoed/dataset-register-entries/tree/main/ANL (not as one data catalog, but a data catalog per organization) and the biggest one (95MB) is https://github.com/netwerk-digitaal-erfgoed/dataset-register-entries/blob/main/ANL/zar.ttl and contains about 700K triples (42.840 datasets).

Do you have a suggestion as to how we can improve this behaviour?

Is it possible to do some sort of count of triples or datasets and issue a warning in API and crawler if this count exceeds our limit or is 0?

And another approach; what instructions can we give publishers of data catalogs to limit the size of a data catalog? Splitting up a data catalog might be a possibililty (hydra:PagedCollection #151).

ddeboer · 2021-10-07T09:49:58Z

Can we split up the query into multiple small ones? And if so, does that lower memory/CPU usage or not?

ddeboer · 2021-11-24T13:34:07Z

Is it possible to do some sort of count of triples or datasets and issue a warning in API and crawler if this count exceeds our limit or is 0?

While we cannot know for sure how many triples our query would return if it were unlimited, we can know if the SPARQL result size matches our limit. This indicates that we have to raise the limit further (or is pure coincidence, but this is rather unlikely). Because this is valuable information, I log this in #353.

ddeboer self-assigned this Oct 7, 2021

coret mentioned this issue Nov 9, 2021

Only first dataset of datacatalog processed #332

Closed

ddeboer mentioned this issue Nov 24, 2021

Raise SPARQL LIMIT #353

Merged

ddeboer closed this as completed in #353 Nov 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not all datasets from datacatalog processed #279

Not all datasets from datacatalog processed #279

coret commented Oct 5, 2021 •

edited

Loading

ddeboer commented Oct 5, 2021 •

edited

Loading

coret commented Oct 5, 2021

ddeboer commented Oct 5, 2021 •

edited

Loading

coret commented Oct 6, 2021

ddeboer commented Oct 7, 2021

ddeboer commented Nov 24, 2021

Not all datasets from datacatalog processed #279

Not all datasets from datacatalog processed #279

Comments

coret commented Oct 5, 2021 • edited Loading

ddeboer commented Oct 5, 2021 • edited Loading

coret commented Oct 5, 2021

ddeboer commented Oct 5, 2021 • edited Loading

coret commented Oct 6, 2021

ddeboer commented Oct 7, 2021

ddeboer commented Nov 24, 2021

coret commented Oct 5, 2021 •

edited

Loading

ddeboer commented Oct 5, 2021 •

edited

Loading

ddeboer commented Oct 5, 2021 •

edited

Loading