-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not all datasets from datacatalog processed #279
Comments
Are we running into the limit on the number of triples? How many triples are on https://www.openarch.nl/datasets/? Our query is capped at 10000. We can of course raise that limit, but that comes at the cost of CPU and memory usage. And what if catalogs grow even further, both in breadth (number of datasets) and depth (amount of data per dataset)?
That’s right. If this is desired behaviour (it wasn’t yet in the design) please create a separate issue. |
Is this maximum of 10.000 triples based on best-practice or metrics?
I guess we can learn from https://www.openarch.nl/datasets/ to see what the cost is. I previously generated a datacatalog of all customers of De Ree which was much larger. So I certainly feel we must be ready for growth. And not fail silently. |
Nope, we just had to set some limit.
Do you have a suggestion as to how we can improve this behaviour? |
We want to add this data too, it's available in https://github.com/netwerk-digitaal-erfgoed/dataset-register-entries/tree/main/ANL (not as one data catalog, but a data catalog per organization) and the biggest one (95MB) is https://github.com/netwerk-digitaal-erfgoed/dataset-register-entries/blob/main/ANL/zar.ttl and contains about 700K triples (42.840 datasets).
Is it possible to do some sort of count of triples or datasets and issue a warning in API and crawler if this count exceeds our limit or is 0? And another approach; what instructions can we give publishers of data catalogs to limit the size of a data catalog? Splitting up a data catalog might be a possibililty (hydra:PagedCollection #151). |
Can we split up the query into multiple small ones? And if so, does that lower memory/CPU usage or not? |
While we cannot know for sure how many triples our query would return if it were unlimited, we can know if the SPARQL result size matches our limit. This indicates that we have to raise the limit further (or is pure coincidence, but this is rather unlikely). Because this is valuable information, I log this in #353. |
From #252 (comment)
Upon closer examination, it seems that some datasets were updated but some - like dataset_wba (see query 1) - were not updated.
In the triplestore there are 36 datasets in the datacatalog Open Archives Data Catalog (see query 2). That's 46 short, as the following query shows 82 datasets at the source:
The https://www.openarch.nl/id/dataset_wba is one of the datasets missing in query 2, which is why this dataset wasn't updated.
Additionaly, I guess our crawler doesn't yet remove datasets graphs from our triplestore when they are no longer present in a datacatalog they were introduced by?
The text was updated successfully, but these errors were encountered: