Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not all datasets from datacatalog processed #279

Closed
coret opened this issue Oct 5, 2021 · 6 comments · Fixed by #353
Closed

Not all datasets from datacatalog processed #279

coret opened this issue Oct 5, 2021 · 6 comments · Fixed by #353
Assignees

Comments

@coret
Copy link
Contributor

coret commented Oct 5, 2021

From #252 (comment)

the datasetdescriptions which are already in the Dataset Register aren't updated because no datasets are found

How did you reach that conclusion? This query shows us that the registration URL is valid and was read last night.

This query seems to show the expected encoding formats.

Upon closer examination, it seems that some datasets were updated but some - like dataset_wba (see query 1) - were not updated.

In the triplestore there are 36 datasets in the datacatalog Open Archives Data Catalog (see query 2). That's 46 short, as the following query shows 82 datasets at the source:

comunica-sparql https://www.openarch.nl/datasets/ "SELECT * WHERE { ?s <http://schema.org/dataset> ?o . }"
[
{"?s":"https://www.openarch.nl/id/datacatalog","?o":"https://www.openarch.nl/id/dataset_ade"},
{"?s":"https://www.openarch.nl/id/datacatalog","?o":"https://www.openarch.nl/id/dataset_bhi"},
{"?s":"https://www.openarch.nl/id/datacatalog","?o":"https://www.openarch.nl/id/dataset_elo"},
...
{"?s":"https://www.openarch.nl/id/datacatalog","?o":"https://www.openarch.nl/id/dataset_wba"},
....
]

The https://www.openarch.nl/id/dataset_wba is one of the datasets missing in query 2, which is why this dataset wasn't updated.

Additionaly, I guess our crawler doesn't yet remove datasets graphs from our triplestore when they are no longer present in a datacatalog they were introduced by?

@ddeboer
Copy link
Member

ddeboer commented Oct 5, 2021

Are we running into the limit on the number of triples? How many triples are on https://www.openarch.nl/datasets/? Our query is capped at 10000. We can of course raise that limit, but that comes at the cost of CPU and memory usage. And what if catalogs grow even further, both in breadth (number of datasets) and depth (amount of data per dataset)?

Additionaly, I guess our crawler doesn't yet remove datasets graphs from our triplestore when they are no longer present in a datacatalog they were introduced by?

That’s right. If this is desired behaviour (it wasn’t yet in the design) please create a separate issue.

@coret
Copy link
Contributor Author

coret commented Oct 5, 2021

How many triples are on https://www.openarch.nl/datasets/?

 $ comunica-sparql https://www.openarch.nl/datasets/ "SELECT * WHERE { ?s ?p ?o . }" | wc -l
15660

Is this maximum of 10.000 triples based on best-practice or metrics?

We can of course raise that limit, but that comes at the cost of CPU and memory usage. And what if catalogs grow even further, both in breadth (number of datasets) and depth (amount of data per dataset)?

I guess we can learn from https://www.openarch.nl/datasets/ to see what the cost is. I previously generated a datacatalog of all customers of De Ree which was much larger. So I certainly feel we must be ready for growth. And not fail silently.

@ddeboer
Copy link
Member

ddeboer commented Oct 5, 2021

Is this maximum of 10.000 triples based on best-practice or metrics?

Nope, we just had to set some limit.

And not fail silently.

Do you have a suggestion as to how we can improve this behaviour?

@coret
Copy link
Contributor Author

coret commented Oct 6, 2021

I previously generated a datacatalog of all customers of De Ree which was much larger.

We want to add this data too, it's available in https://github.com/netwerk-digitaal-erfgoed/dataset-register-entries/tree/main/ANL (not as one data catalog, but a data catalog per organization) and the biggest one (95MB) is https://github.com/netwerk-digitaal-erfgoed/dataset-register-entries/blob/main/ANL/zar.ttl and contains about 700K triples (42.840 datasets).

Do you have a suggestion as to how we can improve this behaviour?

Is it possible to do some sort of count of triples or datasets and issue a warning in API and crawler if this count exceeds our limit or is 0?

And another approach; what instructions can we give publishers of data catalogs to limit the size of a data catalog? Splitting up a data catalog might be a possibililty (hydra:PagedCollection #151).

@ddeboer
Copy link
Member

ddeboer commented Oct 7, 2021

Can we split up the query into multiple small ones? And if so, does that lower memory/CPU usage or not?

@ddeboer
Copy link
Member

ddeboer commented Nov 24, 2021

Is it possible to do some sort of count of triples or datasets and issue a warning in API and crawler if this count exceeds our limit or is 0?

While we cannot know for sure how many triples our query would return if it were unlimited, we can know if the SPARQL result size matches our limit. This indicates that we have to raise the limit further (or is pure coincidence, but this is rather unlikely). Because this is valuable information, I log this in #353.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants