+ local cache on occurence requests #53

7yl4r · 2019-06-07T12:14:27Z

I'm doing repeated queries of occurence records which can return large amounts of data.
Rather than downloading it all every time I plan to save to a cache file and update only with the newer records.

I think this can be accomplished by saving the df as a .rds file including the "after" occurence id.

Example:

# fetch data and create file `Abra_abra_<LAST_OCCURRENCE_UUID>.rds`
records <- occurrence("Abra alba", cache=TRUE)

# 2nd call finds & loads `Abra_abra_<LAST_OCCURRENCE_UUID>.rds`,
#     includes `after=LAST_OCCURRENCE_UUID` in the request,
#     and appends result to locally cached df.
records <- occurrence("Abra alba", cache=TRUE)

This will speed things up a lot for me and reduce load on OBIS servers.
As a bonus, a filepath could be passed to the cache param, giving the user control over where the cache file is stored.

prereq: #7

The text was updated successfully, but these errors were encountered:

pieterprovoost · 2019-06-07T12:39:55Z

@7yl4r, I'm not sure this is going to work, as the occurrence identifiers are not sequential and not persistent across dataset updates. When a dataset is updated, the old version is completely removed and replaced with the new version where the occurrences will have new identifiers.

The occurrences we receive from data providers often do not have globally unique identifiers, so it's not trivial to determine which individual records have been added, removed or edited.

7yl4r · 2019-06-07T13:01:13Z

That's unfortunate.
Without a sequential value for pagination in the API I'm not sure where to go with this.
I think uing startdate would be equally problematic since occurence records are not necessarily added chronologically and historical additions would then be missing from the local cache.

Maybe something could be done with the /statistics endpoint to not re-download if the number of records is unchanged?
It's not the level of improvement I was hoping for, but it's something.

pieterprovoost · 2019-06-07T13:22:37Z

Using statistics will cause other problems, e.g. when serious quality issues (coordinates etc) are being fixed without any changes in the number of records, you will miss those updates. It's not an easy problem, especially because records are being removed as well. I can look into adding a published_after parameter to occurrence(), but that doesn't solve the problem of stale records in your cache.

Perhaps you could follow this workflow:

first fetch IDs for all datasets that have been updated since date X (parameter to be added to dataset())
remove all records belonging to these datasets from your cache
restrict your next download to the updated datasets (occurrence(datasetid=...))

I suppose this would offer some improvement, although you need to be aware that some nodes regularly regenerate their whole IPT, which makes it look like all datasets have been updated.

7yl4r · 2019-06-07T13:28:41Z

Thank you Pieter. 🙌

If I'm ambitious in the coming weeks I may try implementing this and submit a pull request.
Until then I'll just keep hammering OBIS's API. 😅

I am very glad I asked before assuming occurence ids were sequential.

pieterprovoost · 2019-06-07T13:30:32Z

@7yl4r Ok, I'm pretty busy right now but I'll try to add the necessary published date parameter to dataset() soonish (this needs to happen at the API level as well).

pieterprovoost self-assigned this Jun 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

+ local cache on occurence requests #53

+ local cache on occurence requests #53

7yl4r commented Jun 7, 2019

pieterprovoost commented Jun 7, 2019

7yl4r commented Jun 7, 2019 •

edited

pieterprovoost commented Jun 7, 2019

7yl4r commented Jun 7, 2019

pieterprovoost commented Jun 7, 2019

+ local cache on occurence requests #53

+ local cache on occurence requests #53

Comments

7yl4r commented Jun 7, 2019

pieterprovoost commented Jun 7, 2019

7yl4r commented Jun 7, 2019 • edited

pieterprovoost commented Jun 7, 2019

7yl4r commented Jun 7, 2019

pieterprovoost commented Jun 7, 2019

7yl4r commented Jun 7, 2019 •

edited