Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

+ local cache on occurence requests #53

Open
7yl4r opened this issue Jun 7, 2019 · 5 comments
Open

+ local cache on occurence requests #53

7yl4r opened this issue Jun 7, 2019 · 5 comments
Assignees
Projects

Comments

@7yl4r
Copy link

7yl4r commented Jun 7, 2019

I'm doing repeated queries of occurence records which can return large amounts of data.
Rather than downloading it all every time I plan to save to a cache file and update only with the newer records.

I think this can be accomplished by saving the df as a .rds file including the "after" occurence id.

Example:

# fetch data and create file `Abra_abra_<LAST_OCCURRENCE_UUID>.rds`
records <- occurrence("Abra alba", cache=TRUE)

# 2nd call finds & loads `Abra_abra_<LAST_OCCURRENCE_UUID>.rds`,
#     includes `after=LAST_OCCURRENCE_UUID` in the request,
#     and appends result to locally cached df.
records <- occurrence("Abra alba", cache=TRUE)

This will speed things up a lot for me and reduce load on OBIS servers.
As a bonus, a filepath could be passed to the cache param, giving the user control over where the cache file is stored.

prereq: #7

@pieterprovoost
Copy link
Member

@7yl4r, I'm not sure this is going to work, as the occurrence identifiers are not sequential and not persistent across dataset updates. When a dataset is updated, the old version is completely removed and replaced with the new version where the occurrences will have new identifiers.

The occurrences we receive from data providers often do not have globally unique identifiers, so it's not trivial to determine which individual records have been added, removed or edited.

@7yl4r
Copy link
Author

7yl4r commented Jun 7, 2019

That's unfortunate.
Without a sequential value for pagination in the API I'm not sure where to go with this.
I think uing startdate would be equally problematic since occurence records are not necessarily added chronologically and historical additions would then be missing from the local cache.

Maybe something could be done with the /statistics endpoint to not re-download if the number of records is unchanged?
It's not the level of improvement I was hoping for, but it's something.

@pieterprovoost
Copy link
Member

Using statistics will cause other problems, e.g. when serious quality issues (coordinates etc) are being fixed without any changes in the number of records, you will miss those updates. It's not an easy problem, especially because records are being removed as well. I can look into adding a published_after parameter to occurrence(), but that doesn't solve the problem of stale records in your cache.

Perhaps you could follow this workflow:

  • first fetch IDs for all datasets that have been updated since date X (parameter to be added to dataset())
  • remove all records belonging to these datasets from your cache
  • restrict your next download to the updated datasets (occurrence(datasetid=...))

I suppose this would offer some improvement, although you need to be aware that some nodes regularly regenerate their whole IPT, which makes it look like all datasets have been updated.

@pieterprovoost pieterprovoost self-assigned this Jun 7, 2019
@7yl4r
Copy link
Author

7yl4r commented Jun 7, 2019

Thank you Pieter. 🙌

If I'm ambitious in the coming weeks I may try implementing this and submit a pull request.
Until then I'll just keep hammering OBIS's API. 😅

I am very glad I asked before assuming occurence ids were sequential.

@pieterprovoost
Copy link
Member

@7yl4r Ok, I'm pretty busy right now but I'll try to add the necessary published date parameter to dataset() soonish (this needs to happen at the API level as well).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
API
  
Awaiting triage
Development

No branches or pull requests

2 participants