New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data version API #7

Open
rufuspollock opened this Issue Apr 29, 2016 · 5 comments

Comments

Projects
None yet
1 participant
@rufuspollock
Member

rufuspollock commented Apr 29, 2016

From @trickvi on June 20, 2013 12:37

I would like to be able to use data.okfn.org as an intermediary between my software and the data packages it uses and be able to quickly check whether there's a new version available of the data (e.g. if I've cached the package on a local machine).

There are ways to do it with the current setup:

  1. Download the datapackage.json descriptor file, parse it and get the version there and check it against my local version. Problems:
    • This solution relies on humans and that they update their version but there might not be any consistency in it since the data package standard describes the version attribute as: "a version string conforming to the Semantic Versioning requirement"
    • I have to fetch the whole datapackage.json (it's not big I know but why download all that extra data I might not even want)
  2. Go around data.okfn.org and look directly at the github repository. Problems:
    • I have to find out where the repo is, use git and do a lot of extra stuff (I don't care how the data packages are stored, I just want a simple interface to fetch them)
    • What would be the point of data.okfn.org/data? In my mind it collects data packages and provides a consistent interface to get the data packages irrespective of how its stored.

I propose data.okfn.org provides an internal system to allow users to quickly check whether a new version might be released. This does not have to be an API. We could leverage HTTP's caching mechanism using an ETag header that would contain some hash value. This hash value can e.g. be the the sha value of heads ref objects served via the Github API:

https://api.github.com/repos/datasets/cpi/git/refs/heads/master

Software that works with data packages could then implement a caching strategy and just send a request with an If-None-Match header along with a GET request for datapackage.json to either get a new version of the descriptor (and look at the version in that file) or just serve the data from its cache.

Copied from original issue: frictionlessdata/project#51

@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock

rufuspollock Apr 29, 2016

Member

So the big question here is:

  • How does this work for data packages that aren't stored in github (we could force this assumption but it seems strong)
  • How do I get this info and serve it - do I make a call to github API on every view request to a page? Alternatively we have a dedicated API by either way we have to hit github API every time (and github API is pretty restrictive re anonymous usage - obviously can do non-anonymous). If we don't do it every time we need a caching mechanism.
Member

rufuspollock commented Apr 29, 2016

So the big question here is:

  • How does this work for data packages that aren't stored in github (we could force this assumption but it seems strong)
  • How do I get this info and serve it - do I make a call to github API on every view request to a page? Alternatively we have a dedicated API by either way we have to hit github API every time (and github API is pretty restrictive re anonymous usage - obviously can do non-anonymous). If we don't do it every time we need a caching mechanism.
@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock

rufuspollock Apr 29, 2016

Member

From @trickvi on June 30, 2013 9:28

Are there any data packages on data.okfn.org that aren't stored in github? If so, is there some other way to generate a hash value, timestamps, etc.? I think the ETag value will always be closely tied to the underlying storage mechanism (and how that tracks changes in to resources). If it isn't possible to get some value you could always fall back on hashing/serving the version number (which works ok, but might not be updated) or do a md5 on the resources (computation heavy for data.okfn.org)

Then there is the issue about checking for updates. I think this depends on the data but I don't think any data package is being updated real time so we don't have to check for updates in real time. I think it might be enough to check for this once a day or even once a week/month (I haven't looked at the existing data packages to infer what time period is necessary). This value could be saved in memory as some sort of a caching mechanism, but looking at github's caching mechanism I wouldn't rely on it as a caching strategy.

Now these kinds of checks could of course be done by those who use the data but I don't think that it's the role of data package users. They just want to use the data, not create a mechanism to check for updates. They will most likely use some intermediate software that's unaware of the data context so that software package needs a generic caching strategy.

Member

rufuspollock commented Apr 29, 2016

From @trickvi on June 30, 2013 9:28

Are there any data packages on data.okfn.org that aren't stored in github? If so, is there some other way to generate a hash value, timestamps, etc.? I think the ETag value will always be closely tied to the underlying storage mechanism (and how that tracks changes in to resources). If it isn't possible to get some value you could always fall back on hashing/serving the version number (which works ok, but might not be updated) or do a md5 on the resources (computation heavy for data.okfn.org)

Then there is the issue about checking for updates. I think this depends on the data but I don't think any data package is being updated real time so we don't have to check for updates in real time. I think it might be enough to check for this once a day or even once a week/month (I haven't looked at the existing data packages to infer what time period is necessary). This value could be saved in memory as some sort of a caching mechanism, but looking at github's caching mechanism I wouldn't rely on it as a caching strategy.

Now these kinds of checks could of course be done by those who use the data but I don't think that it's the role of data package users. They just want to use the data, not create a mechanism to check for updates. They will most likely use some intermediate software that's unaware of the data context so that software package needs a generic caching strategy.

@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock

rufuspollock Apr 29, 2016

Member

Let's agreed on assumption of github only atm and you're suggesting you're happy with data being irregularly updated. In this case it shouldn't be too hard to do.

Last question is what field this goes into. I'm guessing we use lastmodified or similar.

Member

rufuspollock commented Apr 29, 2016

Let's agreed on assumption of github only atm and you're suggesting you're happy with data being irregularly updated. In this case it shouldn't be too hard to do.

Last question is what field this goes into. I'm guessing we use lastmodified or similar.

@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock

rufuspollock Apr 29, 2016

Member

From @trickvi on June 30, 2013 21:10

I'm rather fond of the ETag header instead of Last-Modified because you can do more with strings (e.g. hashes) than dates (ETag value can even be a date if you want to fall back on datapackage.json's last_modified value.

One thing I've been thinking about is to suggest an optional checksum value for resources in data packages (since they are retrieved separately). This value could also be used as the ETag value in case it gets added to the standard (a discussion outside this issue). So I vote for ETag instead of Last-Modified.

Member

rufuspollock commented Apr 29, 2016

From @trickvi on June 30, 2013 21:10

I'm rather fond of the ETag header instead of Last-Modified because you can do more with strings (e.g. hashes) than dates (ETag value can even be a date if you want to fall back on datapackage.json's last_modified value.

One thing I've been thinking about is to suggest an optional checksum value for resources in data packages (since they are retrieved separately). This value could also be used as the ETag value in case it gets added to the standard (a discussion outside this issue). So I vote for ETag instead of Last-Modified.

@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock

rufuspollock Apr 29, 2016

Member

@tryggvib how important is this? at the moment I've given this only 1 star -- you haven't been bugging about it ;-) However, if this is a real blocker to your potential use case may be worth upgrading its priority :-)

Member

rufuspollock commented Apr 29, 2016

@tryggvib how important is this? at the moment I've given this only 1 star -- you haven't been bugging about it ;-) However, if this is a real blocker to your potential use case may be worth upgrading its priority :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment