Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Data version API #7
From @trickvi on June 20, 2013 12:37
I would like to be able to use data.okfn.org as an intermediary between my software and the data packages it uses and be able to quickly check whether there's a new version available of the data (e.g. if I've cached the package on a local machine).
There are ways to do it with the current setup:
I propose data.okfn.org provides an internal system to allow users to quickly check whether a new version might be released. This does not have to be an API. We could leverage HTTP's caching mechanism using an ETag header that would contain some hash value. This hash value can e.g. be the the sha value of heads ref objects served via the Github API:
Software that works with data packages could then implement a caching strategy and just send a request with an If-None-Match header along with a GET request for datapackage.json to either get a new version of the descriptor (and look at the version in that file) or just serve the data from its cache.
Copied from original issue: frictionlessdata/project#51
So the big question here is:
From @trickvi on June 30, 2013 9:28
Are there any data packages on data.okfn.org that aren't stored in github? If so, is there some other way to generate a hash value, timestamps, etc.? I think the ETag value will always be closely tied to the underlying storage mechanism (and how that tracks changes in to resources). If it isn't possible to get some value you could always fall back on hashing/serving the version number (which works ok, but might not be updated) or do a md5 on the resources (computation heavy for data.okfn.org)
Then there is the issue about checking for updates. I think this depends on the data but I don't think any data package is being updated real time so we don't have to check for updates in real time. I think it might be enough to check for this once a day or even once a week/month (I haven't looked at the existing data packages to infer what time period is necessary). This value could be saved in memory as some sort of a caching mechanism, but looking at github's caching mechanism I wouldn't rely on it as a caching strategy.
Now these kinds of checks could of course be done by those who use the data but I don't think that it's the role of data package users. They just want to use the data, not create a mechanism to check for updates. They will most likely use some intermediate software that's unaware of the data context so that software package needs a generic caching strategy.
From @trickvi on June 30, 2013 21:10
I'm rather fond of the ETag header instead of Last-Modified because you can do more with strings (e.g. hashes) than dates (ETag value can even be a date if you want to fall back on datapackage.json's last_modified value.
One thing I've been thinking about is to suggest an optional checksum value for resources in data packages (since they are retrieved separately). This value could also be used as the ETag value in case it gets added to the standard (a discussion outside this issue). So I vote for ETag instead of Last-Modified.