-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create dump of Lagotto JSON API responses #343
Comments
If Zenodo is a practical solution of the monthly CSV file (#339), then we want to do the same for the monthly JSON dump. |
+1 for this if its easy enough to do Cameron Neyloncn@cameronneylon.net - http://cameronneylon.net @cameronneylon - http://orcid.org/0000-0002-0068-716X On Thu, Jun 25, 2015 at 10:07 AM, Martin Fenner notifications@github.com
|
We should generate three JSON files from the following three API calls: All three API endpoints support pagination with a default size of 1000. alm.plos.org as of today has 165222 works (166 pages), 7433520 references (7434 pages), and 4363396 events (4364 pages), for a total of about 12,000 API calls. In theory we could parallelize the API calls, but it would be possible that the API output changes because of updates to the data. The default sort order for |
Part of the work could be some simple benchmarking to understand how long it takes to generate the JSON files. |
@mfenner, after looking thru the code I have a better understanding the issues you mentioned with respect to data being updated. Here's my understanding of the issues... Potential issues for works and referencesFor
If de-duping was required I think Lagotto should handle that before providing the final export file. It would be a rather strange requirement to push onto potential consumers to care about de-duping. Potential issues for eventsThe default sort order for the
Will post more about possible solutions and benchmarks soon... |
Thanks for the update. Happy to change the sort order in the API calls (either as default or as query parameter), e.g. |
What do you think about changing the sort order to be by This has the following benefits:
These benefits would apply to The downside I see is that if a work we already dumped got updated after we dumped it (but before we don't dumping everything else) then we wouldn't get the latest. That update would have to come the next time we dumped the API. I don't find this problematic since we're dumping the API at a point in time and we've accurately captured that work at the point in time that it got dumped. This can be solved with the Ultimately I think It depends on if we want the absolute latest snapshot of everything by the end of the dumping process or if it's okay with a work being snapshotted at time it was dumped (since we know the next time we dump the API we'll get any updates that came after we dumped a particular item). With either of these solutions we don't need to implement a max_id/since_id which I think is good because Twitter doesn't worry about updated tweets just new tweets being added to the front of their stack and they allow their clients to not have to worry about de-duping tweets. WDYT? |
On a benchmarking note. I pushed up a simple Here are some of the benchmarks from crawling the first few pages of works:
The events and works API are taking much much longer to respond. I'm going to let this run for a bit over lunch and see what the results are when I return. I will then take a peek at the queries running these and see if there are any low-hanging optimizations we can make to improve the response time. |
The |
I like the approach of using the If we go with |
But if we introduce a With respect to making it the default sort order, I don't know how many clients you have but would making a new default cause unexpected changes for someone integrating with the data currently? Any reason to change the default as opposed to adding a new sorting option? |
You are of course right about |
Okay, It looks like datetime may be slightly slower, here's a test I ran locally with about a million records: https://gist.github.com/zdennis/ee277a81c4327927f7e0 As it relates to the size of the index this seems like a really useful resource for inspecting index sizes: http://aadant.com/blog/2014/02/04/how-to-calculate-a-specific-innodb-index-size/ |
@mfenner, right now I'm planning on dumping each of the three API end-points you listed above to their own file. E.g.:
I'm planning on including a README in that ZIP file as well. Zenodo is currently plan for this storing information. If you have any thoughts on the Zenodo deposition attributes let me know otherwise they'll inherit most of what we have on the current deposition minus a few changes to reflect a different title, description, etc. I can share an initial list on Monday. The crawling aspect of this is done, I'm working on the rake tasks to create the Zenodo export, e.g.
The risks of this approach right now are:
If anything else comes to mind or if I'm missing anything, let me know! Thanks, |
Sounds good. Total database size is 18.7 GB. |
…for-api-snapshots Issues/343 add user docs for api snapshots
…apshot-filenames Issues/343 add date to snapshot filenames
We want to create a zipped dump of all Lagotto API responses, ideally automated to run for example every month. A similar approach is used by
To limit the file size, we can generate the dump in batches of 1,000 or 10,000 articles, sorted by publication date.
The text was updated successfully, but these errors were encountered: