Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create dump of Lagotto JSON API responses #343

Closed
mfenner opened this issue Jun 23, 2015 · 16 comments
Closed

Create dump of Lagotto JSON API responses #343

mfenner opened this issue Jun 23, 2015 · 16 comments
Milestone

Comments

@mfenner
Copy link
Member

mfenner commented Jun 23, 2015

We want to create a zipped dump of all Lagotto API responses, ideally automated to run for example every month. A similar approach is used by

To limit the file size, we can generate the dump in batches of 1,000 or 10,000 articles, sorted by publication date.

@mfenner mfenner changed the title Create dump of JSON API responses Create dump of Lagotto JSON API responses Jun 25, 2015
@mfenner
Copy link
Member Author

mfenner commented Jun 25, 2015

If Zenodo is a practical solution of the monthly CSV file (#339), then we want to do the same for the monthly JSON dump.

@cameronneylon
Copy link

+1 for this if its easy enough to do

Cameron Neyloncn@cameronneylon.net - http://cameronneylon.net

@cameronneylonhttp://orcid.org/0000-0002-0068-716X

On Thu, Jun 25, 2015 at 10:07 AM, Martin Fenner notifications@github.com
wrote:

If Zenodo is a practical solution of the monthly CSV file, then we want to do the same for the monthly JSON dump.

Reply to this email directly or view it on GitHub:
#343 (comment)

@mfenner
Copy link
Member Author

mfenner commented Jul 6, 2015

We should generate three JSON files from the following three API calls:

All three API endpoints support pagination with a default size of 1000. alm.plos.org as of today has 165222 works (166 pages), 7433520 references (7434 pages), and 4363396 events (4364 pages), for a total of about 12,000 API calls.

In theory we could parallelize the API calls, but it would be possible that the API output changes because of updates to the data. The default sort order for works and references is published_on desc, and updated_at desc for events. If updates while fetching the data are a concern, we can consider a max_id/since_id approach similar to what Twitter is doing: https://dev.twitter.com/rest/public/timelines

@mfenner
Copy link
Member Author

mfenner commented Jul 6, 2015

Part of the work could be some simple benchmarking to understand how long it takes to generate the JSON files.

@zdennis
Copy link
Contributor

zdennis commented Jul 9, 2015

@mfenner, after looking thru the code I have a better understanding the issues you mentioned with respect to data being updated.

Here's my understanding of the issues...

Potential issues for works and references

For /api/works and /api/references updates during the import could cause the following issues:

  • a work shows up more than once and gets dumped twice
  • a work could be missed if it got updated and moved to a page of the API we had already processed

If de-duping was required I think Lagotto should handle that before providing the final export file. It would be a rather strange requirement to push onto potential consumers to care about de-duping.

Potential issues for events

The default sort order for the /api/events endpoint is updated_at.

  • We could miss events that got updated during the JSON dump and were added to an earlier page that we had already processed
  • We could also receive duplicates of events if other events were updated or added that caused an event we already processed to show up on a later page

Will post more about possible solutions and benchmarks soon...

@mfenner
Copy link
Member Author

mfenner commented Jul 9, 2015

Thanks for the update. Happy to change the sort order in the API calls (either as default or as query parameter), e.g. updated_at asc for all three API calls

@zdennis
Copy link
Contributor

zdennis commented Jul 9, 2015

What do you think about changing the sort order to be by id asc as a sort parameter or the default sort?

This has the following benefits:

  • it will never be updated in a way that will change it's place in the API results, e.g. you can update a work and it won't suddenly shift to an earlier or later page
  • It ensures that we will get the newest works as we reach the end of dumping the API

These benefits would apply to /api/events and /api/references as well if they also provided this sort option.

The downside I see is that if a work we already dumped got updated after we dumped it (but before we don't dumping everything else) then we wouldn't get the latest. That update would have to come the next time we dumped the API. I don't find this problematic since we're dumping the API at a point in time and we've accurately captured that work at the point in time that it got dumped.

This can be solved with the updated_at asc sort order that you mentioned above but that requires that we build in de-duping logic at the end (which is something we can do, just a little more work).

Ultimately I think It depends on if we want the absolute latest snapshot of everything by the end of the dumping process or if it's okay with a work being snapshotted at time it was dumped (since we know the next time we dump the API we'll get any updates that came after we dumped a particular item).

With either of these solutions we don't need to implement a max_id/since_id which I think is good because Twitter doesn't worry about updated tweets just new tweets being added to the front of their stack and they allow their clients to not have to worry about de-duping tweets.

WDYT?

@zdennis
Copy link
Contributor

zdennis commented Jul 9, 2015

On a benchmarking note. I pushed up a simple ApiCrawler on the issues/343_dump-api-to-json branch.

Here are some of the benchmarks from crawling the first few pages of works:

http://alm.plos.org/api/works took 22.147557 seconds
http://alm.plos.org/api/works?page=2 took 20.649916 seconds
http://alm.plos.org/api/works?page=3 took 20.978108 seconds
http://alm.plos.org/api/works?page=4 took 21.538611 seconds
http://alm.plos.org/api/works?page=5 took 19.903059 seconds
http://alm.plos.org/api/works?page=6 took 20.552962 seconds

The events and works API are taking much much longer to respond. I'm going to let this run for a bit over lunch and see what the results are when I return. I will then take a peek at the queries running these and see if there are any low-hanging optimizations we can make to improve the response time.

@mfenner
Copy link
Member Author

mfenner commented Jul 9, 2015

The /events and /references API endpoints are fairly new (May 2015), and I haven't tested them for this kind of usage (usually you want to get the events or references for a single work). I am sure they can be optimized. BTW, there is also /api_requests in the admin dashboard, but we only look at the last 1,000 requests.

@mfenner
Copy link
Member Author

mfenner commented Jul 9, 2015

I like the approach of using the id, because it provides a snapshot of everything in the system and there is no need to deduplicate. Following this logic, should we maybe do id desc, so that we don't have to worry about works or references being added while calling the API? An alternative to id would be created_at, I think this is basically the same information. I have not exposed the id column in the API, so I would have a small preference for created_at.

If we go with id desc or created_at desc, then I think we make this the default sort oder. For id asc this should probably be a query parameter, as in most places the application sorts by descending date/datetime.

@zdennis
Copy link
Contributor

zdennis commented Jul 9, 2015

created_at sounds good. The only thing I don't like about desc order is that it introduces the problem caused by pushing things onto the front of the stack.

But if we introduce a since parameter or similar we could ensure that pushing things on the front wouldn't affect the results as we're paging thru them. Thoughts?

With respect to making it the default sort order, I don't know how many clients you have but would making a new default cause unexpected changes for someone integrating with the data currently? Any reason to change the default as opposed to adding a new sorting option?

@mfenner
Copy link
Member Author

mfenner commented Jul 9, 2015

You are of course right about desc putting things in front of the stack. So asc is better, and there isn't really a need to make this the default sort order. What I don't know is whether using an index on a datetime column is slower than int, plus we need an additional index. But we can go with created_at.

@zdennis
Copy link
Contributor

zdennis commented Jul 9, 2015

Okay, created_at it is.

It looks like datetime may be slightly slower, here's a test I ran locally with about a million records: https://gist.github.com/zdennis/ee277a81c4327927f7e0

As it relates to the size of the index this seems like a really useful resource for inspecting index sizes: http://aadant.com/blog/2014/02/04/how-to-calculate-a-specific-innodb-index-size/

@zdennis
Copy link
Contributor

zdennis commented Jul 9, 2015

@mfenner, right now I'm planning on dumping each of the three API end-points you listed above to their own file. E.g.:

  • /api/works => api_works.json
  • /api/events => api_events.json
  • /api/references => api_references.json
  • api_dump.zip

I'm planning on including a README in that ZIP file as well. Zenodo is currently plan for this storing information. If you have any thoughts on the Zenodo deposition attributes let me know otherwise they'll inherit most of what we have on the current deposition minus a few changes to reflect a different title, description, etc. I can share an initial list on Monday.

The crawling aspect of this is done, I'm working on the rake tasks to create the Zenodo export, e.g.

  • rake api:dump
  • rake api:export_to_zenodo

The risks of this approach right now are:

  • we will need to do some performance improvements on /api/events and /api/references.
  • an assumption that we can store this information on Zenodo (if not, we'll have to implement storage for S3 or similar)
  • an assumption that we won't need to split up the downloaded files or zip files by hitting Zenodo's file size limit

If anything else comes to mind or if I'm missing anything, let me know! Thanks,

@mfenner
Copy link
Member Author

mfenner commented Jul 9, 2015

Sounds good. Total database size is 18.7 GB.

@zdennis
Copy link
Contributor

zdennis commented Jul 15, 2015

@mfenner, the PR #383 has been updated with a working implementation for this.

mfenner referenced this issue Jul 16, 2015
…for-api-snapshots

Issues/343 add user docs for api snapshots
mfenner referenced this issue Jul 16, 2015
…apshot-filenames

Issues/343 add date to snapshot filenames
@mfenner mfenner closed this as completed Jul 17, 2015
@mfenner mfenner added this to the Lagotto 4.3 milestone Jul 18, 2015
mfenner pushed a commit that referenced this issue Jul 27, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants