New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Post] Review of PDF data wrangling tools #155
Comments
This was covered here @tlevine, no? http://okfnlabs.org/blog/2013/12/25/parsing-pdfs.html |
@danfowler @tlevine's article was a great specific walkthrough - like a tutorial. This was more of a "complete" review of the tools out there so I think a bit different. We should finish the pad and then post ... :-) |
Indeed mine references only like half of the things you mention. ScraperWiki have a proprietary PDF reader that they say is quite good. On 19 Aug 02:29, Rufus Pollock wrote:
|
Good point – I’ve added that to the proprietary list. |
That's a 404 for me. Plus:
|
@tfmorris it won't go live until tomorrow as per the date ;-) Check the commit if you want to review in advance ... |
Thanks. Online now. Had to give it a nudge to rebuild. |
Write a post reviewing data wrangling tools for PDFs
Text in progress at:
http://pad.okfn.org/p/labs-post-pdf-toolsinlined belowBased on research material in rufuspollock/ideas#52
Questions:
Libraries for Extracting Data and Text from PDFs: A Review
- Authors: Rufus Pollock [add your name here if you contribute and want to be credited] - Who is this for? Data wranglers who would be looking to extract information from PDF - We should try and offer opinions on tools where possible [should actually review the tools ie. which is best? pros/cons? level of capability required? reliability [paragraph on crowd-scraping as well? "alternative approaches when your geeks can't do it"] [perhaps find a sample PDF and see how each tool does? show the differences in output?] - This is a GREAT idea.Extracting data from PDFs unfortunately remains a common data wrangling task. This post reviews various tools and services for doing this with a focus on free (and preferably) open source options.
3 categories:
The last case is really a situation for OCR (optical character recognition) so we're going to ignore it here.
[should include a short para on OCR too, just to provide an indication of the limits of automated extraction without much pre-processing]
[[TODO: some nice PDF screenshots - perhaps we can reference]]
Generic (PDF -> text)
Tables from PDF
Existing open services
Existing proprietary free or paid-for services
Google app engine used to do this http://developers.google.com/appengine/docs/python/conversion/overview
By Language
@maxogden has this list of Node libraries and tools:
https://gist.github.com/maxogden/5842859
Here's a gist showing how to use pdf2json: https://gist.github.com/rgrp/5944247
Other good intros
The text was updated successfully, but these errors were encountered: