Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Post] Review of PDF data wrangling tools #155

Closed
2 tasks
rufuspollock opened this issue Dec 15, 2013 · 8 comments
Closed
2 tasks

[Post] Review of PDF data wrangling tools #155

rufuspollock opened this issue Dec 15, 2013 · 8 comments

Comments

@rufuspollock
Copy link
Member

rufuspollock commented Dec 15, 2013

Write a post reviewing data wrangling tools for PDFs

Text in progress at: http://pad.okfn.org/p/labs-post-pdf-tools inlined below

Based on research material in rufuspollock/ideas#52

Questions:

  • should we split into a series e.g. [ans: no]
    • libraries and tools
    • web services
    • specific examples - e.g. scraping in python, scraping in node etc
    • using scraperwiki
  • should this go on schoolofdata? [ans: no]

Libraries for Extracting Data and Text from PDFs: A Review

- Authors: Rufus Pollock [add your name here if you contribute and want to be credited] - Who is this for? Data wranglers who would be looking to extract information from PDF - We should try and offer opinions on tools where possible [should actually review the tools ie. which is best? pros/cons? level of capability required? reliability [paragraph on crowd-scraping as well? "alternative approaches when your geeks can't do it"] [perhaps find a sample PDF and see how each tool does? show the differences in output?] - This is a GREAT idea.

Extracting data from PDFs unfortunately remains a common data wrangling task. This post reviews various tools and services for doing this with a focus on free (and preferably) open source options.

3 categories:

  • Extracting text from PDF
  • Extracting tables from PDF
  • Extracting data (text or otherwise) from PDFs with scans

The last case is really a situation for OCR (optical character recognition) so we're going to ignore it here.
[should include a short para on OCR too, just to provide an indication of the limits of automated extraction without much pre-processing]

[[TODO: some nice PDF screenshots - perhaps we can reference]]

Generic (PDF -> text)

  • PDFMiner - PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.
    • Pure python
  • pdftohtml - pdftohtml is a utility which converts PDF files into HTML and XML formats. Based on xpdf
    • Command-line Linux
  • pdftoxml - command line utility to convert PDF to XML built on poppler.
  • docsplit - part of DocumentCloud. Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)
  • pypdf2xml - convert PDF to XML. Built on pdfminer. Started as an alternative to poppler's pdftoxml, which didn't properly decode CID Type2 fonts in PDFs.
  • pdf2htmlEX - Convert PDF to HTML without losing text or format. C++. Fast. Primarily focused on producing HTML that exactly resembles the original PDF. Limited use for straightforward text extraction.

Tables from PDF

Existing open services

Existing proprietary free or paid-for services

Google app engine used to do this http://developers.google.com/appengine/docs/python/conversion/overview

By Language

@maxogden has this list of Node libraries and tools:

https://gist.github.com/maxogden/5842859

Here's a gist showing how to use pdf2json: https://gist.github.com/rgrp/5944247

Other good intros

@danfowler
Copy link
Contributor

This was covered here @tlevine, no? http://okfnlabs.org/blog/2013/12/25/parsing-pdfs.html

@rufuspollock
Copy link
Member Author

@danfowler @tlevine's article was a great specific walkthrough - like a tutorial. This was more of a "complete" review of the tools out there so I think a bit different. We should finish the pad and then post ... :-)

@tlevine
Copy link
Contributor

tlevine commented Aug 19, 2015

Indeed mine references only like half of the things you mention.

ScraperWiki have a proprietary PDF reader that they say is quite good.

On 19 Aug 02:29, Rufus Pollock wrote:

@danfowler @tlevine's article was a great specific walkthrough - like a tutorial. This was more of a "complete" review of the tools out there so I think a bit different. We should finish the pad and then post :-)


Reply to this email directly or view it on GitHub:
#155 (comment)

@andylolz
Copy link
Collaborator

Good point – I’ve added that to the proprietary list.

@rufuspollock
Copy link
Member Author

@tfmorris
Copy link
Contributor

That's a 404 for me.

Plus:

  • The Count von Count image on the 404 page is slow to load (I'm on a
    train, but still, PNG?) and looks to potentially be a copyrighted Sesame
    Street character/
  • The 404 page has no search box which would probably be the quickest route
    to finding the page that is missing

@rufuspollock
Copy link
Member Author

@tfmorris it won't go live until tomorrow as per the date ;-) Check the commit if you want to review in advance ...

@danfowler
Copy link
Contributor

Thanks. Online now. Had to give it a nudge to rebuild.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants