[Post] Review of PDF data wrangling tools #155

rufuspollock · 2013-12-15T20:07:46Z

Write a post reviewing data wrangling tools for PDFs

Text in progress at: ~~http://pad.okfn.org/p/labs-post-pdf-tools~~ inlined below

Based on research material in rufuspollock/ideas#52

Questions:

should we split into a series e.g. [ans: no]
- libraries and tools
- web services
- specific examples - e.g. scraping in python, scraping in node etc
- using scraperwiki
should this go on schoolofdata? [ans: no]

Libraries for Extracting Data and Text from PDFs: A Review

- Authors: Rufus Pollock [add your name here if you contribute and want to be credited] - Who is this for? Data wranglers who would be looking to extract information from PDF - We should try and offer opinions on tools where possible [should actually review the tools ie. which is best? pros/cons? level of capability required? reliability [paragraph on crowd-scraping as well? "alternative approaches when your geeks can't do it"] [perhaps find a sample PDF and see how each tool does? show the differences in output?] - This is a GREAT idea.

Extracting data from PDFs unfortunately remains a common data wrangling task. This post reviews various tools and services for doing this with a focus on free (and preferably) open source options.

3 categories:

Extracting text from PDF
Extracting tables from PDF
Extracting data (text or otherwise) from PDFs with scans

The last case is really a situation for OCR (optical character recognition) so we're going to ignore it here.
[should include a short para on OCR too, just to provide an indication of the limits of automated extraction without much pre-processing]

[[TODO: some nice PDF screenshots - perhaps we can reference]]

Generic (PDF -> text)

PDFMiner - PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.
- Pure python
pdftohtml - pdftohtml is a utility which converts PDF files into HTML and XML formats. Based on xpdf
- Command-line Linux
pdftoxml - command line utility to convert PDF to XML built on poppler.
docsplit - part of DocumentCloud. Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)
pypdf2xml - convert PDF to XML. Built on pdfminer. Started as an alternative to poppler's pdftoxml, which didn't properly decode CID Type2 fonts in PDFs.
pdf2htmlEX - Convert PDF to HTML without losing text or format. C++. Fast. Primarily focused on producing HTML that exactly resembles the original PDF. Limited use for straightforward text extraction.

Tables from PDF

http://tabula.nerdpower.org/ - open-source, designed specifically for tabular data. Now easy to install. Ruby-based.
https://github.com/okfn/pdftables - open-source. Created by Scraperwiki but no longer seems to be available so here is a fork)
http://pdftoxml.sourceforge.net/ - one of the better for tables but have not used for a while
http://pdftohtml.sourceforge.net/ - linux only afaict
https://github.com/liberit/scraptils/blob/master/scraptils/tools/pdf2csv.py AGPLv3+, python, scraptils has other useful tools as well, pdf2csv needs pdfminer==20110515
pdf.js - you probably want a fork like pdf2json or node-pdfreader that integrates this better with node. I have not tried this on tables though ...
Using scraperwiki + pdftoxml - see this recent tutorial Get Started With Scraping – Extracting Simple Tables from PDF Documents

Existing open services

http://pdfx.cs.man.ac.uk/ - has a nice command line interface
- Is this open? Says at bottom of usage that it is powered by http://www.utopiadocs.com/
Scraperwiki - https://views.scraperwiki.com/run/pdf-to-html-preview-1/ and this tutorial

Existing proprietary free or paid-for services

http://www.newocr.com/ - free, no API
http://www.free-ocr.com/ - free, no API, captcha
http://www.onlineocr.net/ - free
http://captricity.com/
https://pdftables.com/ - pay-per-page service

Google app engine used to do this http://developers.google.com/appengine/docs/python/conversion/overview

By Language

@maxogden has this list of Node libraries and tools:

https://gist.github.com/maxogden/5842859

Here's a gist showing how to use pdf2json: https://gist.github.com/rgrp/5944247

Other good intros

danfowler · 2015-08-19T09:23:37Z

This was covered here @tlevine, no? http://okfnlabs.org/blog/2013/12/25/parsing-pdfs.html

rufuspollock · 2015-08-19T09:29:12Z

@danfowler @tlevine's article was a great specific walkthrough - like a tutorial. This was more of a "complete" review of the tools out there so I think a bit different. We should finish the pad and then post ... :-)

tlevine · 2015-08-19T13:23:35Z

Indeed mine references only like half of the things you mention.

ScraperWiki have a proprietary PDF reader that they say is quite good.

On 19 Aug 02:29, Rufus Pollock wrote:

@danfowler @tlevine's article was a great specific walkthrough - like a tutorial. This was more of a "complete" review of the tools out there so I think a bit different. We should finish the pad and then post :-)

Reply to this email directly or view it on GitHub:
#155 (comment)

andylolz · 2015-08-20T07:58:48Z

Good point – I’ve added that to the proprietary list.

rufuspollock · 2016-04-17T19:03:47Z

FIXED. http://okfnlabs.org/blog/2016/04/19/pdf-tools-extract-text-and-data-from-pdfs.html

tfmorris · 2016-04-17T22:38:41Z

That's a 404 for me.

Plus:

The Count von Count image on the 404 page is slow to load (I'm on a
train, but still, PNG?) and looks to potentially be a copyrighted Sesame
Street character/
The 404 page has no search box which would probably be the quickest route
to finding the page that is missing

rufuspollock · 2016-04-18T18:55:12Z

@tfmorris it won't go live until tomorrow as per the date ;-) Check the commit if you want to review in advance ...

danfowler · 2016-04-20T09:01:32Z

Thanks. Online now. Had to give it a nudge to rebuild.

rufuspollock closed this as completed in 66f467f Apr 17, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Post] Review of PDF data wrangling tools #155

[Post] Review of PDF data wrangling tools #155

rufuspollock commented Dec 15, 2013 •

edited

danfowler commented Aug 19, 2015

rufuspollock commented Aug 19, 2015

tlevine commented Aug 19, 2015

andylolz commented Aug 20, 2015

rufuspollock commented Apr 17, 2016

tfmorris commented Apr 17, 2016

rufuspollock commented Apr 18, 2016

danfowler commented Apr 20, 2016

[Post] Review of PDF data wrangling tools #155

[Post] Review of PDF data wrangling tools #155

Comments

rufuspollock commented Dec 15, 2013 • edited

Libraries for Extracting Data and Text from PDFs: A Review

Generic (PDF -> text)

Tables from PDF

Existing open services

Existing proprietary free or paid-for services

By Language

Other good intros

danfowler commented Aug 19, 2015

rufuspollock commented Aug 19, 2015

tlevine commented Aug 19, 2015

andylolz commented Aug 20, 2015

rufuspollock commented Apr 17, 2016

tfmorris commented Apr 17, 2016

rufuspollock commented Apr 18, 2016

danfowler commented Apr 20, 2016

rufuspollock commented Dec 15, 2013 •

edited