Skip to content

Loading…

Simple (open) PDF to text service #52

Open
rgrp opened this Issue · 15 comments

8 participants

@rgrp
Open Knowledge member

Looking for a simple online service to convert PDF to plain text. Features wanted:

  • Preferably open (i.e. underlying code is open)
  • Service has API
  • OCR not required (but a bonus) - working on "normal" PDFs (not scan PDFs)

If it does not exist would consider building.

Please add suggestions, ideas and comments in the comments

Research

Tools and Libraries (only open-source)

Taken from this gist list of pdf 2 xxx tools

cf also http://okfnlabs.org/dataconverters/ PDF section

Generic (PDF -> text)

  • PDFMiner - PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.
    • Pure python
  • pdftohtml - pdftohtml is a utility which converts PDF files into HTML and XML formats. Based on xpdf
  • pdftoxml - command line utility to convert PDF to XML built on poppler.
  • docsplit - part of DocumentCloud. Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)
  • pypdf2xml - convert PDF to XML. Built on pdfminer. Started as an alternative to poppler's pdftoxml, which didn't properly decode CID Type2 fonts in PDFs.
  • pdf2htmlEX - Convert PDF to HTML without losing text or format. C++. Fast.

Tables from PDF

Existing open services

Existing proprietary free or paid-for services

Google app engine used to do this http://developers.google.com/appengine/docs/python/conversion/overview

By Language

@maxogden has this list of Node libraries and tools:

https://gist.github.com/maxogden/5842859

Here's a gist showing how to use pdf2json: https://gist.github.com/rgrp/5944247

Colophon

Originally: http://ideas.okfn.org/ideas/106/pdf-tiff-scan-to-text-conversion-service/

@rgrp
Open Knowledge member

I've been trying out pdf2json. Gist is here https://gist.github.com/rgrp/5944247

  • Seems quite slow (though pdf being used is image heavy)
  • Have not yet worked out how to get any text ;-)
@maxogden
Open Knowledge member
@maxogden
Open Knowledge member

left a comment on how to get basic text out of pdf2json https://gist.github.com/rgrp/5944247#comment-863903

@zejn

While you may already know, I still don't see ScraperWiki's PDF conversion support mentioned -- https://views.scraperwiki.com/run/pdf-to-html-preview-1/

Also check the original Scraperwiki blog post on this topic -- http://blog.scraperwiki.com/2010/12/17/scraping-pdfs-now-26-less-unpleasant-with-scraperwiki/

As an alternative to poppler's implementation (used by Scraperwiki), which sometimes breaks on custom fonts, I've written a python script using pdfminer to achieve the same thing. It currently only supports text, but mostly works. See https://github.com/zejn/pypdf2xml

@maxogden
Open Knowledge member
@knowtheory

At DocumentCloud we have built and use http://documentcloud.github.com/docsplit/ which is both an API wrapper and a commandline tool to normalize and process most of the document types that Libre/Open Office can handle.

@rgrp
Open Knowledge member

@zejn that's really useful and have updated issue with scraperwiki (can you use scraperwiki stuff via the API?). I've also

@maxogden good point - tabula is in the gist list (perhaps I should inline that in the issue). Aside: I'd love it if tabula got a nice hosted version with an API!

@knowtheory thanks - I'd thought of including documentcloud but I hadn't done enough digging into the toolstack yet to know whether there were subparts you could use or you need the whole thing!

/cc @mihi-tr

@jazzido

Hey everyone,

Tabula author here.

@maxogden good point - tabula is in the gist list (perhaps I should inline that in the issue). Aside: I'd love it if tabula got a nice hosted version with an API!

Tabula extraction code now lives in its own module (https://github.com/jazzido/tabula-extractor), so an API should be pretty easy to implement. Such a feature isn't in our short-term roadmap, but contributions are welcome :)

@gregelin
@jazzido
@gregelin
@rgrp
Open Knowledge member

Add pdftables to list and inlined gist list here ...

@rgrp
Open Knowledge member

@markbrough could you report your experience with pdftables?

@RouxRC

Regarding tables scraping, I tried the diverse generic tools proposed so far but I always end up having specific issues getting me back to simple scraping following the ideas behind pdftables.

The method is quite straight forward but hard to package: it's basically getting the output from "pdftohtml -xml" then mapping the positions on a graph, and then defining rules from these and from the font ids given by pdftohtml

Examples here and there

The problem is the formatting of the pages always requires some dirty hacks to skip non useful info and fix cases where rows are multilined, or same info not repeated, columns too close, etc

I guess some package could be a framework with middleware to include for the specificities

@mattfullerton

I have a working version of Tika dev (1.8) with tesseract here: http://beta.offenedaten.de:9998/tika

This will give simple text back from a PDF. Nothing fancy.

Test by doing things like:

curl -T mypdf.pdf http:///beta.offenedaten.de:9998/tika

You can run your own using Docker by doing:

sudo docker build -t tika github.com/mattfullerton/tika-tesseract-docker
sudo docker run -d -p 9998:9998 tika

I'm very open to improvements to the Docker build files, I am no expert there.

OCR is supported (for images), but what is lacking now (AFAIK) is detection that standard text extraction from a PDF 'failed' with a fallback to tesseract. We should look into that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.