
Loading…
Simple (open) PDF to text service #52
I've been trying out pdf2json. Gist is here https://gist.github.com/rgrp/5944247
- Seems quite slow (though pdf being used is image heavy)
- Have not yet worked out how to get any text ;-)
left a comment on how to get basic text out of pdf2json https://gist.github.com/rgrp/5944247#comment-863903
While you may already know, I still don't see ScraperWiki's PDF conversion support mentioned -- https://views.scraperwiki.com/run/pdf-to-html-preview-1/
Also check the original Scraperwiki blog post on this topic -- http://blog.scraperwiki.com/2010/12/17/scraping-pdfs-now-26-less-unpleasant-with-scraperwiki/
As an alternative to poppler's implementation (used by Scraperwiki), which sometimes breaks on custom fonts, I've written a python script using pdfminer to achieve the same thing. It currently only supports text, but mostly works. See https://github.com/zejn/pypdf2xml
see also http://tabula.nerdpower.org/ by @jazzido
At DocumentCloud we have built and use http://documentcloud.github.com/docsplit/ which is both an API wrapper and a commandline tool to normalize and process most of the document types that Libre/Open Office can handle.
@zejn that's really useful and have updated issue with scraperwiki (can you use scraperwiki stuff via the API?). I've also
@maxogden good point - tabula is in the gist list (perhaps I should inline that in the issue). Aside: I'd love it if tabula got a nice hosted version with an API!
@knowtheory thanks - I'd thought of including documentcloud but I hadn't done enough digging into the toolstack yet to know whether there were subparts you could use or you need the whole thing!
/cc @mihi-tr
Hey everyone,
Tabula author here.
@maxogden good point - tabula is in the gist list (perhaps I should inline that in the issue). Aside: I'd love it if tabula got a nice hosted version with an API!
Tabula extraction code now lives in its own module (https://github.com/jazzido/tabula-extractor), so an API should be pretty easy to implement. Such a feature isn't in our short-term roadmap, but contributions are welcome :)
Add pdftables to list and inlined gist list here ...
@markbrough could you report your experience with pdftables?
Regarding tables scraping, I tried the diverse generic tools proposed so far but I always end up having specific issues getting me back to simple scraping following the ideas behind pdftables.
The method is quite straight forward but hard to package: it's basically getting the output from "pdftohtml -xml" then mapping the positions on a graph, and then defining rules from these and from the font ids given by pdftohtml
The problem is the formatting of the pages always requires some dirty hacks to skip non useful info and fix cases where rows are multilined, or same info not repeated, columns too close, etc
I guess some package could be a framework with middleware to include for the specificities
I have a working version of Tika dev (1.8) with tesseract here: http://beta.offenedaten.de:9998/tika
This will give simple text back from a PDF. Nothing fancy.
Test by doing things like:
curl -T mypdf.pdf http:///beta.offenedaten.de:9998/tika
You can run your own using Docker by doing:
sudo docker build -t tika github.com/mattfullerton/tika-tesseract-docker
sudo docker run -d -p 9998:9998 tika
I'm very open to improvements to the Docker build files, I am no expert there.
OCR is supported (for images), but what is lacking now (AFAIK) is detection that standard text extraction from a PDF 'failed' with a fallback to tesseract. We should look into that.
Looking for a simple online service to convert PDF to plain text. Features wanted:
If it does not exist would consider building.
Please add suggestions, ideas and comments in the comments
Research
Tools and Libraries (only open-source)
Taken from this gist list of pdf 2 xxx tools
cf also http://okfnlabs.org/dataconverters/ PDF section
Generic (PDF -> text)
Tables from PDF
Existing open services
Existing proprietary free or paid-for services
Google app engine used to do this http://developers.google.com/appengine/docs/python/conversion/overview
By Language
@maxogden has this list of Node libraries and tools:
https://gist.github.com/maxogden/5842859
Here's a gist showing how to use pdf2json: https://gist.github.com/rgrp/5944247
Colophon
Originally: http://ideas.okfn.org/ideas/106/pdf-tiff-scan-to-text-conversion-service/