Skip to content

How to extract data from tables

Niels Lohmann edited this page Apr 18, 2017 · 15 revisions

In this wiki I will document how to detect tablets and extract information from them. I’m writing this on the go, as I try to figure it out. By the way, my case study is a scanned PDF with OCR.

First we need to understand the structure of pdfminer. The module is split amongst several classes, as seen here:

By using the function pdfminer.pdfinterp.process_pdf, we forget about PDFParser, PDFDocument and PDFInterpreter, and care only about PDFResourceManager and PDFDevice (rant: why can’t this function also wrap the resource manager!?!?!). Now, the key modules should be PDFDevice and Layout.

  • PDFDevice: Contains the classes that will DO STUFF with the PDF after it has been interpreted. For instance, you could implement your own version of Adobe Reader (draw text and figures to the screen), or just dump all the text into a TXT file, etc. We want to extend this module to be able to manage tables.
  • Layout: It tries to make sense out of the PDF, to infer a structure by grouping letters that are close from each other. We would like to extend this module to be able to properly detect tables (and rows, columns, headers and titles). However, the code is quite intricate so we may are better off staying away from it.

Warmup: Extract all text from the first two pages

We can do this with very little code:

from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.converter import TextConverter

fp = open('Example.pdf', 'rb')
outfp = open('Example.txt', 'wb')
rsrc = PDFResourceManager()
device =TextConverter(rsrc, outfp)

process_pdf(rsrc, device, fp, maxpages=2)

fp.close()
outfp.close()

What went wrong?

There is a bug in TextConverter (actually in pdfminer.converter.PDFPageAggregator.__init__), which forgets to set the default layout parameters, in case they were not provided. To bypass this bug, let’s tweak our script by passing the default layout to TextConverter.

from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams                            # ++

fp = open('Example.pdf', 'rb')
outfp = open('Example.txt', 'wb')
rsrc = PDFResourceManager()
laparams = LAParams()                                           # ++
device =TextConverter(rsrc, outfp, laparams=laparams)           # ++

process_pdf(rsrc, device, fp, maxpages=2)

fp.close()
outfp.close()

Voilà! It works…

Miscellanea

We see CTM everywhere.. it’s the “current transformation matrix”, a 2×3 matrix where the 2×2 leftmost side is a “Transformation matrix” and the 2×1 rightmost side is a translation across the xy plane.

For text, we care about the render_string(self, textstate, seq) method of device. Textstate is an object with details about the text and seq is the text itself. We can use this to:

  • Get the true fontsize: textstate.fontsize * textstate.scaling
  • Get the location of the text (we also need to adjust for text written before it): textstate.matrix[-2:] gives a (x,y) based coordinate
  • Space between characters: textstate.charspace
  • Space between words: textstate.wordspace