Skip to content
Go to file
This branch is 2 commits ahead, 2 commits behind drj11:dev.

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


pdftables - a library for extracting tables from PDF files

This Readme, and more, is available on ReadTheDocs.

This post on the ScraperWiki blog describes the algorithms used in pdftables, and something of its genesis. This README gives more technical information.

pdftables uses pdfminer to get information on the locations of text elements in a PDF document. pdfminer was chosen as a base because it provides information on the full range of page elements in PDF files, including graphical elements such as lines. Although the algorithms currently used do not use these elements they are planned for future work. As a purely Python library, pdfminer is very portable. The downside of pdfminer is that it is slow, perhaps an order of magnitude slower than alternative C based libraries.


You need poppler and Cairo. On a Ubuntu and friends you can go:

sudo apt-get -y install python-poppler python-cairo

Then we can install the pip-able requirements from the requirements.txt file:

pip install -r requirements.txt


First we get a file object to a PDF:

filepath = 'example.pdf'
fileobj = open(filepath,'rb')

Then we create a PDF element from the file object:

from pdftables.pdf_document import PDFDocument
doc = PDFDocument.from_fileobj(fileobj)

Then we use the get_page() method to select a single page from the document:

from pdftables.pdftables import page_to_tables
page = doc.get_page(pagenumber)
tables = page_to_tables(page)

You can also loop over all pages in the PDF using get_pages():

from pdftables.pdftables import page_to_tables
for page_number, page in enumerate(doc.get_pages()):
  tables = page_to_tables(page)

Now you have a TableContainer object, you can convert it to ASCII for quick previewing:

from pdftables.display import to_string
for table in tables:
  print to_string( is a table that has been found, in the form of a list of lists of strings (ie: a list of rows, each containing the same number of cells).

Command line tool

pdftables includes a command line tool for diagnostic rendering of pages and tables, called pdftables-render. This is installed if you pip install pdftables, or you manually run python

$ pdftables-render example.pdf

This creates separate PNG and SVG files for each page of the specified PDF, in png/ and svg/, with three disagnostic displays per page.

Developing pdftables

Files and folders:

| |-sample_data

fixtures contains test fixtures, in particular the sample_data directory contains PDF files which are installed from a different repository by running the script.

The fixtures are currently unavailable as they are held on a private repository

We're also using data from which is also installed by the download script.

pdftables contains the core code files

test contains tests - this is the core of the pdftables library - implements collections.Counter for the benefit of Python 2.6 - prettily prints a table by implementing the to_string function - partially implements numpy.diff, numpy.arange and numpy.average to avoid a large dependency on numpy. - implements PDFDocument to abstract away the underlying PDF class, and ease any conversion to a different underlying PDF library to replace PDFminer


A library for extracting tables from PDF files




No releases published
You can’t perform that action at this time.