Convert London Oytser rail+tube+bus PDF journey histories into a Pandas DataFrame as an HDF5 file
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.

Given PDFs from Oyster website showing London Bus and Train journeys this script parses the PDFs, extracts line items and outputs a full Pandas DataFrame. Write-up:

The data is downloaded via:

  2. "View Joureny History"
  3. Select a month to view
  4. Download the PDF

Given a PDF file from the Oyster website (or a folder of them), it'll generate an HDF5 based on a DataFrame that looks like:

                             from is_train                to
2016-01-30  Bus Journey, Route 46    False                  
2016-01-28           Kentish Town     True  Leicester Square
2016-01-28             Old Street     True      Kentish Town
2016-01-28       Leicester Square     True        Old Street
2016-01-27                  Angel     True      Kentish Town


$ python --filename="pdfs/Amex_1001_201511.pdf"  # convert a single PDF

$ python --directory="pdfs"  # search pdfs folder for PDFs

The data can be loaded back in to Pandas with df = pandas.read_hdf('journeys.hdf5').


$ py.test


This is a "quick hack" in an evening to process the PDFs, written with:

$ pdftotext --help
pdftotext version 0.24.5
Copyright 2005-2013 The Poppler Developers -
Copyright 1996-2011 Glyph & Cog, LLC