Tabular Document Wrangler
Clone or download
Latest commit 24c3735 Nov 22, 2018
Type Name Latest commit message Commit time
Failed to load latest commit information.
.idea Fixed the grouping of rectangles, tables should now be complete Oct 10, 2017
data adjust to new project structure Sep 20, 2017
log Create Sep 19, 2017
output adjust to new project structure Sep 20, 2017
src add import missed by merge Apr 16, 2018
EarTimeWrangler.iml adjust to new project structure Sep 20, 2017
LICENSE correct name in license Apr 9, 2018 Update Nov 22, 2018 more explanation Apr 17, 2018


Generic badge Generic badge

Tabular Document Wrangler. The code parses data from several types of poorly formatted tabular data formats, including pdf and csv files – on ministerial meetings between ministers and lobbyists. It forms a central part of 'Ear-time with the Cabinet: Ministerial meetings as vehicles for lobbying', which is a joint collaboration between the department of Sociology, University of Oxford and Transparency International UK.


As a pre-requisite to running, you might consider setting up a virtual environment with an installation of pdfminer3k. An install of Python 3.6 or greater is required. pdfminer3k can be installed with the command pip install 'pdfminer3k', a full tutorial can be found here.

Running the Code

Download a zip of this repository or git clone this repository and run python from the src folder at the command line. For a step-by-step setup guide aimed at beginners see

Input/output file folders

The script looks for input files in the data folder, with subfolders expected to match the government department code. The SQLite database and csv exports will be placed in the output folder.


This work is free. You can redistribute it and/or modify it under the terms of the MIT license. This license does not apply to any input or output data processed.


The project was funded by an ESRC IAA Kick-Start (1609-KICK-244) grant.