# Short example on how to parse a whole file

In my main blog post I walked though the steps of how I managed to extract tabular data from a PDF. I wrapped the whole thing in a few functions to make extracting from an entire file possible.

First we impore the relevant function:

In [1]:
from PDFFixup.fixer import get_tables

Next we run it over the whole file:

In [2]:
file_path = "data/DH_Ministerial_gifts_hospitality_travel_and_external_meetings_Jan_to_Mar_2015.pdf"
extracted_table = get_tables(file_path)

In [3]:
len(extracted_table)

12

The returned object is a list of pages, each page containing the tabular data:

In [4]:
extracted_table[2]

[[u'Earl Howe, Parliamentary-under-Secretary of State for Quality  '],
 [u'Date ', u'Name of Organisation  ', u'Type of Hospitality Received   '],
 [u'4 February 2015 ', u'College of Emergency Medicine ', u'Dinner '],
 [u' Jane Ellison MP, Parliamentary Under Secretary of State for Public Health'],
 [u'Date ', u'Name of Organisation  ', u'Type of Hospitality Received   '],
 [u'Nil ', u' ', u' '],
 [u'The Rt Hon Jeremy Hunt, Secretary of State for Health  '],
 [u'Date(s) of trip ',
  u'Destination   ',
  u'Purpose of trip ',
  u'\u2018Scheduled\u2019 \u2018No 32 (The Royal) Squadron\u2019 or \u2018other RAF\u2019 or \u2018Chartered\u2019 or \u2018Eurostar\u2019  ',
  u'Number of officials accompanying Minister, where non-scheduled travel is used    ',
  u'Total cost including travel, and accommodation of Minister only '],
 [u'16 \u2013 17 March 2015 ',
  u'Geneva, Switzerland ',
  u'To attend a World Health Organisation summit ',
  u'Scheduled ',
  u' ',
  u' ',
  u'\xa3265 '],
 [u'Dr. 

To get things into a format that can be dumped into csv, we need to do a bit more work. The lists returned for each row can be different lengths. This reflects different sizes of the column widths in the original tables. To get around this we simply pad each row to the same length. The code below will do this, concatenate the pages and save the whole thing as a csv file:

In [5]:
def table_to_csv(extracted_table):
    max_length = 0
    
    #concatenate the pages
    concatenated_table = [row for page in extracted_table for row in page]
    
    #find the maximum length
    for row in concatenated_table:
        if len(row) > max_length:
            max_length = len(row)
            
    # convert to string
    out = ""
    for row in concatenated_table:
        # pad the row 
        if len(row) < max_length:
            row += [""] * (max_length - len(row))
                           
        out += ",".join(row) + "\n"
    
    return out

In [6]:
csved = table_to_csv(extracted_table)

# Note: you might want to change the encoding, depending on what format your document is
open("data/example_out.csv", "wb").write(csved.encode("utf-8"))