Find file
Fetching contributors…
Cannot retrieve contributors at this time
45 lines (30 sloc) 4.81 KB

PDF Liberation Hackathon – Local Government Audited Financial Report Challenge

Most large US cities publish Comprehensive Annual Financial Reports (CAFRs) in the form of PDFs. These documents contain a large set of audited financial statements with footnotes. This challenge requires you to extract a single statement – a ten-year history of revenues and expenditures – from the latest CAFR for the four hackathon cities (Chicago, New York, San Francisco and Washington DC). By doing this we can compare revenue sources and spending priorities across cities over a number of years.

We have downloaded the CAFRs and placed them in a shared Dropbox folder. For reference, links to web based copies of these reports are provided at the end of this document. Automatically downloading these files would be nice, but is not essential to this already difficult challenge.

The schedules required from each city’s CAFR are listed below:

City Fiscal Year File Name on Dropbox Schedule Name PDF Page(s)
New York 2013 NYC_CAFR2013.PDF Changes in Fund Balances – Governmental Funds – Ten Year Trend 330- 331 of 397
San Francisco 2013 SF_CAFR2013.PDF CHANGES IN FUND BALANCES OF GOVERNMENTAL FUNDS - Last Ten Fiscal Years 241 of 257
Washington DC 2012 DC_CAFR2012.PDF Changes in Fund Balances, Governmental Funds, Last Ten Fiscal Years 184 of 207

We are only seeking to process the Revenue and Expenditure data themselves. Fund Balances and other entries toward the bottom of each schedule can be omitted.

This challenge is difficult because the formats vary across cities and because the Washington DC CAFR was produced by scanning and thus requires OCR processing. The other three CAFR PDFs have embedded text and should thus be easier to process. Analyzing the Chicago report is complicated by the fact that row captions are missing from pages 131 and 133; you need to use the captions from pages 130 and 132, respectively.

In addition to extracting the data, the challenge requires transforming it into a standard database-ready format. Rather than actually loading the data into a database, it will be sufficient to produce a tab- delimited file. The Dropbox folder contains a tab delimited file (named pdflb_cafr_outputs.txt) with the expected output.

Following is a list of fields that should appear in the output file:

Fields Description
CITY Based on the source document (values are Chicago, New York, San Francisco, Washington DC)
Fiscal Year (FY) Column Headers from Each Table (values are 4-digit numbers ranging from 2003- 2012 in Chicago and DC, and 2004-2013 in New York and San Francisco)
Revenue or Expenditure (REV_EXP) Indicator or whether the category is a Revenue or Expenditure item. Valid values are "Revenues" or "Expenditures".
CATEGORY These are the row headers or captions (examples are "Property Tax, "Utility Tax", "Public Safety", "General Government".
AMOUNT This is the dollar amount in each cell of the PDF tables

The challenge is to produce this expected output with the highest degree of automation possible. The ideal solution would be one that could easily be applied to previous reports issued by each of the four cities and to comparable reports issued by other cities.

You are welcome to produce visualizations of the data, include additional data or try adding more cities if you have time. You can find more CAFRs at We recommend working only with larger cities, since many smaller governments do not provide ten-year histories.

CAFR Locations on the Web: