This project will liberate data from pdf files found on http://www.cityofjerseycity.com/pub-info.aspx?id=2430 and will create .csv and .json files to be uploaded on https://data.openjerseycity.org/dataset/jersey-city-2013-budget-adopted-spending
Python
Latest commit f2a0b2c Jan 25, 2014 @adlukasiak adlukasiak sample data
Permalink
Failed to load latest commit information.
csv sample data Jan 25, 2014
download
ocr sample data Jan 25, 2014
AbbyyOnlineSdk.py Added environment variable for handling ApplicationID and Password Jan 23, 2014
MultipartPostHandler.py first commit Jan 19, 2014
PDFLiberation.ipynb added wrapper for csv conversion and status+analysis functions Jan 25, 2014
PdfLiberationUtils.py added wrapper for csv conversion and status+analysis functions Jan 25, 2014
README.md Fixed ABBYY environment variable documentation Jan 23, 2014
process.py first commit Jan 19, 2014

README.md

About the project

In Jersey City, we wanted the public to be more educated about the budget and city finances. So we started a project to convert the 37 scanned PDF documents and total of 3,871 pages available on the city's website into interactive visualizations. Inspired by the OpenSpending and Open Budget Oakland projects, we quickly build a page with place for public comment, but then realized that getting the data out of the scanned PDFs is no simple task. The PDF Liberation hackathon was our missing link.

At the PDF Liberation Hackathon, the goal was automate the process by creating a framework. Our first step was to convert non searchable PDFs to searchable PDFs with the ABBYY Cloud OCR SDK API. The second step was to convert searchable PDF files to CSV files using the non-interactive version of Tabula pable parser. The results of the table parser are not completely accurate but can be cleaned up by programming some higher-level heuristics. To complete the project, we would like to convert the CSV into a hierarchical data model and leverage existing solutions to publish budget visualizations.

The PDF files with historical data can be found on the Jersey City Website.

You can find the corresponding Gist here.

The extracted data will be uploaded to Open JC Open Data Portal.

OpenSpending-like visualization will be made to the public on the Jersey City Budget site.

This project is built by Open JC, a Code for America brigade.


Instructions to run the project

Open python notebook

On the command line, type:

ipython notebook

This project uses ABBYY Cloud OCR SDK Api to convert non-searchable pdf files to searchable pdf file and tabula Api to convert searchable files to CSV files.

ABBYY Cloud OCR SDK:

  • ABBYY Cloud OCR SDK Api was used to convert non-searchable pdf files to searchable pdf file.
  • ABBYY is a commercial PDF solution vendor. For the PDF Liberation Hackathon, we were allowed to perform Optical Character Recognition on up to 5000 pages for free with Abbyy’s cloud based (no installation) solution. Thank you ABBYY!
  • To run the OCR portion, you will need to get ABBYY account. Once you setup your account, you will receive an email with ApplicationId and Password that you will need to set ABBYY_APPID and ABBYY_PASS environment variables. For example, if you are using bash shell:

    export ABBYY_APPID=YourApplicationId
    export ABBYY_PASS=YourPassword
    
  • Python code sample that was utilized: https://github.com/abbyysdk/ocrsdk.com/tree/master/Python

  • You can manually run it on a command line to test:

    process.py <input fie> <output file> -pdfSearchable
    

Tabula:

  • Non-interactive version of Tabula table parser was used to convert searchable PDF files to CSV files.
  • We run the tabula with the -p and -n options, converting one page of the PDF document and saving each page into a seperate CSV file. The -r option did not produce results because the tables in PDF use "-" and "|" instead of solid cell border lines.

    tabula-extractor/bin/tabula -p 11 -f CSV -n -o destination_file.csv file_to_convert.pdf
    
          --pages, -p <s>:  Comma separated list of ranges. Examples: --pages
                            1-3,5-7 or --pages 3. Default is --pages 1 (default:
                            1) 
    
    
        --spreadsheet, -r:  Force PDF to be extracted using spreadsheet-style
                            extraction (if there are ruling lines separating each
                            cell, as in a PDF of an Excel spreadsheet)
    
     --no-spreadsheet, -n:  Force PDF not to be extracted using spreadsheet-style
                            extraction (if there are ruling lines separating each
                            cell, as in a PDF of an Excel spreadsheet)
    
  • Browser version: http://tabula.nerdpower.org/

  • Command-line non-interactive version: https://github.com/jazzido/tabula-extractor

Next steps:

This project is under construction.

To be added:

  • Function to scan CSV files (the last step produced one CSV file per PDF page), detect tables that run accross multiple pages and combine them into single cvs file.
  • Data scraping. Extract only the line items ignoring the rollups + link all spending to accounts, programs, divisions and departments, etc
  • Upload the scraped data as .csv and hierarchical .json into data.openjerseycity.org so it can be used for visualization projects, like http://openjerseycity.org/JerseyCityBudget/.

Issues to be checked out:

  • ABBYY is timing out on files with more than 100 pages. Check the documentaion and if needed, split the file calling ABBYY Api.