4500 pages of documents concerning "escalation of force" incidents involving contractors working for the U.S. Dept. of State in Iraq, 2005-2007
Ruby Shell
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
.gitignore
README.md
extract-text-all.sh
january-june-2005-manual-coverpages
january-june-2005.pdf
january-to-june-2006.pdf
january-to-june-2007.pdf
june-december-2005.pdf
june-to-december-2006.pdf
june-to-december-2007.pdf
repaginate-all.sh
repaginate.rb

README.md

Iraq contractor incidents 2005-2007, original PDFs and splitting scripts

These are the files used to produce iraq-contractor-incidents source file for analysis in the Overview prototype. The original PDFs are on DocumentCloud, stored here for convenience. The resulting .csv is available as part of the Overview sample files.

The most interesting thing here is the repaginate.rb script, which does fuzzy coverpage detection (and can also take a list of manual coverpages where a PDF should be split, as in the january-june-2005-manual-coverpages file). The scripts in this project also demonstrate how to extract text from (pre-OCR'd) PDFs using DocSplit.

To recreate the final csv, run:

./extract-text-all.sh
./repaginate-all.sh

copyleft

GPLv3

contact

need help? ask!

https://twitter.com/overviewproject