GitHub - owid/orae-major-conflicts: OCR of historical conflicts report

Prerequisites

To generate the files yourself, you need ImageMagick, Tesseract 4.0, Node.js and ScanTailor.

You can use brew to install all but one of them:

brew install imagemagick tesseract node

convert -density 300 1-pdf/orae_merged.pdf 2-images/p%03d.png

(After going through this, TIFF seems like it would've been the better option here, but it's a bit of work to change now.)

Open all the images in ScanTailor to crop, deskew and dewarp them, then export them to 3-clean-images/. Doing this improves the results when doing OCR.
Use tessaract to extract the text from the images:

(cd 3-clean-images; for file in *.tif; do echo $file; tesseract --psm 4 -l eng $file 4-ocr/$file; done)

Use the 5-split.js script to extract the individual events from each page. (Note: this script is very specific to tesseract 4.0 with the default training data, you need to tweak it if you're working with something else.)

(cd 4-ocr; for file in *.txt; do echo $file; ../5-split.js 4-ocr/$file 5-ocr-split/$file; done)

./6-structure.js

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
1-pdf		1-pdf
2-images		2-images
3-clean-images		3-clean-images
4-ocr		4-ocr
5-ocr-split		5-ocr-split
6-csv		6-csv
.gitignore		.gitignore
5-split.js		5-split.js
6-structure.js		6-structure.js
MajorConflicts.ScanTailor		MajorConflicts.ScanTailor
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json