To generate the files yourself, you need ImageMagick, Tesseract 4.0, Node.js and ScanTailor.
You can use brew to install all but one of them:
brew install imagemagick tesseract node
Then download ScanTailor.
- Convert the PDF into images:
convert -density 300 1-pdf/orae_merged.pdf 2-images/p%03d.png
(After going through this, TIFF seems like it would've been the better option here, but it's a bit of work to change now.)
-
Open all the images in ScanTailor to crop, deskew and dewarp them, then export them to
3-clean-images/
. Doing this improves the results when doing OCR. -
Use
tessaract
to extract the text from the images:
(cd 3-clean-images; for file in *.tif; do echo $file; tesseract --psm 4 -l eng $file 4-ocr/$file; done)
- Use the
5-split.js
script to extract the individual events from each page. (Note: this script is very specific to tesseract 4.0 with the default training data, you need to tweak it if you're working with something else.)
(cd 4-ocr; for file in *.txt; do echo $file; ../5-split.js 4-ocr/$file 5-ocr-split/$file; done)
- Extract structured data from each event file.
./6-structure.js