Turn a batch of OCR files from Chronicling America into a CSV that can be imported into a database
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitignore
.travis.yml
LICENSE.md
Makefile Testing out goreleaser Jan 24, 2019
README.md
chronam-ocr-debatcher.go
go.mod
go.sum
parallel.go
process-batch.go
utilities.go Process batches concurrently Jan 24, 2019

README.md

Build Status

Chronicling America OCR debatcher

This program takes paths to .tar.bz2 batches of OCR files from the Chronicling America bulk data downloads. It converts each batch into a CSV file, which you can load into a database or do whatever you like with. It will process the batches concurrently.

Usage:

./chronam-ocr-debatcher [--processes=8] <path/to/a/batch.tar.bz2 ...>

You can download binaries from the releases page.