files and code related to the Early Modern OCR Project (eMOP) at the IDHMC
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
GameraTraining
font-history-DB
LICENSE.md
README.md
TCPwLigtrs.sh
regex.cpp
tess-script.sh
xml_to_box.xsl

README.md

The Early Modern OCR Project (eMOP) is a Mellon Foundation grant funded project to attempt to improve the current state of OCR output for the entire corpus of digitally-available Early Modern English language texts. These texts currently reside in the Early English Books Online (EEBO) database owned by Proquest, and the Eighteenth Century Collections Online database owned by Gale-Cengage Learning. So far only the ECCO collection has been OCRd with mixed results.

eMOP is attempting to use open-source technologies to improve the OCR output for the ECCO collection, and, for the first time, create usable OCR text of the EEBO collection. To do this we are using Tesseract and Gamera as our open-source OCR engines. We are also using double-keyed transcriptions of approximately 44,000 EEBO and 2,000 ECCO documents created by the Text Creation Partnership (TCP) as our ground truth.

This repository is a place for IDHMC personnel and its partners to share, store, and collaborate on files and code created during the life of this project. As a Mellon funded project, all products are open-source.