The repository is initially focused on compiling data that is relevant to the OCR activities conducted in the natural history collections communities and in the digital humanities communities. These communities face the challenge of needing to extract high-quality text from documents and images that contain a variety of typefaces. The goal of this repository is to compile a corpus of typeface samples in standardized formats to help the natural history collection and digital humanities communities significantly improve the quality of text generated by OCR engines such as Tesseract and OCRopus.
For details about the types of files and formatting, see the Submission Procedues document.