This list contains links to great software tools and libraries and literature related to Optical Character Recognition (OCR).
Contributions are welcome, as is feedback.
- tesseract - The definitive Open Source OCR engine
- ocropus - OCR engine based on LSTM,
- ocropus 0.4 - Older v0.4 state of Ocropus, with tesseract 2.04 and iulib, C++
- kraken - Ocropus fork with sane defaults
- Ocrad - The GNU OCR.
- digit - OCR for numbers in meter displays, such as a power meter, using caffe
- ocular - Machine-learning OCR for historic documents
- SwiftOCR - fast and simple OCR library written in Swift
- attention-ocr - OCR engine using visual attention mechanisms
- RWTH-OCR - The RWTH Aachen University Optical Character Recognition System
- simple-ocr-opencv and its fork - A simple pythonic OCR engine using opencv and numpy
- Calamari - OCR Engine based on OCRopy and Kraken
Older and possibly abandoned OCR engines
- Clara OCR - Open source OCR in C
- Cuneiform - CuneiForm OCR was developed by Cognitive Technologies
- Eye - an experimental Java OCR (image-to-text) application
- kognition - An omnifont OCR software for KDE
- OCRchie - Modular Optical Character Recognition Software
- ocre - o.c.r. easy
- xplab - A GTK 2 tool for pattern matching
- hebOCR - Hebrew character recognition library (previously named hocr, see Wikipedia article)
OCR file formats
- hocr-tools - Tools for doing various useful things with hOCR files,
- hocr-spec - hOCR 1.1 specification
- ocr-transform - CLI tool to convert between hOCR and ALTO,
- hocr-parser - hOCR Specification Python Parser
- hOCRTools - hOCR to ALTO conversion XSLT
- ALTO XML Schema - XML Schema and development of the ALTO XML format
- ALTO XML Documentation - Documentation and use cases for ALTO
- alto-tools - Various tools to work with ALTO files, Python
- AbbyyToAlto - PHP script converting from Abbyy 6 to ALTO XML
- TEI-OCR - TEI customization for OCR generated layout and content information
- TEI SIG on Libraries - Best Practices for TEI in Libraries
- GDZ - METS/TEI-based GDZ document format
- OCRmyPDF - OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
- Ocrocis - Project manager interface for Ocropy, see also external project homepage
- moz-hocr-editor - Firefox Addon for editing hOCR files Discontinued
- qt-box-editor - QT4 editor of tesseract-ocr box files.
- ocr-gt-tools - Client-Server application for editing OCR ground truth.
- Paperwork - Using scanners and OCR to grep paper documents the easy way.
- Paperless - Scan, index, and archive all of your paper documents.
- gImageReader - gImageReader is a simple Gtk/Qt front-end to tesseract-ocr.
- VietOCR - A Java/.NET GUI frontend for Tesseract OCR engine, including jTessBoxEditor a graphical Tesseract box data editor
- PoCoTo - Fast interactive batch corrections of complete OCR error series in OCR'ed historical documents.
- OCRFeeder - GTK graphical user interface that allows the users to correct characters or bounding boxes, ODT export and more.
- PRImA PAGE Viewer - Java based viewer for PAGE XML files (layout + text content). Also supports ALTO XML, FineReader XML, and HOCR.
- LAREX - A semi-automatic open-source tool for Layout Analysis and Region EXtraction on early printed books.
- archiscribe - Web application for transcribing OCR ground truth from Archive.org. Deployed instance available at https://archiscribe.jbaiter.de/, results are available in @jbaiter/archiscribe-corpus.
- NoiseRemove.java in MathOCR - Java implementation of
- binarize.c in ZBar - C implementations of two binarization algorithms, based on Sauvola
- typeface-corpus - A repository for typefaces to train Tesseract and OCRopus for natural history collections and digital humanities.
- binarizewolfjolion - Comparison of binarization algorithms. Blog post
crop_morphology.pyin oldnyc - Cropping a page to just the text block
- Whiteboard Picture Cleaner - Shell one-liner/script to clean up and beautify photos of whiteboards
- Fred's ImageMagick script textcleaner - Processes a scanned document of text to clean the text background
- localcontrast - Fast O(1) local contrast optimization
OCR as a Service
- Open OCR - Run Tesseract in Docker containers
- tesseract-web-service - An implementation of RESTful web service for tesseract-OCR using tornado.
- docker-ocropy - A Docker container for running the ocropy OCR system.
- ABBYY Cloud OCR SDK Code samples - Code samples for using the proprietary commercial ABBYY OCR API.
- nidaba - An expandable and scalable OCR pipeline
- gamera - A meta-framework for building document processing applications, e.g. OCR
- ocr-tools - Project to provide CLI and web service interfaces to common OCR engines
- ocrad-docker - Run the ocrad OCR engine in a docker container
- kraken-docker - Run the kraken OCR engine in a docker container
- ocr.space - Free Online OCR and OCR API by @a9t9 based on Tesseract (code is not open)
- ISRI OCR Evaluation Tools with a User Guide from 1996 :!:
- ocrevalUAtion - Cross-format evaluation, CLI and GUI
- ngram-ocr-eval - Brute and simple OCR evaluation using ngrams
- quack - Quality-Assurance-tool for scans with corresponding ALTO-files
OCR libraries by programming language
- gosseract - Golang OCR library, wrapping Tesseract-ocr.
- Tess4J - Java Native Access bindings to Tesseract.
- tess-two - Tools for compiling Tesseract on Android and Java API.
- tesseract for .net - A .Net wrapper for tesseract-ocr.
- TTesseractOCR4 - Object Pascal binding for tesseract-ocr 4.x.
- Tesseract OCR for PHP - Tesseract PHP bindings.
- pytesseract - A Python wrapper for Google Tesseract.
- pyocr - A Python wrapper for Tesseract and Cuneiform.
- ocrodjvu - A library and standalone tool for doing OCR on DjVu documents, wrapping Cuneiform, gocr, ocrad, ocropus and tesseract
- tesserocr - A Python wrapper for the tesseract-ocr API
- node-tesseract - A simple wrapper for the Tesseract OCR package.
- node-tesseract-native - C++ module for node providing OCR with tesseract and leptonica.
- rtesseract - Ruby library wrapping the tesseract and imagemagick executables.
- ruby-tesseract - Native Tesseract bindings for Ruby MRI and JRuby
- ocr_space - API wrapper for free ocr service ocr.space. Includes CLI
- tesseract.rs - Rust bindings for tesseract OCR.
- tesseract - R bindings for tesseract OCR.
OCR training tools
- glyph-miner - A system for extracting glyphs from early typeset prints
OCR-related publication and link lists
- IMPACT: Tools for text digitisation - List of tools software projects related, some related to OCR
- OCR-D - List of OCR-related academic articles in the context of the OCR-D project.
- Mendeley Group "OCR - Optical Character Recognition" - Collection of 34 papers on OCR
- eadh.org projects - List of Digital Humanities-related projects in Europe, some related to OCR
- Wikipedia: Comparison of optical character recognition software
- OCR [and Deep Learning] by @handong1587
- Ocropus Wiki: Publications
Blog Posts and Tutorials
- Tesseract Blends Old and New OCR Technology (2016) @theraysmith
- Tutorial@DAS2016, Updated "What You Always Wanted to Know" slides
- What You Always Wanted To Know About Tesseract (2014) @theraysmith
- Tutorial@DAS2014, includes demos
- Extracting text from an image using Ocropus (2015)
- Training an Ocropus OCR model (2015) @danvk
- Ocropus Wiki: Compute errors and confusions (2016) @zuphilip
- Ocropus Wiki: Working with Ground Truth (2016) @zuphilip
- OCRopus (2016) @jze
- mostly on column separation in ocropus
- 10 Tips for making your OCR project succeed (2013) @cneud
- general things to consider for OCR projects
- Overview of LEADTOOLS Image Cleanup and Pre-processing SDK Technology -
- feature list for a commercial image pre-processing library; has nice before-after samples for pre-processing steps related to OCR
- Extracting Text from PDFs; Doing OCR; all within R @shawngraham
- How to work with OCR from PDFs in the R programming environment
- Tutorial: Command-line OCR on a Mac @bmschmidt
- Tutorial on how to run tesseract in Mac OSX
- Practical Expercience with OCRopus Model Training (2016) @jze
- Homemade Manuscript OCR (1): OCRopy (2017) @Jean-Baptiste-Camps
- Tutorial on applying OCR to medieval manuscripts with OCRopy
- Optimizing Binarization for OCRopus (2017) @jze
- Prototype demo for OCR postfix in Danish Newspapers (2016) @thomasegense
- How Can I OCR My Dictionary? (2016) @JessedeDoes
- "Needlessly complex" blog (2016) @mzucker. Several image processing how-tos (Python based), particularly:
- (Open-Source-)OCR-Workflows (2017) @wrznr
🇩🇪overview of the state of the art in open source OCR and related technologies (binarisation, deskewing, layout recognition, etc.), lots of example images and information on the @OCR-D project.
- abbyy-finereader-ocr-senate - Using OCR to parse scanned Senate Financial Disclosure forms.
- cvOCR - An OCR system for recognizing resume or cv text, implemented in Python and C and based on tesseract
- MathOCR - A printed scientific document recognition system, pre-alpha
2011 and before
- High performance document layout analysis (2003) Breuel
- Adaptive degraded document image binarization (2006) Gatos, Pratikakis, Perantonis
- [Internship Report] (2007) Gupta
- OCRopus Addons (Internship Report) (2007) Dantrey
- Local Logistic Classifiers for Large Scale Learning (2012) Yousefi, Breuel
- High Performance OCR for Printed English and Fraktur using LSTM Networks (2013) Breuel, Ul-Hasan, Mayce Al Azawi. Shafait
- Can we build language-independent OCR using LSTM networks? (2013) Ul-Hasan, Breuel
- Offline Printed Urdu Nastaleeq Script Recognition with Bidirectional LSTM Networks (2013) Ul-Hasan, Ahmed, Rashid, Shafait, Breuel
- OCR of historical printings of Latin texts: Problems, Prospects, Progress. (2014) Springmann, Najock, Morgenroth, Schmid, Gotscharek, Fink
- Correcting Noisy OCR: Context beats Confusion (2014) Evershed, Fitch
- TypeWright: An Experiment in Participatory Curation (2015) Bilansky
- On crowd-sourcing OCR postcorrection
- Benchmarking of LSTM Networks (2015) Breuel
- Recognition of Historical Greek Polytonic Scripts Using LSTM (2015) Simistira, Ul-Hassan, Papavassiliou, Basilis Gatos, Katsouros, Liwicki
- A Segmentation-Free Approach for Printed Devanagari Script Recognition (2015) Karayil, Ul-Hasan, Breuel
- A Sequence Learning Approach for Multiple Script Identification (2015) Ul-Hasan, Afzal, Shfait, Liwicki, Breuel
- Important New Developments in Arabographic Optical Character Recognition (OCR) (2016) Romanov, Miller, Savant, Kiessling
- OCR of historical printings with an application to building diachronic corpora: A case study using the RIDGES herbal corpus (2016) Springmann, Lüdeling
- Automatic quality evaluation and (semi-) automatic improvement of mixed models for OCR on historical documents (2016) Springmann, Fink, Schulz
- Generic Text Recognition using Long Short-Term Memory Networks (2016) Ul-Hasan -- Ph.D Thesis
- OCRoRACT: A Sequence Learning OCR System Trained on Isolated Characters (2016) Dengel, Ul-Hasan, Bukhari
- Recursive Recurrent Nets with Attention Modeling for OCR in the Wild (2016) Lee, Osindero
- Telugu OCR Framework using Deep Learning (2015/2017) Achanta, Hastie
- A Two-Stage Method for Text Line Detection in Historical Documents (2018) Grüning, Leifert, Strauß, Labahn. Code available at https://github.com/TobiasGruening/ARU-Net