Tesseract is Open Source OCR (Optical character recognition).
Home page
Any image readable by Leptonica is supported in Tesseract including BMP, PNM, PNG, JFIF, JPEG, TIFF and GIF.
- Tesseract has unicode (UTF-8) support, and can recognize more than 100 languages "out of the box".
- Tesseract supports various output formats: plain text, hOCR (HTML), PDF, invisible-text-only PDF, TSV. The master branch also has experimental support for ALTO (XML) output.
- This wrapper now supports plain text and hOCR (HTML).
The demo supports 2 languages: English and Russian (other languages on demand), and can output to text or html.
- Open an image.
- Select a language of a document.
- Select appropriate Page segmentation mode.
- Select desired output format (text or html).
- Press "Recognize from image" button to extract text from the image.
-- or -- - Press "Add rectangle" and move/resize it as needed to mark one block of text. Scroll the image if necessary.
- Repeat step 5 to mark all needed blocks.
- Press "Recognize rectangles" button to extract text from all rectangles.
https://yadi.sk/d/0uMA7pUxUQ3zYw
- C6 or higher (C5 on demand)
A .NET wrapper for tesseract-ocr 3.04.
Free