pdf-ocr

Recognize page content of a PDF as text Tesseract and Ghostscript.

Prerequisites

Clone or download this repository.
Open the solution in Visual Studio and run Install-Package Tesseract -Version 3.0.2 from the Package Manager Console.
Download language data files for tesseract 3.04 from the tessdata repository and add them to the tessdata folder of your project. Set Copy to output directory to Always for all the copied files. You can copy only the language files you are interested in (e.g. all the files that starts with eng for English language).

	Variable name	Default	Description
Input PDF file	`inputPdfFile`	`test.pdf`, included in the repository	The PDF file whose selected page's content will be recognized as text.
Page number	`pageNumber`	`1`	The number of the page whose content will be recognized as text.
Recognition language	`ocrLanguage`	`"eng"`	The language used from tesseract to recognize text. When you change this value, make shure you add the language data files to the tessdata folder. See Installation section.
DPI converting PDF page to image	`pdfToImageDPI`	`150`	Tesseract can't recognize text from PDF pages. This is way we have to convert the PDF page to an image. This property indicates the DPI when making this convertion.

If you need more information on Tesseract usage, please visit its own repository.