Skip to content

Latest commit

 

History

History
23 lines (18 loc) · 2.81 KB

README.md

File metadata and controls

23 lines (18 loc) · 2.81 KB

pdf-ocr

Recognize page content of a PDF as text Tesseract and Ghostscript.

Prerequisites

Installation

  • Clone or download this repository.
  • Open the solution in Visual Studio and run Install-Package Tesseract -Version 3.0.2 from the Package Manager Console.
  • Download language data files for tesseract 3.04 from the tessdata repository and add them to the tessdata folder of your project. Set Copy to output directory to Always for all the copied files. You can copy only the language files you are interested in (e.g. all the files that starts with eng for English language).

Configuration

Variable name Default Description
Input PDF file inputPdfFile test.pdf, included in the repository The PDF file whose selected page's content will be recognized as text.
Page number pageNumber 1 The number of the page whose content will be recognized as text.
Recognition language ocrLanguage "eng" The language used from tesseract to recognize text. When you change this value, make shure you add the language data files to the tessdata folder. See Installation section.
DPI converting PDF page to image pdfToImageDPI 150 Tesseract can't recognize text from PDF pages. This is way we have to convert the PDF page to an image. This property indicates the DPI when making this convertion.

Tesseract usage

If you need more information on Tesseract usage, please visit its own repository.