Skip to content

michal-h21/hocrtex

Repository files navigation

Hocrtex

Hocr is html microformat for information from OCR packages. You can find more information about hocr in this document. It can be generated by tesseract v3.0>=, ocropus or cuneiform programs. With hocrtex, it is possible to use this information from LaTeX and convert this file to PDF.

Hocrtex is based on xmltex, xml processor written in TeX.

Install

Unzip contents of the file hocr.tar.gz to your local texmf dir. You can find its location with the following command:

kpsewhich -var-value TEXMFHOME

Usage

First, you need to get hocr file. You have to process images from your scanned book with one of OCR packages listed above.

In tesseract, you can generate hocr output with this procedure:

  1. Create file named "hocr", put it somewhere and copy this line into it:

    tessedit_create_hocr 1

  2. call tesseract

    tesseract imagename outputname -l lang_name +path_to_hocr/hocr

Now we have html file with hocr information.

For processing with hocrtex, we need to generate config file using package hocrconfig.

Create file sample.tex:

\documentclass{article}
\usepackage[
   FileName=example   % name of hocr file without .html suffix
  ,ResizeRatio=5.5    % division from bbox coordinates to points
  ,ImageName=normal- % in hocr, each page includes name of its 
                      % source image. but if source image is multipage tiff,      
                      % this name is on all pages the same. it is best to 
                      % convert this tiff image into series of png images
                      % named normal-0.png, ..., normal-n.png
                      % ImageName is the prefix before image number 
  ,Driver=underimage  % driver defines actions on hocr classes 
]{hocrconfig}
\begin{document}
\end{document}

after compilation with LaTeX, file normal.cfg is created. Now you can call xmltex:

pdfxmltex normal.html    

file normal.pdf will be created.

About

xmltex support for hocr format

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published