tesseract 4.00.00alpha image ocr correct, final PDF not #178

ghost · 2017-07-28T14:43:25Z

I'm testing a simple single page in which text is mostly in Hebrew, and I can see that the hocr working files contain ocr'd text (with some mistakes, but mostly ok), but the pdf produced does not seem to contain any text (just default unicode glyphs)... what could be the problem.

See attached screenshots (highlited word is shown and copied as a dummy unicode string).
The 2nd image is the hocr file created for the 1st image.

Invocation:

ocrmypdf -l heb+eng --image-dpi 300 page.png out.pdf

tesseract -v
tesseract 4.00.00alpha
 leptonica-1.74.1
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.1) : libpng 1.6.28 : libtiff 4.0.7 : zlib 1.2.11 : libwebp 0.5.2 : libopenjp2 2.1.2

 Found AVX
 Found SSE

The text was updated successfully, but these errors were encountered:

jbarlow83 · 2017-08-01T00:45:13Z

Please make sure you have ocrmypdf 5.3 and try the following

ocrmypdf --pdf-renderer sandwich  -l heb+eng --image-dpi 300 page.png out.pdf

ocrmypdf's hocr renderer is incapable of rendering some scripts depending on the host system's supply of fonts. This is the default on older versions of ocrmypdf, since it is more compatible. The default on newer versions is sandwich, when available.

ghost · 2017-08-01T08:30:13Z

Hi there, sorry for the dealy.
Yes, tried it with the docker image, on a full pdf as well:

ocrmypdf --pdf-renderer sandwich  -l heb+eng in.pdf out.pdf

The sandwich renderer as well as pdfsandwich produce mirrored text, though there are some dfiferences as to what happens when you highlight/select text in a pdf reader (Evince/Document Viewr in my case).
Text might be highlightining correctly but the ocr content itself is garabge (will not be found in search, and when pasted, seems to be garbage).

p.s. How can I easily add tesseract-heb package to your docker.tess4 image? i.e. not rebuild... I would typicaly attach to a runing container, install some additional packges and save/clone. Your image of course exists immidiately...

jbarlow83 · 2017-08-01T18:48:44Z

How does this one look to you?
ocrmypdf_53_sandwich_heb_eng.pdf

It's possible that the PDF readers cannot display mirrored text or Unicode correctly.

It could also be that Ghostscript messes up the right to left text. Try disabling it

ocrmypdf --output-type pdf --pdf-renderer sandwich  -l heb+eng in.pdf out.pdf

pdftotext seems to display the text correctly in my terminal:
‫ברוב השפות המערביות משתמשים נמלה הצר

To add a new language package you could do the following:

host$ docker run --rm  -v /host/machine/tessdata:/home/docker -it --entrypoint /bin/bash ocrmypdf-tess4

That gets you a terminal inside the Docker container. Now

docker$ cp /usr/share/tesseract-ocr/tessdata /home/docker

to make a copy of all tessdata on your local machine at /host/machine/tessdata

Then add any compatible Hebrew language files. Run docker by adding the modified volume to the list

host$ docker run --rm  -v /host/machine/tessdata:/usr/share/tesseract-ocr/tessdata -v /host/pwd:/home/docker ocrmypdf-tess4

so it will replace the Docker image's tessdata with files from a local directory. You cannot mix tess3/4 files. You have to download the appropriate files for tess4.

jbarlow83 · 2017-11-10T22:25:04Z

Closing due to lack of response/inability to reproduce

jbarlow83 closed this as completed Nov 10, 2017

robinrosenstock mentioned this issue Apr 5, 2018

White glyphs when selecting ocr-text in Evince #249

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tesseract 4.00.00alpha image ocr correct, final PDF not #178

tesseract 4.00.00alpha image ocr correct, final PDF not #178

ghost commented Jul 28, 2017 •

edited by ghost

jbarlow83 commented Aug 1, 2017 •

edited

ghost commented Aug 1, 2017 •

edited by ghost

jbarlow83 commented Aug 1, 2017

jbarlow83 commented Nov 10, 2017

tesseract 4.00.00alpha image ocr correct, final PDF not #178

tesseract 4.00.00alpha image ocr correct, final PDF not #178

Comments

ghost commented Jul 28, 2017 • edited by ghost

jbarlow83 commented Aug 1, 2017 • edited

ghost commented Aug 1, 2017 • edited by ghost

jbarlow83 commented Aug 1, 2017

jbarlow83 commented Nov 10, 2017

ghost commented Jul 28, 2017 •

edited by ghost

jbarlow83 commented Aug 1, 2017 •

edited

ghost commented Aug 1, 2017 •

edited by ghost