Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tesseract 4.00.00alpha image ocr correct, final PDF not #178

Closed
ghost opened this issue Jul 28, 2017 · 4 comments
Closed

tesseract 4.00.00alpha image ocr correct, final PDF not #178

ghost opened this issue Jul 28, 2017 · 4 comments

Comments

@ghost
Copy link

ghost commented Jul 28, 2017

I'm testing a simple single page in which text is mostly in Hebrew, and I can see that the hocr working files contain ocr'd text (with some mistakes, but mostly ok), but the pdf produced does not seem to contain any text (just default unicode glyphs)... what could be the problem.

See attached screenshots (highlited word is shown and copied as a dummy unicode string).
The 2nd image is the hocr file created for the 1st image.

Invocation:

ocrmypdf -l heb+eng --image-dpi 300 page.png out.pdf
tesseract -v
tesseract 4.00.00alpha
 leptonica-1.74.1
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.1) : libpng 1.6.28 : libtiff 4.0.7 : zlib 1.2.11 : libwebp 0.5.2 : libopenjp2 2.1.2

 Found AVX
 Found SSE

screenshot from 2017-07-28 16-33-19
screenshot from 2017-07-28 16-33-56

@jbarlow83
Copy link
Collaborator

jbarlow83 commented Aug 1, 2017

Please make sure you have ocrmypdf 5.3 and try the following

ocrmypdf --pdf-renderer sandwich  -l heb+eng --image-dpi 300 page.png out.pdf

ocrmypdf's hocr renderer is incapable of rendering some scripts depending on the host system's supply of fonts. This is the default on older versions of ocrmypdf, since it is more compatible. The default on newer versions is sandwich, when available.

@ghost
Copy link
Author

ghost commented Aug 1, 2017

Hi there, sorry for the dealy.
Yes, tried it with the docker image, on a full pdf as well:

ocrmypdf --pdf-renderer sandwich  -l heb+eng in.pdf out.pdf

The sandwich renderer as well as pdfsandwich produce mirrored text, though there are some dfiferences as to what happens when you highlight/select text in a pdf reader (Evince/Document Viewr in my case).
Text might be highlightining correctly but the ocr content itself is garabge (will not be found in search, and when pasted, seems to be garbage).

p.s. How can I easily add tesseract-heb package to your docker.tess4 image? i.e. not rebuild... I would typicaly attach to a runing container, install some additional packges and save/clone. Your image of course exists immidiately...

@jbarlow83
Copy link
Collaborator

How does this one look to you?
ocrmypdf_53_sandwich_heb_eng.pdf

It's possible that the PDF readers cannot display mirrored text or Unicode correctly.

It could also be that Ghostscript messes up the right to left text. Try disabling it

ocrmypdf --output-type pdf --pdf-renderer sandwich  -l heb+eng in.pdf out.pdf

pdftotext seems to display the text correctly in my terminal:
‫ברוב השפות המערביות משתמשים נמלה הצר


To add a new language package you could do the following:

host$ docker run --rm  -v /host/machine/tessdata:/home/docker -it --entrypoint /bin/bash ocrmypdf-tess4

That gets you a terminal inside the Docker container. Now

docker$ cp /usr/share/tesseract-ocr/tessdata /home/docker

to make a copy of all tessdata on your local machine at /host/machine/tessdata

Then add any compatible Hebrew language files. Run docker by adding the modified volume to the list

host$ docker run --rm  -v /host/machine/tessdata:/usr/share/tesseract-ocr/tessdata -v /host/pwd:/home/docker ocrmypdf-tess4

so it will replace the Docker image's tessdata with files from a local directory. You cannot mix tess3/4 files. You have to download the appropriate files for tess4.

@jbarlow83
Copy link
Collaborator

Closing due to lack of response/inability to reproduce

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant