Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

White glyphs when selecting ocr-text in Evince #249

Closed
robinrosenstock opened this issue Apr 5, 2018 · 20 comments
Closed

White glyphs when selecting ocr-text in Evince #249

robinrosenstock opened this issue Apr 5, 2018 · 20 comments

Comments

@robinrosenstock
Copy link

robinrosenstock commented Apr 5, 2018

Problem in evince pdf reader:

screenshot_20180405_101240

It only happens when selecting. Is this a display failure? missing fonts? otherwise ocr text is correct.
Similar to #178?

@robinrosenstock
Copy link
Author

Found this: https://unix.stackexchange.com/questions/306051/tesseract-is-it-possible-to-change-font-output-in-ocred-pdf

Do I really need to rebuild tesseract for this? Is there no other way around with OCRmyPDF?

@robinrosenstock
Copy link
Author

This does only happen with scanned pages?

@jbarlow83
Copy link
Collaborator

At a glance it looks like that version of evince can't display the Tesseract glyphless font correctly, although that would be surprising since it's been checked.

Are you using Tesseract 4? What command line?

Try changing the PDF renderer
https://ocrmypdf.readthedocs.io/en/latest/advanced.html#changing-the-pdf-renderer

Try building a regular PDF instead of PDF/A
--output-type pdf

Can you use Tesseract on an image to create a PDF and view that PDF?

@robinrosenstock
Copy link
Author

Thanks for your help:
tesseract --version
tesseract 3.05.01
leptonica-1.75.3
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1

--output-type pdf does not work.
Haven't try to use tesseract on an image directly, because I don't use images only pdfs as input formats.
But when using the hocr renderer it works.
Then I will wait till the arch repos upgrade tesseract to version 4, in hope that other renderers will work, too.
Thanks for you help. I will close this. But if someone has other opinions please go ahead.

@jbarlow83
Copy link
Collaborator

What Linux and version of evince? This might be something that the maintainers of Tesseract or evince need to take up.

@robinrosenstock
Copy link
Author

Evince Version: 3.26.0+14+g2a499547-1
Linux Kernel version: 4.14.31-1

@amitdo
Copy link

amitdo commented Jul 21, 2018

@bitwave
Copy link

bitwave commented Jul 31, 2018

I have the same problem.
evince version: 3.28.2
Linux Kernel version: 4.17.11

looks like the problem lies in ghostscript: tesseract-ocr/tesseract#712

@jbarlow83
Copy link
Collaborator

jbarlow83 commented Aug 1, 2018

@bitwave The Ghostscript issue is fixed now. An older version of Tesseract/Ghostscript will still be problematic, but I managed to replicate the problem with gs 9.23.

I believe the issue is with Evince itself, so I reported it there.
https://gitlab.gnome.org/GNOME/evince/issues/953

@amitdo
Copy link

amitdo commented Sep 7, 2018

https://gitlab.freedesktop.org/poppler/poppler/issues/157

@ghost
Copy link

ghost commented Sep 17, 2018

screenshot_27
Opening the pdf with a web browser allows to see the text while selecting it

@titaniumbones
Copy link

I'm still seeing this issue in poppler-based viewers. Is there any workaround available?

@grossherr
Copy link

@titaniumbones as mentioned above, change the renderer to hocr, so use --pdf-renderer hocr. Works for me in evince.

@titaniumbones
Copy link

where do I add this switch (--pdf-renderer hocr)? In tesseract? evince? just in poppler somehow? I'm not seeing that switch in any docs....

THanks for the help!

@titaniumbones
Copy link

Ah shoot of course you meant in ocrmypdf. That is really helpful.

@shinygnu
Copy link

shinygnu commented Aug 2, 2019

Thank you all, this helped me fix what I thought was a deficiency in OCRmyPDF. For me the boxes were black not white, and manifested in all viewers whether Evince, Okular, or PDF-Tools in Emacs.

Adding --pdf-renderer hocr makes highlighted text visible again. Beautiful!

@amitdo
Copy link

amitdo commented Oct 24, 2019

https://gitlab.freedesktop.org/poppler/poppler/merge_requests/280

@julian-klode, can you please do what the maintainer asked for in your PR?

@julian-klode
Copy link

I'd love too but I broke my wrist and the other side elbow, so I'm a bit incapacitated

@amitdo
Copy link

amitdo commented Oct 24, 2019

Hope you will feel better soon :-)

@amitdo
Copy link

amitdo commented Oct 3, 2020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants