Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

greek letters with diacritics appear as rectangles #176

Closed
hubitor opened this issue Jul 25, 2017 · 3 comments
Closed

greek letters with diacritics appear as rectangles #176

hubitor opened this issue Jul 25, 2017 · 3 comments

Comments

@hubitor
Copy link

hubitor commented Jul 25, 2017

I'm testing OCRmyPDF on some greek documents. When I copy and paste the text from the OCRed file into a text editor all letters with diacritics appear as rectangles.

$ ocrmypdf -l ell -v 1 Ellada.pdf Ellada.ocr.pdf
DEBUG - ocrmypdf 5.2
DEBUG - tesseract 3.04.01
DEBUG - os.symlink(Ellada.pdf, /tmp/com.github.ocrmypdf.0xor195l/origin)


Tasks which will be run:

Task enters queue = 'ocrmypdf.pipeline.triage'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.0xor195l/origin, /tmp/com.github.ocrmypdf.0xor195l/origin.pdf)
Completed Task = 'ocrmypdf.pipeline.triage'
Task enters queue = 'ocrmypdf.pipeline.repair_pdf'
DEBUG - <PdfInfo('...'), page count=1>
Completed Task = 'ocrmypdf.pipeline.repair_pdf'
Task enters queue = 'ocrmypdf.pipeline.split_pages'
Task enters queue = 'ocrmypdf.pipeline.generate_postscript_stub'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.0xor195l/000001.page.pdf, /tmp/com.github.ocrmypdf.0xor195l/000001.ocr.page.pdf)
Completed Task = 'ocrmypdf.pipeline.split_pages'
Task enters queue = 'ocrmypdf.pipeline.orient_page'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.0xor195l/000001.ocr.page.pdf, /tmp/com.github.ocrmypdf.0xor195l/000001.ocr.oriented.pdf)
Completed Task = 'ocrmypdf.pipeline.generate_postscript_stub'
Completed Task = 'ocrmypdf.pipeline.orient_page'
Task enters queue = 'ocrmypdf.pipeline.rasterize_with_ghostscript'
Task enters queue = 'ocrmypdf.pipeline.skip_page'
Uptodate Task = 'ocrmypdf.pipeline.skip_page'

WARNING:
In Task 'ocrmypdf.pipeline.skip_page':
No jobs were run because no file names matched.
Please make sure that the regular expression is correctly specified.

DEBUG - Rasterize 000001.ocr.oriented.pdf with png16m
DEBUG -
Completed Task = 'ocrmypdf.pipeline.rasterize_with_ghostscript'
Task enters queue = 'ocrmypdf.pipeline.preprocess_remove_background'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.0xor195l/000001.page.png, /tmp/com.github.ocrmypdf.0xor195l/000001.pp-background.png)
Completed Task = 'ocrmypdf.pipeline.preprocess_remove_background'
Task enters queue = 'ocrmypdf.pipeline.preprocess_deskew'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.0xor195l/000001.pp-background.png, /tmp/com.github.ocrmypdf.0xor195l/000001.pp-deskew.png)
Completed Task = 'ocrmypdf.pipeline.preprocess_deskew'
Task enters queue = 'ocrmypdf.pipeline.preprocess_clean'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.0xor195l/000001.pp-deskew.png, /tmp/com.github.ocrmypdf.0xor195l/000001.pp-clean.png)
Completed Task = 'ocrmypdf.pipeline.preprocess_clean'
Task enters queue = 'ocrmypdf.pipeline.select_ocr_image'
Task enters queue = 'ocrmypdf.pipeline.select_visible_page_image'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.0xor195l/000001.pp-clean.png, /tmp/com.github.ocrmypdf.0xor195l/000001.ocr.png)
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.0xor195l/000001.page.png, /tmp/com.github.ocrmypdf.0xor195l/000001.image)
Completed Task = 'ocrmypdf.pipeline.select_ocr_image'
Task enters queue = 'ocrmypdf.pipeline.ocr_tesseract_hocr'
DEBUG - ['tesseract', '-l', 'ell', '/tmp/com.github.ocrmypdf.0xor195l/000001.ocr.png', '/tmp/com.github.ocrmypdf.0xor195l/000001', 'hocr', 'txt']
Completed Task = 'ocrmypdf.pipeline.select_visible_page_image'
Task enters queue = 'ocrmypdf.pipeline.select_image_layer'
DEBUG - 1: page eligible for lossless reconstruction
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.0xor195l/000001.ocr.oriented.pdf, /tmp/com.github.ocrmypdf.0xor195l/000001.image-layer.pdf)
Completed Task = 'ocrmypdf.pipeline.select_image_layer'
Completed Task = 'ocrmypdf.pipeline.ocr_tesseract_hocr'
Task enters queue = 'ocrmypdf.pipeline.render_hocr_page'
Completed Task = 'ocrmypdf.pipeline.render_hocr_page'
Task enters queue = 'ocrmypdf.pipeline.combine_layers'
Completed Task = 'ocrmypdf.pipeline.combine_layers'
Task enters queue = 'ocrmypdf.pipeline.merge_pages_ghostscript'
DEBUG - Final pages: /tmp/com.github.ocrmypdf.0xor195l/000001.rendered.pdf
/tmp/com.github.ocrmypdf.0xor195l/pdfa.ps
DEBUG -
Completed Task = 'ocrmypdf.pipeline.merge_pages_ghostscript'
Task enters queue = 'ocrmypdf.pipeline.copy_final'
Completed Task = 'ocrmypdf.pipeline.copy_final'
INFO - Output file is a PDF/A-2B (as expected)
DEBUG - <PdfInfo('...'), page count=1>

Test file (excerpt from Wikipedia):
Ellada.ocr.pdf
Ellada.pdf

@jbarlow83
Copy link
Collaborator

I recommend upgrading to Tesseract 3.05 which should resolve this issue. (Tested by me.)

If you prefer to not upgrade, you could try the following (not tested):

ocrmypdf --pdf-renderer tesseract --output-type pdf -l ell <input> <output>

which will select a PDF renderer that tends to preserve less of the PDF formatting but does better on non-Latin scripts.

@jbarlow83
Copy link
Collaborator

v5.3 now gives a warning in this case

@hubitor
Copy link
Author

hubitor commented Jul 27, 2017

Thanks for the quick response! Just upgraded to tesseract 4 and tested on the same file. Not only there are no rectangles but the result is almost perfect!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants