You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm testing OCRmyPDF on some greek documents. When I copy and paste the text from the OCRed file into a text editor all letters with diacritics appear as rectangles.
WARNING:
In Task 'ocrmypdf.pipeline.skip_page':
No jobs were run because no file names matched.
Please make sure that the regular expression is correctly specified.
Thanks for the quick response! Just upgraded to tesseract 4 and tested on the same file. Not only there are no rectangles but the result is almost perfect!
I'm testing OCRmyPDF on some greek documents. When I copy and paste the text from the OCRed file into a text editor all letters with diacritics appear as rectangles.
$ ocrmypdf -l ell -v 1 Ellada.pdf Ellada.ocr.pdf
DEBUG - ocrmypdf 5.2
DEBUG - tesseract 3.04.01
DEBUG - os.symlink(Ellada.pdf, /tmp/com.github.ocrmypdf.0xor195l/origin)
Tasks which will be run:
Task enters queue = 'ocrmypdf.pipeline.triage'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.0xor195l/origin, /tmp/com.github.ocrmypdf.0xor195l/origin.pdf)
Completed Task = 'ocrmypdf.pipeline.triage'
Task enters queue = 'ocrmypdf.pipeline.repair_pdf'
DEBUG - <PdfInfo('...'), page count=1>
Completed Task = 'ocrmypdf.pipeline.repair_pdf'
Task enters queue = 'ocrmypdf.pipeline.split_pages'
Task enters queue = 'ocrmypdf.pipeline.generate_postscript_stub'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.0xor195l/000001.page.pdf, /tmp/com.github.ocrmypdf.0xor195l/000001.ocr.page.pdf)
Completed Task = 'ocrmypdf.pipeline.split_pages'
Task enters queue = 'ocrmypdf.pipeline.orient_page'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.0xor195l/000001.ocr.page.pdf, /tmp/com.github.ocrmypdf.0xor195l/000001.ocr.oriented.pdf)
Completed Task = 'ocrmypdf.pipeline.generate_postscript_stub'
Completed Task = 'ocrmypdf.pipeline.orient_page'
Task enters queue = 'ocrmypdf.pipeline.rasterize_with_ghostscript'
Task enters queue = 'ocrmypdf.pipeline.skip_page'
Uptodate Task = 'ocrmypdf.pipeline.skip_page'
WARNING:
In Task 'ocrmypdf.pipeline.skip_page':
No jobs were run because no file names matched.
Please make sure that the regular expression is correctly specified.
DEBUG - Rasterize 000001.ocr.oriented.pdf with png16m
DEBUG -
Completed Task = 'ocrmypdf.pipeline.rasterize_with_ghostscript'
Task enters queue = 'ocrmypdf.pipeline.preprocess_remove_background'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.0xor195l/000001.page.png, /tmp/com.github.ocrmypdf.0xor195l/000001.pp-background.png)
Completed Task = 'ocrmypdf.pipeline.preprocess_remove_background'
Task enters queue = 'ocrmypdf.pipeline.preprocess_deskew'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.0xor195l/000001.pp-background.png, /tmp/com.github.ocrmypdf.0xor195l/000001.pp-deskew.png)
Completed Task = 'ocrmypdf.pipeline.preprocess_deskew'
Task enters queue = 'ocrmypdf.pipeline.preprocess_clean'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.0xor195l/000001.pp-deskew.png, /tmp/com.github.ocrmypdf.0xor195l/000001.pp-clean.png)
Completed Task = 'ocrmypdf.pipeline.preprocess_clean'
Task enters queue = 'ocrmypdf.pipeline.select_ocr_image'
Task enters queue = 'ocrmypdf.pipeline.select_visible_page_image'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.0xor195l/000001.pp-clean.png, /tmp/com.github.ocrmypdf.0xor195l/000001.ocr.png)
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.0xor195l/000001.page.png, /tmp/com.github.ocrmypdf.0xor195l/000001.image)
Completed Task = 'ocrmypdf.pipeline.select_ocr_image'
Task enters queue = 'ocrmypdf.pipeline.ocr_tesseract_hocr'
DEBUG - ['tesseract', '-l', 'ell', '/tmp/com.github.ocrmypdf.0xor195l/000001.ocr.png', '/tmp/com.github.ocrmypdf.0xor195l/000001', 'hocr', 'txt']
Completed Task = 'ocrmypdf.pipeline.select_visible_page_image'
Task enters queue = 'ocrmypdf.pipeline.select_image_layer'
DEBUG - 1: page eligible for lossless reconstruction
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.0xor195l/000001.ocr.oriented.pdf, /tmp/com.github.ocrmypdf.0xor195l/000001.image-layer.pdf)
Completed Task = 'ocrmypdf.pipeline.select_image_layer'
Completed Task = 'ocrmypdf.pipeline.ocr_tesseract_hocr'
Task enters queue = 'ocrmypdf.pipeline.render_hocr_page'
Completed Task = 'ocrmypdf.pipeline.render_hocr_page'
Task enters queue = 'ocrmypdf.pipeline.combine_layers'
Completed Task = 'ocrmypdf.pipeline.combine_layers'
Task enters queue = 'ocrmypdf.pipeline.merge_pages_ghostscript'
DEBUG - Final pages: /tmp/com.github.ocrmypdf.0xor195l/000001.rendered.pdf
/tmp/com.github.ocrmypdf.0xor195l/pdfa.ps
DEBUG -
Completed Task = 'ocrmypdf.pipeline.merge_pages_ghostscript'
Task enters queue = 'ocrmypdf.pipeline.copy_final'
Completed Task = 'ocrmypdf.pipeline.copy_final'
INFO - Output file is a PDF/A-2B (as expected)
DEBUG - <PdfInfo('...'), page count=1>
Test file (excerpt from Wikipedia):
Ellada.ocr.pdf
Ellada.pdf
The text was updated successfully, but these errors were encountered: