Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: doesn't always parse Latin with diacritics #1344

Closed
arsinclair opened this issue Jun 28, 2024 · 3 comments
Closed

[Bug]: doesn't always parse Latin with diacritics #1344

arsinclair opened this issue Jun 28, 2024 · 3 comments
Assignees
Labels
third party issue Problem with a third party dependency

Comments

@arsinclair
Copy link

arsinclair commented Jun 28, 2024

Describe the bug

When OCR'ing English (Latin) text with diacritics it doesn't always recognise them. The diacritics in my document are part of surnames originating from Hungary and Belgium.

I've tried with just English, English + Hungarian dictionaries, also tried with Latin script (which has extended character map) to no avail.

The words: poéme, pathétique, animé are recognised.

The words: Ysaÿe, Jenő, Petőfi, etc. are not recognised.

The words csárdás, Telmányi, Dvořák are recognised only with Latin script.

Steps to reproduce

Combinations tried:

1. `ocrmypdf -l eng Booklet.pdf Booklet_ocr.pdf`
2. `ocrmypdf -l eng+hun Booklet.pdf Booklet_ocr.pdf`
3. `ocrmypdf -l script/Latin Booklet.pdf Booklet_ocr.pdf`
4. `ocrmypdf -l Latin Booklet.pdf Booklet_ocr.pdf`

Files

Source file: Booklet.pdf

How did you download and install the software?

Linux package manager (apt, dnf, etc.)

OCRmyPDF version

16.3.1+dfsg1

Relevant log output

No response

@arsinclair arsinclair added the triage Issue needs triage label Jun 28, 2024
@jbarlow83 jbarlow83 added third party issue Problem with a third party dependency and removed triage Issue needs triage labels Jun 30, 2024
@jbarlow83
Copy link
Collaborator

OCRmyPDF mainly transcribes the OCR output from Tesseract to a PDF. It does not handle OCR itself so it cannot usually improve issues like missed diacritics.

If you run ocrmypdf --keep-temporary-files ... a folder will be produced containing the page images that were sent to OCR. Your best bet is to take these images and report them to https://github.com/tesseract-ocr/tesseract/ as missed recognition. The font in the sample is a little unusual and the spacing between letters and their diacritic seems to be more than typical. It's possible Tesseract would need training to recognize this unusual font.

Just possibly, using ocrmypdf --oversample 600 may improve results. This causes OCRmyPDF to generate higher resolution images, which can help with diacritic detection.

@arsinclair
Copy link
Author

Thank you, will try to raise it there.

@arsinclair
Copy link
Author

ocrmypdf --oversample 600

This makes it worse somehow with many already recognised diacritics not working.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
third party issue Problem with a third party dependency
Projects
None yet
Development

No branches or pull requests

2 participants