[Bug]: doesn't always parse Latin with diacritics #1344

arsinclair · 2024-06-28T10:15:39Z

Describe the bug

When OCR'ing English (Latin) text with diacritics it doesn't always recognise them. The diacritics in my document are part of surnames originating from Hungary and Belgium.

I've tried with just English, English + Hungarian dictionaries, also tried with Latin script (which has extended character map) to no avail.

The words: poéme, pathétique, animé are recognised.

The words: Ysaÿe, Jenő, Petőfi, etc. are not recognised.

The words csárdás, Telmányi, Dvořák are recognised only with Latin script.

Steps to reproduce

Combinations tried:

1. `ocrmypdf -l eng Booklet.pdf Booklet_ocr.pdf`
2. `ocrmypdf -l eng+hun Booklet.pdf Booklet_ocr.pdf`
3. `ocrmypdf -l script/Latin Booklet.pdf Booklet_ocr.pdf`
4. `ocrmypdf -l Latin Booklet.pdf Booklet_ocr.pdf`

Files

Source file: Booklet.pdf

How did you download and install the software?

Linux package manager (apt, dnf, etc.)

OCRmyPDF version

16.3.1+dfsg1

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

jbarlow83 · 2024-06-30T06:39:42Z

OCRmyPDF mainly transcribes the OCR output from Tesseract to a PDF. It does not handle OCR itself so it cannot usually improve issues like missed diacritics.

If you run ocrmypdf --keep-temporary-files ... a folder will be produced containing the page images that were sent to OCR. Your best bet is to take these images and report them to https://github.com/tesseract-ocr/tesseract/ as missed recognition. The font in the sample is a little unusual and the spacing between letters and their diacritic seems to be more than typical. It's possible Tesseract would need training to recognize this unusual font.

Just possibly, using ocrmypdf --oversample 600 may improve results. This causes OCRmyPDF to generate higher resolution images, which can help with diacritic detection.

arsinclair · 2024-06-30T19:06:05Z

Thank you, will try to raise it there.

arsinclair · 2024-06-30T19:06:37Z

ocrmypdf --oversample 600

This makes it worse somehow with many already recognised diacritics not working.

arsinclair added the triage Issue needs triage label Jun 28, 2024

arsinclair assigned jbarlow83 Jun 28, 2024

jbarlow83 added third party issue Problem with a third party dependency and removed triage Issue needs triage labels Jun 30, 2024

jbarlow83 closed this as completed Jun 30, 2024

arsinclair mentioned this issue Jun 30, 2024

Tesseract doesn't always recognise diacritics tesseract-ocr/tesseract#4276

Open

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: doesn't always parse Latin with diacritics #1344

[Bug]: doesn't always parse Latin with diacritics #1344

arsinclair commented Jun 28, 2024 •

edited

Loading

jbarlow83 commented Jun 30, 2024

arsinclair commented Jun 30, 2024

arsinclair commented Jun 30, 2024

[Bug]: doesn't always parse Latin with diacritics #1344

[Bug]: doesn't always parse Latin with diacritics #1344

Comments

arsinclair commented Jun 28, 2024 • edited Loading

Describe the bug

Steps to reproduce

Files

How did you download and install the software?

OCRmyPDF version

Relevant log output

jbarlow83 commented Jun 30, 2024

arsinclair commented Jun 30, 2024

arsinclair commented Jun 30, 2024

arsinclair commented Jun 28, 2024 •

edited

Loading