OCR-Generated Text Layers Not Readable by PDF Readers for RTL Languages Like Persian #1157

PSEUDO-SAPPHO · 2023-09-24T12:21:23Z

What were you trying to do?

I have used ocrmypdf to perform OCR on a PDF document, but I'm encountering a specific issue with RTL (right-to-left) languages like Persian. Despite successful OCR processing, the text in the resulting PDF is not selectable or searchable within PDF readers like foxit reader or other popular PDF viewers.

I tested Foxit Reader and OCR-generated text was not rtl, However, when using Zotero's PDF reader, I observed that words are separated. It's worth noting that I tested this PDF on chrome and edge and i didn't encounter the issues, ocr works and text output is available with "ocrmypdf".

Where are you installing from?

Wndows package manager (chocolatey, etc.)

What operating system are you working on?

Windows

Relevant log output

No response

jbarlow83 · 2023-09-24T20:05:26Z

Please provide an example file, the command you're using, and the versions you're using.

medmedin2014 · 2023-10-14T11:49:11Z

@jbarlow83 @PSEUDO-SAPPHO

ocrmypdf: 15.1.0
Operating System: Manjaro Linux 
KDE Plasma Version: 5.27.8
KDE Frameworks Version: 5.110.0
Qt Version: 5.15.11
Kernel Version: 6.5.7-2-MANJARO (64-bit)
Graphics Platform: Wayland

I confirm the bug with Arabic, it puts a reversed text on the output pdf.

Source file:
تقديم.pdf

Command:
ocrmypdf -l ara -f تقديم.pdf out-تقديم.pdf

Output:
out-تقديم.pdf

If you try to copy some text from the output pdf you will get Arabic letters copied in reverse order:

If you copy:

You get:
يساردلا لشفلا ةلأسم تتاب

Instead of:
باتت مسألة الفشل الدراسي

jbarlow83 · 2023-10-20T09:22:17Z

Unfortunately, this is an open issue in Tesseract PDF generation.
tesseract-ocr/tesseract#238
Other RTL languages might be affected too (Hebrew).

Fixes #1009, #1191, #1157

jbarlow83 · 2023-12-10T23:47:17Z

Fixed in v16

AhmadHakami · 2024-01-06T18:23:41Z

@jbarlow83: Fixed in v16

this problem has not been solved yet even with the updated version

tesseract v5.3.1
ocrmypdf 16.0.3

Reference:
وبعد الاطلاع علی الترتیبات التنظیمیة للمؤسسة
Searchable pdf:
دعبو عالطالا یلع تابیترتلا ةیمیظنتلا ةسسؤملل

jbarlow83 · 2024-01-07T01:44:07Z

To confirm I'm not insane, the English translation of the first line should be something like
"The issue of academic failure has become a matter of concern to parents, teachers, and public opinion alike over the decades..."

I did some experiments - it's difficult since many programs handle RTL poorly, so it's hard to tell where what is working in the first place.

AhmadHakami · 2024-02-12T01:25:56Z

Hi @jbarlow83
any updates?

jbarlow83 · 2024-02-12T02:19:46Z

Both Tesseract and OCRmyPDF use the Glyphless font approach to RTL. Glyphless is a font where every glyph is mapped to a non-printing character. I've come to believe that this approach won't work for RTL languages across all PDF viewers, barely works for Tesseract and techniques that improve rendering for LTR languages over the Tesseract baseline don't work for RTL.

There are at least three ways to create RTL text and some viewers don't support some methods well.

At the very least I believe I need to add a new character to the Glyphless font, which would be the blank RTL character. That would allow RTL fonts to be inserted in an approach that is closer to how RTL fonts are typically rendering, as far as I know anyway.

It would probably also help to have a blank double-width character for CJK characters, and maybe something for vertical CJK.

Alternately it looks like Nato Sans has become a universal open source font and I could look into embedding it everywhere.

PSEUDO-SAPPHO added the bug label Sep 24, 2023

PSEUDO-SAPPHO assigned jbarlow83 Sep 24, 2023

PSEUDO-SAPPHO changed the title ~~OCR-Generated Text Layers Not Readable by PDF Readers~~ OCR-Generated Text Layers Not Readable by PDF Readers for RTL Languages Like Persian Sep 24, 2023

jbarlow83 added need test file and removed bug labels Sep 24, 2023

jbarlow83 added third party issue Problem with a third party dependency and removed need test file labels Oct 20, 2023

jbarlow83 added a commit that referenced this issue Dec 3, 2023

v16.0.0rc1 release notes

39eee05

Fixes #1009, #1191, #1157

jbarlow83 closed this as completed Dec 10, 2023

jbarlow83 reopened this Jan 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCR-Generated Text Layers Not Readable by PDF Readers for RTL Languages Like Persian #1157

OCR-Generated Text Layers Not Readable by PDF Readers for RTL Languages Like Persian #1157

PSEUDO-SAPPHO commented Sep 24, 2023 •

edited

jbarlow83 commented Sep 24, 2023

medmedin2014 commented Oct 14, 2023 •

edited

jbarlow83 commented Oct 20, 2023

jbarlow83 commented Dec 10, 2023

AhmadHakami commented Jan 6, 2024

jbarlow83 commented Jan 7, 2024 •

edited

AhmadHakami commented Feb 12, 2024

jbarlow83 commented Feb 12, 2024

OCR-Generated Text Layers Not Readable by PDF Readers for RTL Languages Like Persian #1157

OCR-Generated Text Layers Not Readable by PDF Readers for RTL Languages Like Persian #1157

Comments

PSEUDO-SAPPHO commented Sep 24, 2023 • edited

What were you trying to do?

Where are you installing from?

What operating system are you working on?

Relevant log output

jbarlow83 commented Sep 24, 2023

medmedin2014 commented Oct 14, 2023 • edited

jbarlow83 commented Oct 20, 2023

jbarlow83 commented Dec 10, 2023

AhmadHakami commented Jan 6, 2024

jbarlow83 commented Jan 7, 2024 • edited

AhmadHakami commented Feb 12, 2024

jbarlow83 commented Feb 12, 2024

PSEUDO-SAPPHO commented Sep 24, 2023 •

edited

medmedin2014 commented Oct 14, 2023 •

edited

jbarlow83 commented Jan 7, 2024 •

edited