[Bug]: OCR not complete. Parts of all pages are ignored #1323

0lm · 2024-06-01T14:04:34Z

Describe the bug

Hello.

First of all, I apologize if this is not the right place to write about this. To be honest, I am not sure what's the issue: Tesseract itself or ocrmypdf.

I am using the newest versions of both, tesseract and ocrmypdf. (tesseract-ocr-w64-setup-5.4.0.20240519-1-ga5ff320e.exe by UB Mannheim and ocrmypdf 16.3.1) on python 3.9.19 on Windows. To avoid interference, I created a new virtual enviroment for ocrmypdf.

I will describe the issue now as detailed as possible: The pdf is a bookscan. All pages are JPG pages. They are all 12MP images. Everything on each page is readable perfectly. The only flaws are: The pages are not smooth, but wavy (the usual wavy shape book pages get when they re opened) and the upper part of each page is brighter because of how the light was shining on the book, while the lower partof each page is a bit darker.

I already read that tesseract has issue doing ocr on tables. But I can assure, those were not tables, but normal pure text. What confuses me most is: The parts that were properly ocr-ed, were ocr-ed perfectly. All diacritics were perfectly searchable. Even when werds ended in a line and contrined in the next (indicated with a "-" hyphen) were perfectly searchable.

Basically, the ocr was so good. Even more confusing, why certain parts were just ignored. Anyone ever had similar issues? How could I solve this? I unfortunately can't send an example, because the pdf is private.

Steps to reproduce

1. Run ocrmypdf -v1 l- deu input.pdf output.pdf
2. Open output.pdf
3. Not everything got ocr-ed

Files

No response

How did you download and install the software?

PyPI (pip, poetry, pipx, etc.)

OCRmyPDF version

16.3.1

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

jbarlow83 · 2024-06-01T21:07:25Z

I am really tempted to make including a file mandatory. Without one I'm just speculating.

For an inconsistent background try using sauvola thresholding, which is an option listed in --help. Wavy text baselines are not something it can necessarily solve. There is book scanning software like ScanTailor you can use to flatten the text so it OCRs more accurately.

0lm added the bug label Jun 1, 2024

0lm assigned jbarlow83 Jun 1, 2024

0lm changed the title ~~[Bug]: OCR not complete. PArts of all pages are ignored~~ [Bug]: OCR not complete. Parts of all pages are ignored Jun 1, 2024

jbarlow83 closed this as not planned Won't fix, can't repro, duplicate, stale Jun 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: OCR not complete. Parts of all pages are ignored #1323

[Bug]: OCR not complete. Parts of all pages are ignored #1323

0lm commented Jun 1, 2024 •

edited

jbarlow83 commented Jun 1, 2024

[Bug]: OCR not complete. Parts of all pages are ignored #1323

[Bug]: OCR not complete. Parts of all pages are ignored #1323

Comments

0lm commented Jun 1, 2024 • edited

Describe the bug

Steps to reproduce

Files

How did you download and install the software?

OCRmyPDF version

Relevant log output

jbarlow83 commented Jun 1, 2024

0lm commented Jun 1, 2024 •

edited