Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: OCR not complete. Parts of all pages are ignored #1323

Closed
0lm opened this issue Jun 1, 2024 · 1 comment
Closed

[Bug]: OCR not complete. Parts of all pages are ignored #1323

0lm opened this issue Jun 1, 2024 · 1 comment
Assignees
Labels

Comments

@0lm
Copy link

0lm commented Jun 1, 2024

Describe the bug

Hello.

First of all, I apologize if this is not the right place to write about this. To be honest, I am not sure what's the issue: Tesseract itself or ocrmypdf.

I am using the newest versions of both, tesseract and ocrmypdf. (tesseract-ocr-w64-setup-5.4.0.20240519-1-ga5ff320e.exe by UB Mannheim and ocrmypdf 16.3.1) on python 3.9.19 on Windows. To avoid interference, I created a new virtual enviroment for ocrmypdf.

I will describe the issue now as detailed as possible: The pdf is a bookscan. All pages are JPG pages. They are all 12MP images. Everything on each page is readable perfectly. The only flaws are: The pages are not smooth, but wavy (the usual wavy shape book pages get when they re opened) and the upper part of each page is brighter because of how the light was shining on the book, while the lower partof each page is a bit darker.

I already read that tesseract has issue doing ocr on tables. But I can assure, those were not tables, but normal pure text. What confuses me most is: The parts that were properly ocr-ed, were ocr-ed perfectly. All diacritics were perfectly searchable. Even when werds ended in a line and contrined in the next (indicated with a "-" hyphen) were perfectly searchable.

Basically, the ocr was so good. Even more confusing, why certain parts were just ignored. Anyone ever had similar issues? How could I solve this? I unfortunately can't send an example, because the pdf is private.

Steps to reproduce

1. Run ocrmypdf -v1 l- deu input.pdf output.pdf
2. Open output.pdf
3. Not everything got ocr-ed

Files

No response

How did you download and install the software?

PyPI (pip, poetry, pipx, etc.)

OCRmyPDF version

16.3.1

Relevant log output

No response

@0lm 0lm added the bug label Jun 1, 2024
@0lm 0lm changed the title [Bug]: OCR not complete. PArts of all pages are ignored [Bug]: OCR not complete. Parts of all pages are ignored Jun 1, 2024
@jbarlow83
Copy link
Collaborator

I am really tempted to make including a file mandatory. Without one I'm just speculating.

For an inconsistent background try using sauvola thresholding, which is an option listed in --help. Wavy text baselines are not something it can necessarily solve. There is book scanning software like ScanTailor you can use to flatten the text so it OCRs more accurately.

@jbarlow83 jbarlow83 closed this as not planned Won't fix, can't repro, duplicate, stale Jun 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants