You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First of all, I apologize if this is not the right place to write about this. To be honest, I am not sure what's the issue: Tesseract itself or ocrmypdf.
I am using the newest versions of both, tesseract and ocrmypdf. (tesseract-ocr-w64-setup-5.4.0.20240519-1-ga5ff320e.exe by UB Mannheim and ocrmypdf 16.3.1) on python 3.9.19 on Windows. To avoid interference, I created a new virtual enviroment for ocrmypdf.
I will describe the issue now as detailed as possible: The pdf is a bookscan. All pages are JPG pages. They are all 12MP images. Everything on each page is readable perfectly. The only flaws are: The pages are not smooth, but wavy (the usual wavy shape book pages get when they re opened) and the upper part of each page is brighter because of how the light was shining on the book, while the lower partof each page is a bit darker.
I already read that tesseract has issue doing ocr on tables. But I can assure, those were not tables, but normal pure text. What confuses me most is: The parts that were properly ocr-ed, were ocr-ed perfectly. All diacritics were perfectly searchable. Even when werds ended in a line and contrined in the next (indicated with a "-" hyphen) were perfectly searchable.
Basically, the ocr was so good. Even more confusing, why certain parts were just ignored. Anyone ever had similar issues? How could I solve this? I unfortunately can't send an example, because the pdf is private.
Steps to reproduce
1. Run ocrmypdf -v1 l- deu input.pdf output.pdf
2. Open output.pdf
3. Not everything got ocr-ed
Files
No response
How did you download and install the software?
PyPI (pip, poetry, pipx, etc.)
OCRmyPDF version
16.3.1
Relevant log output
No response
The text was updated successfully, but these errors were encountered:
I am really tempted to make including a file mandatory. Without one I'm just speculating.
For an inconsistent background try using sauvola thresholding, which is an option listed in --help. Wavy text baselines are not something it can necessarily solve. There is book scanning software like ScanTailor you can use to flatten the text so it OCRs more accurately.
Describe the bug
Hello.
First of all, I apologize if this is not the right place to write about this. To be honest, I am not sure what's the issue: Tesseract itself or ocrmypdf.
I am using the newest versions of both, tesseract and ocrmypdf. (tesseract-ocr-w64-setup-5.4.0.20240519-1-ga5ff320e.exe by UB Mannheim and ocrmypdf 16.3.1) on python 3.9.19 on Windows. To avoid interference, I created a new virtual enviroment for ocrmypdf.
I will describe the issue now as detailed as possible: The pdf is a bookscan. All pages are JPG pages. They are all 12MP images. Everything on each page is readable perfectly. The only flaws are: The pages are not smooth, but wavy (the usual wavy shape book pages get when they re opened) and the upper part of each page is brighter because of how the light was shining on the book, while the lower partof each page is a bit darker.
I already read that tesseract has issue doing ocr on tables. But I can assure, those were not tables, but normal pure text. What confuses me most is: The parts that were properly ocr-ed, were ocr-ed perfectly. All diacritics were perfectly searchable. Even when werds ended in a line and contrined in the next (indicated with a "-" hyphen) were perfectly searchable.
Basically, the ocr was so good. Even more confusing, why certain parts were just ignored. Anyone ever had similar issues? How could I solve this? I unfortunately can't send an example, because the pdf is private.
Steps to reproduce
Files
No response
How did you download and install the software?
PyPI (pip, poetry, pipx, etc.)
OCRmyPDF version
16.3.1
Relevant log output
No response
The text was updated successfully, but these errors were encountered: