New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCR problem: "cannot write mode P as JPEG" exception #151
Comments
I couldn't reproduce it but added a likely fix anyway for 4.5.4. The error came up because page 4 is blank (possibly due to file corruption) and the logic for a blank PDF given `--force`` was incomplete. To improve the OCR I suggest trying Tesseract 4 (alpha version) and consulting the documentation on recommended arguments with ocrmypdf for using Tess4 ( |
Thanks for the quick reply. I can confirm that the error messages no longer appear with version 4.5.4. As you said, the fix didn't change the OCR results, so I will have to play around a bit with Tesseract 4 and see, whether it gives better results. Thanks again for your help! |
You noticed a change in file size. Because I regularly run ocrmypdf on batches of >10k files, I watch any such reports closely. With This file looks like it would work without |
Yes, sorry about that - I initially reported the increase in file size, but then realised that the old file had been ocred without the |
Another PDF file, which proves a bit difficult to OCR:
https://app.box.com/s/ffraogy4ayco5gc87t8kj406ww3o731v
Using
ocrmypdf -l por --force myfile.pdf myfile_ocr.pdf
it was possible to ocr the respective file. However, many pages are very poorly ocred (most sentences are missing, and the ocred parts are completely wrong). The following exceptions are thrown:
I'm just wondering, whether the poor OCR quality for that file is just related to the image quality of the document itself or whether it is related to the above exceptions?
I'm using the current ocrmypdf version 4.5.3.
The text was updated successfully, but these errors were encountered: