New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A set of failing PDFs #325
Comments
I will take a look. Do you know what version of ocrmypdf version you used? The stack traces appear to be from an older version. |
Whatever Ubuntu 18.0.4.1 ships, which appears to be '6.1.2-1ubuntu1.1' or '6.1.2' from |
Please try the latest released version. There is an installation procedure in the documentation specifically for Ubuntu 18.04. I suspect that will fix many of these errors. |
The problem is quite definitely how these files are formatted. In any case, the next release should be more tolerant of PDFs with these types of errors - it will issue warnings instead. I went by the logs and concluded the errors are for the same for the most part. |
That's good to hear. I hope they'll be good test cases for the next release, then. |
I found another error. Unfortunately, I cannot upload the pdf file, because it has personal data, and I do not know how to reproduce the error by creating a handcrafted pdf file. It seems to be a problem of the internal structure of the pdf file. This is the stacktrace of the error:
|
The wiki has instructions for encrypting a file for me only if you are comfortable with that. |
I am afraid I cannot do that, sorry. The document itself pertains to a third-party organization, and the personal info is not mine. I can check why it fails with pdb if it helps. Thanks |
Probably fixed this, or at least suppressed the immediate cause of stack trace, in next release |
I recently used
ocrmypdf
to mass-OCR my PDFs and a bunch of DjVu files I converted to PDF (which strips the original Tesseract OCR so I needed some way to restore it). Worked very nicely, and I like the better compression over the defaultddjvu
output.Some files failed. I noticed the mention of a test corpus, so I thought you might like a list of failing files (these failed multiple times, so should be reliable test cases) and the errors.
The errors:
myocr-gwernnet-errors.txt
The files:
The text was updated successfully, but these errors were encountered: