New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] 'DecompressionBombError' on a ACM PDF - need resolution limit on high DPI #1104
Comments
|
@stumpylog Yes, v11 is behind, but are you saying that the latest version handled this PDF successfully? Otherwise I don't understand why this bug has been closed as 'not planned' (what does 'not planned' mean?). It seems like a valid PDF that ocrmypdf ought to be able to handle & do OCR on & return a new PDF around the same size etc. |
There's an argument that allows you to configure the limit you're running into. So there's an existing solution |
Sure, but there being a workaround doesn't mean ocrmypdf (or Ghostscript) is correct. Like, if I lift the pixel limit to handle the 471,055,716 pixels the error message says, perhaps ocrmypdf can at least finish without erroring; but... isn't that a really weird and unusual number of pixels (which is why the limit is there in the first place)? That means there is somehow an image of, presumably, sqrt(471,055,716) = 21,703x21,703px inside the PDF, right? That's an absurdly huge image when most papers would be like 10x smaller on each dimension, at most. There shouldn't be any such thing in a scan of a simple short old math paper. So something is wrong somewhere, and 'just raise the limit' doesn't really resolve the bug. It only treats a symptom, without having diagnosed whether there is even a disease to be concerned about. Is Ocrmypdf wrong? Is Ghostscript wrong? Is the ACM wrong & generating screwed-up PDFs? If it's Ocrmypdf, then obviously maybe it needs to fix something; if it's Ghostscript then I need to upgrade (the newer Ghostscripts include a complete rewrite which obviates a number of earlier bugs, including one I reported a few weeks ago) and see whether to report upstream; if it's ACM, then I need to send them a complaint (just so I can say I tried) and add the workaround to my system (because it is unlikely they will fix their entire corpus, much less all the copies floating around). |
On page 1 of the supplied PDF, there is a small image with dimensions ~1003x1004 that renders at 2770x2770 pixels per inch (ppi/dpi). (It's that "check for updates" icon.) When OCRmyPDF needs to rasterize a PDF to image, it does so at the highest ppi it finds on that page as this ensures nothing is lost. So the entire first page is promoted to a huge image because of this little icon. While typical scanned documents range from 72 to 600 dpi, there are practical imaging scenarios with resolutions well above this, in particular when scanning film or micrographs. Similarly, when scanning large maps or blueprints at high resolution, it's easy to produce images with >10k pixels. (I work with both at times.) I will see about setting an upper limit on DPI and giving users the opportunity to opt in to high resolution if they know they need it -- something along those lines. |
So the upper limit to the rasterized image of pg1 would apply only to the temporary rasterized image fed to the OCR utility and wouldn't change the final PDF (unless one was forcing rasterization)? |
Should be fixed in v15 |
The ACM's PDF of Hamming 1959 fails on any ocrmypdf v11.7.3 command I try with a message about pixels and decompression bomb problems.
320954.320958.pdf
The text was updated successfully, but these errors were encountered: