New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Pathological output: PDF expands to 50x size after half an hour of processing #1079
Comments
…e quite pathological (ocrmypdf/OCRmyPDF#1079)
This is probably due to an image having high DPI trigger high DPI rendering for the whole page. |
Likely same as #1004 |
That looks similar, but there's at least one difference here from his bug report: I'm not using |
Should be fixed in v15 |
While checking my automatically-ocrmypdf-compressed mirrors of PDFs, I discovered a pathological instance of a 2MB PDF which had been
--skip-text --optimize 3 --jbig2-lossy
to 50× larger (~100MB); ocrmypdf also takes over half an hour to run. Disabling PDF/A output per the warning fixes the issue in the sense of running quickly & yielding a smaller PDF like normal.The culprit is an Arxiv paper. The responsible parts seem to be 2 graphs, figure 6/8 on page 7, which I notice visibly render dot by dot in pdfjs in my browser, and when ocrmypdf begins the PDF/A conversion, it almost freezes when it hits the middle around page 7 and pegs the CPU at 100% as Ghostscript runs (
ps
:gs -dBATCH -dNOPAUSE -dSAFER -dCompatibilityLevel=1.6 -sDEVICE=pdfwrite -dAutoRotatePages=/None -sColorConversionStrategy=RGB -dAutoFilterColorImages=true -dAutoFilterGrayImages=true -dJPEGQ=95 -dPDFA=2 -dPDFACompatibilityPolicy=1 -o - -sstdout=%stderr /tmp/ocrmypdf.io.0x2nifu9/fix_docinfo.pdf /tmp/ocrmypdf.io.0x2nifu9/pdfa.ps
, and the gs intermediates in /tmp/ are very large, like 84MB partway through). So I guess there is some issue with complex graphs made of lots of little datapoints in a scatterplot?In case the 106MB PDF is useful, I've uploaded it at https://mega.nz/file/Ka4x0QIL#sbOo6Z9Vn04Cx6OhMfOAD6aBfhTomD1sqhqYobUL954.
Given the warning and expected small size when disabling PDF/A output, is this a known, possibly unfixable, issue? If it isn't fixable, it might be worthwhile to make the warning a little more informative so people know that this is a known issue and that's the recommended fix and there's no need to investigate further or file a bug report.
(Something like "NOTE: PDF used feature XYZ, which is known to result in unavoidable huge filesize blowups when encoded into a valid PDF/A file! Either accept the increased filesize, edit the original PDF to remove all XYZ, or avoid the default PDF/A encoding by using the option
--output-type pdf
. For more details, see <ocrmypdf.com/foobar>.")The text was updated successfully, but these errors were encountered: