[BUG] Pathological output: PDF expands to 50x size after half an hour of processing #1079

gwern · 2023-03-10T17:38:30Z

While checking my automatically-ocrmypdf-compressed mirrors of PDFs, I discovered a pathological instance of a 2MB PDF which had been --skip-text --optimize 3 --jbig2-lossy to 50× larger (~100MB); ocrmypdf also takes over half an hour to run. Disabling PDF/A output per the warning fixes the issue in the sense of running quickly & yielding a smaller PDF like normal.

The culprit is an Arxiv paper. The responsible parts seem to be 2 graphs, figure 6/8 on page 7, which I notice visibly render dot by dot in pdfjs in my browser, and when ocrmypdf begins the PDF/A conversion, it almost freezes when it hits the middle around page 7 and pegs the CPU at 100% as Ghostscript runs (ps: gs -dBATCH -dNOPAUSE -dSAFER -dCompatibilityLevel=1.6 -sDEVICE=pdfwrite -dAutoRotatePages=/None -sColorConversionStrategy=RGB -dAutoFilterColorImages=true -dAutoFilterGrayImages=true -dJPEGQ=95 -dPDFA=2 -dPDFACompatibilityPolicy=1 -o - -sstdout=%stderr /tmp/ocrmypdf.io.0x2nifu9/fix_docinfo.pdf /tmp/ocrmypdf.io.0x2nifu9/pdfa.ps, and the gs intermediates in /tmp/ are very large, like 84MB partway through). So I guess there is some issue with complex graphs made of lots of little datapoints in a scatterplot?

In case the 106MB PDF is useful, I've uploaded it at https://mega.nz/file/Ka4x0QIL#sbOo6Z9Vn04Cx6OhMfOAD6aBfhTomD1sqhqYobUL954.

$ ocrmypdf --version
11.7.3
$ ghostscript --version
9.50
$ wget https://arxiv.org/pdf/2202.05798.pdf
$ ocrmypdf --skip-text --optimize 3 --jbig2-lossy 2202.05798.pdf  2202.05798.pdf-large
Scanning contents: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:19<00:00,  1.06page/s]
Start processing 21 pages concurrently
    4 skipping all processing on this page
...
    8 skipping all processing on this page
OCR: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21.0/21.0 [00:00<00:00, 134.78page/s]
Postprocessing...
PDF/A conversion:  43%|███████████████████████████████████████████████▏                                                              | 9/21 [01:37<02:54, 14.54s/page]alert
PDF/A conversion:  62%|███████████████████████████████████████████████████████████████████▍                                         | 13/21 [04:31<05:50, 43.76s/page]PDF/A conversion:  86%|████████████████████████████████████████████████████████████████████████████████████████████▌               | 18/21 [17:07<08:03, 161.05s/page]
PDF/A conversion:  95%|██████████████████████████████████████████████████████████████████████████████████████████████████████▊     | 20/21 [33:48<05:37, 337.92s/page]
PNGs: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 41/41 [00:01<00:00,  1.54image/s]
...
The output file size is 43.24× larger than the input file.
Possible reasons for this include:
PDF/A conversion was enabled. (Try `--output-type pdf`.)
$ ocrmypdf --skip-text --optimize 3 --jbig2-lossy --output-type pdf 2202.05798.pdf  2202.05798.pdf-pdfonly ; alert; duh *
Scanning contents: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:19<00:00,  1.06page/s]
Start processing 21 pages concurrently
...
OCR: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21.0/21.0 [00:00<00:00, 129.36page/s]
Postprocessing...
JPEGs: 0image [00:00, ?image/s]
PNGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 0.99 savings: -1.0%
Image optimization did not improve the file - discarded
$ du -h *.pdf
2.5M	2202.05798.pdf
106M	2202.05798.pdf-large
1.9M	2202.05798.pdf-pdfonly

Given the warning and expected small size when disabling PDF/A output, is this a known, possibly unfixable, issue? If it isn't fixable, it might be worthwhile to make the warning a little more informative so people know that this is a known issue and that's the recommended fix and there's no need to investigate further or file a bug report.
(Something like "NOTE: PDF used feature XYZ, which is known to result in unavoidable huge filesize blowups when encoded into a valid PDF/A file! Either accept the increased filesize, edit the original PDF to remove all XYZ, or avoid the default PDF/A encoding by using the option --output-type pdf. For more details, see <ocrmypdf.com/foobar>.")

The text was updated successfully, but these errors were encountered:

…e quite pathological (ocrmypdf/OCRmyPDF#1079)

jbarlow83 · 2023-03-11T20:22:13Z

This is probably due to an image having high DPI trigger high DPI rendering for the whole page.
Unfortunately I'm too busy with other work to dig into that issue, which looks easy to fix at a glance but has significant complications and implications for users that actually need high DPI for things (e.g. microscope images in PDFs).

jbarlow83 · 2023-03-18T19:39:51Z

Likely same as #1004

gwern · 2023-03-18T23:53:52Z

That looks similar, but there's at least one difference here from his bug report: I'm not using --force-ocr (and I wouldn't need to with Arxiv PDFs because they are all born-digital as TeX-generated papers). Further, if as suggested I try adding --oversample 300 to the testcase above, it does not seem fixed; ocrmypdf still takes a good 45 minutes to process and yields a 2202.05798.pdf-large which is 106MB, same as before.

jbarlow83 · 2023-09-26T19:29:51Z

Should be fixed in v15

gwern added a commit to gwern/gwern.net that referenced this issue Mar 11, 2023

lA: call compressPdf to guard against filesize increases, which can b…

a50f0e0

…e quite pathological (ocrmypdf/OCRmyPDF#1079)

jbarlow83 closed this as completed Sep 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Pathological output: PDF expands to 50x size after half an hour of processing #1079

[BUG] Pathological output: PDF expands to 50x size after half an hour of processing #1079

gwern commented Mar 10, 2023 •

edited

jbarlow83 commented Mar 11, 2023

jbarlow83 commented Mar 18, 2023

gwern commented Mar 18, 2023 •

edited

jbarlow83 commented Sep 26, 2023

[BUG] Pathological output: PDF expands to 50x size after half an hour of processing #1079

[BUG] Pathological output: PDF expands to 50x size after half an hour of processing #1079

Comments

gwern commented Mar 10, 2023 • edited

jbarlow83 commented Mar 11, 2023

jbarlow83 commented Mar 18, 2023

gwern commented Mar 18, 2023 • edited

jbarlow83 commented Sep 26, 2023

gwern commented Mar 10, 2023 •

edited

gwern commented Mar 18, 2023 •

edited