Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Pathological output: PDF expands to 50x size after half an hour of processing #1079

Closed
gwern opened this issue Mar 10, 2023 · 4 comments

Comments

@gwern
Copy link

gwern commented Mar 10, 2023

While checking my automatically-ocrmypdf-compressed mirrors of PDFs, I discovered a pathological instance of a 2MB PDF which had been --skip-text --optimize 3 --jbig2-lossy to 50× larger (~100MB); ocrmypdf also takes over half an hour to run. Disabling PDF/A output per the warning fixes the issue in the sense of running quickly & yielding a smaller PDF like normal.

The culprit is an Arxiv paper. The responsible parts seem to be 2 graphs, figure 6/8 on page 7, which I notice visibly render dot by dot in pdfjs in my browser, and when ocrmypdf begins the PDF/A conversion, it almost freezes when it hits the middle around page 7 and pegs the CPU at 100% as Ghostscript runs (ps: gs -dBATCH -dNOPAUSE -dSAFER -dCompatibilityLevel=1.6 -sDEVICE=pdfwrite -dAutoRotatePages=/None -sColorConversionStrategy=RGB -dAutoFilterColorImages=true -dAutoFilterGrayImages=true -dJPEGQ=95 -dPDFA=2 -dPDFACompatibilityPolicy=1 -o - -sstdout=%stderr /tmp/ocrmypdf.io.0x2nifu9/fix_docinfo.pdf /tmp/ocrmypdf.io.0x2nifu9/pdfa.ps, and the gs intermediates in /tmp/ are very large, like 84MB partway through). So I guess there is some issue with complex graphs made of lots of little datapoints in a scatterplot?

In case the 106MB PDF is useful, I've uploaded it at https://mega.nz/file/Ka4x0QIL#sbOo6Z9Vn04Cx6OhMfOAD6aBfhTomD1sqhqYobUL954.

$ ocrmypdf --version
11.7.3
$ ghostscript --version
9.50
$ wget https://arxiv.org/pdf/2202.05798.pdf
$ ocrmypdf --skip-text --optimize 3 --jbig2-lossy 2202.05798.pdf  2202.05798.pdf-large
Scanning contents: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:19<00:00,  1.06page/s]
Start processing 21 pages concurrently
    4 skipping all processing on this page
...
    8 skipping all processing on this page
OCR: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21.0/21.0 [00:00<00:00, 134.78page/s]
Postprocessing...
PDF/A conversion:  43%|███████████████████████████████████████████████▏                                                              | 9/21 [01:37<02:54, 14.54s/page]alert
PDF/A conversion:  62%|███████████████████████████████████████████████████████████████████▍                                         | 13/21 [04:31<05:50, 43.76s/page]PDF/A conversion:  86%|████████████████████████████████████████████████████████████████████████████████████████████▌               | 18/21 [17:07<08:03, 161.05s/page]
PDF/A conversion:  95%|██████████████████████████████████████████████████████████████████████████████████████████████████████▊     | 20/21 [33:48<05:37, 337.92s/page]
PNGs: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 41/41 [00:01<00:00,  1.54image/s]
...
The output file size is 43.24× larger than the input file.
Possible reasons for this include:
PDF/A conversion was enabled. (Try `--output-type pdf`.)
$ ocrmypdf --skip-text --optimize 3 --jbig2-lossy --output-type pdf 2202.05798.pdf  2202.05798.pdf-pdfonly ; alert; duh *
Scanning contents: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:19<00:00,  1.06page/s]
Start processing 21 pages concurrently
...
OCR: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21.0/21.0 [00:00<00:00, 129.36page/s]
Postprocessing...
JPEGs: 0image [00:00, ?image/s]
PNGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 0.99 savings: -1.0%
Image optimization did not improve the file - discarded
$ du -h *.pdf
2.5M	2202.05798.pdf
106M	2202.05798.pdf-large
1.9M	2202.05798.pdf-pdfonly

Given the warning and expected small size when disabling PDF/A output, is this a known, possibly unfixable, issue? If it isn't fixable, it might be worthwhile to make the warning a little more informative so people know that this is a known issue and that's the recommended fix and there's no need to investigate further or file a bug report.
(Something like "NOTE: PDF used feature XYZ, which is known to result in unavoidable huge filesize blowups when encoded into a valid PDF/A file! Either accept the increased filesize, edit the original PDF to remove all XYZ, or avoid the default PDF/A encoding by using the option --output-type pdf. For more details, see <ocrmypdf.com/foobar>.")

gwern added a commit to gwern/gwern.net that referenced this issue Mar 11, 2023
@jbarlow83
Copy link
Collaborator

This is probably due to an image having high DPI trigger high DPI rendering for the whole page.
Unfortunately I'm too busy with other work to dig into that issue, which looks easy to fix at a glance but has significant complications and implications for users that actually need high DPI for things (e.g. microscope images in PDFs).

@jbarlow83
Copy link
Collaborator

Likely same as #1004

@gwern
Copy link
Author

gwern commented Mar 18, 2023

That looks similar, but there's at least one difference here from his bug report: I'm not using --force-ocr (and I wouldn't need to with Arxiv PDFs because they are all born-digital as TeX-generated papers). Further, if as suggested I try adding --oversample 300 to the testcase above, it does not seem fixed; ocrmypdf still takes a good 45 minutes to process and yields a 2202.05798.pdf-large which is 106MB, same as before.

@jbarlow83
Copy link
Collaborator

Should be fixed in v15

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants