Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] 'DecompressionBombError' on a ACM PDF - need resolution limit on high DPI #1104

Closed
gwern opened this issue May 18, 2023 · 7 comments
Closed

Comments

@gwern
Copy link

gwern commented May 18, 2023

The ACM's PDF of Hamming 1959 fails on any ocrmypdf v11.7.3 command I try with a message about pixels and decompression bomb problems.

320954.320958.pdf

$ ocrmypdf --version
11.7.3
$ wget https://dl.acm.org/doi/pdf/10.1145/320954.320958
$ ocrmypdf 320954.320958.pdf 320954.320958.pdf
Scanning contents: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 102.66page/s]
Start processing 11 pages concurrently
OCR:  77%|███████████████████████████████████████████████████████████████████████████████████████████▉                           | 8.5/11.0 [00:02<00:02,  1.19page/s]ocrmypdf --version
OCR:  91%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▎          | 10.0/11.0 [00:14<00:01,  1.48s/page]
An exception occurred while executing the pipeline
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/gwern/bin/miniconda2/envs/fastai/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/gwern/bin/miniconda2/envs/fastai/lib/python3.6/site-packages/ocrmypdf/_sync.py", line 191, in exec_page_sync
    page_context, orientation_correction
  File "/home/gwern/bin/miniconda2/envs/fastai/lib/python3.6/site-packages/ocrmypdf/_sync.py", line 120, in make_intermediate_images
    remove_vectors=False,
  File "/home/gwern/bin/miniconda2/envs/fastai/lib/python3.6/site-packages/ocrmypdf/_pipeline.py", line 458, in rasterize
    filter_vector=remove_vectors,
  File "/home/gwern/.local/lib/python3.6/site-packages/pluggy/hooks.py", line 286, in __call__
    return self._hookexec(self, self.get_hookimpls(), kwargs)
  File "/home/gwern/.local/lib/python3.6/site-packages/pluggy/manager.py", line 93, in _hookexec
    return self._inner_hookexec(hook, methods, kwargs)
  File "/home/gwern/.local/lib/python3.6/site-packages/pluggy/manager.py", line 87, in <lambda>
    firstresult=hook.spec.opts.get("firstresult") if hook.spec else False,
  File "/home/gwern/.local/lib/python3.6/site-packages/pluggy/callers.py", line 208, in _multicall
    return outcome.get_result()
  File "/home/gwern/.local/lib/python3.6/site-packages/pluggy/callers.py", line 80, in get_result
    raise ex[1].with_traceback(ex[2])
  File "/home/gwern/.local/lib/python3.6/site-packages/pluggy/callers.py", line 187, in _multicall
    res = hook_impl.function(*args)
  File "/home/gwern/bin/miniconda2/envs/fastai/lib/python3.6/site-packages/ocrmypdf/builtin_plugins/ghostscript.py", line 76, in rasterize_pdf_page
    filter_vector=filter_vector,
  File "/home/gwern/bin/miniconda2/envs/fastai/lib/python3.6/site-packages/ocrmypdf/_exec/ghostscript.py", line 124, in rasterize_pdf
    with Image.open(BytesIO(p.stdout)) as im:
  File "/home/gwern/.local/lib/python3.6/site-packages/PIL/Image.py", line 2953, in open
    im = _open_core(fp, filename, prefix, formats)
  File "/home/gwern/.local/lib/python3.6/site-packages/PIL/Image.py", line 2940, in _open_core
    _decompression_bomb_check(im.size)
  File "/home/gwern/.local/lib/python3.6/site-packages/PIL/Image.py", line 2850, in _decompression_bomb_check
    f"Image size ({pixels} pixels) exceeds limit of {2 * MAX_IMAGE_PIXELS} "
PIL.Image.DecompressionBombError: Image size (471055716 pixels) exceeds limit of 256000000 pixels, could be decompression bomb DOS attack.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/gwern/bin/miniconda2/envs/fastai/lib/python3.6/site-packages/ocrmypdf/_sync.py", line 373, in run_pipeline
    exec_concurrent(context)
  File "/home/gwern/bin/miniconda2/envs/fastai/lib/python3.6/site-packages/ocrmypdf/_sync.py", line 285, in exec_concurrent
    task_finished=update_page,
  File "/home/gwern/bin/miniconda2/envs/fastai/lib/python3.6/site-packages/ocrmypdf/_concurrent.py", line 112, in exec_progress_pool
    for result in results:
  File "/home/gwern/bin/miniconda2/envs/fastai/lib/python3.6/multiprocessing/pool.py", line 735, in next
    raise value
PIL.Image.DecompressionBombError: Image size (471055716 pixels) exceeds limit of 256000000 pixels, could be decompression bomb DOS attack.
@stumpylog
Copy link
Contributor

  • Version 11 is a decent amount behind, I'd consider upgrading
  • You can set --max-image-mpixels to increase the default limit, in cases where you trust the input file

@jbarlow83 jbarlow83 closed this as not planned Won't fix, can't repro, duplicate, stale May 22, 2023
@gwern
Copy link
Author

gwern commented May 24, 2023

@stumpylog Yes, v11 is behind, but are you saying that the latest version handled this PDF successfully? Otherwise I don't understand why this bug has been closed as 'not planned' (what does 'not planned' mean?). It seems like a valid PDF that ocrmypdf ought to be able to handle & do OCR on & return a new PDF around the same size etc.

@stumpylog
Copy link
Contributor

There's an argument that allows you to configure the limit you're running into. So there's an existing solution

@gwern
Copy link
Author

gwern commented May 26, 2023

Sure, but there being a workaround doesn't mean ocrmypdf (or Ghostscript) is correct. Like, if I lift the pixel limit to handle the 471,055,716 pixels the error message says, perhaps ocrmypdf can at least finish without erroring; but... isn't that a really weird and unusual number of pixels (which is why the limit is there in the first place)? That means there is somehow an image of, presumably, sqrt(471,055,716) = 21,703x21,703px inside the PDF, right? That's an absurdly huge image when most papers would be like 10x smaller on each dimension, at most. There shouldn't be any such thing in a scan of a simple short old math paper.

So something is wrong somewhere, and 'just raise the limit' doesn't really resolve the bug. It only treats a symptom, without having diagnosed whether there is even a disease to be concerned about. Is Ocrmypdf wrong? Is Ghostscript wrong? Is the ACM wrong & generating screwed-up PDFs? If it's Ocrmypdf, then obviously maybe it needs to fix something; if it's Ghostscript then I need to upgrade (the newer Ghostscripts include a complete rewrite which obviates a number of earlier bugs, including one I reported a few weeks ago) and see whether to report upstream; if it's ACM, then I need to send them a complaint (just so I can say I tried) and add the workaround to my system (because it is unlikely they will fix their entire corpus, much less all the copies floating around).

@jbarlow83
Copy link
Collaborator

On page 1 of the supplied PDF, there is a small image with dimensions ~1003x1004 that renders at 2770x2770 pixels per inch (ppi/dpi). (It's that "check for updates" icon.)

When OCRmyPDF needs to rasterize a PDF to image, it does so at the highest ppi it finds on that page as this ensures nothing is lost. So the entire first page is promoted to a huge image because of this little icon.

While typical scanned documents range from 72 to 600 dpi, there are practical imaging scenarios with resolutions well above this, in particular when scanning film or micrographs. Similarly, when scanning large maps or blueprints at high resolution, it's easy to produce images with >10k pixels. (I work with both at times.)

I will see about setting an upper limit on DPI and giving users the opportunity to opt in to high resolution if they know they need it -- something along those lines.

@jbarlow83 jbarlow83 reopened this Jun 11, 2023
@jbarlow83 jbarlow83 changed the title [BUG] 'DecompressionBombError' on a ACM PDF [BUG] 'DecompressionBombError' on a ACM PDF - need resolution limit on high DPI Jun 11, 2023
@gwern
Copy link
Author

gwern commented Jun 11, 2023

So the upper limit to the rasterized image of pg1 would apply only to the temporary rasterized image fed to the OCR utility and wouldn't change the final PDF (unless one was forcing rasterization)?

@jbarlow83
Copy link
Collaborator

Should be fixed in v15

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants