ValueError decoding images - "not enough image data" #1814

ryanfox · 2023-04-26T05:20:08Z

A ValueError is thrown (through pillow) when trying to decode certain images - when pillow tries to decode the image, it throws not enough image data . I have found a couple PDFs on Project Gutenberg that cause the error. One is Grimm's Fairy Tales: https://gutenberg.org/ebooks/2591

Environment

Which environment were you using when you encountered the problem?

Both windows and wsl (linux)

$ python -c "import pypdf;print(pypdf.__version__)"
3.8.1

I am using pillow 9.5.0. (latest as of this report)

Code + PDF

Using the .images iterator causes the error:

from pypdf import PdfReader
reader = PdfReader(pdf_file)
for page in reader.pages:
    for img in page.images:
        pass

Tested with https://gutenberg.org/files/2591/old/grimm10.pdf

Traceback

This is the complete Traceback I see:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/home/myuser/venv/lib/python3.10/site-packages/pypdf/_page.py", line 463, in images
    extension, byte_stream = _xobj_to_image(x_object[obj])
  File "/home/myuser/venv/lib/python3.10/site-packages/pypdf/filters.py", line 731, in _xobj_to_image
    img = Image.frombytes(mode, size, data)
  File "/home/myuser/venv/lib/python3.10/site-packages/PIL/Image.py", line 2970, in frombytes
    im.frombytes(data, decoder_name, args)
  File "/home/myuser/venv/lib/python3.10/site-packages/PIL/Image.py", line 826, in frombytes
    raise ValueError(msg)
ValueError: not enough image data

The text was updated successfully, but these errors were encountered:

ryanfox · 2023-04-26T05:31:34Z

I also just noticed calling page.extract_text() on any page in that PDF returns random-looking unicode.

pubpub-zz · 2023-04-26T20:25:39Z

adding file for test:
grimm10.pdf

The issue is coming from the fact that the number of bits per components is not checked. it works with it

fixes py-pdf#1814

pubpub-zz · 2023-04-26T20:31:56Z

I also just noticed calling page.extract_text() on any page in that PDF returns random-looking unicode.

I've checked the data copied from Acrobat Reader and it is also unreadable. the PDF seems to have been processed to prevent data extraction

Fixes #1814

MartinThoma · 2023-05-01T06:45:01Z

The image issue was addressed in #1815 which was just merged to main. It will be part of pypdf > 3.8.1. I will release it on 7th of May.

ericgonzadev · 2023-05-24T00:56:20Z

This issue still persist on the latest released version pypdf 3.9.0

pubpub-zz · 2023-05-24T05:01:12Z

@ericgonzadev
Some other reasons can produce this error. Can you please provide simple test code, pdf and stack report

ericgonzadev · 2023-05-24T23:32:35Z

@pubpub-zz

Environment

Mac OS 12.6.6

$ python3 -c "import pypdf;print(pypdf.__version__)"
3.9.0

Code + PDF

from pypdf import PdfReader
reader = PdfReader(pdf_file)
for index, page in enumerate(reader.pages):
    # Keep track of what page it's scanning
    print("Scanning page #" + str(index + 1))
    
    # The following line is only there to ensure that .extract_text() is returning actual text. Which looks like it is.
    print(page.extract_text())
    
    # The error occurs with the following
    for img in page.images:
        pass

Tested with the following pdf: https://ufile.io/o1whh9b3
The error seems to be happening on page # 34

Traceback

Traceback (most recent call last):
  File "/Users/eric/Downloads/pdf_extract_images.py", line 60, in scan_directory
    for img in page.images:
                      ^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pypdf/_page.py", line 463, in images
    extension, byte_stream = _xobj_to_image(x_object[obj])
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pypdf/filters.py", line 707, in _xobj_to_image
    alpha = Image.frombytes("L", size, x_object_obj[G.S_MASK].get_data())
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/PIL/Image.py", line 2970, in frombytes
    im.frombytes(data, decoder_name, args)
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/PIL/Image.py", line 826, in frombytes
    raise ValueError(msg)
ValueError: not enough image data

pubpub-zz · 2023-05-25T19:01:21Z

thanks for sharing:
PR #1834 has been issued and solves the issue However, others errors is reported further : your document is a good test sample

Can you raise another issue (copy/paste) for you to be able to follow progress

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Apr 26, 2023

BUG : Cope with 1 bit images

4c8994d

fixes py-pdf#1814

pubpub-zz mentioned this issue Apr 26, 2023

BUG : Cope with 1 bit images #1815

Merged

MartinThoma closed this as completed in #1815 May 1, 2023

MartinThoma pushed a commit that referenced this issue May 1, 2023

BUG: Cope with 1 Bit images (#1815)

8e343c1

Fixes #1814

pubpub-zz mentioned this issue May 26, 2023

bug : Issues during image extraction #1863

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError decoding images - "not enough image data" #1814

ValueError decoding images - "not enough image data" #1814

ryanfox commented Apr 26, 2023

ryanfox commented Apr 26, 2023

pubpub-zz commented Apr 26, 2023

pubpub-zz commented Apr 26, 2023

MartinThoma commented May 1, 2023

ericgonzadev commented May 24, 2023

pubpub-zz commented May 24, 2023

ericgonzadev commented May 24, 2023

pubpub-zz commented May 25, 2023

ValueError decoding images - "not enough image data" #1814

ValueError decoding images - "not enough image data" #1814

Comments

ryanfox commented Apr 26, 2023

Environment

Code + PDF

Traceback

ryanfox commented Apr 26, 2023

pubpub-zz commented Apr 26, 2023

pubpub-zz commented Apr 26, 2023

MartinThoma commented May 1, 2023

ericgonzadev commented May 24, 2023

pubpub-zz commented May 24, 2023

ericgonzadev commented May 24, 2023

Environment

Code + PDF

Traceback

pubpub-zz commented May 25, 2023