Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError decoding images - "not enough image data" #1814

Closed
ryanfox opened this issue Apr 26, 2023 · 8 comments · Fixed by #1815
Closed

ValueError decoding images - "not enough image data" #1814

ryanfox opened this issue Apr 26, 2023 · 8 comments · Fixed by #1815

Comments

@ryanfox
Copy link

ryanfox commented Apr 26, 2023

A ValueError is thrown (through pillow) when trying to decode certain images - when pillow tries to decode the image, it throws not enough image data . I have found a couple PDFs on Project Gutenberg that cause the error. One is Grimm's Fairy Tales: https://gutenberg.org/ebooks/2591

Environment

Which environment were you using when you encountered the problem?

Both windows and wsl (linux)

$ python -c "import pypdf;print(pypdf.__version__)"
3.8.1

I am using pillow 9.5.0. (latest as of this report)

Code + PDF

Using the .images iterator causes the error:

from pypdf import PdfReader
reader = PdfReader(pdf_file)
for page in reader.pages:
    for img in page.images:
        pass

Tested with https://gutenberg.org/files/2591/old/grimm10.pdf

Traceback

This is the complete Traceback I see:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/home/myuser/venv/lib/python3.10/site-packages/pypdf/_page.py", line 463, in images
    extension, byte_stream = _xobj_to_image(x_object[obj])
  File "/home/myuser/venv/lib/python3.10/site-packages/pypdf/filters.py", line 731, in _xobj_to_image
    img = Image.frombytes(mode, size, data)
  File "/home/myuser/venv/lib/python3.10/site-packages/PIL/Image.py", line 2970, in frombytes
    im.frombytes(data, decoder_name, args)
  File "/home/myuser/venv/lib/python3.10/site-packages/PIL/Image.py", line 826, in frombytes
    raise ValueError(msg)
ValueError: not enough image data
@ryanfox
Copy link
Author

ryanfox commented Apr 26, 2023

I also just noticed calling page.extract_text() on any page in that PDF returns random-looking unicode.

@pubpub-zz
Copy link
Collaborator

adding file for test:
grimm10.pdf

The issue is coming from the fact that the number of bits per components is not checked. it works with it

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Apr 26, 2023
@pubpub-zz
Copy link
Collaborator

I also just noticed calling page.extract_text() on any page in that PDF returns random-looking unicode.

I've checked the data copied from Acrobat Reader and it is also unreadable. the PDF seems to have been processed to prevent data extraction

MartinThoma pushed a commit that referenced this issue May 1, 2023
@MartinThoma
Copy link
Member

The image issue was addressed in #1815 which was just merged to main. It will be part of pypdf > 3.8.1. I will release it on 7th of May.

@ericgonzadev
Copy link

This issue still persist on the latest released version pypdf 3.9.0

@pubpub-zz
Copy link
Collaborator

@ericgonzadev
Some other reasons can produce this error. Can you please provide simple test code, pdf and stack report

@ericgonzadev
Copy link

@pubpub-zz

Environment

Mac OS 12.6.6

$ python3 -c "import pypdf;print(pypdf.__version__)"
3.9.0

Code + PDF

from pypdf import PdfReader
reader = PdfReader(pdf_file)
for index, page in enumerate(reader.pages):
    # Keep track of what page it's scanning
    print("Scanning page #" + str(index + 1))
    
    # The following line is only there to ensure that .extract_text() is returning actual text. Which looks like it is.
    print(page.extract_text())
    
    # The error occurs with the following
    for img in page.images:
        pass

Tested with the following pdf: https://ufile.io/o1whh9b3
The error seems to be happening on page # 34

Traceback

Traceback (most recent call last):
  File "/Users/eric/Downloads/pdf_extract_images.py", line 60, in scan_directory
    for img in page.images:
                      ^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pypdf/_page.py", line 463, in images
    extension, byte_stream = _xobj_to_image(x_object[obj])
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pypdf/filters.py", line 707, in _xobj_to_image
    alpha = Image.frombytes("L", size, x_object_obj[G.S_MASK].get_data())
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/PIL/Image.py", line 2970, in frombytes
    im.frombytes(data, decoder_name, args)
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/PIL/Image.py", line 826, in frombytes
    raise ValueError(msg)
ValueError: not enough image data

@pubpub-zz
Copy link
Collaborator

thanks for sharing:
PR #1834 has been issued and solves the issue However, others errors is reported further : your document is a good test sample

Can you raise another issue (copy/paste) for you to be able to follow progress

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants