Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression extracting data from PDF: Unexpected end of stream #1541

Closed
brunoseivam opened this issue Jan 8, 2023 · 6 comments · Fixed by #1552
Closed

Regression extracting data from PDF: Unexpected end of stream #1541

brunoseivam opened this issue Jan 8, 2023 · 6 comments · Fixed by #1552
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF

Comments

@brunoseivam
Copy link

I am getting pypdf.errors.PdfReadError: Unexpected end of stream when trying to extract text from a PDF. This used to work before (around July 2022).

I tested a few versions and it seems the regression was introduced between versions 2.10.5 and 2.10.6 (PyPDF2). It looks like it was introduced with this commit: 5049c1e

Environment

Which environment were you using when you encountered the problem?

$ python3 -m platform
Linux-5.15.0-57-generic-x86_64-with-glibc2.35

$ python3 -c "import pypdf;print(pypdf.__version__)"
3.2.1

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader
def pdf_to_text(filename: str) -> str:
    with open(filename, 'rb') as f:
        reader = PdfReader(f)
        return '\n'.join([
            page.extract_text()
            for page in reader.pages
        ])

I can't really share the PDF that's causing the issue here. It is small but contains financial data. I am willing to share it privately if needed.

Traceback

This is the complete Traceback I see:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/beancount/ingest/identify.py", line 63, in find_imports
    matched = importer.identify(file)
  File "/home/bmartins/docs/Beancount/src/importers/chase/chase_pdf.py", line 60, in identify
    and 'www.chase.com' in file.convert(pdf_to_text)
  File "/usr/lib/python3/dist-packages/beancount/ingest/cache.py", line 54, in convert
    result = self._cache[converter_func] = converter_func(self.name)
  File "/home/bmartins/docs/Beancount/src/importers/chase/chase_pdf.py", line 27, in pdf_to_text
    return '\n'.join([
  File "/home/bmartins/docs/Beancount/src/importers/chase/chase_pdf.py", line 28, in <listcomp>
    page.extract_text()
  File "/usr/local/lib/python3.10/dist-packages/pypdf/_page.py", line 1852, in extract_text
    return self._extract_text(
  File "/usr/local/lib/python3.10/dist-packages/pypdf/_page.py", line 1357, in _extract_text
    content = ContentStream(content, pdf, "bytes")
  File "/usr/local/lib/python3.10/dist-packages/pypdf/generic/_data_structures.py", line 907, in __init__
    self.__parse_content_stream(stream_bytes)
  File "/usr/local/lib/python3.10/dist-packages/pypdf/generic/_data_structures.py", line 977, in __parse_content_stream
    ii = self._read_inline_image(stream)
  File "/usr/local/lib/python3.10/dist-packages/pypdf/generic/_data_structures.py", line 1018, in _read_inline_image
    raise PdfReadError("Unexpected end of stream")
pypdf.errors.PdfReadError: Unexpected end of stream
@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Jan 9, 2023

@brunoseivam
can you share the file privately with @MartinThoma [info@martin-thoma.de](mailto:info@martin-thoma.de ?

@pubpub-zz
Copy link
Collaborator

thanks for your help. Actually the issue is because of the QR image on the last page. The issue is that a separator is missing before the EI delimiter. being tolerant makes other pdf where the EI sequence is present within the image data and that do respect the standard are then failing... 😣
still under reflexion to find a good criteria

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Jan 14, 2023

@brunoseivam
Are you ok if I generate for public test a test file with only last page ?

@brunoseivam
Copy link
Author

@pubpub-zz this should be fine. Thanks for looking into it!

@pubpub-zz
Copy link
Collaborator

this is the test file
tst_iss1541.pdf

@MartinThoma
Copy link
Member

Thank you for investigating the issue and fixing it @pubpub-zz 🙏
Thank you @brunoseivam for sharing the data 🤗

pypdf>3.2.1 will be released this weekend (latest 22.01.2023). That version will contain the fix

@MartinThoma MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Mar 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants