Regression extracting data from PDF: Unexpected end of stream #1541

brunoseivam · 2023-01-08T20:18:41Z

I am getting pypdf.errors.PdfReadError: Unexpected end of stream when trying to extract text from a PDF. This used to work before (around July 2022).

I tested a few versions and it seems the regression was introduced between versions 2.10.5 and 2.10.6 (PyPDF2). It looks like it was introduced with this commit: 5049c1e

Environment

Which environment were you using when you encountered the problem?

$ python3 -m platform
Linux-5.15.0-57-generic-x86_64-with-glibc2.35

$ python3 -c "import pypdf;print(pypdf.__version__)"
3.2.1

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader
def pdf_to_text(filename: str) -> str:
    with open(filename, 'rb') as f:
        reader = PdfReader(f)
        return '\n'.join([
            page.extract_text()
            for page in reader.pages
        ])

I can't really share the PDF that's causing the issue here. It is small but contains financial data. I am willing to share it privately if needed.

Traceback

This is the complete Traceback I see:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/beancount/ingest/identify.py", line 63, in find_imports
    matched = importer.identify(file)
  File "/home/bmartins/docs/Beancount/src/importers/chase/chase_pdf.py", line 60, in identify
    and 'www.chase.com' in file.convert(pdf_to_text)
  File "/usr/lib/python3/dist-packages/beancount/ingest/cache.py", line 54, in convert
    result = self._cache[converter_func] = converter_func(self.name)
  File "/home/bmartins/docs/Beancount/src/importers/chase/chase_pdf.py", line 27, in pdf_to_text
    return '\n'.join([
  File "/home/bmartins/docs/Beancount/src/importers/chase/chase_pdf.py", line 28, in <listcomp>
    page.extract_text()
  File "/usr/local/lib/python3.10/dist-packages/pypdf/_page.py", line 1852, in extract_text
    return self._extract_text(
  File "/usr/local/lib/python3.10/dist-packages/pypdf/_page.py", line 1357, in _extract_text
    content = ContentStream(content, pdf, "bytes")
  File "/usr/local/lib/python3.10/dist-packages/pypdf/generic/_data_structures.py", line 907, in __init__
    self.__parse_content_stream(stream_bytes)
  File "/usr/local/lib/python3.10/dist-packages/pypdf/generic/_data_structures.py", line 977, in __parse_content_stream
    ii = self._read_inline_image(stream)
  File "/usr/local/lib/python3.10/dist-packages/pypdf/generic/_data_structures.py", line 1018, in _read_inline_image
    raise PdfReadError("Unexpected end of stream")
pypdf.errors.PdfReadError: Unexpected end of stream

The text was updated successfully, but these errors were encountered:

pubpub-zz · 2023-01-09T19:25:33Z

@brunoseivam
can you share the file privately with @MartinThoma [info@martin-thoma.de](mailto:info@martin-thoma.de ?

pubpub-zz · 2023-01-11T17:57:44Z

thanks for your help. Actually the issue is because of the QR image on the last page. The issue is that a separator is missing before the EI delimiter. being tolerant makes other pdf where the EI sequence is present within the image data and that do respect the standard are then failing... 😣
still under reflexion to find a good criteria

pubpub-zz · 2023-01-14T11:22:59Z

@brunoseivam
Are you ok if I generate for public test a test file with only last page ?

brunoseivam · 2023-01-14T16:11:55Z

@pubpub-zz this should be fine. Thanks for looking into it!

pubpub-zz · 2023-01-14T16:16:56Z

this is the test file
tst_iss1541.pdf

Closes #1541

MartinThoma · 2023-01-16T19:58:17Z

Thank you for investigating the issue and fixing it @pubpub-zz 🙏
Thank you @brunoseivam for sharing the data 🤗

pypdf>3.2.1 will be released this weekend (latest 22.01.2023). That version will contain the fix

pubpub-zz mentioned this issue Jan 14, 2023

ENH: Accept inline images with space before EI #1552

Merged

MartinThoma closed this as completed in #1552 Jan 16, 2023

MartinThoma pushed a commit that referenced this issue Jan 16, 2023

ENH: Accept inline images with space before EI (#1552)

df90053

Closes #1541

MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Mar 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression extracting data from PDF: Unexpected end of stream #1541

Regression extracting data from PDF: Unexpected end of stream #1541

brunoseivam commented Jan 8, 2023

pubpub-zz commented Jan 9, 2023 •

edited

pubpub-zz commented Jan 11, 2023

pubpub-zz commented Jan 14, 2023 •

edited

brunoseivam commented Jan 14, 2023

pubpub-zz commented Jan 14, 2023

MartinThoma commented Jan 16, 2023

Regression extracting data from PDF: Unexpected end of stream #1541

Regression extracting data from PDF: Unexpected end of stream #1541

Comments

brunoseivam commented Jan 8, 2023

Environment

Code + PDF

Traceback

pubpub-zz commented Jan 9, 2023 • edited

pubpub-zz commented Jan 11, 2023

pubpub-zz commented Jan 14, 2023 • edited

brunoseivam commented Jan 14, 2023

pubpub-zz commented Jan 14, 2023

MartinThoma commented Jan 16, 2023

pubpub-zz commented Jan 9, 2023 •

edited

pubpub-zz commented Jan 14, 2023 •

edited