-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression extracting data from PDF: Unexpected end of stream #1541
Comments
@brunoseivam |
thanks for your help. Actually the issue is because of the QR image on the last page. The issue is that a separator is missing before the EI delimiter. being tolerant makes other pdf where the EI sequence is present within the image data and that do respect the standard are then failing... 😣 |
@brunoseivam |
@pubpub-zz this should be fine. Thanks for looking into it! |
this is the test file |
Thank you for investigating the issue and fixing it @pubpub-zz 🙏
|
I am getting
pypdf.errors.PdfReadError: Unexpected end of stream
when trying to extract text from a PDF. This used to work before (around July 2022).I tested a few versions and it seems the regression was introduced between versions 2.10.5 and 2.10.6 (PyPDF2). It looks like it was introduced with this commit: 5049c1e
Environment
Which environment were you using when you encountered the problem?
$ python3 -m platform Linux-5.15.0-57-generic-x86_64-with-glibc2.35 $ python3 -c "import pypdf;print(pypdf.__version__)" 3.2.1
Code + PDF
This is a minimal, complete example that shows the issue:
I can't really share the PDF that's causing the issue here. It is small but contains financial data. I am willing to share it privately if needed.
Traceback
This is the complete Traceback I see:
The text was updated successfully, but these errors were encountered: