Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PdfReadError: EOF marker not found error when opening pdf files generated from selenium snapshot #177

Closed
lovesh opened this issue Feb 8, 2015 · 13 comments
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF PdfReader The PdfReader component is affected

Comments

@lovesh
Copy link

lovesh commented Feb 8, 2015

I am using selenium and Ghost to capture screenshots as pdf. The code for saving screenshot is

driver.get('http://localhost/report/10?page=1')
driver.save_screenshot('page1.pdf')

Now i can open these files in a pdf viewer(I am using Okular) and they look fine. But when i try to open them using this code

from PyPDF2 import PdfFileReader
input1 = PdfFileReader(open("page1.pdf", "rb"))

It gives error PdfReadError: EOF marker not found. The reason i am trying to open this file using PdfFileReader is that i need to merge several pdfs into one and for that i need to open these pdfs. I found a github issue #34 and it says it was resolved but i still face this issue. My pypdf version is 1.24

@abixadamj
Copy link

abixadamj commented Jul 31, 2015

I want to say, that if I read PDF with PdfFileReader, then write with PdfFileWriter, and then read with FileReader once again, I've got:

input_file = PdfFileReader(open("/tmp/zakodowany.pdf", "rb"))
PdfReadError                              Traceback (most recent call last)
/home/adasiek/<ipython-input-12-cb4f4869d7a1> in <module>()
----> 1 input_file = PdfFileReader(open("/tmp/zakodowany.pdf", "rb"))

/usr/local/lib/python2.7/dist-packages/PyPDF2/pdf.pyc in **init**(self, stream, strict, warndest, overwriteWarnings)
   1063             stream = BytesIO(b_(fileobj.read()))
   1064             fileobj.close()
-> 1065         self.read(stream)
   1066         self.stream = stream
   1067 

/usr/local/lib/python2.7/dist-packages/PyPDF2/pdf.pyc in read(self, stream)
   1665         while line[:5] != b_("%%EOF"):
   1666             if stream.tell() < last1K:
-> 1667                 raise utils.PdfReadError("EOF marker not found")
   1668             line = self.readNextEndLine(stream)
   1669             if debug: print("  line:",line)

PdfReadError: EOF marker not found

@rafaelcanovas
Copy link

Hi there,

How this issue ended up?

I'm facing the exact same problem right now.

Thank you :)

@vivekpd15
Copy link

+1

akolpakov added a commit to akolpakov/PyPDF2 that referenced this issue Feb 6, 2017
Try to find “%%EOF” in last 1Mb of file.
akolpakov added a commit to akolpakov/PyPDF2 that referenced this issue Feb 6, 2017
@fractos
Copy link

fractos commented Apr 4, 2017

I just hit this problem too. Please could this be fixed in the pip install version soon?

vstoykov added a commit to IndustriaTech/PyPDF2 that referenced this issue Jul 21, 2017
* akolpakov/issue_177:
  Fix py-pdf#177 Try to find “%%EOF” in last 1Mb of file.
@beruic
Copy link

beruic commented Feb 14, 2018

Again, this issue would be REALLY nice to have fixed in a pip release.

I'm getting a bunch of auto-generated PDFs from customers, where the %%EOF is not within the last 1 kb, so the fix in PR #321 should be applied.
Not that it is elegant, but the current code is not either, and it would get us onwards.

@kut
Copy link

kut commented Dec 5, 2018

same issue here, would be fixed by PR #321...

@myleshk
Copy link

myleshk commented Jul 23, 2019

Same issue, please fix.

@joseprieto
Copy link

I face the same issue! Anyone has found the way to solve it?

@myleshk
Copy link

myleshk commented Feb 26, 2020

I just use pikepdf to preprocess the PDF file.

from pikepdf import Pdf

def fix_file(filename, input_base_dir):
    file_basename = filename[:-4]
    original_input_file_path = path.join(input_base_dir, filename)
    tmp_output_file_path = path.join(
        input_base_dir, file_basename+".pdf.tmp"
    )
    final_input_file_path = path.join(
        input_base_dir, file_basename+".pdf.old"
    )

    pdf = Pdf.open(original_input_file_path)
    new_pdf = Pdf.new()
    for page_obj in pdf.pages:
        new_pdf.pages.append(page_obj)
    new_pdf.save(tmp_output_file_path)

    rename(original_input_file_path, final_input_file_path)
    rename(tmp_output_file_path, original_input_file_path)
    print(f"Fixed {filename}")

@guillaume-uH57J9
Copy link

Hi,
Same issue where with a PDF where %EOF is not within the last 1kb, there's actually about 9.5k of data after %EOF.
The document is an invoice provided by a 3rd party, according to metadata it was generated by "dompdf 0.8.6 + CPDF".

It looks like a fix was submitted and a PR have been pending for some years, is this non-longer maintained?

@MartinThoma MartinThoma added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF PdfReader The PdfReader component is affected labels Apr 7, 2022
@py-pdf py-pdf deleted a comment from claird Apr 7, 2022
@py-pdf py-pdf deleted a comment from preetu098 Apr 7, 2022
@py-pdf py-pdf deleted a comment from preetu098 Apr 7, 2022
@MartinThoma
Copy link
Member

Can somebody share a PDF that has this issue?

@guillaume-uH57J9
Copy link

guillaume-uH57J9 commented Apr 14, 2022

Certainly @MartinThoma, you can download it from there :
https://drop.infini.fr/r/_wVhZCtDBy#npqf9V4POgFy1bzo1FGs3zDdXF/c0IZW9Fti7R0jvEo=

Github throws this error when processing the file "Something went really wrong, and we can't process that file. ", so I had to upload it somewhere else.

Here's how I created the file, in case you want to recreate it locally :

  • Type "Hello world" in a text editor
  • Print to PDF
  • Add some bytes at the end of file file, using command dd if=/dev/zero bs=1024 count=20 >> helloworld.pdf

Obviously this will produce a dumb PDF, but it should be sufficient to reproduce the error.
I've encountered PDF in the wild (invoices, etc) that trigger this same error, but will not share those because they contain personal information.

@MartinThoma MartinThoma added the Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests label Apr 16, 2022
MartinThoma added a commit that referenced this issue Apr 21, 2022
MartinThoma added a commit that referenced this issue Apr 21, 2022
@guillaume-uH57J9
Copy link

Thanks for the merge!

VictorCarlquist pushed a commit to VictorCarlquist/PyPDF2 that referenced this issue Apr 29, 2022
Try to find “%%EOF” in last 1Mb of file.

This fixes the issue with reading Selenium-generated PDF files.

Closes py-pdf#177
Closes py-pdf#442
Closes py-pdf#480
VictorCarlquist pushed a commit to VictorCarlquist/PyPDF2 that referenced this issue Apr 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF PdfReader The PdfReader component is affected
Projects
None yet
Development

No branches or pull requests