Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rarely crash on some PDF #18

Closed
ghaddarAbs opened this issue Dec 28, 2018 · 3 comments
Closed

Rarely crash on some PDF #18

ghaddarAbs opened this issue Dec 28, 2018 · 3 comments

Comments

@ghaddarAbs
Copy link

Hi,

Great library.......I just want to reports some rare crashes (>20/700k PDF), not a big deal. I don't know if it's a bug or exceptions can occur in the extremely damaged cases.

pdf = pikepdf.open(input_file)
File "C:\Users.......\lib\site-packages\pikepdf_init_.py", line 41, in open
return Pdf.open(*args, **kwargs)
pikepdf._qpdf.PdfError: C:/.......\my_pdf.pdf: unable to find trailer dictionary while recovering damaged file

If it helps I can send the PDFs by email. We talk about corrupted PDFs that were generated before 2000 :)

@jbarlow83
Copy link
Member

The error indicates that:

  1. the file was damaged
  2. libqpdf tried to recover the damaged file, but gave up/exhausted its recovery tools

In my experience this usually happens when a file is truncated. Sometimes you can do manual forensic recovery and extract some content, but it all depends how the original was structured.

You should get an exception and that's expected behavior. If you got a crash, meaning the Python interpreter aborted with a segfault or some other error, I'd like to look at the files.

@ghaddarAbs
Copy link
Author

ghaddarAbs commented Dec 29, 2018

Yeh, this is why I closed the issue ...... the documents were extremely damaged..... However, i used a try\catch to skip those docs.

The attachments contain 4 samples.
documents.zip

@ghaddarAbs ghaddarAbs reopened this Dec 29, 2018
@jbarlow83
Copy link
Member

All 4 of these files appear to be truncated. At a glance the first few pages of text/images might be recoverable from the first two, but that's definitely in the realm of forensic data recovery, not what we're trying to do here.

Thanks for your submission.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants