Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse PDFs with broken xref section #101

Open
mythrnr opened this issue Aug 11, 2021 · 11 comments
Open

Parse PDFs with broken xref section #101

mythrnr opened this issue Aug 11, 2021 · 11 comments
Labels
enhancement xref XRef related stuff

Comments

@mythrnr
Copy link

mythrnr commented Aug 11, 2021

Hi,

I got a panic when loading some PDFs. They are displayed fine in the viewer. I have attached an example of a PDF that can be reproduced. Is there anything I can do about this kind of PDF?

thanks.

invalid.pdf

@s3bk
Copy link
Contributor

s3bk commented Aug 13, 2021

Hello, you already did the next best thing to fixing it yourself. Seems something is fishy with the xrefs, but I have not looked at the details yet.

@mythrnr
Copy link
Author

mythrnr commented Aug 13, 2021

Thanks your message.

Yes, I can fix xrefs manually.
But this process is in an automation, and this kind of PDF is uploaded by users in rare cases.

In debugging, I found a process that adds the first_id to the array index and returns it.
https://github.com/pdf-rs/pdf/blob/master/pdf/src/xref.rs#L202

This process is called in the following process, which causes a panic due to index shifting.
https://github.com/pdf-rs/pdf/blob/master/pdf/src/xref.rs#L108-L110

I hope this is of some help to you.

@s3bk
Copy link
Contributor

s3bk commented Aug 13, 2021

Not sure how to handle the file. It is quite broken.

@s3bk
Copy link
Contributor

s3bk commented Aug 13, 2021

The xref section is broken.
The alternative is to ignore it and parse the entire file and build up a new xref section.

@mythrnr
Copy link
Author

mythrnr commented Aug 17, 2021

Please excuse the delay in replying.

I'm guessing that the viewer ignores or doesn't handle xref sections as strictly as it should, since they open fine in the viewer.
I would like to hear your opinion on whether crate will continue to be analyzed strictly as before, or whether there is a possibility that it will be handled like viewer.

Also, as importantly, I would like to at least avoid panic.

@s3bk
Copy link
Contributor

s3bk commented Aug 17, 2021

The xref section is broken, so it can't be used to access data in the file efficiently.
Another slower approach is to parse the entire file into primitives, but that is not implemented yet.

It does not panic anymore as of baa9235 .

@mythrnr
Copy link
Author

mythrnr commented Aug 17, 2021

Thank you so much for fixing that!

PDFs with broken xrefs like this are probably really rare cases, so I can't allege that you should spend the effort to parse the entire file. I'm very happy to see this will be implemented as a fallback etc.

@s3bk s3bk added enhancement xref XRef related stuff labels Sep 27, 2021
@s3bk s3bk changed the title I got a panic when loading some PDFs. Parse PDFs with broken xref section Oct 20, 2021
@Dushistov
Copy link

I tried this crate to parse my pdf collection. It can not parse ~100 files from 11866,
because of various bugs in xref. C/C++ libraries (mupdf, poppler) works just fine for all files.

@s3bk
Copy link
Contributor

s3bk commented Aug 29, 2022

That is to be expected.
This crate is not a translation of an existing C/C++ library, but an implementation from scratch.

@Dushistov
Copy link

@s3bk

This crate is not a translation of an existing C/C++ library

I know, in comment I tried to point out that all "mature" enough libraries has fallback in case there are problems with xref,
and can handle such files. So it would be great if pure rust library also has feature.

@s3bk
Copy link
Contributor

s3bk commented Aug 29, 2022

The simple solution is to build a new xref section from the file data.
But it is not implemented yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement xref XRef related stuff
Projects
None yet
Development

No branches or pull requests

3 participants