Skip to content

Patch for dealing with badly formed pdfs made on an ipad#73

Closed
comqdhb wants to merge 1 commit into
itext:developfrom
comqdhb:develop
Closed

Patch for dealing with badly formed pdfs made on an ipad#73
comqdhb wants to merge 1 commit into
itext:developfrom
comqdhb:develop

Conversation

@comqdhb
Copy link
Copy Markdown

@comqdhb comqdhb commented Jun 22, 2021

While reading a pdf annotated on an iPad, the dictionary was created with a String "Name" and not a token.Name. If we accept a String as a valid name, the reader can continue.

Makes me think that there should be a way of continuing if an err occurs.

While reading a pdf annotated on an iPad, the dictionary was created with a String "Name" and not a token.Name.  If we accept a String as a valid name, the reader can continue.

Makes me think that there should be a way of continuing if an err occurs.
@yulian-gaponenko
Copy link
Copy Markdown
Member

Thank you for the PR @comqdhb !

I'm afraid, we think that this change makes the parsing too lenient and can also have many unexpected and undesirable side effects. This change allows basically any key to be a string (still parsing it as if it was a name), and this alters the behavior for parsing any dictionary within any PDF file. There are a lot of sharp edges here:

  • names and strings have different and various encoding rules, it's not clear how to handle strings properly;
  • what would happens if there is a (Name) key and /Name key at the same time? There are many different ways how such conflicts would be handled, and none is right;
  • another possible scenario is, though, that some software forgot to add an element or added an element too many between << and >>. We encountered such PDFs as well, and one element more or less means that all following keys become values and all following values become key candidates. In dictionaries that are string-valued, such leniency can result in more unexpected results than just having a clear exception.

We've also checked how other PDF processors handle such invalid PDF files and we see that strict approach is commonly applied.

If this scenario is critical, we suggest to customize implementation in the client code instead. Dictionary structure check is done in PdfReader#readDictionary that is protected, so one can quite easily introduce a custom PdfReader class that would override that method implementation.

I'm also attaching some of the examples of such invalid documents that we've generated just to keep them here.
Font-Entry_String-Key.pdf
ModDate_String-Key.pdf
Random_String-Key.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants