Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only inform about unknown PDF Name prefixes #807

Merged

Conversation

prettybits
Copy link
Contributor

Also related to #668, unknown prefixes are currently treated as validation errors, but I'm not sure why. The example files in the linked issue report the following error: ErrorMessage: Unknown Developer Prefix:: ESIC, but I haven't been able to find anything specific about the prefix. It is not currently listed in the official prefix name registry (which I have also updated and sorted alphabetically on this occasion).

I don't currently have access to the formal specification (the link in the comments is being redirected by now), so I'm not sure what specifically it says about prefixes and how to treat them re validity. Can you help me out with some quotes, @carlwilson?

While searching around the codebase for the validation message I also found that it is currently wrongly used for version parsing errors in the PDF header, which I changed to the intended one. If you prefer I can split these commits up to separate PRs, I just combined them for now since they seemed thematically related enough.

@prettybits prettybits changed the title update PDF Name prefix list and sort alphabetically Only inform about unknown PDF Name prefixes Dec 19, 2022
Copy link
Collaborator

@samalloing samalloing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same as #778 (I mean commit "correct message for PDF header version parsing error")

@samalloing
Copy link
Collaborator

Hi @prettybits

I used error because in the PDF specification it is stated in section 7.12 Extensions Dictionary: "The keys in the extensions
dictionary shall be names consisting only of the registered prefixes, described in Annex E, of the developers
whose extensions are being used". Annex E is about the Name registry. That's why I used an ErrorMessage instead of an InfoMessage. So the Prefix needs to be registered.

But thanks for updating the registry!

Sam

@prettybits
Copy link
Contributor Author

Hi @samalloing, thanks for reviewing! I didn't see the preexisting PR for the PDF-HUL155 issue, I will rebase the branch without the commit.

I did in the meantime also find the specification document, so I can follow the official text as well. The rules about developer extensions key names do seem somewhat unclear to me I have to say.

Chapter E.1 "General" says

Developer prefixes shall be used to identify extensions to PDF that use First Class names (see below) and that are
intended for public use. (See 7.12.2, “Developer Extensions Dictionary.)

and Chapter I.3 "Feature Compatibility" mentions

See 7.12.2, “Developer Extensions Dictionary” for a discussion of how to designate the use of
public extensions in PDF file.

which makes it seem like this is more about "public" extensions, although it also sounds like per E.2 private extensions should then still use a XX prefix. I'm not sure that treating unregistered prefixes as a validity error is really helping in practice (I assume it is quite likely that there exists proprietary? reading software that knows how to treat the improperly named private extension), but I suppose strict conformance supports the current handling.

Shall I rebase the PR with just the registry update commit intact?

@prettybits
Copy link
Contributor Author

I just noticed as well that 7.12.4 uses "GLGR" as an extension key in its examples, which is not even in the registered name list, don't know what to make of that.

I also wonder if @tballison may have thoughts about this?

@carlwilson
Copy link
Member

OK, I've been working my way through the earlier PDF module PRs but these are on my radar now. I'm going to raise one or two of these questions with our resident PDF expert and will add any answers here. I'd like to get this merged and working for the upcoming 1.28 release.

@prettybits
Copy link
Contributor Author

Thanks, @carlwilson!

In the meantime I have rebased my branch, removed the now redundant fix and re-added the "ADBE" prefix which I noticed I mistakenly left out previously.

@tballison
Copy link

I also wonder if @tballison may have thoughts about this?

I defer to digipres on use cases for digipres and actual pdf experts on the spec.

Personal opinion only: I'd want to be alerted to public extensions, private extensions and non-standard extensions. For some use cases, I imagine that you'd not want to allow any extensions because of the risk of data leakage and or vendor-specific behavior that can't be replicated.

@carlwilson
Copy link
Member

I've read the specification document and I'm in agreement with your reading @prettybits in that unregistered prefixes should begin xx but if they do then the document isn't invalid. I think I know the issue with GLGR as well. I think that it's this company https://whattheythink.com/news/18038-global-graphics-launches-jaws-pdf-library-com-edition/, but that link is 2003. Reading on the Adobe names list GitHub site here: https://github.com/adobe/pdf-names-list#additional-details it says

NOTE: Due to legal reasons, Adobe maintains the list of all registrations that occured prior to that 2008.

I suspect this is one of those, which raises the possibility of registered names that aren't on the list. I'm starting to favour a warning for unknown prefixes and xx ones but not marking the document as invalid.

I've asked someone who knows more about PDF than me to give some advice. I also noticed we're one registration light for "DLLB" added on Dec 12 2022. By coincidence, it's the person I've asked to help.

@codecov
Copy link

codecov bot commented Jan 20, 2023

Codecov Report

Base: 29.18% // Head: 46.22% // Increases project coverage by +17.04% 🎉

Coverage data is based on head (291f7e6) compared to base (0dba54e).
Patch has no changes to coverable lines.

Additional details and impacted files
@@                Coverage Diff                 @@
##             integration     #807       +/-   ##
==================================================
+ Coverage          29.18%   46.22%   +17.04%     
- Complexity           597     1056      +459     
==================================================
  Files                 57       57               
  Lines               9057     9057               
  Branches            1607     1607               
==================================================
+ Hits                2643     4187     +1544     
+ Misses              6033     4330     -1703     
- Partials             381      540      +159     
Impacted Files Coverage Δ
.../java/edu/harvard/hul/ois/jhove/ConfigHandler.java 61.41% <0.00%> (+0.78%) ⬆️
...ava/org/openpreservation/jhove/ReleaseDetails.java 82.75% <0.00%> (+1.72%) ⬆️
...n/java/edu/harvard/hul/ois/jhove/PropertyType.java 100.00% <0.00%> (+4.76%) ⬆️
...rvard/hul/ois/jhove/messages/JhoveMessageImpl.java 21.42% <0.00%> (+4.76%) ⬆️
...a/edu/harvard/hul/ois/jhove/NisoImageMetadata.java 76.19% <0.00%> (+9.32%) ⬆️
.../java/edu/harvard/hul/ois/jhove/PropertyArity.java 100.00% <0.00%> (+10.00%) ⬆️
...in/java/edu/harvard/hul/ois/jhove/HandlerBase.java 71.42% <0.00%> (+11.56%) ⬆️
...src/main/java/edu/harvard/hul/ois/jhove/Agent.java 46.82% <0.00%> (+13.49%) ⬆️
...main/java/edu/harvard/hul/ois/jhove/JhoveBase.java 41.29% <0.00%> (+18.02%) ⬆️
...c/main/java/edu/harvard/hul/ois/jhove/RepInfo.java 54.61% <0.00%> (+21.53%) ⬆️
... and 24 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@prettybits
Copy link
Contributor Author

Good hunting, @carlwilson, Global Graphics also looks like the right candidate to me for the GLGR prefix. In the names list there is one entry that has a listed creation date before 2008: Plib for PDFlib GmbH, and I'm also not quite sure how to parse that note by Adobe. It sounds like (almost all) those names are known to Adobe only and cannot be publically listed in the repository? It's quite certain to not be a complete list in any case it would seem.

I also added the DLLB prefix now, that was commited a few hours after I opened this PR. ;)

As to validity, I think that even if the prefix names list were complete, unless we knew how to validate what the extension actually means to extend we wouldn't really win anything by being strict. I agree with @tballison that being notified about any extension no matter how official can be very valuable, but that's not a question of validity but more of policy. But that is personal opinion territory, let's see what Peter says. :)

@carlwilson
Copy link
Member

I've now had confirmation that not all of the names on the pre-2008 list are public, so there are legitimate prefixes that aren't on the list. A better explanation of the implications of the message on the Wiki and an informational message is the way to go here. I will merge this and add some better background to the Wiki. Thanks for the PR @prettybits

@carlwilson carlwilson merged commit 47826c0 into openpreserve:integration Jan 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants