New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String type and encoding in name trees #214
Comments
|
Interesting... Can't we side-step the whole encoding issue and mandate that the keys shall be sorted as byte strings in lexicographic order? That doesn't require any encoding. In pseudo-code-ish: This is completely unambiguous and avoids the whole quagmire of having to even think about encodings for name trees that are defined as having binary entries (the I suspect (hope?) that the scheme I cooked up above agrees with what most implementations would do given >99.9% of real-world inputs for name tree keys anyhow. |
|
I can't get used to the (pdftron) part of the name @tmerzpdftron ... First off, I would hope that we can all agree that we do not want to change existing behavior or definition here, because that will break existing processing of files. That said, I am fine trying to find improved text that make things less ambiguous for both creators and processors (reading & modification). |
Sure, but isn't the current text too ambiguous to allow for any workable assumptions regarding the order anyhow? I also don't really see a good way to reconcile the current text with the notion of binary-keyed name trees, unless the cited portion of the spec excludes those. Also, "lexical order" is not even straightforward to define for text strings; the complexity of the Unicode collation algorithm comes to mind, along with many different language-specific conventions related to the way text should be sorted. To be completely honest, I'm not sure whether it's even possible to make this text less ambiguous without somehow abandoning the notion of comparing name tree keys as text strings... In fact, from a processor's point of view, there's not much difference between the current language and "the keys should be ordered in a reasonable manner". |
|
OK @MatthiasValvekens - perhaps lets start with a "well known implementation". In that implementation, when comparing two values, to determine equality or ordering... it doesn't care about text vs. hex strings. Both need to be objects of type String (vs. Number, for a number tree or something else, which is an error). It then retrieves the bytes of each of the strings ( |
|
Actually, I think that's pretty much the "algorithm" I wrote in my first comment, no? :) (well, the outcome should be the same) |
|
Is there any chance that "lexical" was originally used as a synonym for "lexicographic", and that the language about encodings was tacked on later in an attempt to clarify? |
Same here
Fully agreed. Another good reason for avoiding the term lexical in order to not further confuse the issue.
I think there's not much to discuss regarding the consumer side since the above is the obvious approach. It doesn't make sense to force the consumer to apply any encoding-related heuristics (such as expanding PDFDoc-encoded text before comparing it with UTF-16 text). Also, it's impossible to distinguish byte strings vs. PDFDoc strings by looking at it (at least in the general case). The main clarification is for the producer side. The common optimization for emitting text strings:
must not be applied for name tree keys since it might spoil sort order if mixed encodings are used. Unlike all other situations the keys for name trees must be normalized to the same encoding to ensure reliable sorting. Accidentally emitting mixed PDFDoc vs. UTF-16BE keys triggered the current issue. I tried to do a quick test with "a well-known implementation", but apparently it doesn't like JavaScript called "ā" (U+0101) and converts it to ASCII "a". |
Right, when you put it like that, it makes perfect sense. Thanks! |
|
Another finding: Table 32 "Entries in the name dictionary" talks about "name strings" which are mapped to destinations, JavaScript etc. But what is a name string? Name trees are defined in terms of string objects and we have clearly defined subtypes of string - not including "name string". I think "name string" should be replaced with string in Table 32 to avoid the use of an undefined type designator. There are "name" and "string" objects, but not "name strings". The term is used in several other places as well. |
|
Marching further in interesting territory... Table 32, row "Renditions" specifies the only name tree with a specific encoding requirement: "A name tree mapping name strings (which shall have Unicode encoding) to rendition objects." (*) This data structure requirement corresponds to the usage requirement in Table 277 "Entries common to all rendition dictionaries" where the /N entry holds a rendition name: "A Unicode-encoded text string specifying the name of the rendition for use in a user interface and for name tree lookup by ECMAScript actions." The goal is apparently to ensure that a particular rendition name can be found in the name tree regardless of encoding issues. This breaks in PDF 2.0 since "Unicode encoding" may mean UTF-16BE or UTF-8. Mixing both encodings would spoil retrieval of renditions by name. Both instances of "Unicode" should either be qualified as one of UTF-16BE or UTF-8 unless we come up with a general requirement for all name trees (which is somehow intended with the current "self-consistent" phrase). |
Ouch... That's a nasty one. I suppose the backwards compatible thing to do would be to mandate UTF-16BE, then... :/ EDIT: For the Renditions tree, that is. |
Agreed. (and given the limited use of that feature, that's OK). |
|
So trying to summarize the various outcomes from this discussion as proposed solutions:
I could not see anywhere in 13.2.3 Renditions that needed rewording. Did I miss anything? |
See my comment above:
Here the same change Unicode ==> UTF-16BE is appropriate. |
|
PDF TWG agree with last 2 comment proposals |
Regarding the strings used as key in a name tree, ISO 32000-2:2020, Table 36 "Entries in a name tree node dictionary" mandates that
The referenced descriptive text below the table reads as follows:
The term lexical(ly), which is used twice, is meaningful only in combination with a fixed encoding. However, the requirement self-consistent is not very clear, and once an encoding is fixed the term lexical doesn't add any requirement or explanatory value. It may actually be wrong if two byte strings are compared numerically where one string accidentally starts with a BOM and might be interpreted as text (one might argue that "lexical" is undefined anyway for byte strings).
Proposed clarification
In the paragraph below Table 36 change
"Any encoding of the keys may be used as long as it is self-consistent;"
to
"All keys shall use the same string type and encoding (see Figure 7 "Relationship between string types");"
For text strings this clarifies the intended existing requirement of having the same encoding. For ASCII strings and byte strings the "same type" requirement is necessary to avoid incorrect comparison.
For example, the string "A" may be represented as ASCII or UTF-8 string as follows:
ASCII string: 0x41
UTF-8 string: 0xEF 0xBB 0xBF 0x41
Lexically both strings are identical, but numerically they are not.
Proposed editorial changes
Table 36, row "Names": change
"The keys shall be sorted in lexical order, as described below."
to
"The keys shall be sorted as described below."
In the paragraph below Table 36 change
"...shall be sorted lexically in ascending order by key."
to
"...shall be sorted in ascending order by key."
The text was updated successfully, but these errors were encountered: