-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmap handling: failure when a beginbfchar
pair has a zero-length second element
#1111
Comments
After #1110 is applied, the code sample above doesn't crash, but it yields:
I'd have expected it to yield something like:
|
fwiw, this is the section of
I can't tell what the goal is here -- it looks like the first stanza is trying to take every angle-bracketed object, removing internal whitespace, and then recombine it with any trailing data, separated by whitespace. but then the second stanza rejoins the whole list with whitespace and tries to adjust square-bracketed regions. what if a square bracket appears within an angle bracket? Seems like it would be better to model this data structure explicitly rather than transforming it back and forth to a modified binary string. I don't know enough about what PDF permits within a |
The code within the if block assumes that `lst` has index 0 and index 1. Fixes py-pdf#1091 Related to py-pdf#1111
…char This is not a principled fix, but it is a hack to avoid a crash when encountering an empty dstString in a `beginbfchar` table in a ToUnicode CMap. The right way to fix this would be to replace all the string manipulation with a formal grammar, but i don't have the skill or capacity to do that right now. Instead, we take narrow aim at the issue of zero-length (empty) hex string representations. We take advantage of the fact that no angle-bracket-delimited hex string contains a . character. when we encounter an empty hex string, rather than replacing it with the empty string, we replace it with a literal ".". Then, when we encounter a ".", we remember that it was supposed to be an empty string. One consequence of this fix is that the exported cmap can now return an empty string, so we also have to clean up `PageObject::process_operation` so that it doesn't try to read the final character from an empty string. This is a hackish workaround for py-pdf#1111.
The more i look at this, and the more i read of the PDF specification (and adobe technical note 5014), the more i think this needs a proper grammar instead of string manipulation. That said, i'm not prepared to write or offer such a fix, so i'm instead offering a hackish workaround for now at #1118. I'm afraid it increases the technical debt of the project, but it also makes it not crash when reading a pdf that's relatively easy to create. |
I think this bug should be tagged with MCVE, since |
Add habibi.pdf from py-pdf/pypdf#1111 Add sample pdf with alternate CMap structure
I'm building a pdf file with weasyprint that has both english and arabic characters in it. The contents of habibi.html are:
Environment
I build this with weasyprint 54.1-3 (on debian unstable)
This results in habibi.pdf.
Code + PDF
This is a minimal, complete example that shows the issue:
This appears to be because
habibi.pdf
contains this stream:The handling in
parse_to_unicode
in_cmap.py
appears buggy because it can't handle these lines that have an empty anglebracket as the second stage.(feel free to include
habibi.pdf
in your test suite of course!)(this is related to #1110, which i offered on the way to finding this simpler test case)
The text was updated successfully, but these errors were encountered: