Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmap handling: failure when a beginbfchar pair has a zero-length second element #1111

Closed
dkg opened this issue Jul 14, 2022 · 4 comments
Closed
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF

Comments

@dkg
Copy link
Contributor

dkg commented Jul 14, 2022

I'm building a pdf file with weasyprint that has both english and arabic characters in it. The contents of habibi.html are:

<!DOCTYPE html>
<head>
<meta charset="utf-8">
<title>habibi</title>
</head>
<body>
<div>حَبيبي habibi</div>
</body>
</html>

Environment

I build this with weasyprint 54.1-3 (on debian unstable)

$ weasyprint habibi.html habibi.pdf

This results in habibi.pdf.

Code + PDF

This is a minimal, complete example that shows the issue:

# Python 3.10.5 (main, Jun  8 2022, 09:26:22) [GCC 11.3.0] on linux
# Type "help", "copyright", "credits" or "license" for more information.
>>> from PyPDF2 import PdfReader
>>> r = PdfReader('habibi.pdf')
>>> r.pages[0].extract_text()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3/dist-packages/PyPDF2/_page.py", line 1316, in extract_text
    return self._extract_text(self, self.pdf, space_width, PG.CONTENTS)
  File "/usr/lib/python3/dist-packages/PyPDF2/_page.py", line 1129, in _extract_text
    cmaps[f] = build_char_map(f, space_width, obj)
  File "/usr/lib/python3/dist-packages/PyPDF2/_cmap.py", line 21, in build_char_map
    map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
  File "/usr/lib/python3/dist-packages/PyPDF2/_cmap.py", line 246, in parse_to_unicode
    ] = unhexlify(lst[1]).decode(
IndexError: list index out of range
>>> 

This appears to be because habibi.pdf contains this stream:

stream
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <ffff>
endcodespacerange
6 beginbfchar
<0003> <>
<03f2> <>
<0392> <>
<03f4> <>
<02f4> <>
<03a3> <062d064e0628064a0628064a0020>
endbfchar
endcmap
CMapName currentdict /CMap defineresource pop
end
end
endstream

The handling in parse_to_unicode in _cmap.py appears buggy because it can't handle these lines that have an empty anglebracket as the second stage.

(feel free to include habibi.pdf in your test suite of course!)

(this is related to #1110, which i offered on the way to finding this simpler test case)

MartinThoma pushed a commit that referenced this issue Jul 14, 2022
The code within the if block assumes that `lst` has index 0 and index 1.

Fixes #1091
Related to #1111
@dkg
Copy link
Contributor Author

dkg commented Jul 14, 2022

After #1110 is applied, the code sample above doesn't crash, but it yields:

'حَبيبي habibi\x03ϲΒϴΒ˴ حَبيبي '

I'd have expected it to yield something like:

'حَبيبي habibi'

@dkg
Copy link
Contributor Author

dkg commented Jul 15, 2022

fwiw, this is the section of parse_to_unicode() in _cmap.py that i think is problematic for dealing with empty angle brackets:

    ll = cm.split(b"<")
    for i in range(len(ll)):
        j = ll[i].find(b">")
        if j >= 0:
            ll[i] = ll[i][:j].replace(b" ", b"") + b" " + ll[i][j + 1 :]
    cm = (
        (b" ".join(ll))
        .replace(b"[", b" [ ")
        .replace(b"]", b" ]\n ")
        .replace(b"\r", b"\n")
    )

I can't tell what the goal is here -- it looks like the first stanza is trying to take every angle-bracketed object, removing internal whitespace, and then recombine it with any trailing data, separated by whitespace. but then the second stanza rejoins the whole list with whitespace and tries to adjust square-bracketed regions. what if a square bracket appears within an angle bracket?

Seems like it would be better to model this data structure explicitly rather than transforming it back and forth to a modified binary string.

I don't know enough about what PDF permits within a bfchar block to know what the explicit model should be, though, without introducing new potential errors. I'm reading §9.10.3 ("ToUnicode CMaps") of PDF 1.7 to try to understand what kinds of things can be present in these tables.

mtd91429 pushed a commit to mtd91429/PyPDF2 that referenced this issue Jul 15, 2022
The code within the if block assumes that `lst` has index 0 and index 1.

Fixes py-pdf#1091
Related to py-pdf#1111
dkg added a commit to dkg/pypdf that referenced this issue Jul 15, 2022
…char

This is not a principled fix, but it is a hack to avoid a crash when
encountering an empty dstString in a `beginbfchar` table in a
ToUnicode CMap.

The right way to fix this would be to replace all the string
manipulation with a formal grammar, but i don't have the skill or
capacity to do that right now.

Instead, we take narrow aim at the issue of zero-length (empty) hex
string representations.

We take advantage of the fact that no angle-bracket-delimited hex
string contains a . character.  when we encounter an empty hex string,
rather than replacing it with the empty string, we replace it with a
literal ".".  Then, when we encounter a ".", we remember that it was
supposed to be an empty string.

One consequence of this fix is that the exported cmap can now return
an empty string, so we also have to clean up
`PageObject::process_operation` so that it doesn't try to read the
final character from an empty string.

This is a hackish workaround for py-pdf#1111.
@dkg
Copy link
Contributor Author

dkg commented Jul 15, 2022

The more i look at this, and the more i read of the PDF specification (and adobe technical note 5014), the more i think this needs a proper grammar instead of string manipulation.

That said, i'm not prepared to write or offer such a fix, so i'm instead offering a hackish workaround for now at #1118. I'm afraid it increases the technical debt of the project, but it also makes it not crash when reading a pdf that's relatively easy to create.

@dkg
Copy link
Contributor Author

dkg commented Jul 15, 2022

I think this bug should be tagged with MCVE, since habibi.pdf is a minimal reproducer.

@MartinThoma MartinThoma added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests labels Jul 16, 2022
dkg added a commit to dkg/pypdf2-sample-files that referenced this issue Jul 17, 2022
MartinThoma pushed a commit to py-pdf/sample-files that referenced this issue Jul 17, 2022
Add habibi.pdf from py-pdf/pypdf#1111
Add sample pdf with alternate CMap structure
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF
Projects
None yet
Development

No branches or pull requests

2 participants