cmap handling: failure when a `beginbfchar` pair has a zero-length second element #1111

dkg · 2022-07-14T18:56:22Z

I'm building a pdf file with weasyprint that has both english and arabic characters in it. The contents of habibi.html are:

<!DOCTYPE html>
<head>
<meta charset="utf-8">
<title>habibi</title>
</head>
<body>
<div>حَبيبي habibi</div>
</body>
</html>

Environment

I build this with weasyprint 54.1-3 (on debian unstable)

$ weasyprint habibi.html habibi.pdf

This results in habibi.pdf.

Code + PDF

This is a minimal, complete example that shows the issue:

# Python 3.10.5 (main, Jun  8 2022, 09:26:22) [GCC 11.3.0] on linux
# Type "help", "copyright", "credits" or "license" for more information.
>>> from PyPDF2 import PdfReader
>>> r = PdfReader('habibi.pdf')
>>> r.pages[0].extract_text()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3/dist-packages/PyPDF2/_page.py", line 1316, in extract_text
    return self._extract_text(self, self.pdf, space_width, PG.CONTENTS)
  File "/usr/lib/python3/dist-packages/PyPDF2/_page.py", line 1129, in _extract_text
    cmaps[f] = build_char_map(f, space_width, obj)
  File "/usr/lib/python3/dist-packages/PyPDF2/_cmap.py", line 21, in build_char_map
    map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
  File "/usr/lib/python3/dist-packages/PyPDF2/_cmap.py", line 246, in parse_to_unicode
    ] = unhexlify(lst[1]).decode(
IndexError: list index out of range
>>>

This appears to be because habibi.pdf contains this stream:

stream
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <ffff>
endcodespacerange
6 beginbfchar
<0003> <>
<03f2> <>
<0392> <>
<03f4> <>
<02f4> <>
<03a3> <062d064e0628064a0628064a0020>
endbfchar
endcmap
CMapName currentdict /CMap defineresource pop
end
end
endstream

The handling in parse_to_unicode in _cmap.py appears buggy because it can't handle these lines that have an empty anglebracket as the second stage.

(feel free to include habibi.pdf in your test suite of course!)

(this is related to #1110, which i offered on the way to finding this simpler test case)

The text was updated successfully, but these errors were encountered:

The code within the if block assumes that `lst` has index 0 and index 1. Fixes #1091 Related to #1111

dkg · 2022-07-14T20:11:59Z

After #1110 is applied, the code sample above doesn't crash, but it yields:

'حَبيبي habibi\x03ϲΒϴΒ˴ حَبيبي '

I'd have expected it to yield something like:

'حَبيبي habibi'

dkg · 2022-07-15T16:20:06Z

fwiw, this is the section of parse_to_unicode() in _cmap.py that i think is problematic for dealing with empty angle brackets:

    ll = cm.split(b"<")
    for i in range(len(ll)):
        j = ll[i].find(b">")
        if j >= 0:
            ll[i] = ll[i][:j].replace(b" ", b"") + b" " + ll[i][j + 1 :]
    cm = (
        (b" ".join(ll))
        .replace(b"[", b" [ ")
        .replace(b"]", b" ]\n ")
        .replace(b"\r", b"\n")
    )

I can't tell what the goal is here -- it looks like the first stanza is trying to take every angle-bracketed object, removing internal whitespace, and then recombine it with any trailing data, separated by whitespace. but then the second stanza rejoins the whole list with whitespace and tries to adjust square-bracketed regions. what if a square bracket appears within an angle bracket?

Seems like it would be better to model this data structure explicitly rather than transforming it back and forth to a modified binary string.

I don't know enough about what PDF permits within a bfchar block to know what the explicit model should be, though, without introducing new potential errors. I'm reading §9.10.3 ("ToUnicode CMaps") of PDF 1.7 to try to understand what kinds of things can be present in these tables.

The code within the if block assumes that `lst` has index 0 and index 1. Fixes py-pdf#1091 Related to py-pdf#1111

…char This is not a principled fix, but it is a hack to avoid a crash when encountering an empty dstString in a `beginbfchar` table in a ToUnicode CMap. The right way to fix this would be to replace all the string manipulation with a formal grammar, but i don't have the skill or capacity to do that right now. Instead, we take narrow aim at the issue of zero-length (empty) hex string representations. We take advantage of the fact that no angle-bracket-delimited hex string contains a . character. when we encounter an empty hex string, rather than replacing it with the empty string, we replace it with a literal ".". Then, when we encounter a ".", we remember that it was supposed to be an empty string. One consequence of this fix is that the exported cmap can now return an empty string, so we also have to clean up `PageObject::process_operation` so that it doesn't try to read the final character from an empty string. This is a hackish workaround for py-pdf#1111.

dkg · 2022-07-15T23:11:54Z

The more i look at this, and the more i read of the PDF specification (and adobe technical note 5014), the more i think this needs a proper grammar instead of string manipulation.

That said, i'm not prepared to write or offer such a fix, so i'm instead offering a hackish workaround for now at #1118. I'm afraid it increases the technical debt of the project, but it also makes it not crash when reading a pdf that's relatively easy to create.

dkg · 2022-07-15T23:16:49Z

I think this bug should be tagged with MCVE, since habibi.pdf is a minimal reproducer.

Add habibi.pdf from py-pdf/pypdf#1111 Add sample pdf with alternate CMap structure

MartinThoma pushed a commit that referenced this issue Jul 14, 2022

BUG: Avoid IndexError in _cmap.parse_to_unicode (#1110)

bb2d1db

The code within the if block assumes that `lst` has index 0 and index 1. Fixes #1091 Related to #1111

dkg mentioned this issue Jul 14, 2022

_cmap.py: avoid IndexError in parse_to_unicode #1110

Merged

dkg mentioned this issue Jul 14, 2022

'IndexError: list index out of range' when extracting text #1091

Closed

mtd91429 pushed a commit to mtd91429/PyPDF2 that referenced this issue Jul 15, 2022

BUG: Avoid IndexError in _cmap.parse_to_unicode (py-pdf#1110)

2ba30d6

The code within the if block assumes that `lst` has index 0 and index 1. Fixes py-pdf#1091 Related to py-pdf#1111

dkg mentioned this issue Jul 15, 2022

Avoid a crash when a ToUnicode CMap has an empty dstString in beginbfchar #1118

Merged

MartinThoma added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests labels Jul 16, 2022

MartinThoma closed this as completed in ae0ff49 Jul 17, 2022

dkg added a commit to dkg/pypdf2-sample-files that referenced this issue Jul 17, 2022

ENH: Add habibi.pdf from py-pdf/pypdf#1111

922a247

MartinThoma pushed a commit to py-pdf/sample-files that referenced this issue Jul 17, 2022

ENH: Add PDF with Arab text (#13)

200644f

Add habibi.pdf from py-pdf/pypdf#1111 Add sample pdf with alternate CMap structure

srogmann mentioned this issue Oct 11, 2022

Coords in extract text #1389

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmap handling: failure when a `beginbfchar` pair has a zero-length second element #1111

cmap handling: failure when a `beginbfchar` pair has a zero-length second element #1111

dkg commented Jul 14, 2022

dkg commented Jul 14, 2022 •

edited

Loading

dkg commented Jul 15, 2022 •

edited

Loading

dkg commented Jul 15, 2022

dkg commented Jul 15, 2022

cmap handling: failure when a beginbfchar pair has a zero-length second element #1111

cmap handling: failure when a beginbfchar pair has a zero-length second element #1111

Comments

dkg commented Jul 14, 2022

Environment

Code + PDF

dkg commented Jul 14, 2022 • edited Loading

dkg commented Jul 15, 2022 • edited Loading

dkg commented Jul 15, 2022

dkg commented Jul 15, 2022

cmap handling: failure when a `beginbfchar` pair has a zero-length second element #1111

cmap handling: failure when a `beginbfchar` pair has a zero-length second element #1111

dkg commented Jul 14, 2022 •

edited

Loading

dkg commented Jul 15, 2022 •

edited

Loading