Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid a crash when a ToUnicode CMap has an empty dstString in beginbfchar #1118

Merged
merged 1 commit into from
Jul 17, 2022

Conversation

dkg
Copy link
Contributor

@dkg dkg commented Jul 15, 2022

This is not a principled fix, but it is a hack to avoid a crash when
encountering an empty dstString in a beginbfchar table in a
ToUnicode CMap.

The right way to fix this would be to replace all the string
manipulation with a formal grammar, but i don't have the skill or
capacity to do that right now.

Instead, we take narrow aim at the issue of zero-length (empty) hex
string representations.

We take advantage of the fact that no angle-bracket-delimited hex
string contains a . character. when we encounter an empty hex string,
rather than replacing it with the empty string, we replace it with a
literal ".". Then, when we encounter a ".", we remember that it was
supposed to be an empty string.

One consequence of this fix is that the exported cmap can now return
an empty string, so we also have to clean up
PageObject::process_operation so that it doesn't try to read the
final character from an empty string.

This is a hackish workaround for #1111.

…char

This is not a principled fix, but it is a hack to avoid a crash when
encountering an empty dstString in a `beginbfchar` table in a
ToUnicode CMap.

The right way to fix this would be to replace all the string
manipulation with a formal grammar, but i don't have the skill or
capacity to do that right now.

Instead, we take narrow aim at the issue of zero-length (empty) hex
string representations.

We take advantage of the fact that no angle-bracket-delimited hex
string contains a . character.  when we encounter an empty hex string,
rather than replacing it with the empty string, we replace it with a
literal ".".  Then, when we encounter a ".", we remember that it was
supposed to be an empty string.

One consequence of this fix is that the exported cmap can now return
an empty string, so we also have to clean up
`PageObject::process_operation` so that it doesn't try to read the
final character from an empty string.

This is a hackish workaround for py-pdf#1111.
@dkg dkg changed the title Avoid a crash when a ToUnicode CMap has an empty dstString in beginbf… Avoid a crash when a ToUnicode CMap has an empty dstString in beginbfchar Jul 15, 2022
@codecov
Copy link

codecov bot commented Jul 15, 2022

Codecov Report

Merging #1118 (cba5b04) into main (9bbe827) will decrease coverage by 0.05%.
The diff coverage is 57.14%.

@@            Coverage Diff             @@
##             main    #1118      +/-   ##
==========================================
- Coverage   91.96%   91.91%   -0.06%     
==========================================
  Files          24       24              
  Lines        4657     4663       +6     
  Branches      962      964       +2     
==========================================
+ Hits         4283     4286       +3     
- Misses        229      230       +1     
- Partials      145      147       +2     
Impacted Files Coverage Δ
PyPDF2/_page.py 92.13% <ø> (ø)
PyPDF2/_cmap.py 92.18% <57.14%> (-1.37%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9bbe827...cba5b04. Read the comment docs.

@@ -1377,6 +1377,7 @@ def process_operation(operator: bytes, operands: List) -> None:
if (
(abs(float(op)) >= _space_width)
and (abs(float(op)) <= 8 * _space_width)
and (len(text) > 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That one is pretty clear and easy to approve / merge. I have a harder time with the other changes. I'm currently running the benchmark to see if it changed.

@MartinThoma
Copy link
Member

Looking at my benchmark results: https://github.com/py-pdf/benchmarks

            "1601.03642": 98.8 -> 98.9 : +0.1
            "1602.06541": 98.0 -> 97.8 : -0.2
            "1707.09725": 94.0 -> 94.1 : +0.1
            "2201.00021": 96.7 -> 96.5 : -0.2
            "2201.00022": 97.2 -> 97.3 : +0.1
            "2201.00029": 97.6 -> 96.7 : -0.9
            "2201.00037": 93.8 -> 94.0 : +0.2
            "2201.00069": 96.3 -> 96.4 : +0.1
            "2201.00151": 93.3 -> 93.8 : +0.5
            "2201.00178": 92.9 -> 93.1 : +0.2
            "2201.00200": 97.4 -> 97.4 : ----
            "2201.00201": 98.2 -> 98.2 : ----
            "2201.00214": 97.2 -> 97.4 : +0.2
          "GeoTopo-book": 86.5 -> 87.1 : +0.6

@MartinThoma
Copy link
Member

@pubpub-zz What do you think about this PR?

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Jul 16, 2022

@pubpub-zz What do you think about this PR?

I of course agree with and (text != ""). Just wandering if the and (abs(float(op)) <= 8 * _space_width) should not be removed : can you compare the benchmark with and without?
about change in _cmap.py I would more likely propose this mod:

        elif process_char:
            lst = [x for x in l.split(b" ") if x]
            map_dict[-1] = len(lst[0]) // 2
            if len(lst) == 1:       # some case where the 2nd param is empty (seems not IAW pdfspec)
                map_dict[
                    unhexlify(lst[0]).decode(
                        "charmap" if map_dict[-1] == 1 else "utf-16-be", "surrogatepass"
                    )
                ] = ""
            else:**
                while len(lst) > 0:
                    map_dict[
                        unhexlify(lst[0]).decode(
                            "charmap" if map_dict[-1] == 1 else "utf-16-be", "surrogatepass"
                        )
                    ] = unhexlify(lst[1]).decode(
                        "utf-16-be", "surrogatepass"
                    )  # join is here as some cases where the code was split
                    int_entry.append(int(lst[0], 16))
                    lst = lst[2:]

Tested on habibi.pdf

If you want I can push it as a suggestion

note : the arabic test showed before and after the roman text is also reported (but with some unknown car after) : not an issue with PyPDF2

@MartinThoma
Copy link
Member

@pubpub-zz Very good hint!

                PyPDF2==2.5.0  This PR        removing the line 'and (abs(float(op)) <= 8 * _space_width)'
  "1601.03642": 98.8%          98.9% (+0.1)   99.0% (+0.2 / +0.1) 
  "1602.06541": 98.0%          97.8% (-0.2)   98.0% (---  / +0.2)
  "1707.09725": 94.0%          94.1% (+0.1)   94.3% (+0.3 / +0.2)
  "2201.00021": 96.7%          96.5% (-0.2)   96.7% (---- / +0.2)
  "2201.00022": 97.2%          97.3% (+0.1)   97.3% (+0.1 / ----)
  "2201.00029": 97.6%          96.7% (-0.9)   96.7% (-0.9 / ----)
  "2201.00037": 93.8%          94.0% (+0.2)   94.0% (+0.2 / ----)
  "2201.00069": 96.3%          96.4% (+0.1)   96.4% (+0.1 / ----)
  "2201.00151": 93.3%          93.8% (+0.5)   93.8% (+0.5 / ----)
  "2201.00178": 92.9%          93.1% (+0.2)   93.2% (+0.3 / +0.1)
  "2201.00200": 97.4%          97.4% (----)   97.4% (---  / ----)
  "2201.00201": 98.2%          98.2% (----)   98.3% (+0.1 / +0.1)
  "2201.00214": 97.2%          97.4% (+0.2)   97.6% (+0.4 / +0.2)
"GeoTopo-book": 86.5%          87.1% (+0.6)   87.1% (+0.6 / ----)

Removing that line improves almost all examples. The one exception is 2201.00029 - sadly by quite a bit.

@MartinThoma
Copy link
Member

@dkg Would you mind adding the two changes suggested by @pubpub-zz ? (I could also add them afterwards)

@MartinThoma MartinThoma merged commit ae0ff49 into py-pdf:main Jul 17, 2022
MartinThoma added a commit that referenced this pull request Jul 17, 2022
Credits to pubpub-zz, see
#1118 (comment)

Co-authored-by: pubpub-zz <4083478+pubpub-zz@users.noreply.github.com>
@MartinThoma
Copy link
Member

Thank you for your contributon @dkg ! It will be part of the release today :-)

MartinThoma added a commit that referenced this pull request Jul 17, 2022
Credits to pubpub-zz, see
#1118 (comment)

Co-authored-by: pubpub-zz <4083478+pubpub-zz@users.noreply.github.com>
@dkg
Copy link
Contributor Author

dkg commented Jul 17, 2022

As i understand it, there are three requested changes:

  • use (text != "") instead of (len(text) > 0) as an "if" predicate. I'm fine with either, i don't know enough about the various implementations of python to know whether there's any significant performance benefit across all of them. fwiw, _page.py has at least one other instance of a len(text) == 0 test internally.
  • remove and (abs(float(op)) <= 8 * _space_width) -- this seems orthogonal to the change i'm proposing to make here, and would recommend it being done separately.
  • changes to the elif process_char stanza: the proposed change only works if the last entry on each line is the empty entry. It is not at all clear to me from either the PDF standard or the CMap technical note 5014 requires each grouping to be on a single line.

In fact, i'm pretty sure that the PDF format doesn't care about linebreaks here at all. That is, a CMap stream's beginbfchar range that contains two mappings which most folks would emit as:

<0012> <0056>
<0013> <0059>

could just as easily be written as:

<0012> <0056> <0013> <0059>

now, if the dstString of the first mapping is the empty string, you get this:

<0012> <>
<0013> <0059>

this will work with either proposed code change.

But if the CMap is written the second way:

<0012> <> <0013> <0059>

Then @pubpub-zz 's proposed variant (in 7740a6e), which only deals with the final odd pairing, will effectively transform such a CMap stream to:

<0012> <0013>
<0059> <>

That's an entirely different character mapping! So if i had to choose one, i'd stick with my original proposal.

Again, the right way to fix this is to use a formal grammar for CMap, but i'm grateful for the consideration of any improvement that would work with habibi.pdf :)

@dkg
Copy link
Contributor Author

dkg commented Jul 17, 2022

@pubpub-zz wrote:

note : the arabic test showed before and after the roman text is also reported (but with some unknown car after) : not an issue with PyPDF2

Agreed, and thanks for pointing this out. The fact that the arabic text shows up twice is a distinct issue with the generator, WeasyPrint, not something for PyPDF2 to worry about.

MartinThoma added a commit that referenced this pull request Jul 17, 2022
Credits to pubpub-zz, see
#1118 (comment)

Co-authored-by: pubpub-zz <4083478+pubpub-zz@users.noreply.github.com>
@pubpub-zz
Copy link
Collaborator

now, if the dstString of the first mapping is the empty string, you get this:

<0012> <>
<0013> <0059>

@dkg,
When I read the pdf Spec, I have not been able to identify an example about empty string. Despite WeasyPrint can produce this and we have a way to cope with it, can you point me a ref where this case is described. Thx

MartinThoma added a commit that referenced this pull request Jul 17, 2022
New Features (ENH):
-  Add color and font_format to PdfReader.outlines[i] (#1104)
-  Extract Text Enhancement (whitespaces) (#1084)

Bug Fixes (BUG):
-  Use `build_destination` for named destination outlines (#1128)
-  Avoid a crash when a ToUnicode CMap has an empty dstString in beginbfchar (#1118)
-  Prevent deduplication of PageObject (#1105)
-  None-check in DictionaryObject.read_from_stream (#1113)
-  Avoid IndexError in _cmap.parse_to_unicode (#1110)

Documentation (DOC):
-  Explanation for git submodule
-  Watermark and stamp (#1095)

Maintenance (MAINT):
-  Text extraction improvements (#1126)
-  Destination.color returns ArrayObject instead of tuple as fallback (#1119)
-  Use add_bookmark_destination in add_bookmark (#1100)
-  Use add_bookmark_destination in add_bookmark_dict (#1099)

Testing (TST):
-  Remove xfail from test_outline_title_issue_1121
-  Add test for arab text (#1127)
-  Add xfail for decryption fail (#1125)
-  Add xfail test for IndexError when extracting text (#1124)
-  Add MCVE showing outline title issue (#1123)

Code Style (STY):
-  Apply black and isort
-  Use IntFlag for permissions_flag / update_page_form_field_values (#1094)
-  Simplify code (#1101)

Full Changelog: 2.5.0...2.6.0
@dkg
Copy link
Contributor Author

dkg commented Jul 19, 2022

@pubpub-zz that's an entirely reasonable question. I think you're asking whether a string can have zero characters in it or not. I refer you to §7.3.4 ("String Objects") from PDF32000_2008.pdf, which says:

A string object shall consist of a series of zero or more bytes.

§7.3.4.3 ("Hexadecimal Strings"), which is the relevant bit for what we're parsing here, says:

If the final digit of a hexadecimal string is missing—that is, if there is an odd number of digits—the final digit
shall be assumed to be 0.

All of these suggest that a string generally can be the empty string.

So maybe you're asking whether it's acceptable to have an empty string in the beginbfchar/endbfchar table. Adobe Techical note #5014 itself predates unicode, but page 71 describes the structure of beginbfchar and endbfchar, which says:

dstCode must be specified as hexadecimal strings.

back in the PDF spec, §9.10.3 ("ToUnicode CMaps") says that these maps:

define the mapping from character codes to Unicode character sequences expressed in UTF-16BE encoding.

Neither the PDF spec nor the tech note suggests that the strings have any required length other than an integer number of bytes (as a baseline for strings) or valid UTF-16BE encoding (for ToUnicode CMaps). and of course the empty string has an integer number of bytes and is also a valid UTF-16BE string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants