Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pikepdf will have failed test with qpdf 10.6 but can be fixed without breaking compatibility #303

Closed
jberkenbilt opened this issue Feb 8, 2022 · 40 comments

Comments

@jberkenbilt
Copy link
Contributor

When running pikepdf's tests against qpdf 10.6, the following failures occur:

b = b'\x7f'

    @given(binary())
    def test_codec_involution(b):
        # For all binary strings, there is a pdfdoc decoding. The encoding of that
        # decoding recovers the initial string. (However, not all str have a pdfdoc
        # encoding.)
>       assert b.decode('pdfdoc').encode('pdfdoc') == b
E       AssertionError: assert b'\x9f' == b'\x7f'
E         At index 0 diff: b'\x9f' != b'\x7f'
E         Use -v to get the full diff

and

s = '\x1f'

    @given(text())
    def test_break_encode(s):
        try:
            encoded_bytes = s.encode('pdfdoc')
        except ValueError as e:
            allowed_errors = [
                "'pdfdoc' codec can't encode character",
                "'pdfdoc' codec can't process Unicode surrogates",
                "'pdfdoc' codec can't encode some characters",
            ]
            if any((allowed in str(e)) for allowed in allowed_errors):
                return
            raise
        else:
>           assert encoded_bytes.decode('pdfdoc') == s
E           AssertionError: assert '˜' == '\x1f'
E             Strings contain only whitespace, escaping them using repr()
E             - '\x1f'
E             + '˜'

tests/test_codec.py:52: AssertionError

This is most likely because of qpdf/qpdf#606 which added previously omitted Unicode conversions for PDF Doc Encoding code points 0x18 through 0x1f and 0x7f. If you want to test mapping to an invalid code point, you can pick something lower than 0x18. That should map to the invalid character. Anyway, I'm not sure what correct fix is for your test.

I plan to release qpdf 10.6 most likely tomorrow, February 8. I plan on preparing everything today. Other than version numbers and final release mechanics, qpdf's main is what 10.6 will look like. At this moment, I haven't yet updated configure.ac and libtool versions, but I will be doing that shortly.

@jbarlow83
Copy link
Member

jbarlow83 commented Feb 9, 2022

Thanks for pointing this out. I did notice the 0x18 through 0x1f issue before but... I don't know, I suppose I thought maybe you had a reason for doing it that way, and it seemed fairly minor so I didn't raise it.

I found another little surprise when I looked over your changes and studied the reference manual: 0xAD should not decode to U+00AD; it should also be undefined officially or U+FFFD. Surprisingly Table D.2 has this. 0x7F, 0x9F and 0xAD are all defined to have no Unicode equivalent. I'll open an issue. Obviously not a showstopper or anything.

@mgorny
Copy link
Contributor

mgorny commented Feb 13, 2022

I'm also getting a failure from test_new_item, it seems relevant:

____________________________________________________________ test_new_item ____________________________________________________________

resources = PosixPath('/tmp/portage/dev-python/pikepdf-4.5.0/work/pikepdf-4.5.0/tests/resources')

    @settings(deadline=timedelta(milliseconds=750))
>   @given(
        title=st.text(),
        page_num=st.integers(0, 1),
        page_loc=st.sampled_from(PageLocation),
    )

f          = <function given.<locals>.run_test_as_given.<locals>.wrapped_test at 0x7fe2a286d550>
resources  = PosixPath('/tmp/portage/dev-python/pikepdf-4.5.0/work/pikepdf-4.5.0/tests/resources')

tests/test_outlines.py:439: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

resources = PosixPath('/tmp/portage/dev-python/pikepdf-4.5.0/work/pikepdf-4.5.0/tests/resources'), title = '\xad', page_num = 0
page_loc = <PageLocation.XYZ: 1>

    @settings(deadline=timedelta(milliseconds=750))
    @given(
        title=st.text(),
        page_num=st.integers(0, 1),
        page_loc=st.sampled_from(PageLocation),
    )
    @example(
        title='',
        page_num=0,
        page_loc='FitR',
    )
    def test_new_item(resources, title, page_num, page_loc):
        # @given precludes use of outlines_doc fixture - causes hypothesis health check to
        # fail
        with Pdf.open(resources / 'outlines.pdf') as doc:
            kwargs = dict.fromkeys(ALL_PAGE_LOCATION_KWARGS, 100)
            page_ref = doc.pages[page_num]
    
            new_item = OutlineItem(title, page_num, page_loc, **kwargs)
            with doc.open_outline() as outline:
                outline.root.append(new_item)
            if isinstance(page_loc, PageLocation):
                loc_str = page_loc.name
            else:
                loc_str = page_loc
            if loc_str == 'FitR':
                kwarg_len = 4
            elif loc_str == 'XYZ':
                kwarg_len = 3
            elif loc_str in ('FitH', 'FitV', 'FitBH', 'FitBV'):
                kwarg_len = 1
            else:
                kwarg_len = 0
            expected_dest = [page_ref.obj, Name(f'/{loc_str}')]
            expected_dest.extend(repeat(100, kwarg_len))
            assert new_item.destination == expected_dest
            new_obj = new_item.obj
>           assert new_obj.Title == title
E           assert pikepdf.String("�") == '\xad'
E             +pikepdf.String("�")
E             -'\xad'

[...]

Full log:
dev-python:pikepdf-4.5.0:20220213-095103.log

@jbarlow83
Copy link
Member

Thanks. Yes, the fixes needed for 10.6 are not exactly complicated, but not obvious either, and it's all codec stuff.

Also tricky is mainly compatibility with qpdf 10.5. I might have to drop 10.5 support and just move on to 10.6 (implying pikepdf 5.x 11 days before the Ubuntu 22.04 freeze, not ideal), because pikepdf now needs to change its behavior whether dealing with qpdf 10.5/10.6 to ensure it produces the right Unicode errors at the right time.

@jberkenbilt
Copy link
Contributor Author

Is it a compile-time issue or a runtime issue? qpdf 10.6 introduced QPDF_MAJOR_VERSION et al (see release notes) for compile-time checks, and QPDF::QPDFVersion() has been available for a long time for runtime checks. You could also also QUtil::pdf_doc_to_utf8 to get a runtime determination of whether the specific code points are handled in one way or another.

@jbarlow83
Copy link
Member

It's a runtime issue. Yes I'll probably do some runtime version checks. The issue is not making the checks but making the behavior consistent for both versions.

Not complaining, it needed to be done.

@mgorny
Copy link
Contributor

mgorny commented Feb 13, 2022

Just to be clear, is this a problem that affects real-life use of the library or mostly tests?

@jberkenbilt
Copy link
Contributor Author

Mostly tests.

The problem was that qpdf (the C++ library pikepdf uses internally) was not handling certain characters correctly in PDF strings that were encoded in a certain way. The bug had been in qpdf for a very long time and was only discovered recently. While it could in principle affect real-life users of the library, it would only do so for certain relatively unusual operations, not including any of the transformations that qpdf/pikepdf are so often used for, and then only with very unusual files that encoded these characters in this somewhat unusual way. The characters that were not encoded properly were mostly stand-alone accents, which are quite rare on their own, and usually files that would need them would use a different encoding.

So, bottom line: while this could affect real-life use of the library, you could probably go the rest of your life and never encounter the combination of PDF file and use case where this bug would actually matter.

As for the failing tests, they do not indicate anything wrong with pikepdf itself. The tests were, in a sense, incorrect because they were relying on incorrect behavior of qpdf.

@jbarlow83 can add to this if needed for any details specific to pikepdf.

@mgorny
Copy link
Contributor

mgorny commented Feb 13, 2022

Thanks. More context: I was mostly wondering if I should be forcing older qpdf in Gentoo but it sounds like it's fine to just ignore the test failures for now.

@jberkenbilt
Copy link
Contributor Author

The issue is not making the checks but making the behavior consistent for both versions.

Please let me know if there's anything I can do to make this easier in the future. I added the QPDF_*_VERSION preprocessor symbols mainly to make it easier for a future pikepdf (and other software too) to work with versions of qpdf before and after the smart pointer type change coming up in 11.

@jberkenbilt
Copy link
Contributor Author

I was mostly wondering if I should be forcing older qpdf in Gentoo but it sounds like it's fine to just ignore the test failures for now.

Speaking as qpdf upstream and as the debian maintainer for qpdf, I would say it is okay to ignore the test failures for now. That said, I imagine a fix will be coming pretty soon, so any workaround would be short-lived. I don't know the gentoo packaging ecosystem, so I don't know which way is easier for you.

At this moment, qpdf 10.6.1 is blocked from transitioning in debian (where I am the maintainer in addition to being qpdf upstream) because of pikepdf, so I'm hoping we'll get all this resolved fast so that qpdf 10.6.1 can make it into Ubuntu 22.04 and everywhere else where it needs to go. I made a big push to get qpdf 10.6 out a couple of weeks before Ubuntu 22.04 freeze in hopes that any surprises would have time to get ironed out. I updated the ubuntu qpdf ppa to 10.6.1 already.

@mara004
Copy link
Contributor

mara004 commented Feb 14, 2022

The problem was that qpdf (the C++ library pikepdf uses internally) was not handling certain characters correctly in PDF strings that were encoded in a certain way.

Is it possible that issue #288 might have been caused by said incorrect handling of certain characters in qpdf?

@jberkenbilt
Copy link
Contributor Author

No, I don't think this could be related. The prefix in #288 is the UTF-16-BE prefix character. Some code somewhere is looking at a binary string and failing to notice UTF-16 encoding. The issue here is related to qpdf failing to map characters 0x18 through 01xf, 0x7f, and 0xad properly from PDF doc encoding to Unicode.

@jberkenbilt
Copy link
Contributor Author

In fact, @mara004, looking at #288, I think the most likely explanation is that there was something wrong with the PDF file.

@mara004
Copy link
Contributor

mara004 commented Feb 14, 2022

Thanks for taking a look. However, I'm not sure whether this is really due to an issue with the PDF, as other libraries seem to be able to decode the strings without problems (tried poppler and mupdf).

@jberkenbilt
Copy link
Contributor Author

@mara004 You piqued my curiosity, so I downloaded the file. There is something wrong with it -- I commented on #288.

@jbarlow83
Copy link
Member

jbarlow83 commented Feb 15, 2022

Sorry, very busy with day job, but I believe I can get this wrapped in time for Ubu 22.04 freeze.

Though this seemed appropriate:

image

@jberkenbilt
After some as yet uncommitted fixes to pikepdf, I believe qpdf 10.6.1 still has a few lingering errors:

QUtil::utf8_to_pdf_doc("\x18", &output) ought to return false, but returns \x18. Likewise for \x18 through \x1f. That is, U+0018 through U+001F, which are \x18 through \x1f when encoded in UTF-8, have no corresponding character in pdfdoc.

QUtil::utf8_to_pdf_doc("\xcb\x98", &output) outputs \x18 and returns true, as expected.

pdf_doc_to_utf8 does not seem to have any issues.

@jberkenbilt
Copy link
Contributor Author

Yes, you're right; I was passing non-printable characters in UTF-8 through as they were. I can fix that...not sure it's worth 10.6.2. Will you be able to create the tests to that my fixing this will not break tests in pikepdf? Or do you need me to release 10.6.2 with that fix?

Re Ubuntu freeze, I know from experience that you can get bug fixes in for a little while after the freeze. qpdf is in main, and a small fix to pikepdf that unblocks qpdf will probably be granted a freeze exception in the first phase. Of course it would be easier to avoid that. qpdf 10.6.1 has still not transitioned from debian to Ubuntu, which is still supposed to be automatic for a few more days, so I'm going to reach out and find out what the problem is.

Let me know if there's something I can do to help on this side...

@jberkenbilt
Copy link
Contributor Author

@jbarlow83 I am pushing a fix to main probably in an hour or so to the issue you just raised, but I'm not going to do a qpdf release for the fix unless you tell me that it would help you with your tests. In that case, pikepdf would fail tests with 10.6.0 and 10.6.1, but those releases would only have lived a few days, so I don't think it would matter. Alternatively, you can check with this fix to make sure the tests pass with 10.6.1 and qpdf main so we don't have to do a release just to fix this edge case. I'll comment here after I merge the changes.

@jberkenbilt
Copy link
Contributor Author

qpdf fix is in main now but still not planning a release unless I hear otherwise.

@jbarlow83
Copy link
Member

I think because behavior is different in <10.6, 10.6.1, 10.6.2, the easiest thing for us to do is you release 10.6.2 and I will release pikepdf 5. I'd rather have a clean slate than try to support all those cases.

First I will confirm your commit works. Let me check before the release.

@jberkenbilt
Copy link
Contributor Author

Okay, turns out I have to release 10.6.2 anyway because of unintentionally breaking CLI backward compatibility. I'll wait for your input before doing the release.

@jberkenbilt
Copy link
Contributor Author

I also added recognition of UTF-16LE. Hopefully that won't change any tests. I will be pushing to main after it passes CI.

@jberkenbilt
Copy link
Contributor Author

@jbarlow83 What is on qpdf's main right now is what will go out as 10.6.2 (other than bumping the version number). I am not planning on adding any new non-critical fixes unless you need me to make a change for getting a stable pikepdf release. Then I will not introduce any qpdf changes that break pikepdf tests until at least after Ubuntu 22.04 is out. I have no intention to make any qpdf releases until after then anyway and will only do so if needed for critical fixes or fixes related to this issue.

@jbarlow83
Copy link
Member

jbarlow83 commented Feb 15, 2022

I believe this fixes all utf8-pdfdoc issues.

There is a new potential issue unfortunately, as of qpdf commit 894940d17.

auto h = QPDFObjectHandle::newUnicodeString("\x1f"); // ASCII string, not pdfdoc
assert h.getUTF8Value() == "\x1f"; // fails, returns U+02DC

That is, newUnicodeString(U+001F) gets converted to U+02DC when round-tripped instead of returning U+001F. So it looks like \x18-\x1f create a QPDF_String in UTF-16 mode so they're preserved when called through newUnicodeString because that's only the way to preserve UTF-8 input. newString I believe works fine.

I pushed to a branch, qpdf-10.6-fixes. I'll rebase because it's a mess but wanted to push it so you could test against it.

@jbarlow83
Copy link
Member

(It's okay that qpdf-10.6-fixes fails CI.)

@jberkenbilt
Copy link
Contributor Author

A one-line change and a very old logic error. newUnicodeString stored the string as is if it could be converted to either ascii or pdfdoc, but getUTF8Value() treated the input as pdfdoc if it wasn't utf16. This is asymmetrical. I will now always encode as either PDFDoc or UTF-16 since this range of unprintable characters is no longer encodable as utf16.

Really solid testing. Thanks. This is literally a one-line change. I will test qpdf and then I will test pikepdf against the latest qpdf and let you know.

@jberkenbilt
Copy link
Contributor Author

I just pushed the fix to qpdf main and verified that pikepdf's test pass on the qpdf-10.6-fixes branch when run against qpdf main. I'll await your confirmation and then release that as 10.6.2. Realistically it will be tomorrow mid morning at the earliest before I can do the release unless I wake up really early.

@jberkenbilt
Copy link
Contributor Author

qpdf main is 10.6.2. Tomorrow morning (Eastern US) I will tag and prepare the release.

@jbarlow83
Copy link
Member

hypothesis/property based testing is great. Although sometimes frustrating because of how much it uncovers. Case in point.

The error above is fixed, but now I get

FAILED tests/test_outlines.py::test_new_item - assert pikepdf.String("-") == '\xad'

Which means QPDFObjectHandle::newUnicodeString("\xad"),getUTF8Value() -> U+002D (standard hyphen).

@jbarlow83
Copy link
Member

I re-pushed my branch, and was testing against qpdf commit 38d8362c.

@jberkenbilt
Copy link
Contributor Author

That was a conscious decision which I have reversed because of the asymmetry it creates. I have to update my own tests, then will push again.

@jberkenbilt
Copy link
Contributor Author

Your branch passes with my changed (not yet pushed)

@jberkenbilt
Copy link
Contributor Author

Your branch passes against qpdf main, 3e2109ab, which will be 10.6.2. At this point, I think we know that. I am dealing with morning routine in the house so here and there. I won't be able to get back to my computer to make the release for a few hours.

@jberkenbilt
Copy link
Contributor Author

I'd like to make a statement in my release announcement like, "Distributions: this version of qpdf resolves the test failures with pikepdf. A new version of pikepdf is also about to be released whose tests pass against qpdf 10.6.2." Any objection?

Just now pushed...good thing I checked because I hadn't actually pushed and would have been surprised not to see my release build finished when I get back...

@jberkenbilt
Copy link
Contributor Author

Note for later: the PDF 2.0 spec allows UTF-8 encoded strings. I just learned this. From ISO+3200-2-2020, 7.9.2.2.1:

The text string type shall be used for character strings that shall be encoded in PDFDocEncoding, the
UTF-16BE Unicode character encoding scheme, or (PDF 2.0) the UTF-8 Unicode character encoding
scheme.

For text strings encoded in UTF-8, the first three bytes shall be 239 followed by 187, followed by 191.
These three bytes represent the Unicode byte order marker indicating that the string is encoded in the
UTF-8 encoding scheme specified in Unicode.

qpdf 10.6.2 will not support this, but I will be adding support. I'll make sure to test with pikepdf's test suite as soon as I add that feature to make sure we don't have more problems. I think that will be an easier change though.

@jberkenbilt
Copy link
Contributor Author

qpdf 10.6.2 is ready to go at 3e2109ab. All I have to do is push the tag. I will do that sometime in the next 45 minutes.

@jberkenbilt
Copy link
Contributor Author

got pulled into something...10.6.2 is out. Announcement coming shortly.

@jbarlow83
Copy link
Member

pikepdf 5.0.0 is out now too.

@spwhitton pikepdf 5.0.0 requires qpdf 10.6.2. jberkenbilt maintains qpdf for Debian as well and has pushed qpdf 10.6.2. pikepdf 5.0.0 should fix the failing tests. Scope of changes from most recent Debian should otherwise be fairly small.

@QuLogic fyi for Fedora

@jberkenbilt
Copy link
Contributor Author

@jbarlow83 Thanks!

@mgorny
Copy link
Contributor

mgorny commented Feb 18, 2022

Thanks! I've bumped it yesterday and can also confirm all tests pass for us now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants