New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pikepdf will have failed test with qpdf 10.6 but can be fixed without breaking compatibility #303
Comments
Thanks for pointing this out. I did notice the 0x18 through 0x1f issue before but... I don't know, I suppose I thought maybe you had a reason for doing it that way, and it seemed fairly minor so I didn't raise it. I found another little surprise when I looked over your changes and studied the reference manual: 0xAD should not decode to U+00AD; it should also be undefined officially or U+FFFD. Surprisingly Table D.2 has this. 0x7F, 0x9F and 0xAD are all defined to have no Unicode equivalent. I'll open an issue. Obviously not a showstopper or anything. |
I'm also getting a failure from ____________________________________________________________ test_new_item ____________________________________________________________
resources = PosixPath('/tmp/portage/dev-python/pikepdf-4.5.0/work/pikepdf-4.5.0/tests/resources')
@settings(deadline=timedelta(milliseconds=750))
> @given(
title=st.text(),
page_num=st.integers(0, 1),
page_loc=st.sampled_from(PageLocation),
)
f = <function given.<locals>.run_test_as_given.<locals>.wrapped_test at 0x7fe2a286d550>
resources = PosixPath('/tmp/portage/dev-python/pikepdf-4.5.0/work/pikepdf-4.5.0/tests/resources')
tests/test_outlines.py:439:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
resources = PosixPath('/tmp/portage/dev-python/pikepdf-4.5.0/work/pikepdf-4.5.0/tests/resources'), title = '\xad', page_num = 0
page_loc = <PageLocation.XYZ: 1>
@settings(deadline=timedelta(milliseconds=750))
@given(
title=st.text(),
page_num=st.integers(0, 1),
page_loc=st.sampled_from(PageLocation),
)
@example(
title='',
page_num=0,
page_loc='FitR',
)
def test_new_item(resources, title, page_num, page_loc):
# @given precludes use of outlines_doc fixture - causes hypothesis health check to
# fail
with Pdf.open(resources / 'outlines.pdf') as doc:
kwargs = dict.fromkeys(ALL_PAGE_LOCATION_KWARGS, 100)
page_ref = doc.pages[page_num]
new_item = OutlineItem(title, page_num, page_loc, **kwargs)
with doc.open_outline() as outline:
outline.root.append(new_item)
if isinstance(page_loc, PageLocation):
loc_str = page_loc.name
else:
loc_str = page_loc
if loc_str == 'FitR':
kwarg_len = 4
elif loc_str == 'XYZ':
kwarg_len = 3
elif loc_str in ('FitH', 'FitV', 'FitBH', 'FitBV'):
kwarg_len = 1
else:
kwarg_len = 0
expected_dest = [page_ref.obj, Name(f'/{loc_str}')]
expected_dest.extend(repeat(100, kwarg_len))
assert new_item.destination == expected_dest
new_obj = new_item.obj
> assert new_obj.Title == title
E assert pikepdf.String("�") == '\xad'
E +pikepdf.String("�")
E -'\xad'
[...] |
Thanks. Yes, the fixes needed for 10.6 are not exactly complicated, but not obvious either, and it's all codec stuff. Also tricky is mainly compatibility with qpdf 10.5. I might have to drop 10.5 support and just move on to 10.6 (implying pikepdf 5.x 11 days before the Ubuntu 22.04 freeze, not ideal), because pikepdf now needs to change its behavior whether dealing with qpdf 10.5/10.6 to ensure it produces the right Unicode errors at the right time. |
Is it a compile-time issue or a runtime issue? qpdf 10.6 introduced QPDF_MAJOR_VERSION et al (see release notes) for compile-time checks, and QPDF::QPDFVersion() has been available for a long time for runtime checks. You could also also QUtil::pdf_doc_to_utf8 to get a runtime determination of whether the specific code points are handled in one way or another. |
It's a runtime issue. Yes I'll probably do some runtime version checks. The issue is not making the checks but making the behavior consistent for both versions. Not complaining, it needed to be done. |
Just to be clear, is this a problem that affects real-life use of the library or mostly tests? |
Mostly tests. The problem was that qpdf (the C++ library pikepdf uses internally) was not handling certain characters correctly in PDF strings that were encoded in a certain way. The bug had been in qpdf for a very long time and was only discovered recently. While it could in principle affect real-life users of the library, it would only do so for certain relatively unusual operations, not including any of the transformations that qpdf/pikepdf are so often used for, and then only with very unusual files that encoded these characters in this somewhat unusual way. The characters that were not encoded properly were mostly stand-alone accents, which are quite rare on their own, and usually files that would need them would use a different encoding. So, bottom line: while this could affect real-life use of the library, you could probably go the rest of your life and never encounter the combination of PDF file and use case where this bug would actually matter. As for the failing tests, they do not indicate anything wrong with pikepdf itself. The tests were, in a sense, incorrect because they were relying on incorrect behavior of qpdf. @jbarlow83 can add to this if needed for any details specific to pikepdf. |
Thanks. More context: I was mostly wondering if I should be forcing older qpdf in Gentoo but it sounds like it's fine to just ignore the test failures for now. |
Please let me know if there's anything I can do to make this easier in the future. I added the QPDF_*_VERSION preprocessor symbols mainly to make it easier for a future pikepdf (and other software too) to work with versions of qpdf before and after the smart pointer type change coming up in 11. |
Speaking as qpdf upstream and as the debian maintainer for qpdf, I would say it is okay to ignore the test failures for now. That said, I imagine a fix will be coming pretty soon, so any workaround would be short-lived. I don't know the gentoo packaging ecosystem, so I don't know which way is easier for you. At this moment, qpdf 10.6.1 is blocked from transitioning in debian (where I am the maintainer in addition to being qpdf upstream) because of pikepdf, so I'm hoping we'll get all this resolved fast so that qpdf 10.6.1 can make it into Ubuntu 22.04 and everywhere else where it needs to go. I made a big push to get qpdf 10.6 out a couple of weeks before Ubuntu 22.04 freeze in hopes that any surprises would have time to get ironed out. I updated the ubuntu qpdf ppa to 10.6.1 already. |
Is it possible that issue #288 might have been caused by said incorrect handling of certain characters in qpdf? |
No, I don't think this could be related. The prefix in #288 is the UTF-16-BE prefix character. Some code somewhere is looking at a binary string and failing to notice UTF-16 encoding. The issue here is related to qpdf failing to map characters 0x18 through 01xf, 0x7f, and 0xad properly from PDF doc encoding to Unicode. |
Thanks for taking a look. However, I'm not sure whether this is really due to an issue with the PDF, as other libraries seem to be able to decode the strings without problems (tried poppler and mupdf). |
Sorry, very busy with day job, but I believe I can get this wrapped in time for Ubu 22.04 freeze. Though this seemed appropriate: @jberkenbilt
|
Yes, you're right; I was passing non-printable characters in UTF-8 through as they were. I can fix that...not sure it's worth 10.6.2. Will you be able to create the tests to that my fixing this will not break tests in pikepdf? Or do you need me to release 10.6.2 with that fix? Re Ubuntu freeze, I know from experience that you can get bug fixes in for a little while after the freeze. qpdf is in main, and a small fix to pikepdf that unblocks qpdf will probably be granted a freeze exception in the first phase. Of course it would be easier to avoid that. qpdf 10.6.1 has still not transitioned from debian to Ubuntu, which is still supposed to be automatic for a few more days, so I'm going to reach out and find out what the problem is. Let me know if there's something I can do to help on this side... |
@jbarlow83 I am pushing a fix to main probably in an hour or so to the issue you just raised, but I'm not going to do a qpdf release for the fix unless you tell me that it would help you with your tests. In that case, pikepdf would fail tests with 10.6.0 and 10.6.1, but those releases would only have lived a few days, so I don't think it would matter. Alternatively, you can check with this fix to make sure the tests pass with 10.6.1 and qpdf main so we don't have to do a release just to fix this edge case. I'll comment here after I merge the changes. |
qpdf fix is in main now but still not planning a release unless I hear otherwise. |
I think because behavior is different in <10.6, 10.6.1, 10.6.2, the easiest thing for us to do is you release 10.6.2 and I will release pikepdf 5. I'd rather have a clean slate than try to support all those cases. First I will confirm your commit works. Let me check before the release. |
Okay, turns out I have to release 10.6.2 anyway because of unintentionally breaking CLI backward compatibility. I'll wait for your input before doing the release. |
I also added recognition of UTF-16LE. Hopefully that won't change any tests. I will be pushing to main after it passes CI. |
@jbarlow83 What is on qpdf's main right now is what will go out as 10.6.2 (other than bumping the version number). I am not planning on adding any new non-critical fixes unless you need me to make a change for getting a stable pikepdf release. Then I will not introduce any qpdf changes that break pikepdf tests until at least after Ubuntu 22.04 is out. I have no intention to make any qpdf releases until after then anyway and will only do so if needed for critical fixes or fixes related to this issue. |
I believe this fixes all utf8-pdfdoc issues. There is a new potential issue unfortunately, as of qpdf commit 894940d17. auto h = QPDFObjectHandle::newUnicodeString("\x1f"); // ASCII string, not pdfdoc
assert h.getUTF8Value() == "\x1f"; // fails, returns U+02DC That is, newUnicodeString(U+001F) gets converted to U+02DC when round-tripped instead of returning U+001F. So it looks like \x18-\x1f create a QPDF_String in UTF-16 mode so they're preserved when called through newUnicodeString because that's only the way to preserve UTF-8 input. newString I believe works fine. I pushed to a branch, qpdf-10.6-fixes. I'll rebase because it's a mess but wanted to push it so you could test against it. |
(It's okay that qpdf-10.6-fixes fails CI.) |
A one-line change and a very old logic error. newUnicodeString stored the string as is if it could be converted to either ascii or pdfdoc, but getUTF8Value() treated the input as pdfdoc if it wasn't utf16. This is asymmetrical. I will now always encode as either PDFDoc or UTF-16 since this range of unprintable characters is no longer encodable as utf16. Really solid testing. Thanks. This is literally a one-line change. I will test qpdf and then I will test pikepdf against the latest qpdf and let you know. |
I just pushed the fix to qpdf main and verified that pikepdf's test pass on the qpdf-10.6-fixes branch when run against qpdf main. I'll await your confirmation and then release that as 10.6.2. Realistically it will be tomorrow mid morning at the earliest before I can do the release unless I wake up really early. |
qpdf main is 10.6.2. Tomorrow morning (Eastern US) I will tag and prepare the release. |
hypothesis/property based testing is great. Although sometimes frustrating because of how much it uncovers. Case in point. The error above is fixed, but now I get
Which means |
I re-pushed my branch, and was testing against qpdf commit 38d8362c. |
That was a conscious decision which I have reversed because of the asymmetry it creates. I have to update my own tests, then will push again. |
Your branch passes with my changed (not yet pushed) |
Your branch passes against qpdf main, 3e2109ab, which will be 10.6.2. At this point, I think we know that. I am dealing with morning routine in the house so here and there. I won't be able to get back to my computer to make the release for a few hours. |
I'd like to make a statement in my release announcement like, "Distributions: this version of qpdf resolves the test failures with pikepdf. A new version of pikepdf is also about to be released whose tests pass against qpdf 10.6.2." Any objection? Just now pushed...good thing I checked because I hadn't actually pushed and would have been surprised not to see my release build finished when I get back... |
Note for later: the PDF 2.0 spec allows UTF-8 encoded strings. I just learned this. From ISO+3200-2-2020, 7.9.2.2.1:
qpdf 10.6.2 will not support this, but I will be adding support. I'll make sure to test with pikepdf's test suite as soon as I add that feature to make sure we don't have more problems. I think that will be an easier change though. |
qpdf 10.6.2 is ready to go at 3e2109ab. All I have to do is push the tag. I will do that sometime in the next 45 minutes. |
got pulled into something...10.6.2 is out. Announcement coming shortly. |
pikepdf 5.0.0 is out now too. @spwhitton pikepdf 5.0.0 requires qpdf 10.6.2. jberkenbilt maintains qpdf for Debian as well and has pushed qpdf 10.6.2. pikepdf 5.0.0 should fix the failing tests. Scope of changes from most recent Debian should otherwise be fairly small. @QuLogic fyi for Fedora |
@jbarlow83 Thanks! |
Thanks! I've bumped it yesterday and can also confirm all tests pass for us now. |
When running pikepdf's tests against qpdf 10.6, the following failures occur:
and
This is most likely because of qpdf/qpdf#606 which added previously omitted Unicode conversions for PDF Doc Encoding code points 0x18 through 0x1f and 0x7f. If you want to test mapping to an invalid code point, you can pick something lower than 0x18. That should map to the invalid character. Anyway, I'm not sure what correct fix is for your test.
I plan to release qpdf 10.6 most likely tomorrow, February 8. I plan on preparing everything today. Other than version numbers and final release mechanics, qpdf's main is what 10.6 will look like. At this moment, I haven't yet updated configure.ac and libtool versions, but I will be doing that shortly.
The text was updated successfully, but these errors were encountered: