Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correctly encode JVM option strings #414

Merged
merged 1 commit into from
Feb 7, 2023

Conversation

argv-minus-one
Copy link
Contributor

@argv-minus-one argv-minus-one commented Feb 3, 2023

Overview

This draft PR fixes #410. It adds code to transcode JVM option strings, as passed to InitArgsBuilder::option, into the character encoding that HotSpot expects. It also adds InitArgsBuilder::option_encoded to skip transcoding and pass a raw CStr to the JVM.

This is a draft because it's a work in progress. Only Windows is handled so far. Next I will see about doing the same for POSIX platforms.

Definition of Done

  • There are no TODOs left in the code
  • Change is covered by automated tests
  • The coding guidelines are followed
  • Public API has documentation
  • User-visible changes are mentioned in the Changelog
  • The continuous integration build passes

@argv-minus-one argv-minus-one added bug 🐛 breaking A patch that introduces breaking changes (or an issue requiring such patch) work in progress 👷‍♂️ labels Feb 3, 2023
@argv-minus-one argv-minus-one self-assigned this Feb 3, 2023
@rib
Copy link
Contributor

rib commented Feb 3, 2023

Thanks for looking at this.

Please see my comment here: #410 (comment) where I try to clarify my concerns around something similar for POSIX.

Although I was roughly expecting that the Windows support was going to be a lot simpler than this I can still see that there's real-world value in adding some character mapping support on Windows.

I really can't say the same for POSIX - I can't see how we can really justify adding code, complexity and dependence on legacy character maps etc when there's no longer any reason for POSIX systems to use those legacy non-utf8 locales - we'd be adding code and complexity that we'll have to maintain but no one is going to use - which isn't the same situation as with Windows.

@argv-minus-one
Copy link
Contributor Author

Although I was roughly expecting that the Windows support was going to be a lot simpler than this

I was too, but Windows' unfortunate choice to use int instead of SIZE_T as the string length type created the possibility of overflow, which added a bunch of complexity that I wasn't expecting. I decided against using local-encoding-ng because it doesn't check for or handle overflow.

Right now, this code splits the input string into chunks small enough that they shouldn't overflow, with the maximum chunk length being just small enough to avoid overflow if the code page is UTF-7 (which I believe is the least-byte-efficient character encoding Windows supports).

I could simplify this code by only checking for overflow and reporting an error if there is one. This would only save about 15 lines of code, though.

I really can't say the same for POSIX - I can't see how we can really justify adding code, complexity and dependence on legacy character maps etc when there's no longer any reason for POSIX systems to use those legacy non-utf8 locales - we'd be adding code and complexity that we'll have to maintain but no one is going to use - which isn't the same situation as with Windows.

OK, I'll leave POSIX assuming UTF-8 like it does now.

@rib
Copy link
Contributor

rib commented Feb 3, 2023

Although I was roughly expecting that the Windows support was going to be a lot simpler than this

I was too, but Windows' unfortunate choice to use int instead of SIZE_T as the string length type created the possibility of overflow, which added a bunch of complexity that I wasn't expecting. I decided against using local-encoding-ng because it doesn't check for or handle overflow.

I had roughly thought it was going to be possible to just go from Rust uft8 -> utf16 and then there was a Windows API that took a complete utf16 string and could return an 8-bit encoded string.

Having to faff around manually checking for utf16 surrogate pairs was very surprising to see here.

Why can't we either:

  1. pass cchWideChar as -1 after converting the utf16 string to a CString (that will ensure it's NUL terminated)
  2. query the utf16 length in Rust and throw our own error if it's greater than i32::MAX before passing the length to WideCharToMultiByte

I don't think efficiency should be concern at all if it means we can have a drastically simpler solution.

OK, I'll leave POSIX assuming UTF-8 like it does now.

ah, okey, phew

@rib
Copy link
Contributor

rib commented Feb 3, 2023

It looks like it could be helpful to utilize this widestring crate (licensing is compatible and it looks like a well considered, well maintained crate with some utilities that could be handy here)

In particular this utf16 equivalent of CString looks useful: https://docs.rs/widestring/1.0.2/widestring/ucstring/struct.U16CString.html

@argv-minus-one
Copy link
Contributor Author

OK, I'll leave POSIX assuming UTF-8 like it does now.

ah, okey, phew

Sorry, didn't mean to worry you. 😅

I had roughly thought it was going to be possible to just go from Rust uft8 -> utf16 and then there was a Windows API that took a complete utf16 string and could return an 8-bit encoded string.

Well, yes, there is, but it has overflow problems.

Having to faff around manually checking for utf16 surrogate pairs was very surprising to see here.

That's a consequence of splitting the string in order to avoid overflow. It's safe to split a UTF-16 string to transcode it in chunks, if and only if it isn't split in the middle of a surrogate pair.

The widestring crate doesn't check for this, by the way. According to an example in the documentation, it considers unpaired surrogates valid.

Why can't we either:

  1. pass cchWideChar as -1 after converting the utf16 string to a CString (that will ensure it's NUL terminated)

Here it is (I left in some debug statements for now), but this breaks on overflow. Turns out that Win32 WideCharToMultiByte doesn't check for overflow if the input string is more than i32::MAX UTF-16 units long, and behaves similarly to local-encoding-ng.

I've added a test that checks what happens when the string length is around i32::MAX, and it fails here. WideCharToMultiByte returns zero, which normally indicates error, but in this case it's the result of its string length counter wrapping around, resulting in an amusing error message of “The operation completed successfully.”

Conclusion: feeding WideCharToMultiByte a null-terminated string is not overflow-safe.

  1. query the utf16 length in Rust and throw our own error if it's greater than i32::MAX before passing the length to WideCharToMultiByte

Here it is. This sort of works, but WideCharToMultiByte raises ERROR_INVALID_PARAMETER (instead of ERROR_INSUFFICIENT_BUFFER) if the input UTF-16 string isn't too large (≤ i32::MAX UTF-16 units) but the output string is (> i32::MAX bytes).

Maybe that's okay. The only Windows code page I know of where this could happen is UTF-7, and I seriously doubt anyone will ever set that as their default code page. (UTF-8 can also exceed the limit this way, as demonstrated by the test, but transcoding is skipped if the code page is UTF-8, so it doesn't matter.)

That test code sure is ugly, though, and it takes 5 minutes and around 12GB of memory to run. WideCharToMultiByte apparently allocates a lot of memory when transcoding a huge UTF-16 string to Windows-1252. It's much more efficient transcoding UTF-16 to UTF-8, interestingly.

So, what do you think? Here are the options I can think of:

  1. Check for overflow and raise an error if it happens, like in c501e88.
  2. Avoid overflow by splitting the string, like in 8809b61.
  3. Don't even bother checking for or handling overflow at all.

Option 3 seems to be the most popular approach. That's what local-encoding-ng and WideCharToMultiByte do. I usually don't like ignoring potential errors like that, but if that's what you think is best, then I'll do it.

I'm fond of option 2 because it makes testing simple: just change the chunk size to something small and make sure the string gets split and reassembled correctly.

I don't like option 1. The implementation is a little simpler, but the test is horrible. My computer almost melted running it! 😅 Maybe it can be simplified somehow, but I'm not sure how.

@rib
Copy link
Contributor

rib commented Feb 4, 2023

I've added a test that checks what happens when the string length is around i32::MAX

Hehe, kudos. I was wondering if you were going to end up doing that. To be fair I was assuming the implementation was going to detect the overflow fine, so interesting to see that it doesn't.

I think the right solution here is to just to set our own option size limit and just check for that. E.g. 8k would be a ludicrous size surely for an option value. If we adapt your test for that it will also be able to run in a much more reasonable time too.

@argv-minus-one
Copy link
Contributor Author

Okay, I've pushed ee109d3 where it limits the string's length to what should very definitely never overflow, even with UTF-7. That's still 858,993,458 bytes of UTF-8, so hopefully no one will complain. 😅

I've kept the big overflow test. I figured we'd better keep it so we know for sure that it really doesn't overflow. It's not so computer-melting any more; it only requires about 4GB of memory and runs in one minute instead of five. Still needs to be run in a --release build, though; it relies very heavily on the optimizer to fill those giant buffers efficiently.

Or we could just forget about this and limit to some arbitrary length like 8k. Your choice, but I thought I'd better present this option too.

@rib
Copy link
Contributor

rib commented Feb 6, 2023

I suppose I tend to assume that a test that takes a minute to run and requires a release build is quite unlikely to be run again after this lands.

I can see the sense in testing the limits of codepage_to_string_win32 which is written as a general-purpose wrapper around MultiByteToWideChar - and in that context an arbitrary 8k limit doesn't necessarily make sense.

The fact that you've gone to the extent of testing the overflow case though is enough for me though - even if that test doesn't get run regularly here.

To have an overflow test that would run regularly I'd perhaps be inclined to add an additional length check for 8k or 8Mb etc to try_option (i.e. outside of the general-purpose wrapper), since I still think it's perfectly reasonable to limit the length of these JNI options (especially if it simplifies testing) but I'm also fine with landing this as is (with the CHANGELOG update)

@rib rib added this to the 0.21 milestone Feb 6, 2023
@rib rib mentioned this pull request Feb 6, 2023
@argv-minus-one argv-minus-one marked this pull request as ready for review February 6, 2023 23:25
@argv-minus-one
Copy link
Contributor Author

Okay, all set. I've set the limit to 1MB so the overflow test runs quickly even without --release. I've also added changelog entries and added to the documentation. Should be ready to merge.

Fixes jni-rs#410.

UTF-8 is still assumed on other platforms.
@rib rib force-pushed the invocation-character-encoding branch from d38a450 to 61d1e2c Compare February 7, 2023 04:07
@rib rib merged commit a0c4c48 into jni-rs:master Feb 7, 2023
@rib
Copy link
Contributor

rib commented Feb 7, 2023

looks good, thanks!

@argv-minus-one argv-minus-one deleted the invocation-character-encoding branch February 7, 2023 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking A patch that introduces breaking changes (or an issue requiring such patch) bug 🐛
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Invocation API uses wrong encoding for options
2 participants