-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correctly encode JVM option strings #414
Correctly encode JVM option strings #414
Conversation
Thanks for looking at this. Please see my comment here: #410 (comment) where I try to clarify my concerns around something similar for POSIX. Although I was roughly expecting that the Windows support was going to be a lot simpler than this I can still see that there's real-world value in adding some character mapping support on Windows. I really can't say the same for POSIX - I can't see how we can really justify adding code, complexity and dependence on legacy character maps etc when there's no longer any reason for POSIX systems to use those legacy non-utf8 locales - we'd be adding code and complexity that we'll have to maintain but no one is going to use - which isn't the same situation as with Windows. |
I was too, but Windows' unfortunate choice to use Right now, this code splits the input string into chunks small enough that they shouldn't overflow, with the maximum chunk length being just small enough to avoid overflow if the code page is UTF-7 (which I believe is the least-byte-efficient character encoding Windows supports). I could simplify this code by only checking for overflow and reporting an error if there is one. This would only save about 15 lines of code, though.
OK, I'll leave POSIX assuming UTF-8 like it does now. |
I had roughly thought it was going to be possible to just go from Rust uft8 -> utf16 and then there was a Windows API that took a complete utf16 string and could return an 8-bit encoded string. Having to faff around manually checking for utf16 surrogate pairs was very surprising to see here. Why can't we either:
I don't think efficiency should be concern at all if it means we can have a drastically simpler solution.
ah, okey, phew |
It looks like it could be helpful to utilize this widestring crate (licensing is compatible and it looks like a well considered, well maintained crate with some utilities that could be handy here) In particular this utf16 equivalent of |
Sorry, didn't mean to worry you. 😅
Well, yes, there is, but it has overflow problems.
That's a consequence of splitting the string in order to avoid overflow. It's safe to split a UTF-16 string to transcode it in chunks, if and only if it isn't split in the middle of a surrogate pair. The widestring crate doesn't check for this, by the way. According to an example in the documentation, it considers unpaired surrogates valid.
Here it is (I left in some debug statements for now), but this breaks on overflow. Turns out that Win32 I've added a test that checks what happens when the string length is around Conclusion: feeding
Here it is. This sort of works, but Maybe that's okay. The only Windows code page I know of where this could happen is UTF-7, and I seriously doubt anyone will ever set that as their default code page. (UTF-8 can also exceed the limit this way, as demonstrated by the test, but transcoding is skipped if the code page is UTF-8, so it doesn't matter.) That test code sure is ugly, though, and it takes 5 minutes and around 12GB of memory to run. So, what do you think? Here are the options I can think of:
Option 3 seems to be the most popular approach. That's what local-encoding-ng and I'm fond of option 2 because it makes testing simple: just change the chunk size to something small and make sure the string gets split and reassembled correctly. I don't like option 1. The implementation is a little simpler, but the test is horrible. My computer almost melted running it! 😅 Maybe it can be simplified somehow, but I'm not sure how. |
Hehe, kudos. I was wondering if you were going to end up doing that. To be fair I was assuming the implementation was going to detect the overflow fine, so interesting to see that it doesn't. I think the right solution here is to just to set our own option size limit and just check for that. E.g. 8k would be a ludicrous size surely for an option value. If we adapt your test for that it will also be able to run in a much more reasonable time too. |
f3b7d6b
to
ee109d3
Compare
Okay, I've pushed ee109d3 where it limits the string's length to what should very definitely never overflow, even with UTF-7. That's still 858,993,458 bytes of UTF-8, so hopefully no one will complain. 😅 I've kept the big overflow test. I figured we'd better keep it so we know for sure that it really doesn't overflow. It's not so computer-melting any more; it only requires about 4GB of memory and runs in one minute instead of five. Still needs to be run in a Or we could just forget about this and limit to some arbitrary length like 8k. Your choice, but I thought I'd better present this option too. |
I suppose I tend to assume that a test that takes a minute to run and requires a release build is quite unlikely to be run again after this lands. I can see the sense in testing the limits of The fact that you've gone to the extent of testing the overflow case though is enough for me though - even if that test doesn't get run regularly here. To have an overflow test that would run regularly I'd perhaps be inclined to add an additional length check for 8k or 8Mb etc to |
ee109d3
to
d38a450
Compare
Okay, all set. I've set the limit to 1MB so the overflow test runs quickly even without |
Fixes jni-rs#410. UTF-8 is still assumed on other platforms.
d38a450
to
61d1e2c
Compare
looks good, thanks! |
Overview
This draft PR fixes #410. It adds code to transcode JVM option strings, as passed to
InitArgsBuilder::option
, into the character encoding that HotSpot expects. It also addsInitArgsBuilder::option_encoded
to skip transcoding and pass a rawCStr
to the JVM.This is a draft because it's a work in progress. Only Windows is handled so far. Next I will see about doing the same for POSIX platforms.
Definition of Done