New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8268230: Foreign Linker API & Windows user32/kernel32: String conversion seems broken #554
8268230: Foreign Linker API & Windows user32/kernel32: String conversion seems broken #554
Conversation
👋 Welcome back jvernee! A progress list of the required criteria for merging this PR into |
Webrevs
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch - I think the fix can be made slightly more efficient
return toCString(addNullTerminator(str).getBytes(), allocator); | ||
} | ||
|
||
private static String addNullTerminator(String str) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is gonna allocate another string in an already allocation-heavy code. Wouldn't it be better to just add the correct termination sequence in the segment? I suggest running StrLen benchmark before/after to make sure string conversion performance isn't negatively impacted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll check the benchmark.
I looked at the CharSet API, and AFAICS the only way to get the number of bytes for the null terminator would be to create a String with the value "\0"
then encode that and check the resulting byte[]
. I thought going the string concat route would have better chances of being optimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benchmark results are as follows:
Before:
Benchmark (size) Mode Cnt Score Error Units
StrLenTest.panama_strlen_prefix 5 avgt 30 124.874 � 15.751 ns/op
StrLenTest.panama_strlen_prefix 20 avgt 30 131.683 � 6.011 ns/op
StrLenTest.panama_strlen_prefix 100 avgt 30 161.046 � 10.580 ns/op
After:
Benchmark (size) Mode Cnt Score Error Units
StrLenTest.panama_strlen_prefix 5 avgt 30 130.758 � 5.691 ns/op
StrLenTest.panama_strlen_prefix 20 avgt 30 145.012 � 6.804 ns/op
StrLenTest.panama_strlen_prefix 100 avgt 30 179.992 � 6.457 ns/op
I think C2 should be able to eliminate the intermediate string, but there's still a slight regression.
Wouldn't it be better to just add the correct termination sequence in the segment?
The problem is finding out how many bytes to add. Looking at Charset again, there really doesn't seem to be a way to get the number of bytes per character. The closest seems to be charset.newEncoder().averageBytesPerChar()
, which I'm not sure is what we want. I'll ask around as well.
Mailing list message from Duncan Gittins on panama-dev: I've had problems with Windows String conversions to/from wide string I don't think reverse conversion works using Clinker.toJavaString. It ??? for (String testString : new String[] {"", "x", "testing"} ) { ... .. and also checked the reverse operation returns the original: ?????? String outString = CLinker.toJavaString(text, charset); Kind regards Duncan On 10/06/2021 13:12, Jorn Vernee wrote: |
1 similar comment
Mailing list message from Duncan Gittins on panama-dev: I've had problems with Windows String conversions to/from wide string I don't think reverse conversion works using Clinker.toJavaString. It ??? for (String testString : new String[] {"", "x", "testing"} ) { ... .. and also checked the reverse operation returns the original: ?????? String outString = CLinker.toJavaString(text, charset); Kind regards Duncan On 10/06/2021 13:12, Jorn Vernee wrote: |
Mailing list message from Duncan Gittins on panama-dev: A better range of test strings for Windows <=> wide char conversions and private static final String [] STRINGS = { Kind regards Duncan On Fri, 11 Jun 2021 at 13:54, Duncan Gittins <duncan.gittins at gmail.com>
|
1 similar comment
Mailing list message from Duncan Gittins on panama-dev: A better range of test strings for Windows <=> wide char conversions and private static final String [] STRINGS = { Kind regards Duncan On Fri, 11 Jun 2021 at 13:54, Duncan Gittins <duncan.gittins at gmail.com>
|
Good catch, the Though we use Charset to do the encoding, it's probably not a bad idea to test a couple different strings as well. |
1983faa
to
bd6cbe1
Compare
@JornVernee this pull request can not be integrated into git checkout Windows_String_Encoding
git fetch https://git.openjdk.java.net/panama-foreign foreign-memaccess+abi
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge foreign-memaccess+abi"
git push |
When looking into toJavaString, the problems turned out to be much greater. The problem essentially boils down to not having a We discussed this at length among the team members, and arrived at the intermediate conclusion to only support the 'platform native' Charset. Though, this turns out to be tricky as well, as this charset, which is also called the 'execution character set' in C lingo, is determined based on a compiler setting at build time of the native code. With GCC and Clang the default is UTF-8, while on Windows it depends on the current code page. While there is a way to get the current code page of the runtime system and determine the character set from that, we would not be able to avoid issues with code page mismatches between the build environment and runtime environment on Windows, While it would still technically be possible to support different character sets as long as they work with As a result of all this, for now we have arrived at the decision to only support the UTF-8 Charset for the toCString and toJavaString methods, and to leave encoding and decoding using other character sets (including determining the length of a native string) to be implemented manually. I've updated this PR to remove the overloads that accept a Notice that the prime focus for this patch is stabilization (for JDK 17 as well). Perhaps in the future these APIs could be expanded to support more character sets again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks sensible!
@JornVernee This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 142 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
/integrate |
Going to push as commit 782aeb4.
Your commit was automatically rebased without conflicts. |
@JornVernee Pushed as commit 782aeb4. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
Can you give us a hint, how such manual re-encoding could look like, if an C lib wants to consume a different charset? If I call |
Note that the char set that is passed to the For encoding to a native representation, here is an example that assumes the native representation of the string should be terminated with 2 null bytes (the exact format the native string should have depends ultimately on the domain, and what the library expects): String input = foo(); // get a Java string from somewhere
// Using a standard null terminated format here, but the library could also expect a different format
byte[] bytes = (input + '\0').getBytes(UTF_16BE);
MemorySegment text = MemorySegment.allocateNative(bytes.length);
text.copyFrom(MemorySegment.ofArray(bytes));
// use 'text' For converting from a native string to a Java string a way to determine the length of a string is needed. Examples are Then something like this could be used: MemoryAddress input = foo(); // get a native string from somewhere
int length = utf16be_string_length(input);
byte[] bytes = new byte[length];
MemorySegment inputSegment = input.asSegment(length, ResourceScope.newImplicitScope());
MemorySegment.ofArray(bytes).copyFrom(inputSegment);
String text = new String(bytes, UTF_16BE);
// use 'text' |
The problem is that we only add a single 0 byte as a null terminator, regardless of the charset used. For wider char sets, more 0 bytes need to be added. For instance, for UTF_16LE two 0 bytes need to be added.
This patch fixes the issue by adding the null terminator to the Java string, and only then encoding it as a
byte[]
.Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/panama-foreign pull/554/head:pull/554
$ git checkout pull/554
Update a local copy of the PR:
$ git checkout pull/554
$ git pull https://git.openjdk.java.net/panama-foreign pull/554/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 554
View PR using the GUI difftool:
$ git pr show -t 554
Using diff file
Download this PR as a diff file:
https://git.openjdk.java.net/panama-foreign/pull/554.diff