Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8268230: Foreign Linker API & Windows user32/kernel32: String conversion seems broken #554

Conversation

JornVernee
Copy link
Member

@JornVernee JornVernee commented Jun 10, 2021

The problem is that we only add a single 0 byte as a null terminator, regardless of the charset used. For wider char sets, more 0 bytes need to be added. For instance, for UTF_16LE two 0 bytes need to be added.

This patch fixes the issue by adding the null terminator to the Java string, and only then encoding it as a byte[].


Progress

  • Change must not contain extraneous whitespace
  • Change must be properly reviewed

Issue

  • JDK-8268230: Foreign Linker API & Windows user32/kernel32: String conversion seems broken

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/panama-foreign pull/554/head:pull/554
$ git checkout pull/554

Update a local copy of the PR:
$ git checkout pull/554
$ git pull https://git.openjdk.java.net/panama-foreign pull/554/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 554

View PR using the GUI difftool:
$ git pr show -t 554

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/panama-foreign/pull/554.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Jun 10, 2021

👋 Welcome back jvernee! A progress list of the required criteria for merging this PR into foreign-memaccess+abi will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Ready for review label Jun 10, 2021
@mlbridge
Copy link

mlbridge bot commented Jun 10, 2021

Webrevs

Copy link
Collaborator

@mcimadamore mcimadamore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch - I think the fix can be made slightly more efficient

return toCString(addNullTerminator(str).getBytes(), allocator);
}

private static String addNullTerminator(String str) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is gonna allocate another string in an already allocation-heavy code. Wouldn't it be better to just add the correct termination sequence in the segment? I suggest running StrLen benchmark before/after to make sure string conversion performance isn't negatively impacted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check the benchmark.

I looked at the CharSet API, and AFAICS the only way to get the number of bytes for the null terminator would be to create a String with the value "\0" then encode that and check the resulting byte[]. I thought going the string concat route would have better chances of being optimized.

Copy link
Member Author

@JornVernee JornVernee Jun 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmark results are as follows:

Before:

Benchmark                        (size)  Mode  Cnt    Score    Error  Units
StrLenTest.panama_strlen_prefix       5  avgt   30  124.874 � 15.751  ns/op
StrLenTest.panama_strlen_prefix      20  avgt   30  131.683 �  6.011  ns/op
StrLenTest.panama_strlen_prefix     100  avgt   30  161.046 � 10.580  ns/op

After:

Benchmark                        (size)  Mode  Cnt    Score   Error  Units
StrLenTest.panama_strlen_prefix       5  avgt   30  130.758 � 5.691  ns/op
StrLenTest.panama_strlen_prefix      20  avgt   30  145.012 � 6.804  ns/op
StrLenTest.panama_strlen_prefix     100  avgt   30  179.992 � 6.457  ns/op

I think C2 should be able to eliminate the intermediate string, but there's still a slight regression.

Wouldn't it be better to just add the correct termination sequence in the segment?

The problem is finding out how many bytes to add. Looking at Charset again, there really doesn't seem to be a way to get the number of bytes per character. The closest seems to be charset.newEncoder().averageBytesPerChar(), which I'm not sure is what we want. I'll ask around as well.

@mlbridge
Copy link

mlbridge bot commented Jun 11, 2021

Mailing list message from Duncan Gittins on panama-dev:

I've had problems with Windows String conversions to/from wide string
using Clinker toCString / toJavaString so switched to using kernel32.dll
MultiByteToWideChar / WideCharToMultiByte. Hopefully your fix will
address the issue with toCString(s, UTF_16LE).

I don't think reverse conversion works using Clinker.toJavaString. It
may help to verify by changing
"test/jdk/java/foreign/TestToCStringWide.java.testStrings()" to handled
input array of strings:

??? for (String testString : new String[] {"", "x", "testing"} ) { ...

.. and also checked the reverse operation returns the original:

?????? String outString = CLinker.toJavaString(text, charset);
?????? assertEquals(testString, outString);

Kind regards

Duncan

On 10/06/2021 13:12, Jorn Vernee wrote:

1 similar comment
@mlbridge
Copy link

mlbridge bot commented Jun 11, 2021

Mailing list message from Duncan Gittins on panama-dev:

I've had problems with Windows String conversions to/from wide string
using Clinker toCString / toJavaString so switched to using kernel32.dll
MultiByteToWideChar / WideCharToMultiByte. Hopefully your fix will
address the issue with toCString(s, UTF_16LE).

I don't think reverse conversion works using Clinker.toJavaString. It
may help to verify by changing
"test/jdk/java/foreign/TestToCStringWide.java.testStrings()" to handled
input array of strings:

??? for (String testString : new String[] {"", "x", "testing"} ) { ...

.. and also checked the reverse operation returns the original:

?????? String outString = CLinker.toJavaString(text, charset);
?????? assertEquals(testString, outString);

Kind regards

Duncan

On 10/06/2021 13:12, Jorn Vernee wrote:

@mlbridge
Copy link

mlbridge bot commented Jun 11, 2021

Mailing list message from Duncan Gittins on panama-dev:

A better range of test strings for Windows <=> wide char conversions and
back using CLinker.toCString / CLinker.toJavaString might be:

private static final String [] STRINGS = {
"","X","12345","testing".repeat(5)
,"euro \u20AC"
,"yen \u00a5"
,"Small-Omega \u03C9"
,"umlaut \u00FC".repeat(2000)
};

Kind regards

Duncan

On Fri, 11 Jun 2021 at 13:54, Duncan Gittins <duncan.gittins at gmail.com>
wrote:

I've had problems with Windows String conversions to/from wide string
using Clinker toCString / toJavaString so switched to using kernel32.dll
MultiByteToWideChar / WideCharToMultiByte. Hopefully your fix will
address the issue with toCString(s, UTF_16LE).

I don't think reverse conversion works using Clinker.toJavaString. It
may help to verify by changing
"test/jdk/java/foreign/TestToCStringWide.java.testStrings()" to handled
input array of strings:

 for \(String testString \: new String\[\] \{\"\"\, \"x\"\, \"testing\"\} \) \{ \.\.\.

.. and also checked the reverse operation returns the original:

    String outString \= CLinker\.toJavaString\(text\, charset\)\;
    assertEquals\(testString\, outString\)\;

Kind regards

Duncan

On 10/06/2021 13:12, Jorn Vernee wrote:

The problem is that we only add a single 0 byte as a null terminator,
regardless of the charset used. For wider char sets, more 0 bytes need to
be added. For instance, for UTF_16LE two 0 bytes need to be added.

This patch fixes the issue by adding the null terminator to the Java
string, and only then encoding it as a `byte[]`.
https://webrevs.openjdk.java.net/?repo=panama-foreign&pr=554&range=00
pull/554/head:pull/554

1 similar comment
@mlbridge
Copy link

mlbridge bot commented Jun 11, 2021

Mailing list message from Duncan Gittins on panama-dev:

A better range of test strings for Windows <=> wide char conversions and
back using CLinker.toCString / CLinker.toJavaString might be:

private static final String [] STRINGS = {
"","X","12345","testing".repeat(5)
,"euro \u20AC"
,"yen \u00a5"
,"Small-Omega \u03C9"
,"umlaut \u00FC".repeat(2000)
};

Kind regards

Duncan

On Fri, 11 Jun 2021 at 13:54, Duncan Gittins <duncan.gittins at gmail.com>
wrote:

I've had problems with Windows String conversions to/from wide string
using Clinker toCString / toJavaString so switched to using kernel32.dll
MultiByteToWideChar / WideCharToMultiByte. Hopefully your fix will
address the issue with toCString(s, UTF_16LE).

I don't think reverse conversion works using Clinker.toJavaString. It
may help to verify by changing
"test/jdk/java/foreign/TestToCStringWide.java.testStrings()" to handled
input array of strings:

 for \(String testString \: new String\[\] \{\"\"\, \"x\"\, \"testing\"\} \) \{ \.\.\.

.. and also checked the reverse operation returns the original:

    String outString \= CLinker\.toJavaString\(text\, charset\)\;
    assertEquals\(testString\, outString\)\;

Kind regards

Duncan

On 10/06/2021 13:12, Jorn Vernee wrote:

The problem is that we only add a single 0 byte as a null terminator,
regardless of the charset used. For wider char sets, more 0 bytes need to
be added. For instance, for UTF_16LE two 0 bytes need to be added.

This patch fixes the issue by adding the null terminator to the Java
string, and only then encoding it as a `byte[]`.
https://webrevs.openjdk.java.net/?repo=panama-foreign&pr=554&range=00
pull/554/head:pull/554

@JornVernee
Copy link
Member Author

JornVernee commented Jun 11, 2021

I don't think reverse conversion works using Clinker.toJavaString.

Good catch, the toJavaString version also looks broken, since it looks for the first 0 byte, but that might just be the high-order byte of a character, which just happens to be 0, for wider char sets (for some reason I had assumed this was okay when thinking about it before).

Though we use Charset to do the encoding, it's probably not a bad idea to test a couple different strings as well.

@openjdk
Copy link

openjdk bot commented Jun 15, 2021

@JornVernee this pull request can not be integrated into foreign-memaccess+abi due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout Windows_String_Encoding
git fetch https://git.openjdk.java.net/panama-foreign foreign-memaccess+abi
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge foreign-memaccess+abi"
git push

@openjdk openjdk bot removed the merge-conflict label Jun 15, 2021
@JornVernee JornVernee marked this pull request as draft June 15, 2021 13:28
@openjdk openjdk bot removed the rfr Ready for review label Jun 15, 2021
@JornVernee
Copy link
Member Author

JornVernee commented Jun 15, 2021

When looking into toJavaString, the problems turned out to be much greater. The problem essentially boils down to not having a strlen function for any arbitrary Charset, so we can only support a certain subset of Charsets for which we know our current way of determining the native string's length works.

We discussed this at length among the team members, and arrived at the intermediate conclusion to only support the 'platform native' Charset. Though, this turns out to be tricky as well, as this charset, which is also called the 'execution character set' in C lingo, is determined based on a compiler setting at build time of the native code. With GCC and Clang the default is UTF-8, while on Windows it depends on the current code page. While there is a way to get the current code page of the runtime system and determine the character set from that, we would not be able to avoid issues with code page mismatches between the build environment and runtime environment on Windows,

While it would still technically be possible to support different character sets as long as they work with strlen, at present there is no way to detect this for an arbitrary character set. So, if we kept the Charset parameter, we would not be able to sanity check it, which doesn't seem great either.

As a result of all this, for now we have arrived at the decision to only support the UTF-8 Charset for the toCString and toJavaString methods, and to leave encoding and decoding using other character sets (including determining the length of a native string) to be implemented manually.

I've updated this PR to remove the overloads that accept a Charset, and updated the implementation to always use UTF-8. I've added several test cases as well that test Unicode characters that get encoded with different amounts of bytes in UTF-8.

Notice that the prime focus for this patch is stabilization (for JDK 17 as well). Perhaps in the future these APIs could be expanded to support more character sets again.

@JornVernee JornVernee marked this pull request as ready for review June 15, 2021 15:27
@openjdk openjdk bot added the rfr Ready for review label Jun 15, 2021
Copy link
Collaborator

@mcimadamore mcimadamore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks sensible!

@openjdk
Copy link

openjdk bot commented Jun 15, 2021

@JornVernee This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8268230: Foreign Linker API & Windows user32/kernel32: String conversion seems broken

Reviewed-by: mcimadamore

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 142 new commits pushed to the foreign-memaccess+abi branch:

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the foreign-memaccess+abi branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Ready to be integrated label Jun 15, 2021
@JornVernee
Copy link
Member Author

/integrate

@openjdk
Copy link

openjdk bot commented Jun 15, 2021

Going to push as commit 782aeb4.
Since your change was applied there have been 142 commits pushed to the foreign-memaccess+abi branch:

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot closed this Jun 15, 2021
@openjdk openjdk bot added the integrated Pull request has been integrated label Jun 15, 2021
@openjdk openjdk bot removed ready Ready to be integrated rfr Ready for review labels Jun 15, 2021
@openjdk
Copy link

openjdk bot commented Jun 15, 2021

@JornVernee Pushed as commit 782aeb4.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@overheadhunter
Copy link
Contributor

leave encoding and decoding using other character sets (including determining the length of a native string) to be implemented manually.

Can you give us a hint, how such manual re-encoding could look like, if an C lib wants to consume a different charset?

If I call CLinker.toCString(new String(bytes, UTF-16BE)), it'll still be converted to UTF-8 internally.

@JornVernee
Copy link
Member Author

If I call CLinker.toCString(new String(bytes, UTF-16BE)), it'll still be converted to UTF-8 internally.

Note that the char set that is passed to the String constructor only indicates the encoding of the bytes that are being passed. These bytes are then converted to one of the internal encodings used by Java strings to produce the new String, and then toCString converts from that String back to a UTF-8 encoded memory region.


For encoding to a native representation, here is an example that assumes the native representation of the string should be terminated with 2 null bytes (the exact format the native string should have depends ultimately on the domain, and what the library expects):

String input = foo(); // get a Java string from somewhere
// Using a standard null terminated format here, but the library could also expect a different format
byte[] bytes = (input + '\0').getBytes(UTF_16BE);
MemorySegment text = MemorySegment.allocateNative(bytes.length);
text.copyFrom(MemorySegment.ofArray(bytes));
// use 'text'

For converting from a native string to a Java string a way to determine the length of a string is needed. Examples are strlen and wcslen, but it depends on the encoding the string is in, and again the domain (e.g. the library might expose a function to determine the length of the strings it returns).

Then something like this could be used:

MemoryAddress input = foo(); // get a native string from somewhere
int length = utf16be_string_length(input);
byte[] bytes = new byte[length];
MemorySegment inputSegment = input.asSegment(length, ResourceScope.newImplicitScope());
MemorySegment.ofArray(bytes).copyFrom(inputSegment);
String text = new String(bytes, UTF_16BE);
// use 'text'

@JornVernee JornVernee deleted the Windows_String_Encoding branch November 2, 2021 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

3 participants