8268230: Foreign Linker API & Windows user32/kernel32: String conversion seems broken #554

JornVernee · 2021-06-10T12:03:59Z

The problem is that we only add a single 0 byte as a null terminator, regardless of the charset used. For wider char sets, more 0 bytes need to be added. For instance, for UTF_16LE two 0 bytes need to be added.

This patch fixes the issue by adding the null terminator to the Java string, and only then encoding it as a byte[].

Progress

Change must not contain extraneous whitespace
Change must be properly reviewed

Issue

JDK-8268230: Foreign Linker API & Windows user32/kernel32: String conversion seems broken

Reviewers

Maurizio Cimadamore (@mcimadamore - Committer)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/panama-foreign pull/554/head:pull/554
$ git checkout pull/554

Update a local copy of the PR:
$ git checkout pull/554
$ git pull https://git.openjdk.java.net/panama-foreign pull/554/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 554

View PR using the GUI difftool:
$ git pr show -t 554

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/panama-foreign/pull/554.diff

bridgekeeper · 2021-06-10T12:05:22Z

👋 Welcome back jvernee! A progress list of the required criteria for merging this PR into foreign-memaccess+abi will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

mlbridge · 2021-06-10T12:12:44Z

Webrevs

mcimadamore

Good catch - I think the fix can be made slightly more efficient

mcimadamore · 2021-06-10T12:33:23Z

src/jdk.incubator.foreign/share/classes/jdk/incubator/foreign/CLinker.java

+        return toCString(addNullTerminator(str).getBytes(), allocator);
+    }
+
+    private static String addNullTerminator(String str) {


This is gonna allocate another string in an already allocation-heavy code. Wouldn't it be better to just add the correct termination sequence in the segment? I suggest running StrLen benchmark before/after to make sure string conversion performance isn't negatively impacted.

I'll check the benchmark.

I looked at the CharSet API, and AFAICS the only way to get the number of bytes for the null terminator would be to create a String with the value "\0" then encode that and check the resulting byte[]. I thought going the string concat route would have better chances of being optimized.

The benchmark results are as follows:

Before:

Benchmark (size) Mode Cnt Score Error Units StrLenTest.panama_strlen_prefix 5 avgt 30 124.874 � 15.751 ns/op StrLenTest.panama_strlen_prefix 20 avgt 30 131.683 � 6.011 ns/op StrLenTest.panama_strlen_prefix 100 avgt 30 161.046 � 10.580 ns/op

After:

Benchmark (size) Mode Cnt Score Error Units StrLenTest.panama_strlen_prefix 5 avgt 30 130.758 � 5.691 ns/op StrLenTest.panama_strlen_prefix 20 avgt 30 145.012 � 6.804 ns/op StrLenTest.panama_strlen_prefix 100 avgt 30 179.992 � 6.457 ns/op

I think C2 should be able to eliminate the intermediate string, but there's still a slight regression.

Wouldn't it be better to just add the correct termination sequence in the segment?

The problem is finding out how many bytes to add. Looking at Charset again, there really doesn't seem to be a way to get the number of bytes per character. The closest seems to be charset.newEncoder().averageBytesPerChar(), which I'm not sure is what we want. I'll ask around as well.

mlbridge · 2021-06-11T12:55:36Z

Mailing list message from Duncan Gittins on panama-dev:

I've had problems with Windows String conversions to/from wide string
using Clinker toCString / toJavaString so switched to using kernel32.dll
MultiByteToWideChar / WideCharToMultiByte. Hopefully your fix will
address the issue with toCString(s, UTF_16LE).

I don't think reverse conversion works using Clinker.toJavaString. It
may help to verify by changing
"test/jdk/java/foreign/TestToCStringWide.java.testStrings()" to handled
input array of strings:

??? for (String testString : new String[] {"", "x", "testing"} ) { ...

.. and also checked the reverse operation returns the original:

?????? String outString = CLinker.toJavaString(text, charset);
?????? assertEquals(testString, outString);

Kind regards

Duncan

On 10/06/2021 13:12, Jorn Vernee wrote:

mlbridge · 2021-06-11T12:55:37Z

Mailing list message from Duncan Gittins on panama-dev:

I've had problems with Windows String conversions to/from wide string
using Clinker toCString / toJavaString so switched to using kernel32.dll
MultiByteToWideChar / WideCharToMultiByte. Hopefully your fix will
address the issue with toCString(s, UTF_16LE).

I don't think reverse conversion works using Clinker.toJavaString. It
may help to verify by changing
"test/jdk/java/foreign/TestToCStringWide.java.testStrings()" to handled
input array of strings:

??? for (String testString : new String[] {"", "x", "testing"} ) { ...

.. and also checked the reverse operation returns the original:

?????? String outString = CLinker.toJavaString(text, charset);
?????? assertEquals(testString, outString);

Kind regards

Duncan

On 10/06/2021 13:12, Jorn Vernee wrote:

mlbridge · 2021-06-11T16:43:34Z

Mailing list message from Duncan Gittins on panama-dev:

A better range of test strings for Windows <=> wide char conversions and
back using CLinker.toCString / CLinker.toJavaString might be:

private static final String [] STRINGS = {
"","X","12345","testing".repeat(5)
,"euro \u20AC"
,"yen \u00a5"
,"Small-Omega \u03C9"
,"umlaut \u00FC".repeat(2000)
};

Kind regards

Duncan

On Fri, 11 Jun 2021 at 13:54, Duncan Gittins <duncan.gittins at gmail.com>
wrote:

I've had problems with Windows String conversions to/from wide string
using Clinker toCString / toJavaString so switched to using kernel32.dll
MultiByteToWideChar / WideCharToMultiByte. Hopefully your fix will
address the issue with toCString(s, UTF_16LE).

I don't think reverse conversion works using Clinker.toJavaString. It
may help to verify by changing
"test/jdk/java/foreign/TestToCStringWide.java.testStrings()" to handled
input array of strings:
 for $String testString \: new String\[\] \{\"\"\, \"x\"\, \"testing\"\} $ \{ \.\.\.
.. and also checked the reverse operation returns the original:
    String outString \= CLinker\.toJavaString$text\, charset$\;
    assertEquals$testString\, outString$\;
Kind regards

Duncan

On 10/06/2021 13:12, Jorn Vernee wrote:

The problem is that we only add a single 0 byte as a null terminator,
regardless of the charset used. For wider char sets, more 0 bytes need to
be added. For instance, for UTF_16LE two 0 bytes need to be added.

This patch fixes the issue by adding the null terminator to the Java
string, and only then encoding it as a `byte[]`.
https://webrevs.openjdk.java.net/?repo=panama-foreign&pr=554&range=00
pull/554/head:pull/554

mlbridge · 2021-06-11T16:43:35Z

Mailing list message from Duncan Gittins on panama-dev:

A better range of test strings for Windows <=> wide char conversions and
back using CLinker.toCString / CLinker.toJavaString might be:

private static final String [] STRINGS = {
"","X","12345","testing".repeat(5)
,"euro \u20AC"
,"yen \u00a5"
,"Small-Omega \u03C9"
,"umlaut \u00FC".repeat(2000)
};

Kind regards

Duncan

On Fri, 11 Jun 2021 at 13:54, Duncan Gittins <duncan.gittins at gmail.com>
wrote:

I've had problems with Windows String conversions to/from wide string
using Clinker toCString / toJavaString so switched to using kernel32.dll
MultiByteToWideChar / WideCharToMultiByte. Hopefully your fix will
address the issue with toCString(s, UTF_16LE).

I don't think reverse conversion works using Clinker.toJavaString. It
may help to verify by changing
"test/jdk/java/foreign/TestToCStringWide.java.testStrings()" to handled
input array of strings:
 for $String testString \: new String\[\] \{\"\"\, \"x\"\, \"testing\"\} $ \{ \.\.\.
.. and also checked the reverse operation returns the original:
    String outString \= CLinker\.toJavaString$text\, charset$\;
    assertEquals$testString\, outString$\;
Kind regards

Duncan

On 10/06/2021 13:12, Jorn Vernee wrote:

The problem is that we only add a single 0 byte as a null terminator,
regardless of the charset used. For wider char sets, more 0 bytes need to
be added. For instance, for UTF_16LE two 0 bytes need to be added.

This patch fixes the issue by adding the null terminator to the Java
string, and only then encoding it as a `byte[]`.
https://webrevs.openjdk.java.net/?repo=panama-foreign&pr=554&range=00
pull/554/head:pull/554

JornVernee · 2021-06-11T19:49:52Z

I don't think reverse conversion works using Clinker.toJavaString.

Good catch, the toJavaString version also looks broken, since it looks for the first 0 byte, but that might just be the high-order byte of a character, which just happens to be 0, for wider char sets (for some reason I had assumed this was okay when thinking about it before).

Though we use Charset to do the encoding, it's probably not a bad idea to test a couple different strings as well.

openjdk · 2021-06-15T13:25:26Z

@JornVernee this pull request can not be integrated into foreign-memaccess+abi due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout Windows_String_Encoding
git fetch https://git.openjdk.java.net/panama-foreign foreign-memaccess+abi
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge foreign-memaccess+abi"
git push

JornVernee · 2021-06-15T15:25:29Z

When looking into toJavaString, the problems turned out to be much greater. The problem essentially boils down to not having a strlen function for any arbitrary Charset, so we can only support a certain subset of Charsets for which we know our current way of determining the native string's length works.

We discussed this at length among the team members, and arrived at the intermediate conclusion to only support the 'platform native' Charset. Though, this turns out to be tricky as well, as this charset, which is also called the 'execution character set' in C lingo, is determined based on a compiler setting at build time of the native code. With GCC and Clang the default is UTF-8, while on Windows it depends on the current code page. While there is a way to get the current code page of the runtime system and determine the character set from that, we would not be able to avoid issues with code page mismatches between the build environment and runtime environment on Windows,

While it would still technically be possible to support different character sets as long as they work with strlen, at present there is no way to detect this for an arbitrary character set. So, if we kept the Charset parameter, we would not be able to sanity check it, which doesn't seem great either.

As a result of all this, for now we have arrived at the decision to only support the UTF-8 Charset for the toCString and toJavaString methods, and to leave encoding and decoding using other character sets (including determining the length of a native string) to be implemented manually.

I've updated this PR to remove the overloads that accept a Charset, and updated the implementation to always use UTF-8. I've added several test cases as well that test Unicode characters that get encoded with different amounts of bytes in UTF-8.

Notice that the prime focus for this patch is stabilization (for JDK 17 as well). Perhaps in the future these APIs could be expanded to support more character sets again.

test/jdk/java/foreign/TestStringEncoding.java

mcimadamore

Looks sensible!

openjdk · 2021-06-15T18:11:48Z

@JornVernee This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8268230: Foreign Linker API & Windows user32/kernel32: String conversion seems broken

Reviewed-by: mcimadamore

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 142 new commits pushed to the foreign-memaccess+abi branch:

599b146: Merge master
37791a8: Automatic merge of jdk:master into master
94d0b0f: 8268565: runtime/records/RedefineRecord.java should be run in driver mode
df65237: 8267930: Refine code for loading hsdis library
2e900da: 8268574: ProblemList tests failing due to UseBiasedLocking going away
4fd2a14: 8267556: Enhance class paths check during runtime
8c8422e: 8267893: Improve jtreg test failure handler do get native/mixed stack traces for cores and live processes
1e1039a: 8268223: Problemlist vmTestbase/nsk/jdi/HiddenClass/events/events001.java
78cb677: 8268539: several serviceability/sa tests should be run in driver mode
7267227: 8268361: Fix the infinite loop in next_line
... and 132 more: https://git.openjdk.java.net/panama-foreign/compare/3cc47de0a6e3e2b29aa663478b2072f6ed40b054...foreign-memaccess+abi

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the foreign-memaccess+abi branch, type /integrate in a new comment.

JornVernee · 2021-06-15T18:38:26Z

/integrate

openjdk · 2021-06-15T18:40:09Z

Going to push as commit 782aeb4.
Since your change was applied there have been 142 commits pushed to the foreign-memaccess+abi branch:

599b146: Merge master
37791a8: Automatic merge of jdk:master into master
94d0b0f: 8268565: runtime/records/RedefineRecord.java should be run in driver mode
df65237: 8267930: Refine code for loading hsdis library
2e900da: 8268574: ProblemList tests failing due to UseBiasedLocking going away
4fd2a14: 8267556: Enhance class paths check during runtime
8c8422e: 8267893: Improve jtreg test failure handler do get native/mixed stack traces for cores and live processes
1e1039a: 8268223: Problemlist vmTestbase/nsk/jdi/HiddenClass/events/events001.java
78cb677: 8268539: several serviceability/sa tests should be run in driver mode
7267227: 8268361: Fix the infinite loop in next_line
... and 132 more: https://git.openjdk.java.net/panama-foreign/compare/3cc47de0a6e3e2b29aa663478b2072f6ed40b054...foreign-memaccess+abi

Your commit was automatically rebased without conflicts.

openjdk · 2021-06-15T18:40:20Z

@JornVernee Pushed as commit 782aeb4.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

overheadhunter · 2021-07-09T07:42:14Z

leave encoding and decoding using other character sets (including determining the length of a native string) to be implemented manually.

Can you give us a hint, how such manual re-encoding could look like, if an C lib wants to consume a different charset?

If I call CLinker.toCString(new String(bytes, UTF-16BE)), it'll still be converted to UTF-8 internally.

JornVernee · 2021-07-09T11:08:24Z

If I call CLinker.toCString(new String(bytes, UTF-16BE)), it'll still be converted to UTF-8 internally.

Note that the char set that is passed to the String constructor only indicates the encoding of the bytes that are being passed. These bytes are then converted to one of the internal encodings used by Java strings to produce the new String, and then toCString converts from that String back to a UTF-8 encoded memory region.

For encoding to a native representation, here is an example that assumes the native representation of the string should be terminated with 2 null bytes (the exact format the native string should have depends ultimately on the domain, and what the library expects):

String input = foo(); // get a Java string from somewhere
// Using a standard null terminated format here, but the library could also expect a different format
byte[] bytes = (input + '\0').getBytes(UTF_16BE);
MemorySegment text = MemorySegment.allocateNative(bytes.length);
text.copyFrom(MemorySegment.ofArray(bytes));
// use 'text'

For converting from a native string to a Java string a way to determine the length of a string is needed. Examples are strlen and wcslen, but it depends on the encoding the string is in, and again the domain (e.g. the library might expose a function to determine the length of the strings it returns).

Then something like this could be used:

MemoryAddress input = foo(); // get a native string from somewhere
int length = utf16be_string_length(input);
byte[] bytes = new byte[length];
MemorySegment inputSegment = input.asSegment(length, ResourceScope.newImplicitScope());
MemorySegment.ofArray(bytes).copyFrom(inputSegment);
String text = new String(bytes, UTF_16BE);
// use 'text'

openjdk bot added the rfr Ready for review label Jun 10, 2021

mcimadamore reviewed Jun 10, 2021

View reviewed changes

Alternative fix: limit char set used to UTF-8 only.

bd6cbe1

JornVernee force-pushed the Windows_String_Encoding branch from 1983faa to bd6cbe1 Compare June 15, 2021 13:24

openjdk bot added the merge-conflict label Jun 15, 2021

Merge branch 'foreign-memaccess+abi' into Windows_String_Encoding

9df822a

openjdk bot removed the merge-conflict label Jun 15, 2021

JornVernee marked this pull request as draft June 15, 2021 13:28

openjdk bot removed the rfr Ready for review label Jun 15, 2021

Fix tests after merge

aba2b1a

JornVernee commented Jun 15, 2021

View reviewed changes

test/jdk/java/foreign/TestStringEncoding.java Outdated Show resolved Hide resolved

Typo in comment

273acec

JornVernee marked this pull request as ready for review June 15, 2021 15:27

openjdk bot added the rfr Ready for review label Jun 15, 2021

mcimadamore approved these changes Jun 15, 2021

View reviewed changes

openjdk bot added the ready Ready to be integrated label Jun 15, 2021

openjdk bot closed this Jun 15, 2021

openjdk bot added the integrated Pull request has been integrated label Jun 15, 2021

openjdk bot removed ready Ready to be integrated rfr Ready for review labels Jun 15, 2021

JornVernee mentioned this pull request Jun 16, 2021

8268888: Upstream 8268230: Foreign Linker API & Windows user32/kernel32: String conversion seems broken openjdk/jdk17#77

Closed

3 tasks

JornVernee deleted the Windows_String_Encoding branch November 2, 2021 15:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

8268230: Foreign Linker API & Windows user32/kernel32: String conversion seems broken #554

8268230: Foreign Linker API & Windows user32/kernel32: String conversion seems broken #554

JornVernee commented Jun 10, 2021 •

edited by openjdk bot

Loading

bridgekeeper bot commented Jun 10, 2021

mlbridge bot commented Jun 10, 2021 •

edited

Loading

mcimadamore left a comment

mcimadamore Jun 10, 2021

JornVernee Jun 10, 2021

JornVernee Jun 10, 2021 •

edited

Loading

mlbridge bot commented Jun 11, 2021

mlbridge bot commented Jun 11, 2021

mlbridge bot commented Jun 11, 2021

mlbridge bot commented Jun 11, 2021

JornVernee commented Jun 11, 2021 •

edited

Loading

openjdk bot commented Jun 15, 2021

JornVernee commented Jun 15, 2021 •

edited

Loading

mcimadamore left a comment

openjdk bot commented Jun 15, 2021 •

edited

Loading

JornVernee commented Jun 15, 2021

openjdk bot commented Jun 15, 2021

openjdk bot commented Jun 15, 2021

overheadhunter commented Jul 9, 2021

JornVernee commented Jul 9, 2021

8268230: Foreign Linker API & Windows user32/kernel32: String conversion seems broken #554

8268230: Foreign Linker API & Windows user32/kernel32: String conversion seems broken #554

Conversation

JornVernee commented Jun 10, 2021 • edited by openjdk bot Loading

Progress

Issue

Reviewers

Reviewing

bridgekeeper bot commented Jun 10, 2021

mlbridge bot commented Jun 10, 2021 • edited Loading

Webrevs

mcimadamore left a comment

Choose a reason for hiding this comment

mcimadamore Jun 10, 2021

Choose a reason for hiding this comment

JornVernee Jun 10, 2021

Choose a reason for hiding this comment

JornVernee Jun 10, 2021 • edited Loading

Choose a reason for hiding this comment

mlbridge bot commented Jun 11, 2021

mlbridge bot commented Jun 11, 2021

mlbridge bot commented Jun 11, 2021

mlbridge bot commented Jun 11, 2021

JornVernee commented Jun 11, 2021 • edited Loading

openjdk bot commented Jun 15, 2021

JornVernee commented Jun 15, 2021 • edited Loading

mcimadamore left a comment

Choose a reason for hiding this comment

openjdk bot commented Jun 15, 2021 • edited Loading

JornVernee commented Jun 15, 2021

openjdk bot commented Jun 15, 2021

openjdk bot commented Jun 15, 2021

overheadhunter commented Jul 9, 2021

JornVernee commented Jul 9, 2021

JornVernee commented Jun 10, 2021 •

edited by openjdk bot

Loading

mlbridge bot commented Jun 10, 2021 •

edited

Loading

JornVernee Jun 10, 2021 •

edited

Loading

JornVernee commented Jun 11, 2021 •

edited

Loading

JornVernee commented Jun 15, 2021 •

edited

Loading

openjdk bot commented Jun 15, 2021 •

edited

Loading