Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8308858: FFM API and strings #836

Closed

Conversation

mcimadamore
Copy link
Collaborator

@mcimadamore mcimadamore commented Jun 2, 2023

This patch is an attempt to generalize string support in FFM API. Currently we only support UTF-8 encoding for strings. While this is a sensible default, on Windows strings are often encoded as wide strings, using UTF-16, so it would be nice to support more charsets.

As a result, this patch adds a Charset-accepting variant to MemorySegment::getString, MemorySegment::setString and MemorySegment::allocateString (the methods have also been renamed to drop their Utf8 suffix).

However, not all charsets can be supported. In fact, we only support the charsets defined in the StandardCharset class. While we tried to use charset encoder/decoder to support any charset, it seems like the charset interface is too general, and there is no guarantee that what will come out would be interoperable with a null C string.

In C (and C++), strings are either expressed as char* or wchar_t*. In the former case, no matter the encoding, the terminator char is expected to be 0x00 (e.g. the size of a char value), whereas in the latter case the terminator is expected to be 0x0000 or 0x0000000000 (the size of a wchar_t value, depending on platform).

Moreover, C more or less requires that a string, whether expressed as char* or wchar_t* cannot contain any element in the array that is a terminator char. That is, in case of extended representations, surrogates must not use the reserved terminator values.

There is no way really to map these restrictions to Java charsets. A Java charset can do pretty much anything it wants, even adding extra character at the end of a string (past the terminator) using the encoder's flush method. As a result, we can only really work with charsets we know of (and the list in StandardCharsets seems a good starting set).

The subject of strings is quite tricky, and we noticed that other frameworks tend to get it wrong, often assuming that a wide string only has a single byte terminator.

An alternative would be to do what LWJGL does, and provide multiple methods, for different encodings - e.g.

https://javadoc.lwjgl.org/org/lwjgl/system/MemoryUtil.html#memASCII(long)
https://javadoc.lwjgl.org/org/lwjgl/system/MemoryUtil.html#memUTF16(long)
https://javadoc.lwjgl.org/org/lwjgl/system/MemoryUtil.html#memUTF8(long)

Anyway, I thought it would be a good idea to propose this PR and see what kind of feedback it generates :-)


Progress

  • Change must not contain extraneous whitespace
  • Change must be properly reviewed (1 review required, with at least 1 Committer)

Issue

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/panama-foreign.git pull/836/head:pull/836
$ git checkout pull/836

Update a local copy of the PR:
$ git checkout pull/836
$ git pull https://git.openjdk.org/panama-foreign.git pull/836/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 836

View PR using the GUI difftool:
$ git pr show -t 836

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/panama-foreign/pull/836.diff

Webrev

Link to Webrev Comment


@ForceInline
private static int strlen_byte(MemorySegment segment, long start) {
// iterate until overflow (String can only hold a byte[], whose length can be expressed as an int)
Copy link
Collaborator Author

@mcimadamore mcimadamore Jun 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could see if functions such as strnlen, wcsnlen are supported, and, if so, create a trivial downcall method handle, and call that (as the underlying C function is likely to be a lot smarter than the simple loop here, and use vectorized instructions).

@bridgekeeper
Copy link

bridgekeeper bot commented Jun 2, 2023

👋 Welcome back mcimadamore! A progress list of the required criteria for merging this PR into foreign-memaccess+abi will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Ready for review label Jun 2, 2023
@mlbridge
Copy link

mlbridge bot commented Jun 2, 2023

Webrevs

Copy link
Member

@JornVernee JornVernee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of @ForceInline should be removed I think. Since it messes with the JIT's own heuristics, it should be avoided in general, unless we know that the contents of a method is likely to fold up to almost nothing (such as for MS::get*/set*). However, that is not the case here. Better to not mess with it and let the JIT, which has more information, make the decision IMO. (at the end of the day, we bottom out into an out-of-line call to Unsafe::copyMemory any way)

@mcimadamore
Copy link
Collaborator Author

The use of @ForceInline should be removed I think. Since it messes with the JIT's own heuristics, it should be avoided in general, unless we know that the contents of a method is likely to fold up to almost nothing (such as for MS::get*/set*). However, that is not the case here. Better to not mess with it and let the JIT, which has more information, make the decision IMO. (at the end of the day, we bottom out into an out-of-line call to Unsafe::copyMemory any way)

I tend to agree. But I believe copyMemory is an intrinsic:
https://github.com/openjdk/jdk/blob/master/src/hotspot/share/classfile/vmIntrinsics.hpp#L612

@JornVernee
Copy link
Member

JornVernee commented Jun 2, 2023

I tend to agree. But I believe copyMemory is an intrinsic: https://github.com/openjdk/jdk/blob/master/src/hotspot/share/classfile/vmIntrinsics.hpp#L612

Yes, sorry. It is an intrinsic, but it is intrisified as an out of line call: https://github.com/openjdk/jdk/blob/aff9cea05468daf60fa80c7d9993b3aa8497b0c5/src/hotspot/share/opto/library_call.cpp#L4728-L4734

/**
* Miscellaneous functions to read and write strings, in various charsets.
*/
public class StringSupport {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pedantic comment: The class may be final and declare a private constructor.

this.terminatorCharSize = terminatorCharSize;
}

public int getTerminatorCharSize() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggesting terminatorCharSize

@minborg
Copy link
Collaborator

minborg commented Jun 5, 2023

Maybe the default charset will be the most likely to use and if so, we could provide some kind of optimized version (e.g. using a composed MethodHandle?) in the future. Perhaps it is good enough but it would be interesting to see some benchmarks.

@laeubi
Copy link

laeubi commented Jun 5, 2023

Moreover, C more or less requires that a string, whether expressed as char* or wchar_t* cannot contain any element in the array that is a terminator char.

Why not then using two method:

  • get/setCString
  • get/set WString

that both accept a generic char-set and let the user choose a suitable one as the name of the charset do not really matter much (as long as both sides use the same byte -> char).

@mcimadamore
Copy link
Collaborator Author

Maybe the default charset will be the most likely to use and if so, we could provide some kind of optimized version (e.g. using a composed MethodHandle?) in the future. Perhaps it is good enough but it would be interesting to see some benchmarks.

See my comment on special-casing strlen in some cases (as I think that's the big gain to be had).

@mcimadamore
Copy link
Collaborator Author

Moreover, C more or less requires that a string, whether expressed as char* or wchar_t* cannot contain any element in the array that is a terminator char.

Why not then using two method:

* get/set`CString`

* get/set `WString`

that both accept a generic char-set and let the user choose a suitable one as the name of the charset do not really matter much (as long as both sides use the same byte -> char).

Not sure about this. Yes, we could have a getSingleByteString and getWideString. But it we also support charsets, then we have to worry about the compatibility of the provided charset with the given string method. E.g. what if I pass UTF-8 charset to getWideString? Or Utf-16 charset to getSingleByteString? The approach described here has a single parameter, which controls everything else.

@mcimadamore
Copy link
Collaborator Author

Moreover, C more or less requires that a string, whether expressed as char* or wchar_t* cannot contain any element in the array that is a terminator char.

Why not then using two method:

* get/set`CString`

* get/set `WString`

that both accept a generic char-set and let the user choose a suitable one as the name of the charset do not really matter much (as long as both sides use the same byte -> char).

Not sure about this. Yes, we could have a getSingleByteString and getWideString. But it we also support charsets, then we have to worry about the compatibility of the provided charset with the given string method. E.g. what if I pass UTF-8 charset to getWideString? Or Utf-16 charset to getSingleByteString? The approach described here has a single parameter, which controls everything else.

I guess an alternate design would be to ask the user how many terminator chars there should be (where that has to be a power of two). Then we could allow any charsets, including non-sensical ones. Something like:

getString(long offset) // shortcut for getString(offset, defaultCharset(), 1);
getString(long offset, Charset charset, int numTermChars)

setString(long offset, String string) // shortcut for setString(offset, defaultCharset(), 1, string);
setString(long offset, Charset charset, int numTermChars, String string)

@JornVernee
Copy link
Member

JornVernee commented Jun 7, 2023

I guess an alternate design would be to ask the user how many terminator chars there should be (where that has to be a power of two). Then we could allow any charsets, including non-sensical ones. Something like:

getString(long offset) // shortcut for getString(offset, defaultCharset(), 1);
getString(long offset, Charset charset, int numTermChars)

setString(long offset, String string) // shortcut for setString(offset, defaultCharset(), 1, string);
setString(long offset, Charset charset, int numTermChars, String string)

For getString we need to determine the length of the string. And I think for setString we need to determine how many 0 bytes we need to put at the end? Really, those are the crucial elements, since the rest is just a bulk copy.

Maybe for getString we could just explicitly ask the user for the size of the string? (excluding the terminator). That way we don't need to manually determine the string length. For users that means there is still a convenient wrapper that does the decoding and bulk copy. That would also work for loading sub strings, where there is no null terminator directly after the native string, which might be nice.

I suppose to setString we could also just use "\0".getBytes(charset), and then append that to the end of the native string, but as you said before, a charset encoder might add some arbitrary trailing bytes at the end of the string, so I'm not sure if this will work. And there might be issues with meta bytes, such as the BOM in UTF16. So, I think just knowing the number of terminator chars isn't enough to make this work... And again, I feel like Charset is just not suited for handling native strings out of the box (though we can to special case some of them).

@laeubi
Copy link

laeubi commented Jun 7, 2023

that both accept a generic char-set and let the user choose a suitable one as the name of the charset do not really matter much (as long as both sides use the same byte -> char).

Not sure about this. Yes, we could have a getSingleByteString and getWideString. But it we also support charsets, then we have to worry about the compatibility of the provided charset with the given string method. E.g. what if I pass UTF-8 charset to getWideString? Or Utf-16 charset to getSingleByteString? The approach described here has a single parameter, which controls everything else.

Well one can always do "bad" things (especially on the C side of the chain), but one usually try to do "good" things, so if the native code uses a wchar_t* and expect it to contain UTF-8 chars with two zero bytes as end it is actually possible (in C).

So this would be more syntactic sugar (and probably a way to optimize access to the internal array if no trans-coding is required) to (Java) String > char* / wchar_t*, this just came into my mind because in JNI I often declare String types as byte[]` on the JNI Interface and have a helper method like this (sorry in advance for this hack):

private static byte[] cstring(String str) {
  if(str == null || str.isEmpty()) {
    return new byte[]{0};
  }
  byte[] bytes = str.getBytes(CHARSET);
  return Arrays.copyOf(bytes, bytes.length + 1);
}

and on the C(++) side:

jbyte* j_str = env->GetByteArrayElements(jniStrBytes, nullptr);
char* c_str = reinterpret_cast<char*>(j_str);

so for FFM something like

setCString(long offset, String str, Charset c)
setWString(long offset, String str, Charset c)

should be able to handle any charset (as long as both side agreed on it of course) but that's was just an idea that came into my mind as on C the "standard string" is just a bunch of bytes with one (or two) zero at the end... and as the string-length (or byte length after encoding) and the terminating bates (C = 1, W = 2) is known the user do not need to pass anything more than the above.

@mcimadamore
Copy link
Collaborator Author

For getString we need to determine the length of the string.

That was the purpose for numTermChars. E.g. that determines the size of the term char, as well as the "stride" by which the string is scanned.

@JornVernee
Copy link
Member

For getString we need to determine the length of the string.

That was the purpose for numTermChars. E.g. that determines the size of the term char, as well as the "stride" by which the string is scanned.

Sorry, that's what I meant. The numTermChars is used to determine the length of the string. I wanted to point that out/recap, before going into the alternative.

@JornVernee
Copy link
Member

JornVernee commented Jun 7, 2023

terminating bates (C = 1, W = 2)

One issue with this is that W = 2 is only true on Windows. With a more general approach we get back essentially to what Maurizio suggested.

@mcimadamore
Copy link
Collaborator Author

mcimadamore commented Jun 7, 2023

To summarize, my feeling is that there are several things we could do:

  • have specific getXYZString, where XYZ is the name of the supported encoding. Doable, in practice what can go in XYZ is limited (as LWJGL suggests).
  • have a getString method that accepts a Charset and, based on the Charset, does the right thing (as in this PR)
  • have a getString method that accepts a terminator size/stride and a Charset. Based on the terminator size (which will probably be either 1, 2 or 4 in practical cases), the routine will parse the string bytes using byte, short or int, depending on the cases, trying to find the terminator. In this case the choice of Charset is less important. An api note could explain that if a charset happens to use the terminator char for other purposes, or if it adds characters at the end of an encoding, that could be problematic.

@JornVernee
Copy link
Member

JornVernee commented Jun 7, 2023

I think the second option (this PR) is the most attractive user model, since we do all the work, though we limit the number of Charsets if we do that.

I think the third option is the most flexible if we want to support any Charset, but I'm a little disappointed we end up with potential rough edges... (who knows what kind of bug report we might end up with a few years down the line when somebody uses their custom charset). I also feel like we essentially kick the problem back to the user, since they now have to figure out the number of null terminator bytes. At that point, I think we're better off just letting the user implement the whole thing, since it's not that complicated (which has been the story so far for Charsets other than UTF8). If we do that, we can keep the 'nice' API (this PR) for the Charsets that we know how to do all the work for.

Maybe a fourth option is:

  • avoid dealing with the terminator altogether. That would mean that, for getString, the user needs to give us an explicit size, in bytes, of the string. For setString, the user would have to insert manually using another call the null terminator (whatever that may be).

This is also kicking the problem back to the user.

I think out of all of these options, I still prefer the approach taken by this PR.

@mcimadamore
Copy link
Collaborator Author

mcimadamore commented Jun 7, 2023

I also feel like we essentially kick the problem back to the user, since they now have to figure out the number of null terminator bytes.

On this I'm not 100% sure. I mean, there's a way in which what you say is correct, e.g. if you take it that the number of terminator bytes must be somewhat related to the charset selected by the user. But there's another way to look at this which says that the number of terminator bytes is unrelated from the charset, and it's only related to the notional C type that is used to model the character in the C string (e.g. char vs wchar_t and, in the latter case, whether we're on a platform where wchar_t is 16 or 32 bits). Which is still kicking something back to the user, but in a slightly different shape.

That said, using random charsets can still be problematic if the charset ends up encoding multiple different chars into the terminator sequence (think of a custom character whose replacement string is the terminator sequence :-) ). The Charset API still seems more general that what a null-terminated string API can stomach.

@laeubi
Copy link

laeubi commented Jun 8, 2023

But there's another way to look at this which says that the number of terminator bytes is unrelated from the charset, and it's only related to the notional C type that is used to model the character in the C string

That's exactly what I mean, while theoretically I can choose any charset and do strange things on Java by defining a custom one, practically I have no choice (most of the time) as I can't use a random charset but must use the one chosen by the c-code I interact with and that side will most probably choose one that fits the data-type and c-conventions.

So if this String-To-C-Using-Charset can make it easier for me (e.g maybe handle the wchar_t 16 vs 32 bit) I think that would be enough, for really special cases one can still write custom code, but most of the time I use it charset in C means "system-encoding" (if not ASCII).

@mcimadamore
Copy link
Collaborator Author

Thinking some more about this (and after reading some more documents). This para stuck with me:

The size of a wide character type does not dictate what kind of text encodings a system can process, as conversions are available. (Old conversion code commonly overlook surrogates, however.) The historical circumstances of their adoption does also decide what types of encoding they prefer. A system influenced by Unicode 1.0, such as Windows, tends to mainly use "wide strings" made out of wide character units. Other systems such as the Unix-likes, however, tend to retain the 8-bit "narrow string" convention, using a multibyte encoding (almost universally UTF-8) to handle "wide" characters.[5]

It seems that the status quo is:

  • nix systems stick with UTF-8 as preferred form (and multi-byte char* strings). On these systems, wchar_t is typically is 32 bits (verified on my system) - that is, big enough to contain any extended char w/o the use of surrogate pairs;
  • Windows stick with UTF-16 (and wide wchar_t* strings). This is mostly down to historical reasons (e.g. back when we though UTF-16 was a big enough fixed-size encoding).

So, if we adopt a charset-driven approach, it would seem that the most important charsets to support are (in order):

  • Utf8 (this is by far the most common choice)
  • Utf16 (for compatibility with Windows wide strings)
  • Utf32 (for compatibility with Nix wide strings)

Note how the last bullet is not supported by the current PR (because UTF-32 is not part of SupportedCharsets). This might be an issue when working with wide strings on e.g. Linux.

Alternatively, if we adopt a wide-driven approach, this boils down to having two flavors of strings accessors:

  • getMultiByteString
  • getWideString

Where the "stride" of getWideString is platform-dependent (2 bytes on Windows, 4 bytes on Nix systems). Now, as written in the above text, there's formally no constraint on using a given encoding in either case. That said, I think there are obvious-enough defaults here - which are UTF-8 for multi-byte strings and either UTF-16 or UTF-32 for wide strings.

While these method could also still accept a Charset parameter, there seems to be relatively little value in doing so. UTF-8 is virtually ubiquitous in multi-byte settings, whereas if you want to work with wide strings, typically you want to do so to avoid variable-size processing and/or to adhere more strictly with Unicode formats (so anything other than UTF-16 or UTF-32 here seems an odd choice).

As pointed out, even if we didn't provide a Charset-accepting variant, users would still be able to mix and mach multi-byte and wide with any available charset - but they will have to do it manually which, given the corner^4 case nature of the problem, is not something I'm too concerned with.

@mcimadamore
Copy link
Collaborator Author

getMultiByteString

For the records, I wanted to spell things out very clearly here - I'm not proposing to have a method called getMultiByteString, as much as I wanted to draw closer parallel with what's in the C standards. As a name, we can just use getString/getWideString

@laeubi
Copy link

laeubi commented Jun 8, 2023

At least ASCII and the Windows-1252 are quite common in C so maybe UTF-8 is superior but not always desired? Beside that there is also "Pascal" encoding with 8-bit and 16-bit encoded :-D

@mcimadamore
Copy link
Collaborator Author

mcimadamore commented Jun 8, 2023

At least ASCII and the Windows-1252 are quite common in C so maybe UTF-8 is superior but not always desired? Beside that there is also "Pascal" encoding with 8-bit and 16-bit encoded :-D

I'm not too worried about ASCII since in most cases it's a subset of other encodings - e.g. using UTF-8 to write an ASCII string should work out just fine. Problem with Windows-1252 is/will be more common though.

It's almost as if, for the basic getMultiByteString method, we wanted to use whatever is "native.encoding" set to (which is OS-specific).

@openjdk
Copy link

openjdk bot commented Jun 13, 2023

@mcimadamore this pull request can not be integrated into foreign-memaccess+abi due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout strings
git fetch https://git.openjdk.org/panama-foreign.git foreign-memaccess+abi
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge foreign-memaccess+abi"
git push

Improve javadoc (esp. bound-checking conditions for get/set string)
@openjdk openjdk bot removed the merge-conflict label Jun 13, 2023
@mcimadamore
Copy link
Collaborator Author

mcimadamore commented Jun 13, 2023

Following some offline discussions, I have decided to:

  • keep the general shape of the PR, as that gives us most flexibility (while getString/getWideString is a possibility, we did not like that getString would need a platform-dependent semantics, based on the selected native encoder, which is something the JDK as a whole is trying to get away from)
  • Use UTF-8 instead of Charset.defaultCharset for getString/setString/allocateString
  • Tweak API so that Charset parameter is the last

We will consider whether to add support for UTF32 in the StandardCharsets class as a separate change (which will allow wide string on *nix platforms).

@openjdk
Copy link

openjdk bot commented Jun 13, 2023

@mcimadamore This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8308858: FFM API and strings

Reviewed-by: jvernee

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been no new commits pushed to the foreign-memaccess+abi branch. If another commit should be pushed before you perform the /integrate command, your PR will be automatically rebased. If you prefer to avoid any potential automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the foreign-memaccess+abi branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Ready to be integrated label Jun 13, 2023
@mcimadamore
Copy link
Collaborator Author

/integrate

@openjdk
Copy link

openjdk bot commented Jun 14, 2023

Going to push as commit c5efc43.

@openjdk openjdk bot added the integrated Pull request has been integrated label Jun 14, 2023
@openjdk openjdk bot closed this Jun 14, 2023
@openjdk openjdk bot removed ready Ready to be integrated rfr Ready for review labels Jun 14, 2023
@openjdk
Copy link

openjdk bot commented Jun 14, 2023

@mcimadamore Pushed as commit c5efc43.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@mcimadamore
Copy link
Collaborator Author

We will consider whether to add support for UTF32 in the StandardCharsets class as a separate change (which will allow wide string on *nix platforms).

For the records, here's an RFE for this:
https://bugs.openjdk.org/browse/JDK-8310047

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

4 participants