-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8308858: FFM API and strings #836
Conversation
Cleanup code
|
||
@ForceInline | ||
private static int strlen_byte(MemorySegment segment, long start) { | ||
// iterate until overflow (String can only hold a byte[], whose length can be expressed as an int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could see if functions such as strnlen
, wcsnlen
are supported, and, if so, create a trivial downcall method handle, and call that (as the underlying C function is likely to be a lot smarter than the simple loop here, and use vectorized instructions).
👋 Welcome back mcimadamore! A progress list of the required criteria for merging this PR into |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of @ForceInline
should be removed I think. Since it messes with the JIT's own heuristics, it should be avoided in general, unless we know that the contents of a method is likely to fold up to almost nothing (such as for MS::get*/set*). However, that is not the case here. Better to not mess with it and let the JIT, which has more information, make the decision IMO. (at the end of the day, we bottom out into an out-of-line call to Unsafe::copyMemory
any way)
src/java.base/share/classes/java/lang/foreign/SegmentAllocator.java
Outdated
Show resolved
Hide resolved
src/java.base/share/classes/jdk/internal/foreign/StringSupport.java
Outdated
Show resolved
Hide resolved
src/java.base/share/classes/java/lang/foreign/SegmentAllocator.java
Outdated
Show resolved
Hide resolved
I tend to agree. But I believe copyMemory is an intrinsic: |
Yes, sorry. It is an intrinsic, but it is intrisified as an out of line call: https://github.com/openjdk/jdk/blob/aff9cea05468daf60fa80c7d9993b3aa8497b0c5/src/hotspot/share/opto/library_call.cpp#L4728-L4734 |
/** | ||
* Miscellaneous functions to read and write strings, in various charsets. | ||
*/ | ||
public class StringSupport { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pedantic comment: The class may be final
and declare a private
constructor.
this.terminatorCharSize = terminatorCharSize; | ||
} | ||
|
||
public int getTerminatorCharSize() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggesting terminatorCharSize
Maybe the default charset will be the most likely to use and if so, we could provide some kind of optimized version (e.g. using a composed |
Why not then using two method:
that both accept a generic char-set and let the user choose a suitable one as the name of the charset do not really matter much (as long as both sides use the same byte -> char). |
See my comment on special-casing strlen in some cases (as I think that's the big gain to be had). |
Not sure about this. Yes, we could have a getSingleByteString and getWideString. But it we also support charsets, then we have to worry about the compatibility of the provided charset with the given string method. E.g. what if I pass UTF-8 charset to getWideString? Or Utf-16 charset to getSingleByteString? The approach described here has a single parameter, which controls everything else. |
I guess an alternate design would be to ask the user how many terminator chars there should be (where that has to be a power of two). Then we could allow any charsets, including non-sensical ones. Something like:
|
For Maybe for getString we could just explicitly ask the user for the size of the string? (excluding the terminator). That way we don't need to manually determine the string length. For users that means there is still a convenient wrapper that does the decoding and bulk copy. That would also work for loading sub strings, where there is no null terminator directly after the native string, which might be nice. I suppose to |
Well one can always do "bad" things (especially on the C side of the chain), but one usually try to do "good" things, so if the native code uses a So this would be more syntactic sugar (and probably a way to optimize access to the internal array if no trans-coding is required) to (Java)
and on the C(++) side:
so for FFM something like
should be able to handle any charset (as long as both side agreed on it of course) but that's was just an idea that came into my mind as on C the "standard string" is just a bunch of bytes with one (or two) zero at the end... and as the string-length (or byte length after encoding) and the terminating bates (C = 1, W = 2) is known the user do not need to pass anything more than the above. |
That was the purpose for |
Sorry, that's what I meant. The |
One issue with this is that W = 2 is only true on Windows. With a more general approach we get back essentially to what Maurizio suggested. |
To summarize, my feeling is that there are several things we could do:
|
I think the second option (this PR) is the most attractive user model, since we do all the work, though we limit the number of Charsets if we do that. I think the third option is the most flexible if we want to support any Charset, but I'm a little disappointed we end up with potential rough edges... (who knows what kind of bug report we might end up with a few years down the line when somebody uses their custom charset). I also feel like we essentially kick the problem back to the user, since they now have to figure out the number of null terminator bytes. At that point, I think we're better off just letting the user implement the whole thing, since it's not that complicated (which has been the story so far for Charsets other than UTF8). If we do that, we can keep the 'nice' API (this PR) for the Charsets that we know how to do all the work for. Maybe a fourth option is:
This is also kicking the problem back to the user. I think out of all of these options, I still prefer the approach taken by this PR. |
On this I'm not 100% sure. I mean, there's a way in which what you say is correct, e.g. if you take it that the number of terminator bytes must be somewhat related to the charset selected by the user. But there's another way to look at this which says that the number of terminator bytes is unrelated from the charset, and it's only related to the notional C type that is used to model the character in the C string (e.g. That said, using random charsets can still be problematic if the charset ends up encoding multiple different chars into the terminator sequence (think of a custom character whose replacement string is the terminator sequence :-) ). The Charset API still seems more general that what a null-terminated string API can stomach. |
That's exactly what I mean, while theoretically I can choose any charset and do strange things on Java by defining a custom one, practically I have no choice (most of the time) as I can't use a random charset but must use the one chosen by the c-code I interact with and that side will most probably choose one that fits the data-type and c-conventions. So if this String-To-C-Using-Charset can make it easier for me (e.g maybe handle the |
Thinking some more about this (and after reading some more documents). This para stuck with me:
It seems that the status quo is:
So, if we adopt a charset-driven approach, it would seem that the most important charsets to support are (in order):
Note how the last bullet is not supported by the current PR (because UTF-32 is not part of SupportedCharsets). This might be an issue when working with wide strings on e.g. Linux. Alternatively, if we adopt a wide-driven approach, this boils down to having two flavors of strings accessors:
Where the "stride" of getWideString is platform-dependent (2 bytes on Windows, 4 bytes on Nix systems). Now, as written in the above text, there's formally no constraint on using a given encoding in either case. That said, I think there are obvious-enough defaults here - which are UTF-8 for multi-byte strings and either UTF-16 or UTF-32 for wide strings. While these method could also still accept a Charset parameter, there seems to be relatively little value in doing so. UTF-8 is virtually ubiquitous in multi-byte settings, whereas if you want to work with wide strings, typically you want to do so to avoid variable-size processing and/or to adhere more strictly with Unicode formats (so anything other than UTF-16 or UTF-32 here seems an odd choice). As pointed out, even if we didn't provide a Charset-accepting variant, users would still be able to mix and mach multi-byte and wide with any available charset - but they will have to do it manually which, given the corner^4 case nature of the problem, is not something I'm too concerned with. |
For the records, I wanted to spell things out very clearly here - I'm not proposing to have a method called |
At least ASCII and the Windows-1252 are quite common in C so maybe UTF-8 is superior but not always desired? Beside that there is also "Pascal" encoding with 8-bit and 16-bit encoded :-D |
I'm not too worried about ASCII since in most cases it's a subset of other encodings - e.g. using UTF-8 to write an ASCII string should work out just fine. Problem with Windows-1252 is/will be more common though. It's almost as if, for the basic |
* make Charset parameter be the last
@mcimadamore this pull request can not be integrated into git checkout strings
git fetch https://git.openjdk.org/panama-foreign.git foreign-memaccess+abi
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge foreign-memaccess+abi"
git push |
Improve javadoc (esp. bound-checking conditions for get/set string)
Following some offline discussions, I have decided to:
We will consider whether to add support for UTF32 in the StandardCharsets class as a separate change (which will allow wide string on *nix platforms). |
@mcimadamore This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been no new commits pushed to the ➡️ To integrate this PR with the above commit message to the |
/integrate |
Going to push as commit c5efc43. |
@mcimadamore Pushed as commit c5efc43. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
For the records, here's an RFE for this: |
This patch is an attempt to generalize string support in FFM API. Currently we only support UTF-8 encoding for strings. While this is a sensible default, on Windows strings are often encoded as wide strings, using UTF-16, so it would be nice to support more charsets.
As a result, this patch adds a Charset-accepting variant to
MemorySegment::getString
,MemorySegment::setString
andMemorySegment::allocateString
(the methods have also been renamed to drop theirUtf8
suffix).However, not all charsets can be supported. In fact, we only support the charsets defined in the StandardCharset class. While we tried to use charset encoder/decoder to support any charset, it seems like the charset interface is too general, and there is no guarantee that what will come out would be interoperable with a null C string.
In C (and C++), strings are either expressed as
char*
orwchar_t*
. In the former case, no matter the encoding, the terminator char is expected to be0x00
(e.g. the size of achar
value), whereas in the latter case the terminator is expected to be0x0000
or0x0000000000
(the size of awchar_t
value, depending on platform).Moreover, C more or less requires that a string, whether expressed as
char*
orwchar_t*
cannot contain any element in the array that is a terminator char. That is, in case of extended representations, surrogates must not use the reserved terminator values.There is no way really to map these restrictions to Java charsets. A Java charset can do pretty much anything it wants, even adding extra character at the end of a string (past the terminator) using the encoder's
flush
method. As a result, we can only really work with charsets we know of (and the list inStandardCharsets
seems a good starting set).The subject of strings is quite tricky, and we noticed that other frameworks tend to get it wrong, often assuming that a wide string only has a single byte terminator.
An alternative would be to do what LWJGL does, and provide multiple methods, for different encodings - e.g.
https://javadoc.lwjgl.org/org/lwjgl/system/MemoryUtil.html#memASCII(long)
https://javadoc.lwjgl.org/org/lwjgl/system/MemoryUtil.html#memUTF16(long)
https://javadoc.lwjgl.org/org/lwjgl/system/MemoryUtil.html#memUTF8(long)
Anyway, I thought it would be a good idea to propose this PR and see what kind of feedback it generates :-)
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/panama-foreign.git pull/836/head:pull/836
$ git checkout pull/836
Update a local copy of the PR:
$ git checkout pull/836
$ git pull https://git.openjdk.org/panama-foreign.git pull/836/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 836
View PR using the GUI difftool:
$ git pr show -t 836
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/panama-foreign/pull/836.diff
Webrev
Link to Webrev Comment