-
Notifications
You must be signed in to change notification settings - Fork 6.2k
8369564: Provide a MemorySegment API to read strings with known lengths #28043
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
/contributor add @minborg |
|
👋 Welcome back cushon! A progress list of the required criteria for merging this PR into |
|
❗ This change is not yet ready to be integrated. |
|
@cushon |
Webrevs
|
JornVernee
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some preliminary comments. I didn't look at the tests yet.
| case SINGLE_BYTE -> readByte(segment, offset, len, charset); | ||
| case DOUBLE_BYTE -> readShort(segment, offset, len, charset); | ||
| case QUAD_BYTE -> readInt(segment, offset, len, charset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These 3 methods appear to be identical
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I refactored to do something more similar to the original PR to avoid the duplication here and with the existing read methods.
| String getString(long offset, int length, Charset charset); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest putting the length parameter at the end, so that this becomes a telescoping overload of the length-less variant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| * <li>{@code N} is the size (in bytes) of the terminator char according | ||
| * to the provided charset. For instance, this is 1 for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the terminator char important? The segment doesn't necessarily need to have a terminator char, right? I don't see this invariant being checked in the code either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, it is not, I think this was left over from javadoc adapted from another overload
| * <li>{@code B} is the size, in bytes, of the string encoded using the | ||
| * provided charset (e.g. {@code str.getBytes(charset).length});</li> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't B equal to the length argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, yes, I reworked this part
| * @param length byte length to be used for string conversion (not including any | ||
| * null termination) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think 'to be used for string conversion' is a bit too vague (used how?). I think a more descriptive text could be something like 'length in bytes of the string to read' (matching also the pattern of the existing 'offset in bytes').
Also, what happens if:
- The length does include a null terminator
- The length is not a multiple of the byte size of a character in the given charset.
On that last note, I wonder if this shouldn't be the length in bytes, but the length in characters. Then we can compute the byte length from the charset. That will make it impossible to pass a length that is not a multiple of the character size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look, I wanted to respond briefly to this part and will review the rest of the comments later:
I wonder if this shouldn't be the length in bytes, but the length in characters. Then we can compute the byte length from the charset
Part of the motivation here is to support efficiently reading binary formats where I think it's more common to record the length of string data in bytes, than in 16-bit code units in the UTF-16 encoding of the string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed this with the team as well. For cases of native interop it seems more likely that you'd have e.g. an array of wchar_t on the native side, and you are tracking the length of that array, not the byte size.
A user can easily convert between one or the other length representation by multiplying/dividing by the right scalar, but if the length is specified in bytes, the API has an extra error case we need to check and specify.
Either way, we felt that it would be a good idea if you could send an email to panama-dev in which you describe your exact use case, before getting further into the code review. That would give others a chance to respond with their use cases as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A user can easily convert between one or the other length representation by multiplying/dividing by the right scalar
That is true of e.g. UTF-16 but not of UTF-8, since the encoding is variable width and doing the conversion from bytes to characters is more expensive there.
Either way, we felt that it would be a good idea if you could send an email to panama-dev in which you describe your exact use case, before getting further into the code review. That would give others a chance to respond with their use cases as well.
Sounds good, thanks, I can start a thread discussing the use-case here at a higher level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A user can easily convert between one or the other length representation by multiplying/dividing by the right scalar
That is true of e.g. UTF-16 but not of UTF-8, since the encoding is variable width and doing the conversion from bytes to characters is more expensive there.
Sorry, I don't mean 'character' but 'code unit'. For instance, when reading a UTF-8 string, the unit would be one byte, for UTF-16 it would be two, for UTF-32 four. So a user would just need to divide by the unit size, at least that's the idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can start a thread discussing the use-case here at a higher level.
Done: https://mail.openjdk.org/pipermail/panama-dev/2025-November/021182.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was some discussion in this panama-dev@ thread, and mcimadamore wrote a document: Pulling the (foreign) string
|
@cushon |
|
|
||
| void copyToSegmentRaw(MemorySegment segment, long offset) { | ||
| MemorySegment.copy(value, 0, segment, ValueLayout.JAVA_BYTE, offset, value.length); | ||
| void copyToSegmentRaw(MemorySegment segment, long offset, int srcIndex, int numChars) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method takes an index, expressed in chars, and uses that as a byte offset in a bulk copy operation. I don't think this is correct. E.g. if the string is UTF16 (and not LATIN1), there is a scaling factor to be applied?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other words, it seems to me that here we have hardwired the knowledge that we can only get here is the string is latin1. I don't think this was the original intent of this method -- however, if that's the case, we should also add an assertion to avoid misuse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this. For copyToSegmentRaw, I have updated the parameter names to not refer to chars.
I have also tentatively added an assertion to copyToSegmentRaw to only support latin1 strings, which could be relaxed if bytesCompatible is updated to handle UTF-16
| } | ||
|
|
||
| boolean bytesCompatible(Charset charset) { | ||
| boolean bytesCompatible(Charset charset, int srcIndex, int numChars) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surprisingly here we don't do anything for the case where the string is UTF16 and the target charset is also UTF16?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The UTF‑16 Charsets disallow unpaired surrogates, which Java Strings allow.
So this can only return true for UTF‑16 when the platform and charset endianness match and the String doesn’t have any unpaired surrogates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, tricky stuff -- we'll need to think more before changing this
| Objects.requireNonNull(str); | ||
| MemorySegment segment; | ||
| if (StringSupport.bytesCompatible(str, charset, srcIndex, numChars)) { | ||
| segment = allocateNoInit(numChars); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also seems to rely on the fact that we end up here only for latin1 strings. Again, I don't think this is correct, but if it's deliberate, we should add an assertion check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I think we make a similar assumption in the existing allocateFrom(String, Charset), it does length + termCharSize and that should perhaps be (length + 1) * codeUnitSize.
| * will appear truncated when read again. | ||
| * | ||
| * @param src the Java string to be written into this segment | ||
| * @param dstEncoding the charset used to {@linkplain Charset#newEncoder() encode} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we have a dependency on the charset being standard?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not, thanks, fixed.
Although I think the existing allocateFrom(String, Charset) method does have an undocumented dependency, because it uses CharsetKind to get the terminator char length, which only supports standard Charsets. If we add a fast path for UTF-16 that may need a dependency on a standard Charset (or a standard way to get the code unit size of a charset, if it has one).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note sure I follow -- the method you mention says this:
* @throws IllegalArgumentException if {@code charset} is not a
* {@linkplain StandardCharsets standard charset}
What do you mean by "undocumented dependency"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, you're right, it is documented. It's documented differently than e.g. MemorySegment#getString, which mentions it in both the @param and @throws doc.
mcimadamore
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, looks good, but I think we should put some more care when going to char-based indices to byte array indices, esp. if we will optimize the UTF16 case in the future
|
/label remove security |
|
@wangweij |
|
Thanks for the review!
Thanks for catching that. I have made some initial updates and added an assertion. To confirm do you think it's OK to leave optimizing the UTF16 case as future work, as long as the current assumptions are clearly documented and guarded by assertions? Also, I'm planning to spend more time on test coverage for this and going over the javadoc again. |
|
/csr needed |
|
@mcimadamore has indicated that a compatibility and specification (CSR) request is needed for this pull request. @cushon please create a CSR request for issue JDK-8369564 with the correct fix version. This pull request cannot be integrated until the CSR request is approved. |
I think this is ok, yes -- maybe add a link (comment) between the assertion being thrown and String::bytesCompatible. |
|
/reviewers 2 |
|
@mcimadamore |
mcimadamore
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall approach looks solid, thanks for working on this improvement!
JornVernee
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! I've left some inline comments.
|
|
||
| @ForceInline | ||
| public static String readBytes(AbstractMemorySegmentImpl segment, long offset, Charset charset, long length) { | ||
| final int lengthBytes = (int) length; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should do something more here than just ignore the upper bits. We probably need to throw an exception when the value is > Integer.MAX_VALUE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we are already specified to throw IAE if 'the size of the string is greater than the largest string supported by the platform'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, and I added a test to cover the IAE
| } | ||
|
|
||
| /** | ||
| * Converts a Java string into a null-terminated C string using the provided charset, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not null-terminated, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not, thanks, fixed
| * @throws IllegalArgumentException if {@code charset} is not a | ||
| * {@linkplain StandardCharsets standard charset} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, I don't think we have this limitation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| * @throws IndexOutOfBoundsException if the {@code endIndex} is larger than the length of | ||
| * this {@code String} object, or {@code beginIndex} is larger than {@code endIndex}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no 'endIndext'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| assertThrows(IllegalArgumentException.class, () -> segment.getString(0, StandardCharsets.UTF_8, -1)); | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need some more tests for the other new methods as well. Also, it would be nice to test non-standard charsets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added more tests to cover regular and exception cases for the three new methods. I'm happy to take suggestions on additional test coverage, or if there's a better location for any of the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, these look great!
I think another test that tests the case where srcIndex + numChars overflows for copy and allocateFrom, with different char sets (one that takes the internal bytesCompatible == true, and one that takes the bytesCompatible == false route) would be good to have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I added coverage of srcIndex + numChars overflow for both bytesCompatible cases.
I think that in practice some of the cases were being caught later. I added Objects.checkFromIndexSize assertions to MemorySegment#copy and allocateFrom to catch them immediately with a useful exception, instead of e.g. catching it in String.substring.
| * access operation will occur | ||
| * @param charset the charset used to {@linkplain Charset#newDecoder() decode} the | ||
| * string bytes | ||
| * @param length length to be used for string conversion, in bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please keep the same format
| * @param length length to be used for string conversion, in bytes | |
| * @param length length in bytes of the string to read |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| * string bytes | ||
| * @param length length to be used for string conversion, in bytes | ||
| * @return a Java string constructed from the bytes read from the given starting | ||
| * address reading the given length of characters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * address reading the given length of characters | |
| * address reading the given length of bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| * @throws IndexOutOfBoundsException if the {@code endIndex} is larger than the length of | ||
| * this {@code String} object, or {@code beginIndex} is larger than {@code endIndex}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no beginIndex and endIndex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
* document assertion to link to bytesCompatible * throw IAE for length > Integer.MAX_VALUE * javadoc fixes
cushon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review, I think I've responded to all of the comments.
I also added more test coverage, and made a couple more fixes:
- There was a mistake with
srcIndex/numCharshandling,numCharswas being used as a an index instead of a length, I have fixed the implementation and also the javadoc - I removed a couple more obsolete javadoc references to StandardCharsets
| * access operation will occur | ||
| * @param charset the charset used to {@linkplain Charset#newDecoder() decode} the | ||
| * string bytes | ||
| * @param length length to be used for string conversion, in bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| * string bytes | ||
| * @param length length to be used for string conversion, in bytes | ||
| * @return a Java string constructed from the bytes read from the given starting | ||
| * address reading the given length of characters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| * @throws IndexOutOfBoundsException if the {@code endIndex} is larger than the length of | ||
| * this {@code String} object, or {@code beginIndex} is larger than {@code endIndex}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| } | ||
|
|
||
| /** | ||
| * Converts a Java string into a null-terminated C string using the provided charset, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not, thanks, fixed
| * @throws IllegalArgumentException if {@code charset} is not a | ||
| * {@linkplain StandardCharsets standard charset} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| * @throws IndexOutOfBoundsException if the {@code endIndex} is larger than the length of | ||
| * this {@code String} object, or {@code beginIndex} is larger than {@code endIndex}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
|
||
| @ForceInline | ||
| public static String readBytes(AbstractMemorySegmentImpl segment, long offset, Charset charset, long length) { | ||
| final int lengthBytes = (int) length; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, and I added a test to cover the IAE
| assertThrows(IllegalArgumentException.class, () -> segment.getString(0, StandardCharsets.UTF_8, -1)); | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added more tests to cover regular and exception cases for the three new methods. I'm happy to take suggestions on additional test coverage, or if there's a better location for any of the tests.
| public void testStringsLength(String testString) { | ||
| Set<String> excluded = Set.of("yen", "snowman", "rainbow"); | ||
| // This test only works for certain strings where the last character is not special | ||
| Set<String> excluded = Set.of("yen"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know the yen character is translated to / in some encodings when doing a round trip. Maybe we should just avoid this issue by switching it out for e.g. "section \u00A7", which is § and doesn't have the same problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, sounds good! Done
| MemorySegment.copy(testString, StandardCharsets.UTF_8, 1, text, 0, testString.length())); | ||
| assertThrows(IndexOutOfBoundsException.class, () -> | ||
| MemorySegment.copy(testString, StandardCharsets.UTF_8, 0, text, 0, testString.length() + 1)); | ||
| // dstOffset > byteSize() + B |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // dstOffset > byteSize() + B | |
| // dstOffset > byteSize() - B |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the catch, fixed
| * @throws IndexOutOfBoundsException if the {@code numChars + srcIndex} is larger than the length of | ||
| * this {@code String} object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this leaves the question: what happens is srcIndex + numChars overflows and becomes negative? It will be less than the length of the string for sure, but not right either.
This is why the other method's javadoc write down e.g. srcIndex > srcArray.length - elementCount. Assuming all positive numbers, it avoids the overflow issue.
| assertThrows(IllegalArgumentException.class, () -> segment.getString(0, StandardCharsets.UTF_8, -1)); | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, these look great!
I think another test that tests the case where srcIndex + numChars overflows for copy and allocateFrom, with different char sets (one that takes the internal bytesCompatible == true, and one that takes the bytesCompatible == false route) would be good to have.
* handle numChars + srcIndex overflow, and add tests * replace yen with a character that round trips
cushon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again!
I have responded to the comments, and I have also started drafting the CSR: https://bugs.openjdk.org/browse/JDK-8372338
| public void testStringsLength(String testString) { | ||
| Set<String> excluded = Set.of("yen", "snowman", "rainbow"); | ||
| // This test only works for certain strings where the last character is not special | ||
| Set<String> excluded = Set.of("yen"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, sounds good! Done
| MemorySegment.copy(testString, StandardCharsets.UTF_8, 1, text, 0, testString.length())); | ||
| assertThrows(IndexOutOfBoundsException.class, () -> | ||
| MemorySegment.copy(testString, StandardCharsets.UTF_8, 0, text, 0, testString.length() + 1)); | ||
| // dstOffset > byteSize() + B |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the catch, fixed
| assertThrows(IllegalArgumentException.class, () -> segment.getString(0, StandardCharsets.UTF_8, -1)); | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I added coverage of srcIndex + numChars overflow for both bytesCompatible cases.
I think that in practice some of the cases were being caught later. I added Objects.checkFromIndexSize assertions to MemorySegment#copy and allocateFrom to catch them immediately with a useful exception, instead of e.g. catching it in String.substring.
JornVernee
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latest version looks good to me. Left one last suggestion inline.
| @Override | ||
| public String getString(long offset, Charset charset, long length) { | ||
| if (length < 0) { | ||
| throw new IllegalArgumentException(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have an exception message here please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I switched to Utils.checkNonNegativeArgument
| * @param src the Java string to be written into this segment | ||
| * @param dstEncoding the charset used to {@linkplain Charset#newEncoder() encode} | ||
| * the string bytes. | ||
| * @param srcIndex the starting index of the source string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we feel like saying words like "index" is good enough to say what we mean? E.g. by index we mean something that is compatible with length and charAt. I wonder if using a qualifier e.g. character index might help? (open to suggestions here, I'm not 100% sure)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I think 'character index' is probably better than just index, since there are both character and byte positions in this API. The docs for numChars also mentions characters. I don't have a better idea for avoiding confusion here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this would be clear enough since it's talking about an index of a string, where 'index' has a pretty well understood meaning.
| * This method always replaces malformed-input and unmappable-character | ||
| * sequences with this charset's default replacement string. The {@link | ||
| * java.nio.charset.CharsetDecoder} class should be used when more control | ||
| * over the decoding process is required. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we say here, as you did for copy that this method ignores \0 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added:
If the string contains any {@code '\0'} characters, they will be read as well.
I suppose it might also make sense to update those warnings in setString and allocateFrom to mention that if you want to avoid truncating null-terminated strings, getString(long, Charset, long) could be used instead of getString(long). What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be a good idea, thanks!
| * the string, such as {@link MemorySegment#getString(long)}, the string | ||
| * will appear truncated when read again. | ||
| * | ||
| * @param src the Java string to be written into this segment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * @param src the Java string to be written into this segment | |
| * @param src the Java string to be written into the destination segment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| * string bytes | ||
| * @param length length in bytes of the string to read | ||
| * @return a Java string constructed from the bytes read from the given starting | ||
| * address reading the given length of bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe:
a Java string constructed from the bytes read from the given starting address up to the given length
(that seems to match the existing getString a bit more, and avoids the reading/read repetition)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| } | ||
|
|
||
| /** | ||
| * Converts a Java string into a C string using the provided charset, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using the term C string here is misleading.
I suggest:
Encodes a Java string using the provided charset and stores the resulting byte array into a memory segment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, done
| * the string, such as {@link MemorySegment#getString(long)}, the string | ||
| * will appear truncated when read again. | ||
| * | ||
| * @param str the Java string to be converted into a C string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * @param str the Java string to be converted into a C string | |
| * @param str the Java string to be encoded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| * string bytes | ||
| * @param srcIndex the starting index of the source string | ||
| * @param numChars the number of characters to be copied | ||
| * @return a new native segment containing the converted C string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Watch out for C string again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
mcimadamore
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good -- left some more javadoc comments
cushon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again, I updated the javadoc
| * This method always replaces malformed-input and unmappable-character | ||
| * sequences with this charset's default replacement string. The {@link | ||
| * java.nio.charset.CharsetDecoder} class should be used when more control | ||
| * over the decoding process is required. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added:
If the string contains any {@code '\0'} characters, they will be read as well.
I suppose it might also make sense to update those warnings in setString and allocateFrom to mention that if you want to avoid truncating null-terminated strings, getString(long, Charset, long) could be used instead of getString(long). What do you think?
| * the string, such as {@link MemorySegment#getString(long)}, the string | ||
| * will appear truncated when read again. | ||
| * | ||
| * @param src the Java string to be written into this segment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| * @param src the Java string to be written into this segment | ||
| * @param dstEncoding the charset used to {@linkplain Charset#newEncoder() encode} | ||
| * the string bytes. | ||
| * @param srcIndex the starting index of the source string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I think 'character index' is probably better than just index, since there are both character and byte positions in this API. The docs for numChars also mentions characters. I don't have a better idea for avoiding confusion here.
| } | ||
|
|
||
| /** | ||
| * Converts a Java string into a C string using the provided charset, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, done
| * string bytes | ||
| * @param length length in bytes of the string to read | ||
| * @return a Java string constructed from the bytes read from the given starting | ||
| * address reading the given length of bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| * the string, such as {@link MemorySegment#getString(long)}, the string | ||
| * will appear truncated when read again. | ||
| * | ||
| * @param str the Java string to be converted into a C string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| * string bytes | ||
| * @param srcIndex the starting index of the source string | ||
| * @param numChars the number of characters to be copied | ||
| * @return a new native segment containing the converted C string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
mcimadamore
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great -- thanks!
This PR proposes adding a new overload to
MemorySegment::getStringthat takes a known byte length of the content.This was previously proposed in #20725, but the outcome of JDK-8333843 was to update
MemorySegment#getStringto suggestHowever this is less efficient than what the implementation of getString does after JDK-8362893, it now uses
JavaLangAccess::uncheckedNewStringNoReplto avoid the copy.See also discussion in this panama-dev@ thread, and mcimadamore's document Pulling the (foreign) string
Benchmark results:
Progress
Issue
Reviewers
Contributors
<pminborg@openjdk.org>Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/28043/head:pull/28043$ git checkout pull/28043Update a local copy of the PR:
$ git checkout pull/28043$ git pull https://git.openjdk.org/jdk.git pull/28043/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 28043View PR using the GUI difftool:
$ git pr show -t 28043Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/28043.diff
Using Webrev
Link to Webrev Comment