Skip to content

Conversation

@cushon
Copy link
Contributor

@cushon cushon commented Oct 29, 2025

This PR proposes adding a new overload to MemorySegment::getString that takes a known byte length of the content.

This was previously proposed in #20725, but the outcome of JDK-8333843 was to update MemorySegment#getString to suggest

    byte[] bytes = new byte[length];
    MemorySegment.copy(segment, JAVA_BYTE, offset, bytes, 0, length);
    return new String(bytes, charset);

However this is less efficient than what the implementation of getString does after JDK-8362893, it now uses JavaLangAccess::uncheckedNewStringNoRepl to avoid the copy.

See also discussion in this panama-dev@ thread, and mcimadamore's document Pulling the (foreign) string

Benchmark results:

Benchmark                                 (size)  Mode  Cnt    Score   Error  Units
ToJavaStringTest.jni_readString                5  avgt   30   55.339 ± 0.401  ns/op
ToJavaStringTest.jni_readString               20  avgt   30   59.887 ± 0.295  ns/op
ToJavaStringTest.jni_readString              100  avgt   30   84.288 ± 0.419  ns/op
ToJavaStringTest.jni_readString              200  avgt   30  119.275 ± 0.496  ns/op
ToJavaStringTest.jni_readString              451  avgt   30  193.106 ± 1.528  ns/op
ToJavaStringTest.panama_copyLength             5  avgt   30    7.348 ± 0.048  ns/op
ToJavaStringTest.panama_copyLength            20  avgt   30    7.440 ± 0.125  ns/op
ToJavaStringTest.panama_copyLength           100  avgt   30   11.766 ± 0.058  ns/op
ToJavaStringTest.panama_copyLength           200  avgt   30   16.096 ± 0.089  ns/op
ToJavaStringTest.panama_copyLength           451  avgt   30   25.844 ± 0.054  ns/op
ToJavaStringTest.panama_readString             5  avgt   30    5.857 ± 0.046  ns/op
ToJavaStringTest.panama_readString            20  avgt   30    7.750 ± 0.046  ns/op
ToJavaStringTest.panama_readString           100  avgt   30   14.109 ± 0.187  ns/op
ToJavaStringTest.panama_readString           200  avgt   30   18.035 ± 0.130  ns/op
ToJavaStringTest.panama_readString           451  avgt   30   35.896 ± 0.227  ns/op
ToJavaStringTest.panama_readStringLength       5  avgt   30    4.565 ± 0.038  ns/op
ToJavaStringTest.panama_readStringLength      20  avgt   30    4.654 ± 0.040  ns/op
ToJavaStringTest.panama_readStringLength     100  avgt   30    8.502 ± 0.207  ns/op
ToJavaStringTest.panama_readStringLength     200  avgt   30   10.950 ± 0.124  ns/op
ToJavaStringTest.panama_readStringLength     451  avgt   30   16.244 ± 0.135  ns/op

Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change requires a CSR request matching fixVersion 26 to be approved (needs to be created)
  • Change must be properly reviewed (2 reviews required, with at least 1 Reviewer, 1 Author)

Issue

  • JDK-8369564: Provide a MemorySegment API to read strings with known lengths (Enhancement - P4)(⚠️ The fixVersion in this issue is [27] but the fixVersion in .jcheck/conf is 26, a new backport will be created when this pr is integrated.)

Reviewers

Contributors

  • Per Minborg <pminborg@openjdk.org>

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/28043/head:pull/28043
$ git checkout pull/28043

Update a local copy of the PR:
$ git checkout pull/28043
$ git pull https://git.openjdk.org/jdk.git pull/28043/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 28043

View PR using the GUI difftool:
$ git pr show -t 28043

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/28043.diff

Using Webrev

Link to Webrev Comment

@cushon
Copy link
Contributor Author

cushon commented Oct 29, 2025

/contributor add @minborg

@bridgekeeper
Copy link

bridgekeeper bot commented Oct 29, 2025

👋 Welcome back cushon! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Oct 29, 2025

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk
Copy link

openjdk bot commented Oct 29, 2025

@cushon
Contributor Per Minborg <pminborg@openjdk.org> successfully added.

@openjdk openjdk bot added the core-libs core-libs-dev@openjdk.org label Oct 29, 2025
@openjdk
Copy link

openjdk bot commented Oct 29, 2025

@cushon The following label will be automatically applied to this pull request:

  • core-libs

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the rfr Pull request is ready for review label Oct 29, 2025
@mlbridge
Copy link

mlbridge bot commented Oct 29, 2025

Copy link
Member

@JornVernee JornVernee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some preliminary comments. I didn't look at the tests yet.

Comment on lines 64 to 66
case SINGLE_BYTE -> readByte(segment, offset, len, charset);
case DOUBLE_BYTE -> readShort(segment, offset, len, charset);
case QUAD_BYTE -> readInt(segment, offset, len, charset);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 3 methods appear to be identical

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I refactored to do something more similar to the original PR to avoid the duplication here and with the existing read methods.

Comment on lines 1362 to 1363
String getString(long offset, int length, Charset charset);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest putting the length parameter at the end, so that this becomes a telescoping overload of the length-less variant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 1350 to 1351
* <li>{@code N} is the size (in bytes) of the terminator char according
* to the provided charset. For instance, this is 1 for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the terminator char important? The segment doesn't necessarily need to have a terminator char, right? I don't see this invariant being checked in the code either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it is not, I think this was left over from javadoc adapted from another overload

Comment on lines 1348 to 1349
* <li>{@code B} is the size, in bytes, of the string encoded using the
* provided charset (e.g. {@code str.getBytes(charset).length});</li>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't B equal to the length argument?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, yes, I reworked this part

Comment on lines 1337 to 1338
* @param length byte length to be used for string conversion (not including any
* null termination)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 'to be used for string conversion' is a bit too vague (used how?). I think a more descriptive text could be something like 'length in bytes of the string to read' (matching also the pattern of the existing 'offset in bytes').

Also, what happens if:

  • The length does include a null terminator
  • The length is not a multiple of the byte size of a character in the given charset.

On that last note, I wonder if this shouldn't be the length in bytes, but the length in characters. Then we can compute the byte length from the charset. That will make it impossible to pass a length that is not a multiple of the character size.

Copy link
Contributor Author

@cushon cushon Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look, I wanted to respond briefly to this part and will review the rest of the comments later:

I wonder if this shouldn't be the length in bytes, but the length in characters. Then we can compute the byte length from the charset

Part of the motivation here is to support efficiently reading binary formats where I think it's more common to record the length of string data in bytes, than in 16-bit code units in the UTF-16 encoding of the string.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed this with the team as well. For cases of native interop it seems more likely that you'd have e.g. an array of wchar_t on the native side, and you are tracking the length of that array, not the byte size.

A user can easily convert between one or the other length representation by multiplying/dividing by the right scalar, but if the length is specified in bytes, the API has an extra error case we need to check and specify.

Either way, we felt that it would be a good idea if you could send an email to panama-dev in which you describe your exact use case, before getting further into the code review. That would give others a chance to respond with their use cases as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A user can easily convert between one or the other length representation by multiplying/dividing by the right scalar

That is true of e.g. UTF-16 but not of UTF-8, since the encoding is variable width and doing the conversion from bytes to characters is more expensive there.

Either way, we felt that it would be a good idea if you could send an email to panama-dev in which you describe your exact use case, before getting further into the code review. That would give others a chance to respond with their use cases as well.

Sounds good, thanks, I can start a thread discussing the use-case here at a higher level.

Copy link
Member

@JornVernee JornVernee Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A user can easily convert between one or the other length representation by multiplying/dividing by the right scalar

That is true of e.g. UTF-16 but not of UTF-8, since the encoding is variable width and doing the conversion from bytes to characters is more expensive there.

Sorry, I don't mean 'character' but 'code unit'. For instance, when reading a UTF-8 string, the unit would be one byte, for UTF-16 it would be two, for UTF-32 four. So a user would just need to divide by the unit size, at least that's the idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can start a thread discussing the use-case here at a higher level.

Done: https://mail.openjdk.org/pipermail/panama-dev/2025-November/021182.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was some discussion in this panama-dev@ thread, and mcimadamore wrote a document: Pulling the (foreign) string

@openjdk openjdk bot added the security security-dev@openjdk.org label Nov 19, 2025
@openjdk
Copy link

openjdk bot commented Nov 19, 2025

@cushon security has been added to this pull request based on files touched in new commit(s).


void copyToSegmentRaw(MemorySegment segment, long offset) {
MemorySegment.copy(value, 0, segment, ValueLayout.JAVA_BYTE, offset, value.length);
void copyToSegmentRaw(MemorySegment segment, long offset, int srcIndex, int numChars) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method takes an index, expressed in chars, and uses that as a byte offset in a bulk copy operation. I don't think this is correct. E.g. if the string is UTF16 (and not LATIN1), there is a scaling factor to be applied?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words, it seems to me that here we have hardwired the knowledge that we can only get here is the string is latin1. I don't think this was the original intent of this method -- however, if that's the case, we should also add an assertion to avoid misuse.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this. For copyToSegmentRaw, I have updated the parameter names to not refer to chars.

I have also tentatively added an assertion to copyToSegmentRaw to only support latin1 strings, which could be relaxed if bytesCompatible is updated to handle UTF-16

}

boolean bytesCompatible(Charset charset) {
boolean bytesCompatible(Charset charset, int srcIndex, int numChars) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surprisingly here we don't do anything for the case where the string is UTF16 and the target charset is also UTF16?

Copy link

@ExE-Boss ExE-Boss Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UTF‑16 Charsets disallow unpaired surrogates, which Java Strings allow.

So this can only return true for UTF‑16 when the platform and charset endianness match and the String doesn’t have any unpaired surrogates.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, tricky stuff -- we'll need to think more before changing this

Objects.requireNonNull(str);
MemorySegment segment;
if (StringSupport.bytesCompatible(str, charset, srcIndex, numChars)) {
segment = allocateNoInit(numChars);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also seems to rely on the fact that we end up here only for latin1 strings. Again, I don't think this is correct, but if it's deliberate, we should add an assertion check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I think we make a similar assumption in the existing allocateFrom(String, Charset), it does length + termCharSize and that should perhaps be (length + 1) * codeUnitSize.

* will appear truncated when read again.
*
* @param src the Java string to be written into this segment
* @param dstEncoding the charset used to {@linkplain Charset#newEncoder() encode}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we have a dependency on the charset being standard?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not, thanks, fixed.

Although I think the existing allocateFrom(String, Charset) method does have an undocumented dependency, because it uses CharsetKind to get the terminator char length, which only supports standard Charsets. If we add a fast path for UTF-16 that may need a dependency on a standard Charset (or a standard way to get the code unit size of a charset, if it has one).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note sure I follow -- the method you mention says this:

    * @throws IllegalArgumentException if {@code charset} is not a
     *         {@linkplain StandardCharsets standard charset}

What do you mean by "undocumented dependency"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, you're right, it is documented. It's documented differently than e.g. MemorySegment#getString, which mentions it in both the @param and @throws doc.

Copy link
Contributor

@mcimadamore mcimadamore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks good, but I think we should put some more care when going to char-based indices to byte array indices, esp. if we will optimize the UTF16 case in the future

@wangweij
Copy link
Contributor

/label remove security

@openjdk openjdk bot removed the security security-dev@openjdk.org label Nov 19, 2025
@openjdk
Copy link

openjdk bot commented Nov 19, 2025

@wangweij
The security label was successfully removed.

@cushon
Copy link
Contributor Author

cushon commented Nov 20, 2025

Thanks for the review!

I think we should put some more care when going to char-based indices to byte array indices, esp. if we will optimize the UTF16 case in the future

Thanks for catching that. I have made some initial updates and added an assertion. To confirm do you think it's OK to leave optimizing the UTF16 case as future work, as long as the current assumptions are clearly documented and guarded by assertions?

Also, I'm planning to spend more time on test coverage for this and going over the javadoc again.

@mcimadamore
Copy link
Contributor

/csr needed

@openjdk openjdk bot added the csr Pull request needs approved CSR before integration label Nov 20, 2025
@openjdk
Copy link

openjdk bot commented Nov 20, 2025

@mcimadamore has indicated that a compatibility and specification (CSR) request is needed for this pull request.

@cushon please create a CSR request for issue JDK-8369564 with the correct fix version. This pull request cannot be integrated until the CSR request is approved.

@mcimadamore
Copy link
Contributor

To confirm do you think it's OK to leave optimizing the UTF16 case as future work, as long as the current assumptions are clearly documented and guarded by assertions?

I think this is ok, yes -- maybe add a link (comment) between the assertion being thrown and String::bytesCompatible.

@mcimadamore
Copy link
Contributor

/reviewers 2

@openjdk
Copy link

openjdk bot commented Nov 20, 2025

@mcimadamore
The total number of required reviews for this PR (including the jcheck configuration and the last /reviewers command) is now set to 2 (with at least 1 Reviewer, 1 Author).

Copy link
Contributor

@mcimadamore mcimadamore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall approach looks solid, thanks for working on this improvement!

Copy link
Member

@JornVernee JornVernee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! I've left some inline comments.


@ForceInline
public static String readBytes(AbstractMemorySegmentImpl segment, long offset, Charset charset, long length) {
final int lengthBytes = (int) length;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should do something more here than just ignore the upper bits. We probably need to throw an exception when the value is > Integer.MAX_VALUE

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we are already specified to throw IAE if 'the size of the string is greater than the largest string supported by the platform'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, and I added a test to cover the IAE

}

/**
* Converts a Java string into a null-terminated C string using the provided charset,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not null-terminated, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not, thanks, fixed

Comment on lines 176 to 177
* @throws IllegalArgumentException if {@code charset} is not a
* {@linkplain StandardCharsets standard charset}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, I don't think we have this limitation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 179 to 180
* @throws IndexOutOfBoundsException if the {@code endIndex} is larger than the length of
* this {@code String} object, or {@code beginIndex} is larger than {@code endIndex}.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no 'endIndext'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

assertThrows(IllegalArgumentException.class, () -> segment.getString(0, StandardCharsets.UTF_8, -1));
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need some more tests for the other new methods as well. Also, it would be nice to test non-standard charsets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added more tests to cover regular and exception cases for the three new methods. I'm happy to take suggestions on additional test coverage, or if there's a better location for any of the tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, these look great!

I think another test that tests the case where srcIndex + numChars overflows for copy and allocateFrom, with different char sets (one that takes the internal bytesCompatible == true, and one that takes the bytesCompatible == false route) would be good to have.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I added coverage of srcIndex + numChars overflow for both bytesCompatible cases.

I think that in practice some of the cases were being caught later. I added Objects.checkFromIndexSize assertions to MemorySegment#copy and allocateFrom to catch them immediately with a useful exception, instead of e.g. catching it in String.substring.

* access operation will occur
* @param charset the charset used to {@linkplain Charset#newDecoder() decode} the
* string bytes
* @param length length to be used for string conversion, in bytes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep the same format

Suggested change
* @param length length to be used for string conversion, in bytes
* @param length length in bytes of the string to read

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* string bytes
* @param length length to be used for string conversion, in bytes
* @return a Java string constructed from the bytes read from the given starting
* address reading the given length of characters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* address reading the given length of characters
* address reading the given length of bytes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 2661 to 2662
* @throws IndexOutOfBoundsException if the {@code endIndex} is larger than the length of
* this {@code String} object, or {@code beginIndex} is larger than {@code endIndex}.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no beginIndex and endIndex?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* document assertion to link to bytesCompatible
* throw IAE for length > Integer.MAX_VALUE
* javadoc fixes
Copy link
Contributor Author

@cushon cushon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review, I think I've responded to all of the comments.

I also added more test coverage, and made a couple more fixes:

  • There was a mistake with srcIndex/numChars handling, numChars was being used as a an index instead of a length, I have fixed the implementation and also the javadoc
  • I removed a couple more obsolete javadoc references to StandardCharsets

* access operation will occur
* @param charset the charset used to {@linkplain Charset#newDecoder() decode} the
* string bytes
* @param length length to be used for string conversion, in bytes
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* string bytes
* @param length length to be used for string conversion, in bytes
* @return a Java string constructed from the bytes read from the given starting
* address reading the given length of characters
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 2661 to 2662
* @throws IndexOutOfBoundsException if the {@code endIndex} is larger than the length of
* this {@code String} object, or {@code beginIndex} is larger than {@code endIndex}.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

/**
* Converts a Java string into a null-terminated C string using the provided charset,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not, thanks, fixed

Comment on lines 176 to 177
* @throws IllegalArgumentException if {@code charset} is not a
* {@linkplain StandardCharsets standard charset}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 179 to 180
* @throws IndexOutOfBoundsException if the {@code endIndex} is larger than the length of
* this {@code String} object, or {@code beginIndex} is larger than {@code endIndex}.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


@ForceInline
public static String readBytes(AbstractMemorySegmentImpl segment, long offset, Charset charset, long length) {
final int lengthBytes = (int) length;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, and I added a test to cover the IAE

assertThrows(IllegalArgumentException.class, () -> segment.getString(0, StandardCharsets.UTF_8, -1));
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added more tests to cover regular and exception cases for the three new methods. I'm happy to take suggestions on additional test coverage, or if there's a better location for any of the tests.

public void testStringsLength(String testString) {
Set<String> excluded = Set.of("yen", "snowman", "rainbow");
// This test only works for certain strings where the last character is not special
Set<String> excluded = Set.of("yen");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the yen character is translated to / in some encodings when doing a round trip. Maybe we should just avoid this issue by switching it out for e.g. "section \u00A7", which is § and doesn't have the same problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, sounds good! Done

MemorySegment.copy(testString, StandardCharsets.UTF_8, 1, text, 0, testString.length()));
assertThrows(IndexOutOfBoundsException.class, () ->
MemorySegment.copy(testString, StandardCharsets.UTF_8, 0, text, 0, testString.length() + 1));
// dstOffset > byteSize() + B
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// dstOffset > byteSize() + B
// dstOffset > byteSize() - B

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch, fixed

Comment on lines 2659 to 2660
* @throws IndexOutOfBoundsException if the {@code numChars + srcIndex} is larger than the length of
* this {@code String} object.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this leaves the question: what happens is srcIndex + numChars overflows and becomes negative? It will be less than the length of the string for sure, but not right either.

This is why the other method's javadoc write down e.g. srcIndex > srcArray.length - elementCount. Assuming all positive numbers, it avoids the overflow issue.

assertThrows(IllegalArgumentException.class, () -> segment.getString(0, StandardCharsets.UTF_8, -1));
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, these look great!

I think another test that tests the case where srcIndex + numChars overflows for copy and allocateFrom, with different char sets (one that takes the internal bytesCompatible == true, and one that takes the bytesCompatible == false route) would be good to have.

* handle numChars + srcIndex overflow, and add tests
* replace yen with a character that round trips
Copy link
Contributor Author

@cushon cushon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again!

I have responded to the comments, and I have also started drafting the CSR: https://bugs.openjdk.org/browse/JDK-8372338

public void testStringsLength(String testString) {
Set<String> excluded = Set.of("yen", "snowman", "rainbow");
// This test only works for certain strings where the last character is not special
Set<String> excluded = Set.of("yen");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, sounds good! Done

MemorySegment.copy(testString, StandardCharsets.UTF_8, 1, text, 0, testString.length()));
assertThrows(IndexOutOfBoundsException.class, () ->
MemorySegment.copy(testString, StandardCharsets.UTF_8, 0, text, 0, testString.length() + 1));
// dstOffset > byteSize() + B
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch, fixed

assertThrows(IllegalArgumentException.class, () -> segment.getString(0, StandardCharsets.UTF_8, -1));
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I added coverage of srcIndex + numChars overflow for both bytesCompatible cases.

I think that in practice some of the cases were being caught later. I added Objects.checkFromIndexSize assertions to MemorySegment#copy and allocateFrom to catch them immediately with a useful exception, instead of e.g. catching it in String.substring.

Copy link
Member

@JornVernee JornVernee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest version looks good to me. Left one last suggestion inline.

@Override
public String getString(long offset, Charset charset, long length) {
if (length < 0) {
throw new IllegalArgumentException();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have an exception message here please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I switched to Utils.checkNonNegativeArgument

* @param src the Java string to be written into this segment
* @param dstEncoding the charset used to {@linkplain Charset#newEncoder() encode}
* the string bytes.
* @param srcIndex the starting index of the source string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we feel like saying words like "index" is good enough to say what we mean? E.g. by index we mean something that is compatible with length and charAt. I wonder if using a qualifier e.g. character index might help? (open to suggestions here, I'm not 100% sure)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think 'character index' is probably better than just index, since there are both character and byte positions in this API. The docs for numChars also mentions characters. I don't have a better idea for avoiding confusion here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this would be clear enough since it's talking about an index of a string, where 'index' has a pretty well understood meaning.

* This method always replaces malformed-input and unmappable-character
* sequences with this charset's default replacement string. The {@link
* java.nio.charset.CharsetDecoder} class should be used when more control
* over the decoding process is required.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we say here, as you did for copy that this method ignores \0 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added:

If the string contains any {@code '\0'} characters, they will be read as well.

I suppose it might also make sense to update those warnings in setString and allocateFrom to mention that if you want to avoid truncating null-terminated strings, getString(long, Charset, long) could be used instead of getString(long). What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be a good idea, thanks!

* the string, such as {@link MemorySegment#getString(long)}, the string
* will appear truncated when read again.
*
* @param src the Java string to be written into this segment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @param src the Java string to be written into this segment
* @param src the Java string to be written into the destination segment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* string bytes
* @param length length in bytes of the string to read
* @return a Java string constructed from the bytes read from the given starting
* address reading the given length of bytes
Copy link
Contributor

@mcimadamore mcimadamore Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe:

a Java string constructed from the bytes read from the given starting address up to the given length

(that seems to match the existing getString a bit more, and avoids the reading/read repetition)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

/**
* Converts a Java string into a C string using the provided charset,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using the term C string here is misleading.

I suggest:

Encodes a Java string using the provided charset and stores the resulting byte array into a memory segment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done

* the string, such as {@link MemorySegment#getString(long)}, the string
* will appear truncated when read again.
*
* @param str the Java string to be converted into a C string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @param str the Java string to be converted into a C string
* @param str the Java string to be encoded

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* string bytes
* @param srcIndex the starting index of the source string
* @param numChars the number of characters to be copied
* @return a new native segment containing the converted C string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Watch out for C string again

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@mcimadamore mcimadamore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good -- left some more javadoc comments

Copy link
Contributor Author

@cushon cushon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again, I updated the javadoc

* This method always replaces malformed-input and unmappable-character
* sequences with this charset's default replacement string. The {@link
* java.nio.charset.CharsetDecoder} class should be used when more control
* over the decoding process is required.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added:

If the string contains any {@code '\0'} characters, they will be read as well.

I suppose it might also make sense to update those warnings in setString and allocateFrom to mention that if you want to avoid truncating null-terminated strings, getString(long, Charset, long) could be used instead of getString(long). What do you think?

* the string, such as {@link MemorySegment#getString(long)}, the string
* will appear truncated when read again.
*
* @param src the Java string to be written into this segment
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* @param src the Java string to be written into this segment
* @param dstEncoding the charset used to {@linkplain Charset#newEncoder() encode}
* the string bytes.
* @param srcIndex the starting index of the source string
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think 'character index' is probably better than just index, since there are both character and byte positions in this API. The docs for numChars also mentions characters. I don't have a better idea for avoiding confusion here.

}

/**
* Converts a Java string into a C string using the provided charset,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done

* string bytes
* @param length length in bytes of the string to read
* @return a Java string constructed from the bytes read from the given starting
* address reading the given length of bytes
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* the string, such as {@link MemorySegment#getString(long)}, the string
* will appear truncated when read again.
*
* @param str the Java string to be converted into a C string
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* string bytes
* @param srcIndex the starting index of the source string
* @param numChars the number of characters to be copied
* @return a new native segment containing the converted C string
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@mcimadamore mcimadamore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great -- thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core-libs core-libs-dev@openjdk.org csr Pull request needs approved CSR before integration rfr Pull request is ready for review

Development

Successfully merging this pull request may close these issues.

5 participants