8369564: Provide a MemorySegment API to read strings with known lengths #28043

cushon · 2025-10-29T10:50:53Z

This PR proposes adding a new overload to MemorySegment::getString that takes a known byte length of the content.

This was previously proposed in #20725, but the outcome of JDK-8333843 was to update MemorySegment#getString to suggest

    byte[] bytes = new byte[length];
    MemorySegment.copy(segment, JAVA_BYTE, offset, bytes, 0, length);
    return new String(bytes, charset);

However this is less efficient than what the implementation of getString does after JDK-8362893, it now uses JavaLangAccess::uncheckedNewStringNoRepl to avoid the copy.

See also discussion in this panama-dev@ thread, and mcimadamore's document Pulling the (foreign) string

Benchmark results:

Benchmark                                 (size)  Mode  Cnt    Score   Error  Units
ToJavaStringTest.jni_readString                5  avgt   30   55.339 ± 0.401  ns/op
ToJavaStringTest.jni_readString               20  avgt   30   59.887 ± 0.295  ns/op
ToJavaStringTest.jni_readString              100  avgt   30   84.288 ± 0.419  ns/op
ToJavaStringTest.jni_readString              200  avgt   30  119.275 ± 0.496  ns/op
ToJavaStringTest.jni_readString              451  avgt   30  193.106 ± 1.528  ns/op
ToJavaStringTest.panama_copyLength             5  avgt   30    7.348 ± 0.048  ns/op
ToJavaStringTest.panama_copyLength            20  avgt   30    7.440 ± 0.125  ns/op
ToJavaStringTest.panama_copyLength           100  avgt   30   11.766 ± 0.058  ns/op
ToJavaStringTest.panama_copyLength           200  avgt   30   16.096 ± 0.089  ns/op
ToJavaStringTest.panama_copyLength           451  avgt   30   25.844 ± 0.054  ns/op
ToJavaStringTest.panama_readString             5  avgt   30    5.857 ± 0.046  ns/op
ToJavaStringTest.panama_readString            20  avgt   30    7.750 ± 0.046  ns/op
ToJavaStringTest.panama_readString           100  avgt   30   14.109 ± 0.187  ns/op
ToJavaStringTest.panama_readString           200  avgt   30   18.035 ± 0.130  ns/op
ToJavaStringTest.panama_readString           451  avgt   30   35.896 ± 0.227  ns/op
ToJavaStringTest.panama_readStringLength       5  avgt   30    4.565 ± 0.038  ns/op
ToJavaStringTest.panama_readStringLength      20  avgt   30    4.654 ± 0.040  ns/op
ToJavaStringTest.panama_readStringLength     100  avgt   30    8.502 ± 0.207  ns/op
ToJavaStringTest.panama_readStringLength     200  avgt   30   10.950 ± 0.124  ns/op
ToJavaStringTest.panama_readStringLength     451  avgt   30   16.244 ± 0.135  ns/op

Progress

Change must not contain extraneous whitespace
Commit message must refer to an issue
Change requires a CSR request matching fixVersion 26 to be approved (needs to be created)
Change must be properly reviewed (2 reviews required, with at least 1 Reviewer, 1 Author)

Issue

JDK-8369564: Provide a MemorySegment API to read strings with known lengths (Enhancement - P4)(⚠️ The fixVersion in this issue is [27] but the fixVersion in .jcheck/conf is 26, a new backport will be created when this pr is integrated.)

Reviewers

Jorn Vernee (@JornVernee - Reviewer) 🔄 Re-review required (review applies to 489bf150)
Maurizio Cimadamore (@mcimadamore - Reviewer)

Contributors

Per Minborg <pminborg@openjdk.org>

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/28043/head:pull/28043
$ git checkout pull/28043

Update a local copy of the PR:
$ git checkout pull/28043
$ git pull https://git.openjdk.org/jdk.git pull/28043/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 28043

View PR using the GUI difftool:
$ git pr show -t 28043

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/28043.diff

Using Webrev

Link to Webrev Comment

cushon · 2025-10-29T10:51:37Z

/contributor add @minborg

bridgekeeper · 2025-10-29T10:52:25Z

👋 Welcome back cushon! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2025-10-29T10:53:01Z

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

openjdk · 2025-10-29T10:53:35Z

@cushon
Contributor Per Minborg <pminborg@openjdk.org> successfully added.

openjdk · 2025-10-29T10:54:11Z

@cushon The following label will be automatically applied to this pull request:

core-libs

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

mlbridge · 2025-10-29T10:58:01Z

Webrevs

JornVernee

Some preliminary comments. I didn't look at the tests yet.

JornVernee · 2025-10-29T15:29:34Z

src/java.base/share/classes/jdk/internal/foreign/StringSupport.java

+            case SINGLE_BYTE -> readByte(segment, offset, len, charset);
+            case DOUBLE_BYTE -> readShort(segment, offset, len, charset);
+            case QUAD_BYTE -> readInt(segment, offset, len, charset);


These 3 methods appear to be identical

Thanks, I refactored to do something more similar to the original PR to avoid the duplication here and with the existing read methods.

JornVernee · 2025-10-29T15:30:15Z

src/java.base/share/classes/java/lang/foreign/MemorySegment.java

+    String getString(long offset, int length, Charset charset);
+


I'd suggest putting the length parameter at the end, so that this becomes a telescoping overload of the length-less variant.

JornVernee · 2025-10-29T15:31:45Z

src/java.base/share/classes/java/lang/foreign/MemorySegment.java

+     *             <li>{@code N} is the size (in bytes) of the terminator char according
+     *             to the provided charset. For instance, this is 1 for


Why is the terminator char important? The segment doesn't necessarily need to have a terminator char, right? I don't see this invariant being checked in the code either.

Thanks, it is not, I think this was left over from javadoc adapted from another overload

JornVernee · 2025-10-29T15:32:40Z

src/java.base/share/classes/java/lang/foreign/MemorySegment.java

+     *             <li>{@code B} is the size, in bytes, of the string encoded using the
+     *             provided charset (e.g. {@code str.getBytes(charset).length});</li>


Isn't B equal to the length argument?

Thanks, yes, I reworked this part

JornVernee · 2025-10-29T15:35:41Z

src/java.base/share/classes/java/lang/foreign/MemorySegment.java

+     * @param length  byte length to be used for string conversion (not including any
+     *                null termination)


I think 'to be used for string conversion' is a bit too vague (used how?). I think a more descriptive text could be something like 'length in bytes of the string to read' (matching also the pattern of the existing 'offset in bytes').

Also, what happens if:

The length does include a null terminator

The length is not a multiple of the byte size of a character in the given charset.

On that last note, I wonder if this shouldn't be the length in bytes, but the length in characters. Then we can compute the byte length from the charset. That will make it impossible to pass a length that is not a multiple of the character size.

Thanks for taking a look, I wanted to respond briefly to this part and will review the rest of the comments later:

I wonder if this shouldn't be the length in bytes, but the length in characters. Then we can compute the byte length from the charset

Part of the motivation here is to support efficiently reading binary formats where I think it's more common to record the length of string data in bytes, than in 16-bit code units in the UTF-16 encoding of the string.

Discussed this with the team as well. For cases of native interop it seems more likely that you'd have e.g. an array of wchar_t on the native side, and you are tracking the length of that array, not the byte size.

A user can easily convert between one or the other length representation by multiplying/dividing by the right scalar, but if the length is specified in bytes, the API has an extra error case we need to check and specify.

Either way, we felt that it would be a good idea if you could send an email to panama-dev in which you describe your exact use case, before getting further into the code review. That would give others a chance to respond with their use cases as well.

A user can easily convert between one or the other length representation by multiplying/dividing by the right scalar

That is true of e.g. UTF-16 but not of UTF-8, since the encoding is variable width and doing the conversion from bytes to characters is more expensive there.

Either way, we felt that it would be a good idea if you could send an email to panama-dev in which you describe your exact use case, before getting further into the code review. That would give others a chance to respond with their use cases as well.

Sounds good, thanks, I can start a thread discussing the use-case here at a higher level.

A user can easily convert between one or the other length representation by multiplying/dividing by the right scalar

That is true of e.g. UTF-16 but not of UTF-8, since the encoding is variable width and doing the conversion from bytes to characters is more expensive there.

Sorry, I don't mean 'character' but 'code unit'. For instance, when reading a UTF-8 string, the unit would be one byte, for UTF-16 it would be two, for UTF-32 four. So a user would just need to divide by the unit size, at least that's the idea.

I can start a thread discussing the use-case here at a higher level.

Done: https://mail.openjdk.org/pipermail/panama-dev/2025-November/021182.html

There was some discussion in this panama-dev@ thread, and mcimadamore wrote a document: Pulling the (foreign) string

openjdk · 2025-11-19T08:33:17Z

@cushon security has been added to this pull request based on files touched in new commit(s).

…::encode

mcimadamore · 2025-11-19T14:43:09Z

src/java.base/share/classes/java/lang/String.java


-    void copyToSegmentRaw(MemorySegment segment, long offset) {
-        MemorySegment.copy(value, 0, segment, ValueLayout.JAVA_BYTE, offset, value.length);
+    void copyToSegmentRaw(MemorySegment segment, long offset, int srcIndex, int numChars) {


This method takes an index, expressed in chars, and uses that as a byte offset in a bulk copy operation. I don't think this is correct. E.g. if the string is UTF16 (and not LATIN1), there is a scaling factor to be applied?

In other words, it seems to me that here we have hardwired the knowledge that we can only get here is the string is latin1. I don't think this was the original intent of this method -- however, if that's the case, we should also add an assertion to avoid misuse.

Thanks for catching this. For copyToSegmentRaw, I have updated the parameter names to not refer to chars.

I have also tentatively added an assertion to copyToSegmentRaw to only support latin1 strings, which could be relaxed if bytesCompatible is updated to handle UTF-16

mcimadamore · 2025-11-19T14:44:39Z

src/java.base/share/classes/java/lang/String.java

    }

-    boolean bytesCompatible(Charset charset) {
+    boolean bytesCompatible(Charset charset, int srcIndex, int numChars) {


Surprisingly here we don't do anything for the case where the string is UTF16 and the target charset is also UTF16?

The UTF‑16 Charsets disallow unpaired surrogates, which Java Strings allow.

So this can only return true for UTF‑16 when the platform and charset endianness match and the String doesn’t have any unpaired surrogates.

Yeah, tricky stuff -- we'll need to think more before changing this

mcimadamore · 2025-11-19T14:48:52Z

src/java.base/share/classes/java/lang/foreign/SegmentAllocator.java

+        Objects.requireNonNull(str);
+        MemorySegment segment;
+        if (StringSupport.bytesCompatible(str, charset, srcIndex, numChars)) {
+            segment = allocateNoInit(numChars);


This also seems to rely on the fact that we end up here only for latin1 strings. Again, I don't think this is correct, but if it's deliberate, we should add an assertion check.

Good point. I think we make a similar assumption in the existing allocateFrom(String, Charset), it does length + termCharSize and that should perhaps be (length + 1) * codeUnitSize.

mcimadamore · 2025-11-19T14:52:04Z

src/java.base/share/classes/java/lang/foreign/MemorySegment.java

+     * will appear truncated when read again.
+     *
+     * @param src      the Java string to be written into this segment
+     * @param dstEncoding the charset used to {@linkplain Charset#newEncoder() encode}


I'm not sure we have a dependency on the charset being standard?

We do not, thanks, fixed.

Although I think the existing allocateFrom(String, Charset) method does have an undocumented dependency, because it uses CharsetKind to get the terminator char length, which only supports standard Charsets. If we add a fast path for UTF-16 that may need a dependency on a standard Charset (or a standard way to get the code unit size of a charset, if it has one).

Note sure I follow -- the method you mention says this:

* @throws IllegalArgumentException if {@code charset} is not a * {@linkplain StandardCharsets standard charset}

What do you mean by "undocumented dependency"?

Sorry, you're right, it is documented. It's documented differently than e.g. MemorySegment#getString, which mentions it in both the @param and @throws doc.

mcimadamore

Overall, looks good, but I think we should put some more care when going to char-based indices to byte array indices, esp. if we will optimize the UTF16 case in the future

wangweij · 2025-11-19T15:17:06Z

/label remove security

openjdk · 2025-11-19T15:19:01Z

@wangweij
The security label was successfully removed.

cushon · 2025-11-20T08:52:36Z

Thanks for the review!

I think we should put some more care when going to char-based indices to byte array indices, esp. if we will optimize the UTF16 case in the future

Thanks for catching that. I have made some initial updates and added an assertion. To confirm do you think it's OK to leave optimizing the UTF16 case as future work, as long as the current assumptions are clearly documented and guarded by assertions?

Also, I'm planning to spend more time on test coverage for this and going over the javadoc again.

mcimadamore · 2025-11-20T11:40:52Z

/csr needed

openjdk · 2025-11-20T11:41:27Z

@mcimadamore has indicated that a compatibility and specification (CSR) request is needed for this pull request.

@cushon please create a CSR request for issue JDK-8369564 with the correct fix version. This pull request cannot be integrated until the CSR request is approved.

mcimadamore · 2025-11-20T11:41:42Z

To confirm do you think it's OK to leave optimizing the UTF16 case as future work, as long as the current assumptions are clearly documented and guarded by assertions?

I think this is ok, yes -- maybe add a link (comment) between the assertion being thrown and String::bytesCompatible.

mcimadamore · 2025-11-20T11:42:44Z

/reviewers 2

openjdk · 2025-11-20T11:43:24Z

@mcimadamore
The total number of required reviews for this PR (including the jcheck configuration and the last /reviewers command) is now set to 2 (with at least 1 Reviewer, 1 Author).

mcimadamore

Overall approach looks solid, thanks for working on this improvement!

JornVernee

Thanks for working on this! I've left some inline comments.

JornVernee · 2025-11-20T15:47:18Z

src/java.base/share/classes/jdk/internal/foreign/StringSupport.java

+
+    @ForceInline
+    public static String readBytes(AbstractMemorySegmentImpl segment, long offset, Charset charset, long length) {
+        final int lengthBytes = (int) length;


I think we should do something more here than just ignore the upper bits. We probably need to throw an exception when the value is > Integer.MAX_VALUE

Looks like we are already specified to throw IAE if 'the size of the string is greater than the largest string supported by the platform'

Done, and I added a test to cover the IAE

JornVernee · 2025-11-20T15:47:59Z

src/java.base/share/classes/java/lang/foreign/SegmentAllocator.java

    }

+    /**
+     * Converts a Java string into a null-terminated C string using the provided charset,


Not null-terminated, right?

It is not, thanks, fixed

JornVernee · 2025-11-20T15:48:57Z

src/java.base/share/classes/java/lang/foreign/SegmentAllocator.java

+     * @throws IllegalArgumentException if {@code charset} is not a
+     *         {@linkplain StandardCharsets standard charset}


Same here, I don't think we have this limitation.

JornVernee · 2025-11-20T15:49:30Z

src/java.base/share/classes/java/lang/foreign/SegmentAllocator.java

+     * @throws IndexOutOfBoundsException  if the {@code endIndex} is larger than the length of
+     *         this {@code String} object, or {@code beginIndex} is larger than {@code endIndex}.


There is no 'endIndext'?

JornVernee · 2025-11-20T15:52:16Z

test/jdk/java/foreign/TestStringEncoding.java

+            assertThrows(IllegalArgumentException.class, () -> segment.getString(0, StandardCharsets.UTF_8, -1));
+        }
+    }
+


We need some more tests for the other new methods as well. Also, it would be nice to test non-standard charsets.

I added more tests to cover regular and exception cases for the three new methods. I'm happy to take suggestions on additional test coverage, or if there's a better location for any of the tests.

Thanks, these look great!

I think another test that tests the case where srcIndex + numChars overflows for copy and allocateFrom, with different char sets (one that takes the internal bytesCompatible == true, and one that takes the bytesCompatible == false route) would be good to have.

Thanks, I added coverage of srcIndex + numChars overflow for both bytesCompatible cases.

I think that in practice some of the cases were being caught later. I added Objects.checkFromIndexSize assertions to MemorySegment#copy and allocateFrom to catch them immediately with a useful exception, instead of e.g. catching it in String.substring.

JornVernee · 2025-11-20T15:53:38Z

src/java.base/share/classes/java/lang/foreign/MemorySegment.java

+     *                access operation will occur
+     * @param charset the charset used to {@linkplain Charset#newDecoder() decode} the
+     *                string bytes
+     * @param length  length to be used for string conversion, in bytes


Please keep the same format

Suggested change

* @param length length to be used for string conversion, in bytes

* @param length length in bytes of the string to read

JornVernee · 2025-11-20T15:54:17Z

src/java.base/share/classes/java/lang/foreign/MemorySegment.java

+     *                string bytes
+     * @param length  length to be used for string conversion, in bytes
+     * @return a Java string constructed from the bytes read from the given starting
+     *         address reading the given length of characters


Suggested change

* address reading the given length of characters

* address reading the given length of bytes

JornVernee · 2025-11-20T15:58:32Z

src/java.base/share/classes/java/lang/foreign/MemorySegment.java

+     * @throws IndexOutOfBoundsException if the {@code endIndex} is larger than the length of
+     *         this {@code String} object, or {@code beginIndex} is larger than {@code endIndex}.


There's no beginIndex and endIndex?

* document assertion to link to bytesCompatible * throw IAE for length > Integer.MAX_VALUE * javadoc fixes

cushon

Thanks for the review, I think I've responded to all of the comments.

I also added more test coverage, and made a couple more fixes:

There was a mistake with srcIndex/numChars handling, numChars was being used as a an index instead of a length, I have fixed the implementation and also the javadoc
I removed a couple more obsolete javadoc references to StandardCharsets

cushon · 2025-11-21T10:35:18Z

src/java.base/share/classes/java/lang/foreign/MemorySegment.java

+     *                access operation will occur
+     * @param charset the charset used to {@linkplain Charset#newDecoder() decode} the
+     *                string bytes
+     * @param length  length to be used for string conversion, in bytes


cushon · 2025-11-21T10:35:23Z

src/java.base/share/classes/java/lang/foreign/MemorySegment.java

+     *                string bytes
+     * @param length  length to be used for string conversion, in bytes
+     * @return a Java string constructed from the bytes read from the given starting
+     *         address reading the given length of characters


cushon · 2025-11-21T10:36:32Z

src/java.base/share/classes/java/lang/foreign/MemorySegment.java

+     * @throws IndexOutOfBoundsException if the {@code endIndex} is larger than the length of
+     *         this {@code String} object, or {@code beginIndex} is larger than {@code endIndex}.


cushon · 2025-11-21T10:37:19Z

src/java.base/share/classes/java/lang/foreign/SegmentAllocator.java

    }

+    /**
+     * Converts a Java string into a null-terminated C string using the provided charset,


It is not, thanks, fixed

cushon · 2025-11-21T10:37:31Z

src/java.base/share/classes/java/lang/foreign/SegmentAllocator.java

+     * @throws IllegalArgumentException if {@code charset} is not a
+     *         {@linkplain StandardCharsets standard charset}


cushon · 2025-11-21T10:37:34Z

src/java.base/share/classes/java/lang/foreign/SegmentAllocator.java

+     * @throws IndexOutOfBoundsException  if the {@code endIndex} is larger than the length of
+     *         this {@code String} object, or {@code beginIndex} is larger than {@code endIndex}.


cushon · 2025-11-21T10:38:05Z

src/java.base/share/classes/jdk/internal/foreign/StringSupport.java

+
+    @ForceInline
+    public static String readBytes(AbstractMemorySegmentImpl segment, long offset, Charset charset, long length) {
+        final int lengthBytes = (int) length;


Done, and I added a test to cover the IAE

cushon · 2025-11-21T10:38:40Z

test/jdk/java/foreign/TestStringEncoding.java

+            assertThrows(IllegalArgumentException.class, () -> segment.getString(0, StandardCharsets.UTF_8, -1));
+        }
+    }
+


I added more tests to cover regular and exception cases for the three new methods. I'm happy to take suggestions on additional test coverage, or if there's a better location for any of the tests.

JornVernee · 2025-11-21T12:35:01Z

test/jdk/java/foreign/TestStringEncoding.java

    public void testStringsLength(String testString) {
-        Set<String> excluded = Set.of("yen", "snowman", "rainbow");
-        // This test only works for certain strings where the last character is not special
+        Set<String> excluded = Set.of("yen");


I know the yen character is translated to / in some encodings when doing a round trip. Maybe we should just avoid this issue by switching it out for e.g. "section \u00A7", which is § and doesn't have the same problem.

Thanks, sounds good! Done

JornVernee · 2025-11-21T12:38:12Z

test/jdk/java/foreign/TestStringEncoding.java

+                    MemorySegment.copy(testString, StandardCharsets.UTF_8, 1, text, 0, testString.length()));
+            assertThrows(IndexOutOfBoundsException.class, () ->
+                    MemorySegment.copy(testString, StandardCharsets.UTF_8, 0, text, 0, testString.length() + 1));
+            // dstOffset > byteSize() + B


Suggested change

// dstOffset > byteSize() + B

// dstOffset > byteSize() - B

Thanks for the catch, fixed

JornVernee · 2025-11-21T12:45:36Z

src/java.base/share/classes/java/lang/foreign/MemorySegment.java

+     * @throws IndexOutOfBoundsException if the {@code numChars + srcIndex} is larger than the length of
+     *         this {@code String} object.


I think this leaves the question: what happens is srcIndex + numChars overflows and becomes negative? It will be less than the length of the string for sure, but not right either.

This is why the other method's javadoc write down e.g. srcIndex > srcArray.length - elementCount. Assuming all positive numbers, it avoids the overflow issue.

JornVernee · 2025-11-21T12:55:13Z

test/jdk/java/foreign/TestStringEncoding.java

+            assertThrows(IllegalArgumentException.class, () -> segment.getString(0, StandardCharsets.UTF_8, -1));
+        }
+    }
+


Thanks, these look great!

I think another test that tests the case where srcIndex + numChars overflows for copy and allocateFrom, with different char sets (one that takes the internal bytesCompatible == true, and one that takes the bytesCompatible == false route) would be good to have.

* handle numChars + srcIndex overflow, and add tests * replace yen with a character that round trips

cushon

Thanks again!

I have responded to the comments, and I have also started drafting the CSR: https://bugs.openjdk.org/browse/JDK-8372338

cushon · 2025-11-21T13:05:21Z

test/jdk/java/foreign/TestStringEncoding.java

    public void testStringsLength(String testString) {
-        Set<String> excluded = Set.of("yen", "snowman", "rainbow");
-        // This test only works for certain strings where the last character is not special
+        Set<String> excluded = Set.of("yen");


Thanks, sounds good! Done

cushon · 2025-11-21T13:07:18Z

test/jdk/java/foreign/TestStringEncoding.java

+                    MemorySegment.copy(testString, StandardCharsets.UTF_8, 1, text, 0, testString.length()));
+            assertThrows(IndexOutOfBoundsException.class, () ->
+                    MemorySegment.copy(testString, StandardCharsets.UTF_8, 0, text, 0, testString.length() + 1));
+            // dstOffset > byteSize() + B


Thanks for the catch, fixed

cushon · 2025-11-21T13:49:16Z

test/jdk/java/foreign/TestStringEncoding.java

+            assertThrows(IllegalArgumentException.class, () -> segment.getString(0, StandardCharsets.UTF_8, -1));
+        }
+    }
+


Thanks, I added coverage of srcIndex + numChars overflow for both bytesCompatible cases.

I think that in practice some of the cases were being caught later. I added Objects.checkFromIndexSize assertions to MemorySegment#copy and allocateFrom to catch them immediately with a useful exception, instead of e.g. catching it in String.substring.

JornVernee

Latest version looks good to me. Left one last suggestion inline.

JornVernee · 2025-11-21T14:13:52Z

src/java.base/share/classes/jdk/internal/foreign/AbstractMemorySegmentImpl.java

+    @Override
+    public String getString(long offset, Charset charset, long length) {
+        if (length < 0) {
+            throw new IllegalArgumentException();


Could we have an exception message here please?

Thanks, I switched to Utils.checkNonNegativeArgument

mcimadamore · 2025-11-21T14:29:56Z

src/java.base/share/classes/java/lang/foreign/MemorySegment.java

+     * @param src      the Java string to be written into this segment
+     * @param dstEncoding the charset used to {@linkplain Charset#newEncoder() encode}
+     *                 the string bytes.
+     * @param srcIndex the starting index of the source string


Do we feel like saying words like "index" is good enough to say what we mean? E.g. by index we mean something that is compatible with length and charAt. I wonder if using a qualifier e.g. character index might help? (open to suggestions here, I'm not 100% sure)

Thanks, I think 'character index' is probably better than just index, since there are both character and byte positions in this API. The docs for numChars also mentions characters. I don't have a better idea for avoiding confusion here.

I thought this would be clear enough since it's talking about an index of a string, where 'index' has a pretty well understood meaning.

mcimadamore · 2025-11-21T14:30:53Z

src/java.base/share/classes/java/lang/foreign/MemorySegment.java

+     * This method always replaces malformed-input and unmappable-character
+     * sequences with this charset's default replacement string. The {@link
+     * java.nio.charset.CharsetDecoder} class should be used when more control
+     * over the decoding process is required.


Should we say here, as you did for copy that this method ignores \0 ?

I added:

If the string contains any {@code '\0'} characters, they will be read as well.

I suppose it might also make sense to update those warnings in setString and allocateFrom to mention that if you want to avoid truncating null-terminated strings, getString(long, Charset, long) could be used instead of getString(long). What do you think?

Could be a good idea, thanks!

mcimadamore · 2025-11-21T14:31:29Z

src/java.base/share/classes/java/lang/foreign/MemorySegment.java

+     * the string, such as {@link MemorySegment#getString(long)}, the string
+     * will appear truncated when read again.
+     *
+     * @param src      the Java string to be written into this segment


Suggested change

* @param src the Java string to be written into this segment

* @param src the Java string to be written into the destination segment

mcimadamore · 2025-11-21T14:37:54Z

src/java.base/share/classes/java/lang/foreign/MemorySegment.java

+     *                string bytes
+     * @param length  length in bytes of the string to read
+     * @return a Java string constructed from the bytes read from the given starting
+     *         address reading the given length of bytes


Maybe:

a Java string constructed from the bytes read from the given starting address up to the given length

(that seems to match the existing getString a bit more, and avoids the reading/read repetition)

mcimadamore · 2025-11-21T14:40:11Z

src/java.base/share/classes/java/lang/foreign/SegmentAllocator.java

    }

+    /**
+     * Converts a Java string into a C string using the provided charset,


I think using the term C string here is misleading.

I suggest:

Encodes a Java string using the provided charset and stores the resulting byte array into a memory segment.

Thanks, done

mcimadamore · 2025-11-21T14:41:02Z

src/java.base/share/classes/java/lang/foreign/SegmentAllocator.java

+     * the string, such as {@link MemorySegment#getString(long)}, the string
+     * will appear truncated when read again.
+     *
+     * @param str      the Java string to be converted into a C string


Suggested change

* @param str the Java string to be converted into a C string

* @param str the Java string to be encoded

mcimadamore · 2025-11-21T14:42:43Z

src/java.base/share/classes/java/lang/foreign/SegmentAllocator.java

+     *                 string bytes
+     * @param srcIndex the starting index of the source string
+     * @param numChars the number of characters to be copied
+     * @return a new native segment containing the converted C string


Watch out for C string again

mcimadamore

Looks good -- left some more javadoc comments

cushon

Thanks again, I updated the javadoc

cushon · 2025-11-21T14:36:44Z

src/java.base/share/classes/java/lang/foreign/MemorySegment.java

+     * This method always replaces malformed-input and unmappable-character
+     * sequences with this charset's default replacement string. The {@link
+     * java.nio.charset.CharsetDecoder} class should be used when more control
+     * over the decoding process is required.


I added:

If the string contains any {@code '\0'} characters, they will be read as well.

I suppose it might also make sense to update those warnings in setString and allocateFrom to mention that if you want to avoid truncating null-terminated strings, getString(long, Charset, long) could be used instead of getString(long). What do you think?

cushon · 2025-11-21T14:37:06Z

src/java.base/share/classes/java/lang/foreign/MemorySegment.java

+     * the string, such as {@link MemorySegment#getString(long)}, the string
+     * will appear truncated when read again.
+     *
+     * @param src      the Java string to be written into this segment


cushon · 2025-11-21T14:39:59Z

src/java.base/share/classes/java/lang/foreign/MemorySegment.java

+     * @param src      the Java string to be written into this segment
+     * @param dstEncoding the charset used to {@linkplain Charset#newEncoder() encode}
+     *                 the string bytes.
+     * @param srcIndex the starting index of the source string


Thanks, I think 'character index' is probably better than just index, since there are both character and byte positions in this API. The docs for numChars also mentions characters. I don't have a better idea for avoiding confusion here.

cushon · 2025-11-21T14:41:46Z

src/java.base/share/classes/java/lang/foreign/SegmentAllocator.java

    }

+    /**
+     * Converts a Java string into a C string using the provided charset,


Thanks, done

cushon · 2025-11-21T14:44:11Z

src/java.base/share/classes/java/lang/foreign/MemorySegment.java

+     *                string bytes
+     * @param length  length in bytes of the string to read
+     * @return a Java string constructed from the bytes read from the given starting
+     *         address reading the given length of bytes


cushon · 2025-11-21T14:51:03Z

src/java.base/share/classes/java/lang/foreign/SegmentAllocator.java

+     * the string, such as {@link MemorySegment#getString(long)}, the string
+     * will appear truncated when read again.
+     *
+     * @param str      the Java string to be converted into a C string


cushon · 2025-11-21T14:51:33Z

src/java.base/share/classes/java/lang/foreign/SegmentAllocator.java

+     *                 string bytes
+     * @param srcIndex the starting index of the source string
+     * @param numChars the number of characters to be copied
+     * @return a new native segment containing the converted C string


mcimadamore

Looks great -- thanks!

8369564: Provide a MemorySegment API to read strings with known lengths

937a868

openjdk bot added the core-libs core-libs-dev@openjdk.org label Oct 29, 2025

openjdk bot added the rfr Pull request is ready for review label Oct 29, 2025

JornVernee reviewed Oct 29, 2025

View reviewed changes

cushon added 7 commits October 29, 2025 17:33

Consolidate duplicate code in read methods

cd6db90

Update length to code units instead of bytes

43a719e

Add benchmarks, and demo setStringWithoutNullTerminator

53b064f

Merge branch 'master' into JDK-8369564

9f05a8b

Remove setStringWithoutNullTerminator demo

b729b55

.

58525ac

Updates from panama-dev thread

3f6ee81

openjdk bot added the security security-dev@openjdk.org label Nov 19, 2025

Add a dstOffset parameter, stop using StringCharBuffer/CharsetEncoder…

0593827

…::encode

mcimadamore reviewed Nov 19, 2025

View reviewed changes

openjdk bot removed the security security-dev@openjdk.org label Nov 19, 2025

Review feedback

faa4c5b

openjdk bot added the csr Pull request needs approved CSR before integration label Nov 20, 2025

mcimadamore reviewed Nov 20, 2025

View reviewed changes

JornVernee reviewed Nov 20, 2025

View reviewed changes

cushon added 2 commits November 20, 2025 20:43

Review feedback

214418f

* document assertion to link to bytesCompatible * throw IAE for length > Integer.MAX_VALUE * javadoc fixes

Improve test coverage, and more fixes

31df3a2

cushon commented Nov 21, 2025

View reviewed changes

JornVernee reviewed Nov 21, 2025

View reviewed changes

Review feedback

489bf15

* handle numChars + srcIndex overflow, and add tests * replace yen with a character that round trips

cushon commented Nov 21, 2025

View reviewed changes

JornVernee approved these changes Nov 21, 2025

View reviewed changes

Use Utils.checkNonNegativeArgument

903696b

mcimadamore reviewed Nov 21, 2025

View reviewed changes

More javadoc updates

3b206ec

cushon commented Nov 21, 2025

View reviewed changes

mcimadamore approved these changes Nov 21, 2025

View reviewed changes

		* <li>{@code N} is the size (in bytes) of the terminator char according
		* to the provided charset. For instance, this is 1 for

		* <li>{@code B} is the size, in bytes, of the string encoded using the
		* provided charset (e.g. {@code str.getBytes(charset).length});</li>

		* @param length byte length to be used for string conversion (not including any
		* null termination)

		* @throws IllegalArgumentException if {@code charset} is not a
		* {@linkplain StandardCharsets standard charset}

		* @throws IndexOutOfBoundsException if the {@code endIndex} is larger than the length of
		* this {@code String} object, or {@code beginIndex} is larger than {@code endIndex}.

	* @param length length to be used for string conversion, in bytes
	* @param length length in bytes of the string to read

	* address reading the given length of characters
	* address reading the given length of bytes

		* @throws IndexOutOfBoundsException if the {@code numChars + srcIndex} is larger than the length of
		* this {@code String} object.

	* @param src the Java string to be written into this segment
	* @param src the Java string to be written into the destination segment

	* @param str the Java string to be converted into a C string
	* @param str the Java string to be encoded

8369564: Provide a MemorySegment API to read strings with known lengths #28043

Are you sure you want to change the base?

8369564: Provide a MemorySegment API to read strings with known lengths #28043

Conversation

cushon commented Oct 29, 2025 • edited by openjdk bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress

Issue

Reviewers

Contributors

Reviewing

Uh oh!

cushon commented Oct 29, 2025

Uh oh!

bridgekeeper bot commented Oct 29, 2025

Uh oh!

openjdk bot commented Oct 29, 2025

Uh oh!

openjdk bot commented Oct 29, 2025

Uh oh!

openjdk bot commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mlbridge bot commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Webrevs

Uh oh!

JornVernee left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cushon Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JornVernee Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openjdk bot commented Nov 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ExE-Boss Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cushon commented Oct 29, 2025 •

edited by openjdk bot

Loading

openjdk bot commented Oct 29, 2025 •

edited

Loading

mlbridge bot commented Oct 29, 2025 •

edited

Loading

cushon Oct 29, 2025 •

edited

Loading

JornVernee Oct 29, 2025 •

edited

Loading

ExE-Boss Nov 19, 2025 •

edited

Loading