-
Notifications
You must be signed in to change notification settings - Fork 6.2k
8369564: Provide a MemorySegment API to read strings with known lengths #28043
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
937a868
cd6db90
43a719e
53b064f
9f05a8b
b729b55
58525ac
3f6ee81
0593827
faa4c5b
214418f
31df3a2
489bf15
903696b
3b206ec
a84b9ea
396f942
6065acd
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1296,12 +1296,7 @@ MemorySegment reinterpret(long newSize, | |
| * over the decoding process is required. | ||
| * <p> | ||
| * Getting a string from a segment with a known byte offset and | ||
| * known byte length can be done like so: | ||
| * {@snippet lang=java : | ||
| * byte[] bytes = new byte[length]; | ||
| * MemorySegment.copy(segment, JAVA_BYTE, offset, bytes, 0, length); | ||
| * return new String(bytes, charset); | ||
| * } | ||
| * known byte length can be done using {@link #getString(long, Charset, long)}. | ||
| * | ||
| * @param offset offset in bytes (relative to this segment address) at which this | ||
| * access operation will occur | ||
|
|
@@ -1328,6 +1323,40 @@ MemorySegment reinterpret(long newSize, | |
| */ | ||
| String getString(long offset, Charset charset); | ||
|
|
||
| /** | ||
| * Reads a string from this segment at the given offset, using the provided length | ||
| * and charset. | ||
| * <p> | ||
| * This method always replaces malformed-input and unmappable-character | ||
| * sequences with this charset's default replacement string. The {@link | ||
| * java.nio.charset.CharsetDecoder} class should be used when more control | ||
| * over the decoding process is required. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we say here, as you did for
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I added:
I suppose it might also make sense to update those warnings in
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could be a good idea, thanks! |
||
| * <p> | ||
| * If the string contains any {@code '\0'} characters, they will be read as well. | ||
| * This differs from {@link #getString(long, Charset)}, which will only read up | ||
| * to the first {@code '\0'}, resulting in truncation for string data that contains | ||
| * the {@code '\0'} character. | ||
| * | ||
| * @param offset offset in bytes (relative to this segment address) at which this | ||
| * access operation will occur | ||
| * @param charset the charset used to {@linkplain Charset#newDecoder() decode} the | ||
| * string bytes | ||
| * @param length length, in bytes, of the region of memory to read and decode into | ||
| * a string | ||
| * @return a Java string constructed from the bytes read from the given starting | ||
| * address up to the given length | ||
| * @throws IllegalArgumentException if the size of the string is greater than the | ||
| * largest string supported by the platform | ||
| * @throws IndexOutOfBoundsException if {@code offset < 0} | ||
| * @throws IndexOutOfBoundsException if {@code offset > byteSize() - length} | ||
| * @throws IllegalStateException if the {@linkplain #scope() scope} associated with | ||
| * this segment is not {@linkplain Scope#isAlive() alive} | ||
| * @throws WrongThreadException if this method is called from a thread {@code T}, | ||
| * such that {@code isAccessibleBy(T) == false} | ||
| * @throws IllegalArgumentException if {@code length < 0} | ||
| */ | ||
| String getString(long offset, Charset charset, long length); | ||
|
|
||
| /** | ||
| * Writes the given string into this segment at the given offset, converting it to | ||
| * a null-terminated byte sequence using the {@linkplain StandardCharsets#UTF_8 UTF-8} | ||
|
|
@@ -1366,7 +1395,8 @@ MemorySegment reinterpret(long newSize, | |
| * If the given string contains any {@code '\0'} characters, they will be | ||
| * copied as well. This means that, depending on the method used to read | ||
| * the string, such as {@link MemorySegment#getString(long)}, the string | ||
| * will appear truncated when read again. | ||
| * will appear truncated when read again. The string can be read without | ||
| * truncation using {@link #getString(long, Charset, long)}. | ||
| * | ||
| * @param offset offset in bytes (relative to this segment address) at which this | ||
| * access operation will occur, the final address of this write | ||
|
|
@@ -2606,6 +2636,50 @@ static void copy(Object srcArray, int srcIndex, | |
| elementCount); | ||
| } | ||
|
|
||
| /** | ||
| * Copies the byte sequence of the given string encoded using the provided charset | ||
| * to the destination segment. | ||
| * <p> | ||
| * This method always replaces malformed-input and unmappable-character | ||
| * sequences with this charset's default replacement string. The {@link | ||
| * java.nio.charset.CharsetDecoder} class should be used when more control | ||
| * over the decoding process is required. | ||
| * <p> | ||
| * If the given string contains any {@code '\0'} characters, they will be | ||
| * copied as well. This means that, depending on the method used to read | ||
| * the string, such as {@link MemorySegment#getString(long)}, the string | ||
| * will appear truncated when read again. The string can be read without | ||
| * truncation using {@link #getString(long, Charset, long)}. | ||
| * | ||
| * @param src the Java string to be written into the destination segment | ||
| * @param dstEncoding the charset used to {@linkplain Charset#newEncoder() encode} | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure we have a dependency on the charset being standard?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We do not, thanks, fixed. Although I think the existing
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note sure I follow -- the method you mention says this: What do you mean by "undocumented dependency"?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, you're right, it is documented. It's documented differently than e.g. |
||
| * the string bytes. | ||
| * @param srcIndex the starting character index of the source string | ||
| * @param dst the destination segment | ||
| * @param dstOffset the starting offset, in bytes, of the destination segment | ||
| * @param numChars the number of characters to be copied | ||
| * @throws IllegalStateException if the {@linkplain #scope() scope} associated with | ||
| * {@code dst} is not {@linkplain Scope#isAlive() alive} | ||
| * @throws WrongThreadException if this method is called from a thread {@code T}, | ||
| * such that {@code dst.isAccessibleBy(T) == false} | ||
| * @throws IndexOutOfBoundsException if either {@code srcIndex}, {@code numChars}, or {@code dstOffset} | ||
| * are {@code < 0} | ||
| * @throws IndexOutOfBoundsException if {@code srcIndex > src.length() - numChars} | ||
| * @throws IllegalArgumentException if {@code dst} is {@linkplain #isReadOnly() read-only} | ||
| * @throws IndexOutOfBoundsException if {@code dstOffset > dstSegment.byteSize() - B} where {@code B} is the size, | ||
| * in bytes, of the substring of {@code src} encoded using the given charset | ||
| * @return the number of copied bytes. | ||
| */ | ||
| @ForceInline | ||
| static long copy(String src, Charset dstEncoding, int srcIndex, MemorySegment dst, long dstOffset, int numChars) { | ||
| Objects.requireNonNull(src); | ||
| Objects.requireNonNull(dstEncoding); | ||
| Objects.requireNonNull(dst); | ||
| Objects.checkFromIndexSize(srcIndex, numChars, src.length()); | ||
|
|
||
| return AbstractMemorySegmentImpl.copy(src, dstEncoding, srcIndex, dst, dstOffset, numChars); | ||
| } | ||
|
|
||
| /** | ||
| * Finds and returns the relative offset, in bytes, of the first mismatch between the | ||
| * source and the destination segments. More specifically, the bytes at offset | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -111,7 +111,8 @@ default MemorySegment allocateFrom(String str) { | |
| * If the given string contains any {@code '\0'} characters, they will be | ||
| * copied as well. This means that, depending on the method used to read | ||
| * the string, such as {@link MemorySegment#getString(long)}, the string | ||
| * will appear truncated when read again. | ||
| * will appear truncated when read again. The string can be read without | ||
| * truncation using {@link MemorySegment#getString(long, Charset, long)}. | ||
| * | ||
| * @param str the Java string to be converted into a C string | ||
| * @param charset the charset used to {@linkplain Charset#newEncoder() encode} the | ||
|
|
@@ -137,10 +138,10 @@ default MemorySegment allocateFrom(String str, Charset charset) { | |
| int termCharSize = StringSupport.CharsetKind.of(charset).terminatorCharSize(); | ||
| MemorySegment segment; | ||
| int length; | ||
| if (StringSupport.bytesCompatible(str, charset)) { | ||
| if (StringSupport.bytesCompatible(str, charset, 0, str.length())) { | ||
| length = str.length(); | ||
| segment = allocateNoInit((long) length + termCharSize); | ||
| StringSupport.copyToSegmentRaw(str, segment, 0); | ||
| StringSupport.copyToSegmentRaw(str, segment, 0, 0, str.length()); | ||
| } else { | ||
| byte[] bytes = str.getBytes(charset); | ||
| length = bytes.length; | ||
|
|
@@ -153,6 +154,53 @@ default MemorySegment allocateFrom(String str, Charset charset) { | |
| return segment; | ||
| } | ||
|
|
||
| /** | ||
| * Encodes a Java string using the provided charset and stores the resulting | ||
| * byte array into a memory segment. | ||
| * <p> | ||
| * This method always replaces malformed-input and unmappable-character | ||
| * sequences with this charset's default replacement byte array. The | ||
| * {@link java.nio.charset.CharsetEncoder} class should be used when more | ||
| * control over the encoding process is required. | ||
| * <p> | ||
| * If the given string contains any {@code '\0'} characters, they will be | ||
| * copied as well. This means that, depending on the method used to read | ||
| * the string, such as {@link MemorySegment#getString(long)}, the string | ||
| * will appear truncated when read again. The string can be read without | ||
| * truncation using {@link MemorySegment#getString(long, Charset, long)}. | ||
| * | ||
| * @param str the Java string to be encoded | ||
| * @param charset the charset used to {@linkplain Charset#newEncoder() encode} the | ||
| * string bytes | ||
| * @param srcIndex the starting index of the source string | ||
| * @param numChars the number of characters to be copied | ||
| * @return a new native segment containing the encoded string | ||
| * @throws IndexOutOfBoundsException if either {@code srcIndex} or {@code numChars} are {@code < 0} | ||
| * @throws IndexOutOfBoundsException if {@code srcIndex > str.length() - numChars} | ||
| * | ||
| * @implSpec The default implementation for this method copies the contents of the | ||
| * provided Java string into a new memory segment obtained by calling | ||
| * {@code this.allocate(B)}, where {@code B} is the size, in bytes, of | ||
| * the string encoded using the provided charset | ||
| * (e.g. {@code str.getBytes(charset).length}); | ||
| */ | ||
| @ForceInline | ||
| default MemorySegment allocateFrom(String str, Charset charset, int srcIndex, int numChars) { | ||
| Objects.requireNonNull(charset); | ||
| Objects.requireNonNull(str); | ||
| Objects.checkFromIndexSize(srcIndex, numChars, str.length()); | ||
| MemorySegment segment; | ||
| if (StringSupport.bytesCompatible(str, charset, srcIndex, numChars)) { | ||
| segment = allocateNoInit(numChars); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This also seems to rely on the fact that we end up here only for latin1 strings. Again, I don't think this is correct, but if it's deliberate, we should add an assertion check.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point. I think we make a similar assumption in the existing |
||
| StringSupport.copyToSegmentRaw(str, segment, 0, srcIndex, numChars); | ||
| } else { | ||
| byte[] bytes = str.substring(srcIndex, srcIndex + numChars).getBytes(charset); | ||
| segment = allocateNoInit(bytes.length); | ||
| MemorySegment.copy(bytes, 0, segment, ValueLayout.JAVA_BYTE, 0, bytes.length); | ||
| } | ||
| return segment; | ||
| } | ||
|
|
||
| /** | ||
| * {@return a new memory segment initialized with the provided byte value} | ||
| * <p> | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surprisingly here we don't do anything for the case where the string is UTF16 and the target charset is also UTF16?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The UTF‑16
Charsets disallow unpaired surrogates, which JavaStrings allow.So this can only return
truefor UTF‑16 when the platform and charset endianness match and theStringdoesn’t have any unpaired surrogates.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, tricky stuff -- we'll need to think more before changing this