Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 8 additions & 7 deletions src/java.base/share/classes/java/lang/StringUTF16.java
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
/*
* Copyright (c) 2015, 2025, Oracle and/or its affiliates. All rights reserved.
* Copyright (c) 2025, Alibaba Group Holding Limited. All Rights Reserved.
* DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
*
* This code is free software; you can redistribute it and/or modify it
Expand Down Expand Up @@ -1312,12 +1313,6 @@ static Stream<String> lines(byte[] value) {
return StreamSupport.stream(LinesSpliterator.spliterator(value), false);
}

private static void putChars(byte[] val, int index, char[] str, int off, int end) {
while (off < end) {
putChar(val, index++, str[off++]);
}
}

public static String newString(byte[] val, int index, int len) {
if (len == 0) {
return "";
Expand Down Expand Up @@ -1486,7 +1481,13 @@ public static void putCharSB(byte[] val, int index, int c) {

public static void putCharsSB(byte[] val, int index, char[] ca, int off, int end) {
checkBoundsBeginEnd(index, index + end - off, val);
putChars(val, index, ca, off, end);
String.checkBoundsBeginEnd(off, end, ca.length);
Unsafe.getUnsafe().copyMemory(
ca,
Unsafe.ARRAY_CHAR_BASE_OFFSET + ((long) off << 1),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Unsafe.ARRAY_CHAR_BASE_OFFSET + ((long) off << 1),
Unsafe.ARRAY_CHAR_BASE_OFFSET + (long) off * Unsafe.ARRAY_CHAR_INDEX_SCALE,

val,
Unsafe.ARRAY_BYTE_BASE_OFFSET + ((long) index << 1),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Unsafe.ARRAY_BYTE_BASE_OFFSET + ((long) index << 1),
Unsafe.ARRAY_BYTE_BASE_OFFSET + ((long) index << 1) * Unsafe.ARRAY_BYTE_INDEX_SCALE,

Copy link
Contributor Author

@wenshao wenshao Jul 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to use ARRAY_CHAR_INDEX_SCALE, it should be used as follows

        Unsafe.getUnsafe().copyMemory(
                ca,
                Unsafe.ARRAY_CHAR_BASE_OFFSET + (long) off * Unsafe.ARRAY_CHAR_INDEX_SCALE,
                val,
                Unsafe.ARRAY_CHAR_BASE_OFFSET + (long) index * Unsafe.ARRAY_CHAR_INDEX_SCALE,
                (long) (end - off) * Unsafe.ARRAY_CHAR_INDEX_SCALE);

Copy link
Contributor Author

@wenshao wenshao Jul 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to calculate an ARRAY_CHAR_SHIFT in the constant like ShortVector does, like this

static final int ARRAY_CHAR_SHIFT 
                   = 31 - Integer.numberOfLeadingZeros(Unsafe.ARRAY_CHAR_INDEX_SCALE);

        Unsafe.getUnsafe().copyMemory(
                ca,
                Unsafe.ARRAY_CHAR_BASE_OFFSET + (long) off << ARRAY_CHAR_SHIFT,
                val,
                Unsafe.ARRAY_CHAR_BASE_OFFSET + (long) index << ARRAY_CHAR_SHIFT,
                (long) (end - off) << ARRAY_CHAR_SHIFT);

Copy link
Contributor Author

@wenshao wenshao Jul 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String uses << coder in many places. I think the following way of writing is also good:

        Unsafe.getUnsafe().copyMemory(
                ca,
                Unsafe.ARRAY_CHAR_BASE_OFFSET + (long) off << String.UTF16,
                val,
                Unsafe.ARRAY_CHAR_BASE_OFFSET + (long) index << String.UTF16,
                (long) (end - off) << String.UTF16);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we can expect arrays to be laid out as a contiguous chunk of memory with the intuitively expected element size.
But... AFAIK this is not specified anywhere in the JVMS, although it is true that it is tacitly assumed in many low-level parts of the codebase. So, in this sense, I'm fine with your code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many places in the String class that use << 1 and >> 1 to handle the length of UTF16 byte[], so is it okay to use << 1 directly in the current version of the code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency, I prefer the explicit constant shift << 1 and >> 1.
Using the String.UTF16 symbol makes the code more verbose.

(long) (end - off) << 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation of copyMemory() is not super-clear about endianness.
But it seems to imply that in this case it behaves as if it were to copy shorts, so endianness seems to be preserved.

The invocation of copyMemory() here implicitly assumes that ARRAY_CHAR_INDEX_SCALE and ARRAY_BYTE_INDEX_SCALE are 2 and 1, resp., which seems quite reasonable but not written in the stone.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recall runtime requires UTF16 byte array and char array have exactly the same layout - would be nice if we keep this in the design notes for the string implementation classes, such as on the class header.

(Useful notes could include that indices are char-based, UTF16 byte[] and char[] has identical layout, etc.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The StringUTF16.getChar and putChar methods are carefully written to use the platform endianness to compose and decompose char values from and to byte[] in terms of shifts of the lower and upper bytes.
The mapping of that into other apis that try to optimize between char[] and the compact string byte[] are less well documented.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've found the code that imposes this requirement:

// This intrinsic accesses byte[] array as char[] array. Computing the offsets
// correctly requires matched array shapes.
assert (arrayOopDesc::base_offset_in_bytes(T_CHAR) == arrayOopDesc::base_offset_in_bytes(T_BYTE),
"sanity: byte[] and char[] bases agree");
assert (type2aelembytes(T_CHAR) == type2aelembytes(T_BYTE)*2,
"sanity: byte[] and char[] scales agree");

}

public static void putCharsSB(byte[] val, int index, CharSequence s, int off, int end) {
Expand Down
27 changes: 26 additions & 1 deletion test/micro/org/openjdk/bench/java/lang/StringBuilders.java
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
/*
* Copyright (c) 2014, 2024, Oracle and/or its affiliates. All rights reserved.
* Copyright (c) 2014, 2025, Oracle and/or its affiliates. All rights reserved.
* Copyright (c) 2025, Alibaba Group Holding Limited. All Rights Reserved.
* DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
*
* This code is free software; you can redistribute it and/or modify it
Expand Down Expand Up @@ -50,6 +51,7 @@ public class StringBuilders {
private String[] str16p8p7;
private String[] str3p9p8;
private String[] str22p40p31;
private char[][] charArray22p40p31;
private StringBuilder sbLatin1;
private StringBuilder sbLatin2;
private StringBuilder sbUtf16;
Expand All @@ -63,10 +65,15 @@ public void setup() {
"advise", "you", "to", "drive", "at", "top", "speed", "it'll",
"be", "a", "god", "damn", "miracle", "if", "we", "can", "get",
"there", "before", "you", "turn", "into", "a", "wild", "animal."};

str3p4p2 = new String[]{"123", "1234", "12"};
str16p8p7 = new String[]{"1234567890123456", "12345678", "1234567"};
str3p9p8 = new String[]{"123", "123456789", "12345678"};
str22p40p31 = new String[]{"1234567890123456789012", "1234567890123456789012345678901234567890", "1234567890123456789012345678901"};
charArray22p40p31 = new char[str22p40p31.length][];
for (int i = 0; i < str22p40p31.length; i++) {
charArray22p40p31[i] = str22p40p31[i].toCharArray();
}
sbLatin1 = new StringBuilder("Latin1 string");
sbLatin2 = new StringBuilder("Latin1 string");
sbUtf16 = new StringBuilder("UTF-\uFF11\uFF16 string");
Expand Down Expand Up @@ -273,6 +280,24 @@ public int appendWithLongUtf16() {
return buf.length();
}

@Benchmark
public int appendWithCharArrayLatin1() {
StringBuilder buf = new StringBuilder();
for (char[] charArray : charArray22p40p31) {
buf.append(charArray);
}
return buf.length();
}

@Benchmark
public int appendWithCharArrayUTF16() {
StringBuilder buf = new StringBuilder("\uFF11");
for (char[] charArray : charArray22p40p31) {
buf.append(charArray);
}
return buf.length();
}

@Benchmark
public String toStringCharWithBool8() {
StringBuilder result = new StringBuilder();
Expand Down