Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8311906: Improve robustness of String constructors with mutable array inputs #16425

Closed

Conversation

RogerRiggs
Copy link
Contributor

@RogerRiggs RogerRiggs commented Oct 30, 2023

Strings, after construction, are immutable but may be constructed from mutable arrays of bytes, characters, or integers.
The string constructors should guard against the effects of mutating the arrays during construction that might invalidate internal invariants for the correct behavior of operations on the resulting strings. In particular, a number of operations have optimizations for operations on pairs of latin1 strings and pairs of non-latin1 strings, while operations between latin1 and non-latin1 strings use a more general implementation.

The changes include:

  • Adding a warning to each constructor with an array as an argument to indicate that the results are indeterminate
    if the input array is modified before the constructor returns.
    The resulting string may contain any combination of characters sampled from the input array.

  • Ensure that strings that are represented as non-latin1 contain at least one non-latin1 character.
    For latin1 inputs, whether the arrays contain ASCII, ISO-8859-1, UTF8, or another encoding decoded to latin1 the scanning and compression is unchanged.
    If a non-latin1 character is found, the string is represented as non-latin1 with the added verification that a non-latin1 character is present at the same index.
    If that character is found to be latin1, then the input array has been modified and the result of the scan may be incorrect.
    Though a ConcurrentModificationException could be thrown, the risk to an existing application of an unexpected exception should be avoided.
    Instead, the non-latin1 copy of the input is re-scanned and compressed; that scan determines whether the latin1 or the non-latin1 representation is returned.

  • The methods that scan for non-latin1 characters and their intrinsic implementations are updated to return the index of the non-latin1 character.

  • String construction from StringBuilder and CharSequence must also be guarded as their contents may be modified during construction.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change requires CSR request JDK-8319228 to be approved

Issues

  • JDK-8311906: Improve robustness of String constructors with mutable array inputs (Bug - P4)
  • JDK-8319228: Improve robustness of String constructors with mutable array inputs (CSR)

Reviewers

Contributors

  • Damon Fenacci <dfenacci@openjdk.org>
  • Claes Redestad <redestad@openjdk.org>
  • Amit Kumar <amitkumar@openjdk.org>
  • Martin Doerr <mdoerr@openjdk.org>

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/16425/head:pull/16425
$ git checkout pull/16425

Update a local copy of the PR:
$ git checkout pull/16425
$ git pull https://git.openjdk.org/jdk.git pull/16425/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 16425

View PR using the GUI difftool:
$ git pr show -t 16425

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/16425.diff

Webrev

Link to Webrev Comment

… arguments

Ensure that mutations of input arguments to constructors can not invalidate the latin1 vs non-latin1 coding.

8313982: Non-conforming CharSequence implementation can break StringBuilder
Increase AbstractStringBuilder/StringBuilder robustness when appending a CharSequence
that may throw exceptions while returning characters from the sequence.
@RogerRiggs
Copy link
Contributor Author

/csr

@bridgekeeper
Copy link

bridgekeeper bot commented Oct 30, 2023

👋 Welcome back rriggs! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the csr Pull request needs approved CSR before integration label Oct 30, 2023
@openjdk
Copy link

openjdk bot commented Oct 30, 2023

@RogerRiggs has indicated that a compatibility and specification (CSR) request is needed for this pull request.

@RogerRiggs this pull request must refer to an issue in JBS to be able to link it to a CSR request. To refer this pull request to an issue in JBS, please update the title of this pull request to just the issue ID.

@openjdk
Copy link

openjdk bot commented Oct 30, 2023

@RogerRiggs The following labels will be automatically applied to this pull request:

  • core-libs
  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added hotspot hotspot-dev@openjdk.org core-libs core-libs-dev@openjdk.org labels Oct 30, 2023
…mpress

to return the index of the first non-latin1 character or the length if none.
Not yet built or tested on RiscV.
The test StringRacyConstructor verifies the behavior of the compress intrinsic.

Updated comments on intrinsic implementations of char_array_compress to reflect
the change in the behavior of StringUTF16.compress.

Damon Fenacci (@damon_fenacci) authored the changes to macroAssembler_aarch64.cpp
and macroAssembler_aarch64.cpp.
@RogerRiggs RogerRiggs changed the title Draft: 8311906: Improve robustness of String constructors with mutable array inputs 8311906: Improve robustness of String constructors with mutable array inputs Nov 1, 2023
@RogerRiggs RogerRiggs marked this pull request as ready for review November 1, 2023 14:25
*/
@ForceInline
public static byte[] compress(final char[] val, final int off, final int count) {
byte[] latin1 = new byte[count];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this redundant array allocation be costly if we are working with mostly-utf16 strings, such as CJK strings with no latin characters?

I suggest we can use a heuristic to read the initial char; if it's utf16 then we skip the latin-1 process altogether (and we can assign the utf16 value to the initial index to ensure it's non-latin-1 compressible.

Copy link
Contributor Author

@RogerRiggs RogerRiggs Nov 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can reconsider this design as a separate PR.
Every additional check has a performance impact and in this bug the goal is to avoid any regression.

We'll need to gain some insight into the distribution of strings when used in a non-latin1 application.
How many of the strings are latin1 vs non-latin1, what is the distribution of string lengths and which APIs are in use in the applications. The implementation is already pretty good about working with strings of different coders
but there may be some different choices when converting between char arrays and int arrays and strings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, how does benchmark StringConstructor.newStringFromCharsMixedBegin change before and after this patch? If we can see how much of an impact this has on CJK strings it would be appreciated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, how does benchmark StringConstructor.newStringFromCharsMixedBegin change before and after this patch? If we can see how much of an impact this has on CJK strings it would be appreciated.

You may have better insights from doing your own runs on your own systems.
Here's a sample from a recent run. Mostly small improvements and a few small negatives.

Benchmark                       Linux aarch64 | Linux x64 | MacOSX aarch64 | MacOSX x64

newStringFromCharsMixedBegin-size:64 |  1.22% |  0.91%    | -0.81%         |  1.08%
newStringFromCharsMixedBegin-size:7  |  4.40% |  0.26%    | -1.88%         |  4.60%

src/java.base/share/classes/java/lang/StringUTF16.java Outdated Show resolved Hide resolved
@jaikiran
Copy link
Member

jaikiran commented Nov 5, 2023

Hello Roger, it looks like there are some whitespace related issues in the changes which jcheck has caught https://github.com/openjdk/jdk/pull/16425/checks?check_run_id=18357062638 and thus hasn't created a RFR for this yet.

@openjdk openjdk bot added the rfr Pull request is ready for review label Nov 6, 2023
@mlbridge
Copy link

mlbridge bot commented Nov 6, 2023

@dafedafe
Copy link
Contributor

dafedafe commented Nov 7, 2023

This PR includes changes to the C2_MacroAssembler::char_array_compress_v intrinsic for RISCV that we couldn't test. @RealFYang could you please test and review it? Thanks a lot!

@RogerRiggs
Copy link
Contributor Author

/author set dfenacci

Damon Fenacci authored the updates to the hotspot intrinsics.

@openjdk
Copy link

openjdk bot commented Nov 7, 2023

@RogerRiggs
Setting overriding author to Damon Fenacci <dfenacci@openjdk.org>. When this pull request is integrated, the overriding author will be used in the commit.

@RogerRiggs
Copy link
Contributor Author

/author set rriggs
/contributor add dfenacci

@openjdk
Copy link

openjdk bot commented Nov 7, 2023

@RogerRiggs
Setting overriding author to Roger Riggs <rriggs@openjdk.org>. When this pull request is integrated, the overriding author will be used in the commit.

@openjdk
Copy link

openjdk bot commented Nov 7, 2023

@RogerRiggs
Contributor Damon Fenacci <dfenacci@openjdk.org> successfully added.

@RogerRiggs
Copy link
Contributor Author

/author remove

Remove the override.

@openjdk
Copy link

openjdk bot commented Nov 7, 2023

@RogerRiggs
Overriding author Roger Riggs <rriggs@openjdk.org> was successfully removed. When this pull request is integrated, the pull request author will be used as the author of the commit.

@dafedafe
Copy link
Contributor

dafedafe commented Nov 8, 2023

It would also be good to have the aarch64 and x64 intrinsics properly reviewed. @theRealAph could I ask you to have a look at aarch64 and @TobiHartmann at x64 please? (you seem to be the last ones that made major changes in the intrinsics)

Copy link
Contributor

@rgiulietti rgiulietti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First take ;-)
More will follow.

src/java.base/share/classes/java/lang/StringLatin1.java Outdated Show resolved Hide resolved
src/java.base/share/classes/java/lang/StringUTF16.java Outdated Show resolved Hide resolved
src/java.base/share/classes/java/lang/StringUTF16.java Outdated Show resolved Hide resolved
src/java.base/share/classes/java/lang/StringUTF16.java Outdated Show resolved Hide resolved
src/java.base/share/classes/java/lang/StringUTF16.java Outdated Show resolved Hide resolved
src/java.base/share/classes/java/lang/StringUTF16.java Outdated Show resolved Hide resolved
src/java.base/share/classes/java/lang/String.java Outdated Show resolved Hide resolved
src/java.base/share/classes/java/lang/StringUTF16.java Outdated Show resolved Hide resolved
Extra checking on maximum string lengths when calling toBytes().
@openjdk openjdk bot added ready Pull request is ready to be integrated rfr Pull request is ready for review labels Nov 16, 2023
@openjdk openjdk bot added csr Pull request needs approved CSR before integration and removed ready Pull request is ready to be integrated labels Nov 17, 2023
…ingBuilder, and

Appendable methods in the case where the input arguments are modified during
construction or a StringBuilder or Appendable method call.
Copy link
Contributor

@rgiulietti rgiulietti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.
Maybe use StringUTF16.coderFromArrayLen() where suggested in the comments before integrating.

src/java.base/share/classes/java/lang/String.java Outdated Show resolved Hide resolved
src/java.base/share/classes/java/lang/String.java Outdated Show resolved Hide resolved
src/hotspot/cpu/x86/macroAssembler_x86.cpp Outdated Show resolved Hide resolved
src/hotspot/cpu/x86/macroAssembler_x86.cpp Outdated Show resolved Hide resolved
src/hotspot/cpu/x86/macroAssembler_x86.cpp Outdated Show resolved Hide resolved
Verified by manual tests with "-XX:AVX3Threshold=0"
And test in the PR test/hotspot/jtreg/compiler/intrinsics/string/TestStringConstructionIntrinsics.java
@openjdk openjdk bot added ready Pull request is ready to be integrated and removed csr Pull request needs approved CSR before integration labels Nov 27, 2023
Copy link
Contributor

@rgiulietti rgiulietti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Java code, both src and test, looks good to me.

Copy link
Contributor

@dafedafe dafedafe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intrinsics look OK to me.

@RogerRiggs
Copy link
Contributor Author

/integrate

@openjdk
Copy link

openjdk bot commented Dec 4, 2023

Going to push as commit 155abc5.
Since your change was applied there have been 456 commits pushed to the master branch:

  • 316b783: 8321276: runtime/cds/appcds/dynamicArchive/DynamicSharedSymbols.java failed with "'17 2: jdk/test/lib/apps ' missing from stdout/stderr"
  • 65be5e0: 8305931: jdk/jfr/jcmd/TestJcmdDumpPathToGCRoots.java failed with "Expected chains but found none"
  • f6be922: 8316193: jdk/jfr/event/oldobject/TestListenerLeak.java java.lang.Exception: Could not find leak
  • a9de5c7: 8315128: jdk/jfr/event/runtime/TestResidentSetSizeEvent.java fails with "The size should be less than or equal to peak"
  • d2c529c: 8319072: JFR: Turn off events for JFR.view
  • d5f59cf: 8321220: JFR: RecordedClass reports incorrect modifiers
  • 9769dfe: 8321214: Parallel: Remove unused SpaceInfo::_min_dense_prefix
  • d23f4f1: 8315559: Delay TempSymbol cleanup to avoid symbol table churn
  • c17b8cf: 8320655: awt screencast robot spin and sync issues with native libpipewire api
  • ed5b8c3: 8225220: When the Tab Policy is checked,the scroll button direction displayed incorrectly.
  • ... and 446 more: https://git.openjdk.org/jdk/compare/4679e9aa00c098cff715fb4deeb4d736e1063571...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Dec 4, 2023
@openjdk openjdk bot closed this Dec 4, 2023
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Dec 4, 2023
@openjdk
Copy link

openjdk bot commented Dec 4, 2023

@RogerRiggs Pushed as commit 155abc5.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-libs core-libs-dev@openjdk.org hotspot hotspot-dev@openjdk.org integrated Pull request has been integrated