Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8261731: shallow copy the internal buffer of a scalar-replaced java.lang.String object #2570

Open
wants to merge 45 commits into
base: master
Choose a base branch
from

Conversation

@navyxliu
Copy link
Contributor

@navyxliu navyxliu commented Feb 15, 2021

There are 3 nodes involving in the construction of a java.lang.String object.

  1. Allocate of itself, aka. alloc
  2. AllocateArray of a byte array, which is value:byte[], aka. aa
  3. ArrayCopyNode which copys in the contents of value, aka. ac

Lemma
When a String object alloc is scalar replaced, C2 can eliminate aa and ac.

Because alloc is scalar replaced, it must be non-escaped. The field value:byte[] of j.l.String cannot be seen by external world, therefore it must not be global escaped. Because the buffer is marked as stable, it is safe to assume its contents are whatever ac copies in. Because all public java.lang.String constructors clone the incoming array, the source of ac is stable as well.

It is possible to rewire aa to the source of ac with the correct offset. That is to say, we can replace both aa and ac with a “shallow copy” of the source of ac. It’s safe if C2 keeps a reference of the source oop for all safepoints.


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8261731: shallow copy the internal buffer of a scalar-replaced java.lang.String object

Download

To checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/2570/head:pull/2570
$ git checkout pull/2570

To update a local copy of the PR:
$ git checkout pull/2570
$ git pull https://git.openjdk.java.net/jdk pull/2570/head

Xin Liu added 30 commits Dec 24, 2020
It's a set of methods which participate escape analysis computation.
The results will be written to LateInlineAfterEACallGenerator. the methods
inline lately after EA but before MacroElimination

Currently, only java.lang.String.substring(int,int) belongs to afterea_late_inlines
This patch also dumps CallStaticJava. OptimizeSubstring is introduced to
control if String.substring is late inlined or not.
we need to check 3 nodes to draw the conclusion, we check them bottom-up.
1. Allocate node for the object of j.l.String.
2. AllocateArray node for the byte buffer of 1
3. ArrayCopy node for 1 and 2

ArrayCopy type must be [B and it's tightly coupled with AllocateArray.
if AllocateArray escapes, we check its Allocate Object see if its type is KlassPtr
and klass is j.l.String.

we determine it is the temporary String pattern if the escapement of the associated
JavaObject(either 1 or 2) is NoEscape.
Add a flag _suppress_cr to outputStream. outstream objects won't emit any CR if it's set.
Correct TypeInstPtr::dump2 to make sure it only emits klass name once.
Remove the comment because Klass::oop_print_on() has emitted the address of oop.

Before:
689  ConP  ===  0  [[ 821 ]]   Oop:java/lang/Stringjava.lang.String
{0x000000010159d7c8} - klass: public final synchronized 'java/lang/String'
 - string: "a"
 :Constant:exact *

After:
689  ConP  ===  0  [[ 821 ]]   Oop:java.lang.String {0x000000010159d7c8} - klass: public final synchronized 'java/lang/String' - string: "a":Constant:exact *
cp.fallthrough_memproj may have more than one use. they must be either
MemNode(Load/Store) or MergeMemNode.
cp.fallthrough_memproj may have an use of PhiNode or SafePointNode.
we rely on igvn after EA to identify Temporary Array pattern.
It will allow code such as s = "hello".substring(1) to trigger TempArray opto.

TODO: I actually haven't finished it. I am hesitating to rewire the pointer to the constant string
I am not sure where the pointer points to and what's the lifecycle of pointee.
Src could be a CheckCastPP.
uses of projmem of ArrayCopy could have duplicates nodes. this should only happen for MergeMem
it's also possible that use is a ReturnNode.
reformat the diagram of transformation
timelines:
1. EA
2. IGVN
3. ME: eliminate_strcpy_node
4. ME: eliminate_allocate_node
…locate and AllocateArray

For java.lang.String, G1GC needs to generates a post-write barrier for the field value: byte[].
I observe 2 allocation nodes.
1. Allocate Node for the j.l.String
2. AllocateArrayNode for value:byte[]

When OptimizeTempArray is on, C2 may eliminate both 2 nodes. ie. PhseMacroExpand may invoke
"eliminate_allocate_node" for 2 nodes respectively.

The post-write barrier generated by G1GC refers the results of allocate nodes.
Currently, it doesn't support to invoke `G1BarrierSetC2::eliminate_gc_barrier` for 2 nodes respectively.

please note that I sort all macro nodes. As a result, I guarantee that Allocate Elimination always comes
after AllocationArrayNode. When AllocateNode invokes `PhaseMacroExpand::eliminate_allocate_node`,
it still sees CastP2X.

This patch takes away the sibling CastP2X to prevent the 2nd eliminate_gc_barrier.
Prohibit from applying this optimization if the object itself isn't
scalar replaceble.

because the oop isnot created if we optimize out the AllocateArray, I
just assign a null for the field value:byte[]. need to handle it batter.
that's need to decouple Allocate and AllocateArray.
it is possible that a few AllocateArray/Arraycopy nodes cascade. we
need to find the AllocateNode which is the instance of AllocateArray.
Only eliminate the last one AllocateArray/ArrayCopy.

we make use instance id of TypeOopPtr. if AllocateNode is scalar
replaceable, EA marks it a unique iid.
the 2nd Addp is an AddP Node which shares the same base with other AddPs.
Only update its Base to new source node because the Address Node will
be handled anyway.
This make sure we only eliminate AllocateArray that we can handle.
afterea_late_inlines bucket is not useful. revert it and its relevant changes
@simonis
Copy link
Member

@simonis simonis commented Feb 17, 2021

What about GC? What happens if the original String isn't reachable any more? Do you put a reference to the byte array into the corresponding oop maps to make sure it can't be garbage collected?

@navyxliu
Copy link
Contributor Author

@navyxliu navyxliu commented Feb 22, 2021

What about GC? What happens if the original String isn't reachable any more? Do you put a reference to the byte array into the corresponding oop maps to make sure it can't be garbage collected?

hi, Volker,

I don't know OopMap very much. IMHO, it should be discovered by GC because I put the byte array oop to jvm->locals
I do that for all safepoints nodes.

          // 1. src oop
          sfpt->add_req(ac->in(ArrayCopyNode::Src));

https://github.com/openjdk/jdk/pull/2570/files#diff-2faebd05d08f9115f8d9ef771644cf05087a6986c2f9013d7163c6aa720169c3R995

Xin Liu added 2 commits Mar 2, 2021
…:value.

before we figure out how to determine the byte array is stable, we only accept j.l.String::value
for the time being because the field is annotated stable.
It is possible that an AddP node which only has second AddP Nodes as uses.
In this case, we cannot eliminate its uses. replace its base and address
with the source of ac.

It is possible that there is a ArrayCopy but res is not its source.
treat it as a CallStaticJavaNode.
@simonis
Copy link
Member

@simonis simonis commented Mar 3, 2021

Hi Xin,
when building a fastdebug build of your latest changes I got a crash because of an assertion in your new code.
I've attached the hs_err and replay file. After restarting the build it succeeded so there must have been a special situation. Hopefully the replay file can help you to reproduce the error.

Best regards,
Volker

hs_err_pid18140.log
replay_pid18140.log

@navyxliu
Copy link
Contributor Author

@navyxliu navyxliu commented Mar 3, 2021

Hi Xin,
when building a fastdebug build of your latest changes I got a crash because of an assertion in your new code.
I've attached the hs_err and replay file. After restarting the build it succeeded so there must have been a special situation. Hopefully the replay file can help you to reproduce the error.

Best regards,
Volker

hs_err_pid18140.log
replay_pid18140.log

hi, Volker,
thank you for using this patch.
Yes, indeed, it's a bug. I introduced yesterday, I got a silly typo for the AddP nodes. I need to use else if instead of else.
in your replay file, optimization met a special node StrEquals. I have handled it above.

diff --git a/src/hotspot/share/opto/macro.cpp b/src/hotspot/share/opto/macro.cpp
index 56dab5c6449..76e97087fd9 100644
--- a/src/hotspot/share/opto/macro.cpp
+++ b/src/hotspot/share/opto/macro.cpp
@@ -1372,7 +1372,7 @@ void PhaseMacroExpand::process_users_of_string_allocation(AllocateArrayNode* all
             if (n->in(3) == use) { // str2
               _igvn.replace_input_of(n, 3, dst_adr);
             }
-          } if (n->is_AddP()) {
+          } else if (n->is_AddP()) {
             // Skip second AddP. This node must be handled by the upper level.
           } else {
             assert(n->Opcode() == Op_LoadUB || n->Opcode() == Op_LoadB, "unknow code shape");

I will fix soon.

Xin Liu
This fixes the crashing case reported on github's PR.
@TobiHartmann
Copy link
Member

@TobiHartmann TobiHartmann commented Mar 4, 2021

I didn't look at the code yet but run this through some quick testing. The following tests fail with -ea -esa -XX:CompileThreshold=100:

  • compiler/codecache/stress/RandomAllocationTest.java
  • compiler/codecache/stress/UnexpectedDeoptimizationTest.java
#  Internal Error (open/src/hotspot/share/opto/node.hpp:825), pid=27269, tid=27288
#  assert(is_AllocateArray()) failed: invalid node class: IfTrue

Current CompileTask:
C2:  12403 2987       4       java.lang.Class::descriptorString (160 bytes)

Stack: [0x00007fc080bfc000,0x00007fc080cfd000],  sp=0x00007fc080cf8340,  free space=1008k
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x1302c8c]  PhaseMacroExpand::scalar_replacement(AllocateNode*, GrowableArray<SafePointNode*>&)+0x88c
V  [libjvm.so+0x1303e5e]  PhaseMacroExpand::eliminate_allocate_node(AllocateNode*) [clone .part.0]+0x50e
V  [libjvm.so+0x130447c]  PhaseMacroExpand::eliminate_macro_nodes()+0x57c
V  [libjvm.so+0xa0fbb5]  Compile::Optimize()+0x1235
V  [libjvm.so+0xa11c85]  Compile::Compile(ciEnv*, ciMethod*, int, bool, bool, bool, bool, DirectiveSet*)+0x1905
V  [libjvm.so+0x83d69a]  C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0x1ea
V  [libjvm.so+0xa21aa1]  CompileBroker::invoke_compiler_on_method(CompileTask*)+0xf21
V  [libjvm.so+0xa22748]  CompileBroker::compiler_thread_loop()+0x5a8
V  [libjvm.so+0x18488e1]  JavaThread::thread_main_inner()+0x271
V  [libjvm.so+0x1850b30]  Thread::call_run()+0x100
V  [libjvm.so+0x153b086]  thread_native_entry(Thread*)+0x116
@TobiHartmann
Copy link
Member

@TobiHartmann TobiHartmann commented Mar 4, 2021

java/lang/String/Split.java and java/lang/management/MemoryMXBean/LowMemoryTest2.sh fail with:

#  Internal Error (open/src/hotspot/share/opto/macro.cpp:1378), pid=5320, tid=5499
#  assert(n->Opcode() == Op_LoadUB || n->Opcode() == Op_LoadB) failed: unknow code shape

Current CompileTask:
C2:  11623  994   !   4       Split::main (651 bytes)

Stack: [0x00007ff1869fe000,0x00007ff186aff000],  sp=0x00007ff186afa2d0,  free space=1008k
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x12f8a5f]  PhaseMacroExpand::process_users_of_string_allocation(AllocateArrayNode*, ArrayCopyNode*)+0x83f
V  [libjvm.so+0x1303e0e]  PhaseMacroExpand::eliminate_allocate_node(AllocateNode*) [clone .part.0]+0x4be
V  [libjvm.so+0x130447c]  PhaseMacroExpand::eliminate_macro_nodes()+0x57c
V  [libjvm.so+0xa0fbb5]  Compile::Optimize()+0x1235
V  [libjvm.so+0xa11c85]  Compile::Compile(ciEnv*, ciMethod*, int, bool, bool, bool, bool, DirectiveSet*)+0x1905
V  [libjvm.so+0x83d69a]  C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0x1ea
V  [libjvm.so+0xa21aa1]  CompileBroker::invoke_compiler_on_method(CompileTask*)+0xf21
V  [libjvm.so+0xa22748]  CompileBroker::compiler_thread_loop()+0x5a8
V  [libjvm.so+0x18488e1]  JavaThread::thread_main_inner()+0x271
V  [libjvm.so+0x1850b30]  Thread::call_run()+0x100
V  [libjvm.so+0x153b086]  thread_native_entry(Thread*)+0x116

And I see some internal testing failing with SIGSEGV in PhaseMacroExpand::eliminate_allocate_node

@mlbridge
Copy link

@mlbridge mlbridge bot commented Mar 4, 2021

Mailing list message from Volker Simonis on hotspot-compiler-dev:

Hi Tobias,

thanks a lot for looking at this PR. Did you run your tests with Xin's
latest fix which he has posted yesterday [1]?

Just asking because the last assertion ("unknown code shape") was the
same I've also reported yesterday and he claimed that he fixed that
with his last change [1].

Best regards,
Volker

[1] https://github.com//pull/2570/commits/672bf3c1c650211582dc605bb20ad4c9f0ca5479

On Thu, Mar 4, 2021 at 11:06 AM Tobias Hartmann
<thartmann at openjdk.java.net> wrote:

@TobiHartmann
Copy link
Member

@TobiHartmann TobiHartmann commented Mar 4, 2021

Yes, I've used the latest version for testing.

@mlbridge
Copy link

@mlbridge mlbridge bot commented Mar 4, 2021

Mailing list message from Volker Simonis on hotspot-compiler-dev:

Thanks for the quick confirmation.

On Thu, Mar 4, 2021 at 11:57 AM Tobias Hartmann
<thartmann at openjdk.java.net> wrote:

Even though field_val is the value field of j.l.String, the code shape has be changed
by StringOpt for the more efficient string concatenantion. The array is generated from
jdk.internal.misc.Unsafe::allocateUninitializedArray instead of AllocateArray node.

This patch checks stricter before genenerating the special SafePointScalarObjectNode node.

This bug is triggered in the following 2 tests with '-ea -esa -XX:CompileThreshold=100':
compiler/codecache/stress/RandomAllocationTest.java
compiler/codecache/stress/UnexpectedDeoptimizationTest.java
@navyxliu
Copy link
Contributor Author

@navyxliu navyxliu commented Mar 8, 2021

Split.java

hi, @TobiHartmann ,
Could you teach me how to trigger crashes for java/lang/String/Split.java and java/lang/management/MemoryMXBean/LowMemoryTest2.sh ?

I tried different options like "'-ea -esa -XX:CompileThreshold=100' with/without -XComp, -XX:-TieredCompilation. still no luck.
or Could you paste a replay file so I can investigate it?

@TobiHartmann
Copy link
Member

@TobiHartmann TobiHartmann commented Mar 9, 2021

Sure, here are the replay and hs_err files for the Split.java failure:
replay_pid5320.log
hs_err_pid5320.log

And for the LowMemoryTest2 failure:
replay_pid18829.log
hs_err_pid18829.log

It seems like the failures are very intermittent though :(

There are a few subclasses of StrIntrinsicNode. They may be users of
j.l.String::value, eg StrEqualsNode and StrCompNode etc. This patch
treats them as general nodes.

This patch fixes TestOptimizeSubstring.java failure with -Xcomp.
The test methods were not properly inlined or pruned in presence of -Xcomp.
This patch provides more fine-grained control to make sure those methods are compiled more reliablely.
@navyxliu
Copy link
Contributor Author

@navyxliu navyxliu commented Mar 15, 2021

hi, @TobiHartmann ,
Could you try out the latest version 94626e8? I haven't reproduced the error from the two replay files. My best guess is that both of them crashed because of some special nodes which are subclasses of StrIntrinsicNode. I only handled the case StrEqualsNode before. This patch handles them. It also dumps the unknown node to tty in case it still meets unknown code shape.

@TobiHartmann
Copy link
Member

@TobiHartmann TobiHartmann commented Mar 16, 2021

Sure, I'll re-run testing.

After looking through the changes, I'm concerned about all the complexity that is added to handle this special case. Do you have any performance numbers other than microbenchmark results that would justify this?

@TobiHartmann
Copy link
Member

@TobiHartmann TobiHartmann commented Mar 16, 2021

The changes to macro.cpp do not build anymore after JDK-8263577 because the signature of Node::replace_edges_in_range changed.

Xin Liu added 2 commits Mar 18, 2021
This patch eliminates AllocateArray in process_users_of_string_allocation(),
so can_eliminate_allocation() and scalar_replacement() don't need any code change.

Because OptimizeTempArray assumes that the string object must be scalar replaced,
we don't need to generate ObjectValue for AllocateArray. all safepoints must be
intercepted at string object's scalar_replacement.
@navyxliu
Copy link
Contributor Author

@navyxliu navyxliu commented Mar 19, 2021

Sure, I'll re-run testing.

After looking through the changes, I'm concerned about all the complexity that is added to handle this special case. Do you have any performance numbers other than microbenchmark results that would justify this?

hi, @TobiHartmann ,
It's not my intention to make a big patch. I would like to get help from reviewers. Am I making thing over complex? Is there an alternative to make thing simpler?

The reason that the patch is big because I meet some constraints in current optimizing compiler.

  1. It seems all oops must be allocated on heap. There's no pseudo oop. or I can’t find a way to create a temp of fake oop.
  2. Scalar_replacement and ArrayCopy idealization don't support variable-length array(VLA). String::value is a VLA.
  3. Lack supports of frozen array in C2's typing system. JDK-8261007
  4. re-allocation doesn't support nested objects. j.l.String has 4 fields. String::value is an array of byte, which I need to reallocate too.

To overcome the constraints, I make the following new features in this patch.

  1. process_users_of_string_allocation intercepts all uses of AllocateArray.
  2. reorder the trinity Allocate-AllocateArray-ArrayCopy(or alloc-aa-ac) and eliminate them in MacroExpansion phase.
  3. ArrayCopyNode::is_string_copy conservatively recognizes string initialization from another string.
  4. invent a scheme to make deoptimization support a nested object, so deopt can reallocate an entire j.l.String object.

TBH, I haven't seen obvious performance improvement in SpecJBB. Only some microbenches can prove my idea. When I profile a string intensive application, I found that the major blocker is JDK-8261531. Current implementation requires the String object must be scalar-replaceable first. A String object is NOT scalar replaceable until its length is a compiler-time constant. It's pretty rare in the real world. Therefore, I would like to put the evaluation of performance impact on hold before I resolve JDK-8261531.

Could you help me verify that the current code can handle all code shape first? Further, I really appreciate that you can give me some advice in terms of high-level design.

thanks,
--lx

@TobiHartmann
Copy link
Member

@TobiHartmann TobiHartmann commented Mar 23, 2021

Hi @navyxliu,

yes, these are all well-known limitations and unfortunately there is no simple solution to get rid of them. But instead of adding code to work around the limitations in special cases, I think it would be better to find more general solutions. Especially, if there are no performance numbers that would justify the amount of code complexity for handling that special case.

And as you've said, even with your patch, JDK-8261531 is still a major blocker and I think that alone is very complex to fix.

But that's only my opinion, let's see what other reviewers think.

Best regards,
Tobias

@vnkozlov
Copy link
Contributor

@vnkozlov vnkozlov commented Mar 24, 2021

@navyxliu I think you want to fix a lot of issues once. You need to split problems you see.

I did few experiments and the one issue which could be fixed first is scalarizing of array created by Arrays.copyOfRange() used by String::substring().

I see that current JDK easy scalarize new array and arraycopy():

     private static int test4(byte[] src) {
         byte[] dst = new byte[6];
         System.arraycopy(src, 1, dst, 0, 6);
         return dst[2] + dst[5];
     }

But it did not scalarize:

     private static int test5(byte[] src) {
         byte[] dst = Arrays.copyOfRange(src, 1, 7);
         return dst[2] + dst[5];
     }

Even so code in Arrays.copyOfRange() is the same as in test4():
https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/util/Arrays.java#L3817

Arrays does not escaped but it is not eliminated because Loads are not redirected into dst array as in test4() case.

I think this should be fixed first.

@vnkozlov
Copy link
Contributor

@vnkozlov vnkozlov commented Mar 24, 2021

Hmm, actually it was easy solved with -XX:DisableIntrinsic=_min :(

This intrinsic was implemented long ago. the code is used for arraycopy optimization. But for some reason it does not work here.

I think this should be investigated first-first. It would be interesting to see performance effects if we disable this intrinsic at all.

@cl4es
Copy link
Member

@cl4es cl4es commented Mar 24, 2021

You seem to have had a run-in with these intrinsics causing issues before: https://bugs.openjdk.java.net/browse/JDK-8039104 - is this a similar issue?

On a trivial micro adding together a few Math.min values the intrinsic seem to have no effect on my x64 system, down to generating the exact same assembly.

@vnkozlov
Copy link
Contributor

@vnkozlov vnkozlov commented Mar 24, 2021

Yes, my test5() have cmove:

021     cmovlge RBP, RDX        # signed, int
@vnkozlov
Copy link
Contributor

@vnkozlov vnkozlov commented Mar 24, 2021

Thanks @cl4es. I forgot about that bug.

@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Apr 21, 2021

@navyxliu This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
6 participants