8320649: C2: Optimize scoped values #16966

rwestrel · 2023-12-05T08:33:08Z

This change implements C2 optimizations for calls to
ScopedValue.get(). Indeed, in:

v1 = scopedValue.get();
...
v2 = scopedValue.get();

v2 can be replaced by v1 and the second call to get() can be
optimized out. That's true whatever is between the 2 calls unless a
new mapping for scopedValue is created in between (when that happens
no optimizations is performed for the method being compiled). Hoisting
a get() call out of loop for a loop invariant scopedValue should
also be legal in most cases.

ScopedValue.get() is implemented in java code as a 2 step process. A
cache is attached to the current thread object. If the ScopedValue
object is in the cache then the result from get() is read from
there. Otherwise a slow call is performed that also inserts the
mapping in the cache. The cache itself is lazily allocated. One
ScopedValue can be hashed to 2 different indexes in the cache. On a
cache probe, both indexes are checked. As a consequence, the process
of probing the cache is a multi step process (check if the cache is
present, check first index, check second index if first index
failed). If the cache is populated early on, then when the method that
calls ScopedValue.get() is compiled, profile reports the slow path
as never taken and only the read from the cache is compiled.

To perform the optimizations, I added 3 new node types to C2:

the pair
ScopedValueGetHitsInCacheNode/ScopedValueGetLoadFromCacheNode for
the cache probe
a cfg node ScopedValueGetResultNode to help locate the result of the
get() call in the IR graph.

In pseudo code, once the nodes are inserted, the code of a get() is:

hits_in_the_cache = ScopedValueGetHitsInCache(scopedValue)
if (hits_in_the_cache) {
  res = ScopedValueGetLoadFromCache(hits_in_the_cache);
} else {
  res = ..; //slow call possibly inlined. Subgraph can be arbitray complex
}
res = ScopedValueGetResult(res)

In the snippet:

v1 = scopedValue.get();
...
v2 = scopedValue.get();

Replacing v2 by v1 is then done by starting from the
ScopedValueGetResult node for the second get() and looking for a
dominating ScopedValueGetResult for the same ScopedValue
object. When one is found, it is used as a replacement. Eliminating
the second get() call is achieved by making
ScopedValueGetHitsInCache always successful if there's a dominating
ScopedValueGetResult and replacing its companion
ScopedValueGetLoadFromCache by the dominating
ScopedValueGetResult.

Hoisting a get() out of loop is achieved by peeling one iteration of
the loop. The optimization above then finds a dominating get() and
removed the get() from the loop body.

An important case, I think, is when profile predicts the slow case to
never taken. Then the code of get() is:

hits_in_the_cache = ScopedValueGetHitsInCache(scopedValue)
if (hits_in_the_cache) {
  res = ScopedValueGetLoadFromCache(hits_in_the_cache);
} else {
  trap();
}
res = ScopedValueGetResult(res)

The ScopedValueGetResult doesn't help and is removed early one. The
optimization process then looks for a pair of
ScopedValueGetHitsInCache/ScopedValueGetLoadFromCache that
dominates the current pair of
ScopedValueGetHitsInCache/ScopedValueGetLoadFromCache and can
replace them. In that case, hoisting a ScopedValue.get() can be
done by predication and I added special logic in predication for that.

Adding the new nodes to the graph when a ScopedValue.get() call is
encountered is done in several steps:

1- inlining of ScopedValue.get() is delayed and the call is enqueued
for late inlining.

2- Once the graph is fully constructed, for each call to
ScopedValue.get(), a ScopedValueGetResult is added between the
result of the call and its uses.

3- the call is then inlined by parsing the ScopedValue.get() method

4- finally the subgraph that results is pattern matched and the pieces
required to perform the cache probe are extracted and attached to new
ScopedValueGetHitsInCache/ScopedValueGetLoadFromCache nodes

There are a couple of reasons for steps 3 and 4:

As mentioned above probing the cache is a multi step process. Having
only 2 nodes in a simple graph shape to represent it makes it easier
to write robust optimizations
the subgraph for the method after parsing contains valuable pieces
of information: profile data that captures which of the 2 locations
in the cache is the most likely to causee a hit. Profile data is
attached to the nodes.

Removal of redundant nodes is done during loop opts. The ScopedValue
nodes are then expanded. That also happens during loop opts because
once expansion is over, there are opportunities for further
optimizations/clean up that can only happens during loop opts. During
expansion, ScopedValueGetResult nodes are removed and
ScopedValueGetHitsInCache/ScopedValueGetLoadFromCache are expanded
to the multi step process of probing the cache. Profile data attached
to the nodes are used to assign correct frequencies/counts to the If
nodes. Of the 2 locations in the cache that are tested, the one that's
the most likely to see a hit (from profile data) is done first.

/cc hotspot-compiler

Progress

Change must not contain extraneous whitespace
Commit message must refer to an issue
Change must be properly reviewed (2 reviews required, with at least 2 Reviewers)

Issue

JDK-8320649: C2: Optimize scoped values (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/16966/head:pull/16966
$ git checkout pull/16966

Update a local copy of the PR:
$ git checkout pull/16966
$ git pull https://git.openjdk.org/jdk.git pull/16966/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 16966

View PR using the GUI difftool:
$ git pr show -t 16966

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/16966.diff

Webrev

Link to Webrev Comment

bridgekeeper · 2023-12-05T08:34:19Z

👋 Welcome back roland! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2023-12-05T08:35:41Z

@rwestrel
The hotspot-compiler label was successfully added.

mlbridge · 2023-12-05T08:41:48Z

Webrevs

TobiHartmann

No review yet, I just performed some quick testing.

The optimized build fails:

[2023-12-05T16:26:12,957Z] open/src/hotspot/share/opto/loopnode.cpp:4745: error: undefined reference to 'ScopedValueGetHitsInCacheNode::verify() const'
[2023-12-05T16:26:12,960Z] open/src/hotspot/share/opto/loopnode.cpp:4761: error: undefined reference to 'ScopedValueGetLoadFromCacheNode::verify() const'
[2023-12-05T16:26:12,964Z] open/src/hotspot/share/opto/loopnode.cpp:4908: error: undefined reference to 'ScopedValueGetHitsInCacheNode::verify() const'
[2023-12-05T16:26:12,967Z] open/src/hotspot/share/opto/loopnode.cpp:4911: error: undefined reference to 'ScopedValueGetLoadFromCacheNode::verify() const'
[2023-12-05T16:26:12,976Z] open/src/hotspot/share/opto/loopopts.cpp:3935: error: undefined reference to 'ScopedValueGetHitsInCacheNode::verify() const'
[2023-12-05T16:26:15,455Z] collect2: error: ld returned 1 exit status

compiler/c2/irTests/TestScopedValue.java fails with -Xcomp on Linux x64:

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f8de687d1b9, pid=270115, tid=270131
#
# JRE version: Java(TM) SE Runtime Environment (22.0) (fastdebug build 22-internal-2023-12-05-1616186.tobias.hartmann.jdk2)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 22-internal-2023-12-05-1616186.tobias.hartmann.jdk2, compiled mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0x12931b9]  PhaseIdealLoop::get_early_ctrl(Node*)+0x4c9

Current CompileTask:
C2:30390 8110    b  4       compiler.c2.irTests.TestScopedValue::testFastPath13 (28 bytes)

Stack: [0x00007f8dc4353000,0x00007f8dc4453000],  sp=0x00007f8dc444d6c0,  free space=1001k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x12931b9]  PhaseIdealLoop::get_early_ctrl(Node*)+0x4c9  (loopnode.hpp:1139)
V  [libjvm.so+0x1293d95]  PhaseIdealLoop::set_subtree_ctrl(Node*, bool) [clone .part.0]+0x75  (loopnode.cpp:251)
V  [libjvm.so+0x1293df6]  PhaseIdealLoop::set_subtree_ctrl(Node*, bool) [clone .part.0]+0xd6  (node.hpp:399)
V  [libjvm.so+0x1293df6]  PhaseIdealLoop::set_subtree_ctrl(Node*, bool) [clone .part.0]+0xd6  (node.hpp:399)
V  [libjvm.so+0x1293df6]  PhaseIdealLoop::set_subtree_ctrl(Node*, bool) [clone .part.0]+0xd6  (node.hpp:399)
V  [libjvm.so+0x1293df6]  PhaseIdealLoop::set_subtree_ctrl(Node*, bool) [clone .part.0]+0xd6  (node.hpp:399)
V  [libjvm.so+0x1293df6]  PhaseIdealLoop::set_subtree_ctrl(Node*, bool) [clone .part.0]+0xd6  (node.hpp:399)
V  [libjvm.so+0x1293df6]  PhaseIdealLoop::set_subtree_ctrl(Node*, bool) [clone .part.0]+0xd6  (node.hpp:399)
V  [libjvm.so+0x1293df6]  PhaseIdealLoop::set_subtree_ctrl(Node*, bool) [clone .part.0]+0xd6  (node.hpp:399)
V  [libjvm.so+0x12960e0]  PhaseIdealLoop::test_and_load_from_cache(Node*, Node*, Node*, Node*, float, float, Node*, Node*&, Node*&, Node*&)+0x820  (loopnode.cpp:4900)
V  [libjvm.so+0x1296b6a]  PhaseIdealLoop::expand_get_from_sv_cache(ScopedValueGetHitsInCacheNode*)+0x82a  (loopnode.cpp:4822)
V  [libjvm.so+0x1297473]  PhaseIdealLoop::expand_scoped_value_get_nodes()+0x243  (loopnode.cpp:4737)
V  [libjvm.so+0x12a45ed]  PhaseIdealLoop::build_and_optimize()+0xf0d  (loopnode.cpp:4672)
V  [libjvm.so+0x9f4ea2]  PhaseIdealLoop::optimize(PhaseIterGVN&, LoopOptsMode)+0x432  (loopnode.hpp:1113)
V  [libjvm.so+0x9ed945]  Compile::optimize_loops(PhaseIterGVN&, LoopOptsMode)+0x75  (compile.cpp:2248)
V  [libjvm.so+0x9f0253]  Compile::Optimize()+0xfd3  (compile.cpp:2500)
V  [libjvm.so+0x9f37e1]  Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0x1c21  (compile.cpp:860)
V  [libjvm.so+0x83eca7]  C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0x1e7  (c2compiler.cpp:134)
V  [libjvm.so+0x9ff17c]  CompileBroker::invoke_compiler_on_method(CompileTask*)+0x92c  (compileBroker.cpp:2299)
V  [libjvm.so+0x9ffe08]  CompileBroker::compiler_thread_loop()+0x468  (compileBroker.cpp:1958)
V  [libjvm.so+0xeb93bc]  JavaThread::thread_main_inner()+0xcc  (javaThread.cpp:720)
V  [libjvm.so+0x17992c6]  Thread::call_run()+0xb6  (thread.cpp:220)
V  [libjvm.so+0x14a30f7]  thread_native_entry(Thread*)+0x127  (os_linux.cpp:787)

compiler/c2/irTests/TestScopedValue.java fails with -ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -server -XX:-TieredCompilation on Linux x64:

Failed IR Rules (1) of Methods (1)
----------------------------------
1) Method "public static void compiler.c2.irTests.TestScopedValue.testFastPath7()" - [Failed IR rules: 1]:
   * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={}, failOn={"_#C#CALL_OF_METHOD#_", "slowGet"}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
     > Phase "PrintIdeal":
       - failOn: Graph contains forbidden nodes:
         * Constraint 1: "(\\d+(\\s){2}(Call.*Java.*)+(\\s){2}===.*slowGet )"
           - Matched forbidden node:
             * 501  CallStaticJava  === 370 6 7 8 1 (648 1 1 1 1 1 ) [[ 502 503 504 ]] # Static  java.lang.ScopedValue::slowGet

compiler/c2/irTests/TestScopedValue.java fails with -XX:TypeProfileLevel=222 on AArch64:

# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (/System/Volumes/Data/mesos/work_dir/slaves/0db9c48f-6638-40d0-9a4b-bd9cc7533eb8-S29331/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/e2db05c4-923c-4a63-923b-5f9870681cc5/runs/c74e986d-15f1-46dc-822b-a41d12c079e0/workspace/open/src/hotspot/share/opto/callGenerator.cpp:929), pid=44590, tid=26115
#  Error: assert(in->Opcode() == Op_LoadP || in->Opcode() == Op_LoadN) failed

Current CompileTask:
C2:766  689    b  4       compiler.c2.irTests.TestScopedValue::testFastPath1 (30 bytes)

Stack: [0x00000001719ec000,0x0000000171bef000],  sp=0x0000000171beb0c0,  free space=2044k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.dylib+0x1130268]  VMError::report_and_die(int, char const*, char const*, char*, Thread*, unsigned char*, void*, void*, char const*, int, unsigned long)+0x564  (callGenerator.cpp:929)
V  [libjvm.dylib+0x1130a88]  VMError::report_and_die(Thread*, unsigned int, unsigned char*, void*, void*)+0x0
V  [libjvm.dylib+0x5618b0]  print_error_for_unit_test(char const*, char const*, char*)+0x0
V  [libjvm.dylib+0x396ea0]  LateInlineScopedValueCallGenerator::process_result(GraphKit&)+0x2534
V  [libjvm.dylib+0x38f8dc]  CallGenerator::do_late_inline_helper()+0x660
V  [libjvm.dylib+0x4cd2bc]  Compile::inline_scoped_value_calls(PhaseIterGVN&)+0x570
V  [libjvm.dylib+0x4c6944]  Compile::Optimize()+0x210
V  [libjvm.dylib+0x4c54bc]  Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0x1228
V  [libjvm.dylib+0x38a590]  C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0x1e0
V  [libjvm.dylib+0x4e2f48]  CompileBroker::invoke_compiler_on_method(CompileTask*)+0x854
V  [libjvm.dylib+0x4e238c]  CompileBroker::compiler_thread_loop()+0x348
V  [libjvm.dylib+0x8bb170]  JavaThread::thread_main_inner()+0x1dc
V  [libjvm.dylib+0x1076548]  Thread::call_run()+0xf4
V  [libjvm.dylib+0xe39138]  thread_native_entry(Thread*)+0x138
C  [libsystem_pthread.dylib+0x726c]  _pthread_start+0x94

compiler/c2/irTests/TestScopedValue.java fails with -XX:+UnlockDiagnosticVMOptions -XX:TieredStopAtLevel=3 -XX:+StressLoopInvariantCodeMotion -XX:+StressRangeCheckElimination -XX:+StressLinearScan on AArch64:

compiler.lib.ir_framework.shared.TestRunException: There was an error while invoking @Run method private void compiler.c2.irTests.TestScopedValue.testFastPath1Runner() throws java.lang.Exception
	at compiler.lib.ir_framework.test.CustomRunTest.invokeTest(CustomRunTest.java:162)
	at compiler.lib.ir_framework.test.CustomRunTest.run(CustomRunTest.java:87)
	at compiler.lib.ir_framework.test.TestVM.runTests(TestVM.java:822)
	at compiler.lib.ir_framework.test.TestVM.start(TestVM.java:249)
	at compiler.lib.ir_framework.test.TestVM.main(TestVM.java:164)
Caused by: java.lang.reflect.InvocationTargetException
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:118)
	at java.base/java.lang.reflect.Method.invoke(Method.java:580)
	at compiler.lib.ir_framework.test.CustomRunTest.invokeTest(CustomRunTest.java:159)
	... 4 more
Caused by: java.lang.RuntimeException: should be compiled
	at compiler.c2.irTests.TestScopedValue.testFastPath1Runner(TestScopedValue.java:87)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	... 6 more

compiler/c2/TestUnsignedByteCompare.java and compiler/codegen/TestSignedMultiplyLong.java fail with -Duse.JTREG_TEST_THREAD_FACTORY=Virtual -XX:-VerifyContinuations intermittent on Windows x64:

# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (/System/Volumes/Data/mesos/work_dir/slaves/0db9c48f-6638-40d0-9a4b-bd9cc7533eb8-S29331/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/e2db05c4-923c-4a63-923b-5f9870681cc5/runs/c74e986d-15f1-46dc-822b-a41d12c079e0/workspace/open/src/hotspot/share/opto/compile.cpp:813), pid=29127, tid=26371
#  assert(IncrementalInline || (_late_inlines.length() == 0 && !has_mh_late_inlines())) failed: incremental inlining is off

Current CompileTask:
C2:5797 3627    b        java.lang.System$2::scopedValueCache (4 bytes)

Stack: [0x0000000171694000,0x0000000171897000],  sp=0x0000000171894bc0,  free space=2050k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.dylib+0x1130268]  VMError::report_and_die(int, char const*, char const*, char*, Thread*, unsigned char*, void*, void*, char const*, int, unsigned long)+0x564  (compile.cpp:813)
V  [libjvm.dylib+0x1130a88]  VMError::report_and_die(Thread*, unsigned int, unsigned char*, void*, void*)+0x0
V  [libjvm.dylib+0x5618b0]  print_error_for_unit_test(char const*, char const*, char*)+0x0
V  [libjvm.dylib+0x4c5794]  Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0x1500
V  [libjvm.dylib+0x38a590]  C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0x1e0
V  [libjvm.dylib+0x4e2f48]  CompileBroker::invoke_compiler_on_method(CompileTask*)+0x854
V  [libjvm.dylib+0x4e238c]  CompileBroker::compiler_thread_loop()+0x348
V  [libjvm.dylib+0x8bb170]  JavaThread::thread_main_inner()+0x1dc
V  [libjvm.dylib+0x1076548]  Thread::call_run()+0xf4
V  [libjvm.dylib+0xe39138]  thread_native_entry(Thread*)+0x138
C  [libsystem_pthread.dylib+0x726c]  _pthread_start+0x94

Just let me know if you need any more information.

rwestrel · 2023-12-07T14:23:56Z

No review yet, I just performed some quick testing.

Thanks for doing that. All issues should be fixed now (I couldn't reproduce the last one so not 100% sure about that one).

dean-long · 2024-01-05T00:59:03Z

I'm not a C2 expert, so my high-level comments might not all make sense, but here goes.

My first reaction was why does this need to be so complicated? Can't we treat the slow path and cache implementation as a black box, and just common up the high-level get()? In my mind the ScopedValue get should be similar to a read from a "condy" dynamic constant.
The reason for breaking up the operations into individual nodes seems to be because of the profiling information. So I'm wondering how much this helps, given the added complexity.
I would expect multiple get() calls in the same method or in loops to be rare and/or programmer errors. The real advantage seems to be when the binding and the get() can be connected from caller to callee through inlining.
Are we able to optimize a get() on a constant/final ScopedValue into a simple array load at a constant offset?
Needing to do things like treat ScopedValueGetHitsInCache as always successful give be a bad feeling for some reason, and seem unnecessary if we did more at a higher (macro?) level rather than eagerly expanding the high-level operation into individual nodes.

theRealAph · 2024-01-05T10:16:24Z

A couple of answers:

I'm not a C2 expert, so my high-level comments might not all make sense, but here goes.

* I would expect multiple get() calls in the same method or in loops to be rare and/or programmer errors.  The real advantage seems to be when the binding and the get() can be connected from caller to callee through inlining.

Binding and get() are usually separated by a long way. It's a common pattern to use get() inside a loop when a ScopedValue is used to hold a capability object which is private within a library context.

* Are we able to optimize a get() on a constant/final ScopedValue into a simple array load at a constant offset?

Maybe I'm misunderstanding this question, but that's what the scoped value cache does.

rwestrel · 2024-01-05T12:45:02Z

I'm not a C2 expert, so my high-level comments might not all make sense, but here goes.
* My first reaction was why does this need to be so complicated?

That's a fair reaction.

Can't we treat the slow path and cache implementation as a black box, and just common up the high-level get()? In my mind the ScopedValue get should be similar to a read from a "condy" dynamic constant.

Initially, I thought about delaying the inlining of get() methods and simply have a pass that look for get() calls with the same inputs. I don't think that works well because the current late inlining framework can't delay inlining very late. We don't run loop opts before we're done with inlining for instance. If we wanted to hoist a call out of loop we would need loop opts. For instance, tt's likely a call to get() depends on a null check that we would need to hoist first.

The other thing about optimizing get() calls is that they are heavy weight nodes (a high level get() macro node would be very similar to a get() call node whichever way you look at it). We don't know how to hoist a call out of loop. A call acts as a barrier on the entire memory state and get in the way of memory optimizations. If profile reports the slow path to be never taken then the shape of the get() becomes lighter weight. It doesn't disrupt other optimizations. Probing the cache acts as a load + test which we know how to hoist from a loop.

It felt to me that it would be fairly common for the slow path to not be needed and given the shape without the slow path is much easier to optimize, it was important to be able to expose early on if the slow path was there or not.

* The reason for breaking up the operations into individual nodes seems to be because of the profiling information.  So I'm wondering how much this helps, given the added complexity.

The thing about get() is that in simple cases, it optimizes well because of profile data. A get() call once inlined can essentially be hoisted out of loop if all goes well. It doesn't take much for simple optimizations on get() to not happen anymore. The goal of this patch is to bring consistency and have optimizations work well in all sort of scenarios. But it would be hard to sell if the simple cases don't work as well as they do without the patch. And I believe that requires profile data.

* I would expect multiple get() calls in the same method or in loops to be rare and/or programmer errors.  The real advantage seems to be when the binding and the get() can be connected from caller to callee through inlining.

Eliminating get() calls with the same inputs may not be common in java code but that transformation is a building block for optimizations. Hoisting a get() out of loop can be achieved by peeling one iteration and letting the get() from the loop body be removed because it's redundant with the one from the peeled iteration. Also, code that c2 optimizes once inlining has happened and dead paths have been trimmed doesn't necessarily look like the java code the programmer wrote.

* Needing to do things like treat ScopedValueGetHitsInCache as always successful give be a bad feeling for some reason, and seem unnecessary if we did more at a higher (macro?) level rather than eagerly expanding the high-level operation into individual nodes.

I think my comments above cover that one.

dean-long · 2024-01-05T23:07:31Z

Thanks @rwestrel, that helps. I have no objections to this change, but I don't understand C2 enough to do a deeper review.

dean-long · 2024-01-05T23:10:37Z

Maybe I'm misunderstanding this question, but that's what the scoped value cache does.

@theRealAph I guess it boils down to whether the hash value can be treated as a compile-time constant, which seems possible because it's marked final.

theRealAph · 2024-01-06T10:07:33Z

Maybe I'm misunderstanding this question, but that's what the scoped value cache does.

@theRealAph I guess it boils down to whether the hash value can be treated as a compile-time constant, which seems possible because it's marked final.

It always has been in the tests I've done. One of the interesting challenges with this work has been to make sure scoped value performance doesn't regress. A great advantage of this PR is that a dedicated scoped value optimization helps to make such regressions less likely.

TobiHartmann · 2024-01-08T10:16:18Z

Tests all pass now.

test/hotspot/jtreg/compiler/c2/irTests/TestScopedValue.java

eme64

I just studied the background and hope to look into this in the next days.

Personal wishlist: can you add a case where this optimization enables vectorization? Or do your optimizations happen too late for that?

eme64 · 2024-01-16T18:24:43Z

test/hotspot/jtreg/compiler/c2/irTests/TestScopedValue.java

+    //     }
+    //     MyLong long2 = (MyLong)scopedValue.get();
+    //     return long1.getValue() + long2.getValue();
+    // }


Are you still working on this?

No. I couldn't make the test work unfortunately, so I wasn't sure whether to leave the test commented out (in case someone revisits that later) or not.

Maybe have the body of the test put in, and the IR-rules commented out, with a follow-up RFE for investigation, if you think there is something one can do about it?

eme64

Nice work @rwestrel
I'm sending out a first batch or comments, more coming later.

eme64 · 2024-01-17T09:53:38Z

src/hotspot/share/gc/shared/c2/barrierSetC2.cpp

@@ -105,7 +105,8 @@ Node* BarrierSetC2::store_at_resolved(C2Access& access, C2AccessValue& val) cons
    assert(access.is_opt_access(), "either parse or opt access");
    C2OptAccess& opt_access = static_cast<C2OptAccess&>(access);
    Node* ctl = opt_access.ctl();
-    MergeMemNode* mm = opt_access.mem();
+    assert(opt_access.mem()->is_MergeMem(), "");
+    MergeMemNode* mm = opt_access.mem()->as_MergeMem();


Does as_MergeMem not assert is_MergeMem?

Thanks for the update ✔️

eme64 · 2024-01-17T10:05:47Z

src/hotspot/share/opto/callGenerator.cpp

+    // }
+    // continue:
+    //
+    // slow_call:


Makes it look like continue is a fall-through to slow_call, that is not what you want, right?

Thanks for the update ✔️

src/hotspot/share/opto/callGenerator.cpp

eme64 · 2024-01-17T10:09:35Z

src/hotspot/share/opto/callGenerator.cpp

+    // the result of the parsing of the java code where the success path ends with an Halt node. The reason for that is
+    // that some paths may end with an uncommon trap and if one traps, we want the trap to be recorded for the right bci.
+    // When the ScopedValueGetHitsInCache/ScopedValueGetLoadFromCache pair is expanded, split if finds the duplicate
+    // logic and cleans it up.


I would prefer the comment section at the beginning of the method. Otherwise I may start reading down linearly, reverse-engineer the code, and only discover this afterwards... 😅

Thanks for the updates ✔️

eme64 · 2024-01-17T10:11:32Z

src/hotspot/share/opto/callGenerator.cpp

+    _process_result = v;
+  }
+
+  virtual void process_result(GraphKit& kit) {


It would be really nice if you refactored this huge method (6+ pages of code) into smaller units.
It would for example make separation of pattern-matching and transformation easier to see.

eme64 · 2024-01-17T11:00:59Z

src/hotspot/share/opto/callGenerator.cpp

+            slow_call = c->as_CallStaticJava();
+            assert(slow_call->method()->intrinsic_id() == vmIntrinsics::_SVslowGet, "");
+          } else {
+            assert(c->is_Proj() || c->is_Catch(), "");


Suggested change

assert(c->is_Proj() || c->is_Catch(), "");

assert(c->is_Proj() || c->is_Catch(), "unexpected node in pattern matching");

eme64 · 2024-01-17T11:02:07Z

src/hotspot/share/opto/callGenerator.cpp

+      Unique_Node_List wq;
+      wq.push(kit.control());
+      for (uint i = 0; i < wq.size(); ++i) {
+        Node* c = wq.at(i);


Could we assert that these are all CFG nodes? And give it a more expressive name?

eme64 · 2024-01-17T11:10:08Z

src/hotspot/share/opto/callGenerator.cpp

+        }
+      }
+      // get_first_iff/get_second_iff contain the first/second check we ran into during the graph traversal but they may
+      // not be the first/second one in execution order. Perform another traversal to figure out which is first.


Why can this not be done in the first traversal, and why does this (down) traversal do the right thing?

Can we assert that c is always CFG?
Please mention in a comment that we are only traversing the same CFG nodes from the first traversal.

Why can this not be done in the first traversal, and why does this (down) traversal do the right thing?

The first traversal starts from the end of the method and follows control paths until it reaches the Thread.scopedValueCache() call. Given the shape of the method and that some paths may have been trimmed and end with an uncommon trap, it could reach either the first or the second if that probe the cache first.

eme64 · 2024-01-17T11:12:24Z

src/hotspot/share/opto/callGenerator.cpp

+      // get_first_iff/get_second_iff contain the first/second check we ran into during the graph traversal but they may
+      // not be the first/second one in execution order. Perform another traversal to figure out which is first.
+      if (get_second_iff != nullptr) {
+        Node_Stack stack(0);


No visited set. Can this trigger an exponential explosion with if/region diamonds?

No visited set. Can this trigger an exponential explosion with if/region diamonds?

It only follows the control subgraph for the ScopedValue.get() which is fairly simple.

eme64 · 2024-01-17T11:18:36Z

src/hotspot/share/opto/callGenerator.cpp

+              get_first_iff->in(1)->as_Bool()->_test._test == BoolTest::ne ? 0 : 1);
+      CallStaticJavaNode* get_first_iff_unc = get_first_iff_failure->is_uncommon_trap_proj(Deoptimization::Reason_none);
+      if (get_first_iff_unc != nullptr) {
+        // first cache check never hits, keep only the second.


I'm struggling to understand:
We still have an unc-trap for the first. So we never failed so far, right? So we always found it in the cache, or am I wrong?
We are not removing this unc-trap though, right?

The ScopedValue.get() codee probes 2 cache locations. If, when pattern matching the get() subgraph:

we only find a single if that probes the cache, then, according to profile data, there was always a hit at the first cache location.

we find 2 ifs, then the first and second locations were probed. If the first if's other branch is to an uncommon trap, then that location never saw a cache hit. In that case, when the ScopedValueGetHitsInCacheNode is expanded, only code to probe the second location is added back to the IR.

eme64

Second badge of comments.

eme64 · 2024-01-17T13:02:26Z

src/hotspot/share/opto/callGenerator.cpp

+    kit.set_i_o(io);
+
+    // remove the scopedValueCache() call
+    CallProjections scoped_value_cache_projs = CallProjections();


Suggested change

CallProjections scoped_value_cache_projs = CallProjections();

CallProjections scoped_value_cache_projs;

Is the assignment really necessary, or style-wise preferrable? I see you use it without elsewhere.

eme64 · 2024-01-17T13:15:38Z

src/hotspot/share/opto/callGenerator.cpp

+    // Now move right above the scopedValueCache() call
+    Node* mem = scoped_value_cache->in(TypeFunc::Memory);
+    Node* c = scoped_value_cache->in(TypeFunc::Control);
+    Node* io = scoped_value_cache->in(TypeFunc::I_O);


I would use input_mem, input_ctrl, input_io. Then the replacements below would read more intuitively.

eme64 · 2024-01-17T13:43:59Z

src/hotspot/share/opto/callGenerator.cpp

+                                                                                        first_index == nullptr ? C->top() : first_index,
+                                                                                        second_index == nullptr ? C->top() : second_index);
+
+    // It will later be expanded back to all the checks so record profile data


Should we also copy the node info (e.g. line number etc)?

eme64 · 2024-01-17T13:57:06Z

src/hotspot/share/opto/callGenerator.cpp

+      sv_hits_in_cache->set_profile_data(2, get_second_iff->_fcnt, get_second_prob);
+    } else {
+      sv_hits_in_cache->set_profile_data(2, 0, 0);
+    }


Another case of code duplication. Why not write a method that extracts cnt, prob for an iff?`Or maybe that already exists?

eme64 · 2024-01-17T14:00:31Z

src/hotspot/share/opto/callGenerator.cpp

+    assert(sv_hits_in_cachex == sv_hits_in_cache, "");
+
+    // And compute the probability of a miss in the cache
+    float prob;


Suggested change

float prob;

float cache_miss_prob;

eme64 · 2024-01-17T14:21:27Z

src/hotspot/share/opto/callGenerator.cpp

+    r->init_req(2, in_cache);
+
+    // ScopedValueGetLoadFromCache is a single that represents the result of a hit in the cache
+    Node* cache_value = kit.gvn().transform(new ScopedValueGetLoadFromCacheNode(C, in_cache, sv_hits_in_cache));


Suggested change

Node* cache_value = kit.gvn().transform(new ScopedValueGetLoadFromCacheNode(C, in_cache, sv_hits_in_cache));

Node* sv_load_from_cache = kit.gvn().transform(new ScopedValueGetLoadFromCacheNode(C, in_cache, sv_hits_in_cache));

For consistency with sv_hits_in_cache and the node class name.

eme64 · 2024-01-17T14:27:32Z

src/hotspot/share/classfile/vmIntrinsics.hpp

+  do_intrinsic(_SVslowGet,               java_lang_ScopedValue,   slowGet_name, void_object_signature, F_R)             \
+   do_name(     slowGet_name,                                    "slowGet")                                             \
+  do_intrinsic(_SVCacheInvalidate,       java_lang_ScopedValue_Cache, invalidate_name, int_void_signature, F_S)         \
+   do_name(     invalidate_name,                                 "invalidate")                                          \


What prevents us from writing out _scopedValueGet etc, just like the other names here?

Thanks for the update ✔️

eme64 · 2024-01-17T14:32:45Z

src/hotspot/share/opto/cfgnode.hpp

+
+  ProjNode* result_out() {
+    return proj_out_or_null(Result);
+  }


Either verify that we have not null, or else rename to result_out_or_null.

Thanks for the update ✔️

eme64 · 2024-01-17T14:33:51Z

src/hotspot/share/opto/compile.cpp

@@ -461,6 +462,7 @@ void Compile::disconnect_useless_nodes(Unique_Node_List& useful, Unique_Node_Lis
  remove_useless_late_inlines(         &_string_late_inlines, useful);
  remove_useless_late_inlines(         &_boxing_late_inlines, useful);
  remove_useless_late_inlines(&_vector_reboxing_late_inlines, useful);
+  remove_useless_late_inlines(          &_scoped_value_late_inlines, useful);


Suggested change

remove_useless_late_inlines( &_scoped_value_late_inlines, useful);

remove_useless_late_inlines( &_scoped_value_late_inlines, useful);

Make it align with the code above

And:
We keep adding more and more of these...
Soon we will need to refactor this to a list of lists 😅

eme64 · 2024-01-17T14:40:55Z

src/hotspot/share/opto/compile.cpp

@@ -2025,6 +2030,73 @@ void Compile::inline_boxing_calls(PhaseIterGVN& igvn) {
  }
 }

+void Compile::inline_scoped_value_calls(PhaseIterGVN& igvn) {
+  if (_scoped_value_late_inlines.length() > 0) {


Rather than indenting everything, I would just check _scoped_value_late_inlines.is_empty() and return.

eme64

Third comment batch.
I still have not looked at some parts, but I think this is enough details for today ;)

eme64 · 2024-01-17T14:54:29Z

src/hotspot/share/opto/compile.cpp

+      }
+      C->set_has_scoped_value_get_nodes(true);
+      CallNode* call = cg->call_node();
+      CallProjections projs;


Suggested change

CallProjections projs;

CallProjections call_projs;

eme64 · 2024-01-17T14:57:15Z

src/hotspot/share/opto/compile.cpp

+      call->extract_projections(&projs, true);
+      Node* sv = call->in(TypeFunc::Parms);
+      Node* control_out = projs.fallthrough_catchproj;
+      Node* res = projs.resproj;


can we have longer and more descriptive names for sv and res please 🙏 😃

eme64 · 2024-01-17T14:58:24Z

src/hotspot/share/opto/compile.cpp

+      gvn->record_for_igvn(control_out);
+      res = res->clone();
+      gvn->set_type_bottom(res);
+      gvn->record_for_igvn(res);


Why do you clone these? (maybe add a comment)

eme64 · 2024-01-17T15:01:56Z

src/hotspot/share/opto/compile.cpp

+  case Op_ScopedValueGetResult:
+  case Op_ScopedValueGetHitsInCache:
+  case Op_ScopedValueGetLoadFromCache: {
+    ShouldNotReachHere();


Why? Add a comment!

eme64 · 2024-01-17T15:07:54Z

src/hotspot/share/opto/intrinsicnode.cpp

+Node* ScopedValueGetLoadFromCacheNode::scoped_value() const {
+  Node* hits_in_cache = in(1);
+  assert(hits_in_cache->Opcode() == Op_ScopedValueGetHitsInCache, "");
+  return ((ScopedValueGetHitsInCacheNode*)hits_in_cache)->scoped_value();


Why not add the neccessary bits to the class so you can use as_ScopedValueGetLoadFromCache()?

eme64 · 2024-01-17T15:32:22Z

src/hotspot/share/opto/multnode.cpp

+          wq.push(u);
+        }
+      }
+    } else if (n->Opcode() != Op_Halt) {


So a path without a call and only Halt node is also a uncommon trap?

eme64 · 2024-01-17T15:36:34Z

src/hotspot/share/opto/node.cpp

+  }
+  return res;
+}
+


Code duplication warning 😉
Not sure what is the best solution though.

I think you could remove duplication with a simple "implement" function, that takes a parameter "want_unique". Then if you find the element, and don't want unique, you return. If you are looking for unique, you just continue and check that you don't find it again.

eme64 · 2024-01-17T15:40:31Z

src/hotspot/share/opto/type.cpp

@@ -614,6 +614,16 @@ void Type::Initialize_shared(Compile* current) {
  TypeInstKlassPtr::OBJECT = TypeInstKlassPtr::make(TypePtr::NotNull, current->env()->Object_klass(), 0);
  TypeInstKlassPtr::OBJECT_OR_NULL = TypeInstKlassPtr::make(TypePtr::BotPTR, current->env()->Object_klass(), 0);

+  const Type **fgetfromcache =(const Type**)shared_type_arena->AmallocWords(3*sizeof(Type*));


Suggested change

const Type **fgetfromcache =(const Type**)shared_type_arena->AmallocWords(3*sizeof(Type*));

const Type** fgetfromcache =(const Type**)shared_type_arena->AmallocWords(3*sizeof(Type*));

eme64 · 2024-01-17T15:40:41Z

src/hotspot/share/opto/type.cpp

+  fgetfromcache[1] = TypeInstPtr::BOTTOM;
+  fgetfromcache[2] = TypeAryPtr::OOPS;
+  TypeTuple::make(3, fgetfromcache);
+  const Type **fsvgetresult =(const Type**)shared_type_arena->AmallocWords(2*sizeof(Type*));


Suggested change

const Type **fsvgetresult =(const Type**)shared_type_arena->AmallocWords(2*sizeof(Type*));

const Type** fsvgetresult =(const Type**)shared_type_arena->AmallocWords(2*sizeof(Type*));

eme64 · 2024-01-17T15:41:09Z

src/hotspot/share/opto/type.hpp

@@ -746,6 +746,7 @@ class TypeTuple : public Type {
  static const TypeTuple *LONG_PAIR;
  static const TypeTuple *INT_CC_PAIR;
  static const TypeTuple *LONG_CC_PAIR;
+  static const TypeTuple *SV_GET_RESULT;


Suggested change

static const TypeTuple *SV_GET_RESULT;

static const TypeTuple* SV_GET_RESULT;

eme64

Path 1 of today.

eme64 · 2024-01-18T08:31:56Z

src/hotspot/share/opto/callGenerator.cpp

+    Node* first_index = nullptr; // index in the cache for first hash
+    Node* second_index = nullptr; // index in the cache for second hash
+    CallStaticJavaNode* slow_call = nullptr; // slowGet() call if any
+    {


This setup really looks like it should be a class, maybe called ScopedValueGetPatternMatcher?
All your variables here could be fields, and the scope below a method, or even split into multiple methods.

eme64 · 2024-01-18T08:36:58Z

src/hotspot/share/opto/loopPredicate.cpp

+}
+
+
+bool PhaseIdealLoop::loop_predication_for_scoped_value_get(IdealLoopTree* loop, IfProjNode* if_success_proj,


Add a short comment above, that we are trying to hoist the If for a ScopedValueGetHitsInCache out of the loop, if possible.
You have one on the first line, but I think generally we place them at the top, right?

Well, now looking at the method... maybe a longer comment with a picture or pseudocode would be helpful.
It would greatly help me in reviewing the code - otherwise I basically have to draw the picture on a piece of paper myself before understanding it ;)

I like it the most when there is ascii art that uses the variable names in the code below.

I'll read the code in a later review.

eme64 · 2024-01-18T08:44:01Z

src/hotspot/share/opto/loopPredicate.cpp

+  // A ScopedValueGetHitsInCache check is loop invariant if the scoped value object it is applied to is loop invariant
+  BoolNode* bol = iff->in(1)->as_Bool();
+  if (bol->in(1)->Opcode() == Op_ScopedValueGetHitsInCache && invar.is_invariant(((ScopedValueGetHitsInCacheNode*)bol->in(1))->scoped_value()) &&
+      invar.is_invariant(((ScopedValueGetHitsInCacheNode*)bol->in(1))->index1()) && invar.is_invariant(((ScopedValueGetHitsInCacheNode*)bol->in(1))->index2())) {


Please refactor this if:

if we don't take it, you return false, so make it a bailout. This allows you to already bailout if bol->in(1)->Opcode() == Op_ScopedValueGetHitsInCache fails.

After that check, you can already have this line: ScopedValueGetHitsInCacheNode* hits_in_the_cache = (ScopedValueGetHitsInCacheNode*) bol->in(1);, and simplify the other 3 conditions of the if, make it more readable.

But make them bailouts as well, instead of indenting the rest of the function.

I don't really know how is_invar works. But why is a use-node not automatically variant, if a def-node is variant. Or stated in other terms: why not just check invar.is_invariant(hits_in_the_cache)?

eme64 · 2024-01-18T08:44:36Z

src/hotspot/share/opto/loopPredicate.cpp

+bool PhaseIdealLoop::loop_predication_for_scoped_value_get(IdealLoopTree* loop, IfProjNode* if_success_proj,
+                                                           ParsePredicateSuccessProj* parse_predicate_proj,
+                                                           Invariance &invar, Deoptimization::DeoptReason reason,
+                                                           IfNode* iff, IfProjNode*&new_predicate_proj) {


Suggested change

IfNode* iff, IfProjNode*&new_predicate_proj) {

IfNode* iff, IfProjNode* &new_predicate_proj) {

eme64 · 2024-01-18T08:53:18Z

src/hotspot/share/opto/loopPredicate.cpp

+    if (TraceLoopPredicate) {
+      tty->print("Predicate invariant if: %d ", new_predicate_iff->_idx);
+      loop->dump_head();
+    } else if (TraceLoopOpts) {


Why not have them as separate ifs? What if someone enables both, will they not miss a line?

That's the code pattern used elsewhere in PhaseIdealLoop::loop_predication_impl_helper().

eme64 · 2024-01-18T10:12:57Z

src/hotspot/share/opto/loopnode.cpp

+  return progress;
+}
+
+void PhaseIdealLoop::expand_get_from_sv_cache(ScopedValueGetHitsInCacheNode* get_from_cache) {


Suggested change

void PhaseIdealLoop::expand_get_from_sv_cache(ScopedValueGetHitsInCacheNode* get_from_cache) {

void PhaseIdealLoop::expand_sv_get_hits_in_cache_and_load_from_cache(ScopedValueGetHitsInCacheNode* get_from_cache) {

And again, a picture / pseudocode of the transformation would help immensely.

eme64 · 2024-01-18T10:17:24Z

src/hotspot/share/opto/loopnode.cpp

+    set_ctrl(zero, C->root());
+    _igvn.replace_input_of(iff, 1, zero);
+    _igvn.replace_node(get_from_cache, C->top());
+


Suggested change

eme64 · 2024-01-18T10:32:14Z

src/hotspot/share/opto/loopnode.cpp

+    return;
+  }
+
+  Node* load_of_cache = get_from_cache->in(1);


Suggested change

Node* load_of_cache = get_from_cache->in(1);

Node* cache_adr = get_from_cache->in(1);

eme64 · 2024-01-18T10:37:57Z

src/hotspot/share/opto/subnode.hpp

+  }
+
+  Node* mem() const {
+    return in(Memory);


Why not verify that this is a MemNode?

MemNodes are not the only ones carrying memory state (projection for a call, membar etc).

eme64 · 2024-01-18T10:40:46Z

src/hotspot/share/opto/loopnode.cpp

+  Node* first_index = get_from_cache->index1();
+  Node* second_index = get_from_cache->index2();
+
+  if (first_index == C->top() && second_index == C->top()) {


could this not be done during igvn?

What happens here is that the cache was always seen to be null so no code to to probe the cache was added to the IR. When that happens, the optimizations still apply (i.e. there could be dominated ScopedValue.get() that can be replaced by this one or a dominating one that can replace this one). This should also be very uncommon. So the nodes that are put in place to enable optimizations should be left in until expansion to not miss optimization opportunities.

Co-authored-by: Andrey Turbanov <turbanoff@gmail.com>

rwestrel · 2024-01-25T14:36:26Z

Personal wishlist: can you add a case where this optimization enables vectorization? Or do your optimizations happen too late for that?

In its simplest form, a ScopedValue get will have a test (to check it's indeed in the cache) and a load so I don't think vectorization is possible.

eme64 · 2024-01-25T14:37:58Z

In its simplest form, a ScopedValue get will have a test (to check it's indeed in the cache) and a load so I don't think vectorization is possible.

So you think this could not be predicated and hoisted out of the loop? That would also be a sad limitation 😞

Co-authored-by: Emanuel Peter <emanuel.peter@oracle.com>

rwestrel · 2024-01-25T14:59:52Z

In its simplest form, a ScopedValue get will have a test (to check it's indeed in the cache) and a load so I don't think vectorization is possible.

So you think this could not be predicated and hoisted out of the loop? That would also be a sad limitation 😞

Even if that was possible, ScopedValue get loads are from the cache indexed by a hash stored as a field in the ScopedValue object. I'm not sure how you would be able to tell which of the loads from several get() calls are contiguous in memory.

rwestrel · 2024-05-02T07:15:01Z

I am wondering if it would make sense to have some scoped_value.hpp/cpp, where you can put all your new classes. This would also allow you to put documentation about the general approach at the top of the scoped_value.hpp file. Currently, the code is spread all over, and it would be hard to know where one could find a good summary of the whole optimization.

I moved most of the scoped value specific code to scoped_value.hpp/cpp in the new commit.

rwestrel · 2024-05-02T07:18:13Z

I also removed Node::find_unique_out_with() and replaced it with Node* find_out_with(int opcode, bool want_unique = false). I'll look into the automatic casting but I'd like to possibly do it as a separate clean up.

rwestrel · 2024-05-02T18:44:25Z

@eme64 change is ready for another review

eme64 · 2024-05-06T08:25:10Z

@rwestrel I feel like I am heavily stepping on your toes now....
Can you please do refactoring in a separate prior PR? This change is now 3K+ lines, and even reading through it all takes me more than a day, I simply cannot commit this many hours at a time.

I'm thinking in particular about your most recent changes with:

class Invariance
estimate_if_peeling_possible

Don't get me wrong: I like those refactorings, but they should be done separately.

If you can find anything else that could be done separately, that would help greatly.

I have been painstakingly separating my SuperWord PR's into more reviewable patches, and I do get quicker reviews that way.

My concern: I think the code is now in a state that can be understood (if one spends a day reading it all), but it is hard for me to say that it is correct. If I now approve this patch, then a subsequent reviewer will pay less attention, hence, I feel like I cannot just approve it too quickly now.

If I am too annoying, feel free to ask someone else to review and I will just step back. Maybe @theRealAph wants to review for a while?

eme64 · 2024-05-06T09:08:00Z

@rwestrel one idea to split things here:

Early inline of ScopedValue methods
Parsing to IR nodes and expansion back.
Optimization
Tests

This way, I can spend only a few hours on one at a time, and we can get this done.
Of course, you cannot really integrate an individual one, maybe there is a way to use skara for dependent PR's?

rwestrel · 2024-05-06T11:56:52Z

I'm thinking in particular about your most recent changes with:
* `class Invariance`

* `estimate_if_peeling_possible`
Don't get me wrong: I like those refactorings, but they should be done separately.

The problem I see is that they have little value unless this patch is integrated as it is. What if another reviewer thinks it's better to keep everything related to loop predication together? There's no need to change the class Invariance then.

rwestrel · 2024-05-06T12:03:12Z

@rwestrel one idea to split things here:
* Early inline of ScopedValue methods

* Parsing to IR nodes and expansion back.

* Optimization

* Tests
This way, I can spend only a few hours on one at a time, and we can get this done. Of course, you cannot really integrate an individual one, maybe there is a way to use skara for dependent PR's?

Would one commit per line above work? Or do you think it needs to be different PRs?

eme64 · 2024-05-06T12:18:41Z

@rwestrel
Just using commits is probably not really helpful. What would you do if there needs to be an update to commit 1, requested by a reviewer?

Honestly, I would like to take a break from this for now.
I leave it up to you how to present it in a way that is easier to review.
Once you get someone to review and accept it, I can see if I find time to review again.

I think the code is significantly better/readable than when we first started. So if someone like @vnkozlov simply scans and approves it, and as such takes the responsibility of "first reviewer", then I'm totally fine with that.

vnkozlov · 2024-05-07T16:53:21Z

I want to see performance numbers on x64 and aarch64 before starting looking on it. It would be nice to have data for all micros test/micro/org/openjdk/bench/java/lang/ScopedValues*.java

Put results into JBS and post short summary here.

You can compare by disable/enable new intrinsics.

theRealAph · 2024-05-08T14:22:54Z

I want to see performance numbers on x64 and aarch64 before starting looking on it. It would be nice to have data for all micros test/micro/org/openjdk/bench/java/lang/ScopedValues*.java

Put results into JBS and post short summary here.

You can compare by disable/enable new intrinsics.

I'm on it.

theRealAph · 2024-05-17T17:40:06Z

I've spoken to some senior JDK developers and their feeling is that this patch is too specific to the current scoped value implementation and too complex to go into HotSpot, especially for a feature like scoped values that is not yet out of preview.
This PR is good. It does everything needed to generate better, smaller code for scoped values. It also means that scoped values have a smaller memory footprint, a real win-win. However, I think we're going to have to park this PR for now.
I think we should revisit it when scoped values is out of preview.
Thanks, and I'm sorry that you put so much work into this patch for it not to be committed.

eme64 · 2024-05-23T05:53:49Z

I know this is "parked" now, but there are some internal conversations happening.

One question that @dean-long and @rose00 had:
If there is more than one ScopedValue, what about profile pollution?

rwestrel · 2024-05-23T16:36:11Z

There are 3 optimizations that this patch performs:

1- replace a ScopedValue.get() by a dominating ScopedValue.get()
2- move a ScopedValue.get() out of loop
3- streamline code emitted for ScopedValue.get()

There are 2 ScopedValue.get() patterns that are handled:

1- when profile reports that the slow path that updates the cache is not taken: ScopedValue.get() always hits in the cache
2- when profile reports that the slow path is taken and the code that updates the cache is included in the compiled code

Obviously, before ScopedValue.get() can use the cache, the cache has to be updated. So the slow path is taken at some point. But because, hotspot doesn't profile the first invocations of a method, profile data can report the slow path as never taken. That's likely what happens with a simple micro benchmark. Also there can be multiple ScopedValue and profile can still report the slow path as not taken. It's a more a matter of when the cache is updated than how many ScopedValue there are.

The patch performs optimization 1-, 2- and 3- for patterns 1- and 2- but, it does it better for pattern 1- than 2-. If the slow path is included in compiled code, then only a ScopedValue.get call that dominate the backedge of a loop is hoisted out of loop. That's because hoisting is a 2 step process in the case of pattern 2-: peel one iteration of the loop and replace the ScopedValue.get in the loop with the one from the peeled iteration. When the slow path is compiled in the method, a lot of extra code is also included and that likely disrupts other optimizations that might be needed before ScopedValue.get can be optimized. For instance, the slow path likely comes with a non inlined call that could get in the way of memory subgraph optimizations. That's why I made sure the patch handles both patterns.

I thought about always speculating initially that the slow path is not taken when compiling ScopedValue.get or trying to find some way to work around profile pollution. I also thought about having cleverer ways of optimizing pattern 2-. But the patch felt complicated enough and when @theRealAph experimented with it, he reported that it was doing ok the way it is.

dean-long · 2024-05-23T23:18:49Z

What's a good benchmark to run to show the benefit of this change, or to show the effect of different cache sizes and/or Java implementation changes?

I tried running micro:ScopedValue benchmarks with -Djava.lang.ScopedValue.cacheSize=2 and didn't see a difference. But the new compiler/scoped_value/TestScopedValue.java test fails in compiler.c2.irTests.TestScopedValue.testFastPath16 with the cache size set to 2.

Given the right benchmark, there are some experiments I'd like to try, related to the ScopedValue Java implemenation:

use only a primary slot probe, no secondary
use a deterministic secondary probe (based on the hash), not random
fix put() so it will reuse an existing slot. Currently it blindly set both victim and other slots. It seems like it should check the other slot first and reuse it if already set.
separate cache bitmap from slow path bitmaps, which could be 64-bits with only 1 bit per SV, not 2.
Use a per-SV MethodHandle getter using MethodHandles.guardWithTest() to avoid profile pollution

dean-long · 2024-05-24T05:12:55Z

1- replace a ScopedValue.get() by a dominating ScopedValue.get()
2- move a ScopedValue.get() out of loop
3- streamline code emitted for ScopedValue.get()

I think it might make sense to split 1 and 2, which are independent of the details of get() and put(), from 3. Then we can consider if there are other optimizations we can do around opaque() get() and put(). For example, why can't we replace a get() with the value from a dominating put()? Why can't we eliminate both the put() and get() completely, as long as the value can't "escape" or we deoptimize?

theRealAph · 2024-05-30T16:38:34Z

What's a good benchmark to run to show the benefit of this change, or to show the effect of different cache sizes and/or Java implementation changes?

I tried running micro:ScopedValue benchmarks with -Djava.lang.ScopedValue.cacheSize=2 and didn't see a difference. But the new compiler/scoped_value/TestScopedValue.java test fails in compiler.c2.irTests.TestScopedValue.testFastPath16 with the cache size set to 2.

With -jvmArgsAppend -Djava.lang.ScopedValue.cacheSize=4 I get

Benchmark Mode Cnt Score Error Units
ScopedValues.sixValues_ScopedValue avgt 10 11.881 ± 0.017 us/op

before this patch and

ScopedValues.sixValues_ScopedValue avgt 10 0.006 ± 0.001 us/op

after.

Given the right benchmark, there are some experiments I'd like to try, related to the ScopedValue Java implemenation:
1. use only a primary slot probe, no secondary

2. use a deterministic secondary probe (based on the hash), not random

Looks pretty deterministic to me. Every value has two hash codes, primary and secondary,and they are different.

3. fix put() so it will reuse an existing slot.  Currently it blindly set both `victim` and `other` slots.  It seems like it should check the `other` slot first and reuse it if already set.

Put another conditional load in the control flow? I'm not sure that would do much, but OK. I guess I don't know how this would work.

4. separate cache bitmap from slow path bitmaps, which could be 64-bits with only 1 bit per SV, not 2.

I guess that might help.

5. Use a per-SV MethodHandle getter using MethodHandles.guardWithTest() to avoid profile pollution

Interesting. I did a version of the code that used bytecode generation to produce a new accessor method for each scoped value a year or two ago, for that same reason. It did work, but was rather heavyweight.

Re benchmarks: the benchmarks are all there, but the current design is based on principles, as well as benchmark results.

As much as possible, and this is hard to do with just random hashing, I (and I believe, we) want the performance to be linear and predictable, rather than mostly luck. That's what this patch really brings to the party!
We have a hard guarantee, not just a probabilistic one, that if you are repeatedly using two different scoped values, we never have to do a fallback linear search.
This is hard to capture with a benchmark, but I guess setting java.lang.ScopedValue.cacheSize=2 would do it.
The bind method may do a little bit of extra work to help the get(), but not too much: I can see some cases where binding is done fairly frequently, and it should not be too heavyweight. But I don't know what too heavyweight really means, so...

theRealAph · 2024-05-30T17:52:54Z

use only a primary slot probe, no secondary

use a deterministic secondary probe (based on the hash), not random
Looks pretty deterministic to me. Every value has two hash codes, primary and secondary,and they are different.

Ah, I just realized what you meant. Without random replacement in the cache, there is a high probability that scoped values will collide, because the cache is small. Even with only two scoped values accessed alternately, each will repeatedly kick out the other, leading to a linear probe every time. The only way to avoid this without random replacement would be to make the cache considerably larger, and even then it would still occasionally happen.

dean-long · 2024-05-30T20:51:28Z

I guess I don't understand how random replacement is supposed to help. Do you have a pointer to where I can read up on the topic?
I would think that for small numbers of scoped values, assigning hash slots sequentially would work well. Only if there is a collision, use secondary slot.

theRealAph · 2024-05-31T09:20:49Z

On 5/30/24 21:54, Dean Long wrote:

I guess I don't understand how random replacement is supposed to help.
Do you have a pointer to where I can read up on the topic?

https://en.wikipedia.org/wiki/Cache_replacement_policies#Random_replacement_(RR)
has links.

The maths is pretty simple: if you have only one slot for each entry, one
time in 16 two scoped locals will hit the same slot, and repeatedly kick
each other out. If you have two slots with random replacement, then each
get() will retry a couple of times until each entry is in a different slot.

Random replacement is good for software implementation because it doesn't
require history, unlike (say) LRU. Maintaining access history in order to
use LRU would be as expensive as the actual get().

I would think that for small numbers of scoped values, assigning hash
slots sequentially would work well. Only if there is a collision, use
secondary slot.

In practice, today's processors speculate both primary and secondary loads
in parallel, so there's no added latency for doing both. The cost is
almost nothing.

Random replacement is a low-cost way to increase the hit ratio of a cache
without incurring the added runtime cost of (say) LRU.

theRealAph · 2024-05-31T11:27:08Z

One other thing. Let's say you always check the slot before overwriting it, and only then go to the secondary slot. You find the secondary slot is occupied. The best thing to do then is random replacement. Given that the end effect of just doing random replacement is the same, there's nothing to be gained from the added complexity.

bridgekeeper · 2024-06-28T13:18:40Z

@rwestrel This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

test & fix

8d2f796

openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Dec 5, 2023

white spaces + bug id in test

bc028fb

openjdk bot added the rfr Pull request is ready for review label Dec 5, 2023

TobiHartmann suggested changes Dec 6, 2023

View reviewed changes

test failures

489bb2a

rwestrel added 2 commits January 3, 2024 16:50

Merge branch 'master' into JDK-8320649

c7d1fe8

merge fix

28fa7f7

turbanoff reviewed Jan 15, 2024

View reviewed changes

test/hotspot/jtreg/compiler/c2/irTests/TestScopedValue.java Outdated Show resolved Hide resolved

eme64 reviewed Jan 16, 2024

View reviewed changes

eme64 suggested changes Jan 17, 2024

View reviewed changes

eme64 suggested changes Jan 18, 2024

View reviewed changes

Update test/hotspot/jtreg/compiler/c2/irTests/TestScopedValue.java

6e79711

Co-authored-by: Andrey Turbanov <turbanoff@gmail.com>

Update src/hotspot/share/opto/callGenerator.cpp

a24f729

Co-authored-by: Emanuel Peter <emanuel.peter@oracle.com>

rwestrel added 3 commits April 30, 2024 13:46

review

595da6c

scoped_value.[ch]pp

70fedae

more tests

5c60a4b

more

d38872f

openjdk bot removed the rfr Pull request is ready for review label May 2, 2024

whitespaces

7723c9c

openjdk bot added the rfr Pull request is ready for review label May 2, 2024

openjdk bot added the merge-conflict Pull request has merge conflict with target branch label May 9, 2024

openjdk bot removed the rfr Pull request is ready for review label Jun 6, 2024

	assert(c->is_Proj() \|\| c->is_Catch(), "");
	assert(c->is_Proj() \|\| c->is_Catch(), "unexpected node in pattern matching");

	CallProjections scoped_value_cache_projs = CallProjections();
	CallProjections scoped_value_cache_projs;

	Node* cache_value = kit.gvn().transform(new ScopedValueGetLoadFromCacheNode(C, in_cache, sv_hits_in_cache));
	Node* sv_load_from_cache = kit.gvn().transform(new ScopedValueGetLoadFromCacheNode(C, in_cache, sv_hits_in_cache));

	remove_useless_late_inlines( &_scoped_value_late_inlines, useful);
	remove_useless_late_inlines( &_scoped_value_late_inlines, useful);

	const Type fgetfromcache =(const Type)shared_type_arena->AmallocWords(3sizeof(Type));
	const Type fgetfromcache =(const Type)shared_type_arena->AmallocWords(3sizeof(Type));

	const Type fsvgetresult =(const Type)shared_type_arena->AmallocWords(2sizeof(Type));
	const Type fsvgetresult =(const Type)shared_type_arena->AmallocWords(2sizeof(Type));

	static const TypeTuple *SV_GET_RESULT;
	static const TypeTuple* SV_GET_RESULT;

		}


		bool PhaseIdealLoop::loop_predication_for_scoped_value_get(IdealLoopTree* loop, IfProjNode* if_success_proj,

	IfNode* iff, IfProjNode*&new_predicate_proj) {
	IfNode* iff, IfProjNode* &new_predicate_proj) {

	void PhaseIdealLoop::expand_get_from_sv_cache(ScopedValueGetHitsInCacheNode* get_from_cache) {
	void PhaseIdealLoop::expand_sv_get_hits_in_cache_and_load_from_cache(ScopedValueGetHitsInCacheNode* get_from_cache) {

	Node* load_of_cache = get_from_cache->in(1);
	Node* cache_adr = get_from_cache->in(1);

8320649: C2: Optimize scoped values #16966

Are you sure you want to change the base?

8320649: C2: Optimize scoped values #16966

Conversation

rwestrel commented Dec 5, 2023 • edited by openjdk bot Loading

Progress

Issue

Reviewing

Webrev

bridgekeeper bot commented Dec 5, 2023

openjdk bot commented Dec 5, 2023

mlbridge bot commented Dec 5, 2023 • edited Loading

Webrevs

TobiHartmann left a comment

Choose a reason for hiding this comment

rwestrel commented Dec 7, 2023

dean-long commented Jan 5, 2024

theRealAph commented Jan 5, 2024

rwestrel commented Jan 5, 2024

dean-long commented Jan 5, 2024

dean-long commented Jan 5, 2024

theRealAph commented Jan 6, 2024

TobiHartmann commented Jan 8, 2024

eme64 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eme64 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eme64 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eme64 Jan 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eme64 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eme64 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rwestrel commented Dec 5, 2023 •

edited by openjdk bot

Loading

mlbridge bot commented Dec 5, 2023 •

edited

Loading

eme64 Jan 17, 2024 •

edited

Loading

eme64 left a comment •

edited

Loading

rwestrel Jan 30, 2024 •

edited

Loading