Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8320649: C2: Optimize scoped values #16966

Open
wants to merge 28 commits into
base: master
Choose a base branch
from

Conversation

rwestrel
Copy link
Contributor

@rwestrel rwestrel commented Dec 5, 2023

This change implements C2 optimizations for calls to
ScopedValue.get(). Indeed, in:

v1 = scopedValue.get();
...
v2 = scopedValue.get();

v2 can be replaced by v1 and the second call to get() can be
optimized out. That's true whatever is between the 2 calls unless a
new mapping for scopedValue is created in between (when that happens
no optimizations is performed for the method being compiled). Hoisting
a get() call out of loop for a loop invariant scopedValue should
also be legal in most cases.

ScopedValue.get() is implemented in java code as a 2 step process. A
cache is attached to the current thread object. If the ScopedValue
object is in the cache then the result from get() is read from
there. Otherwise a slow call is performed that also inserts the
mapping in the cache. The cache itself is lazily allocated. One
ScopedValue can be hashed to 2 different indexes in the cache. On a
cache probe, both indexes are checked. As a consequence, the process
of probing the cache is a multi step process (check if the cache is
present, check first index, check second index if first index
failed). If the cache is populated early on, then when the method that
calls ScopedValue.get() is compiled, profile reports the slow path
as never taken and only the read from the cache is compiled.

To perform the optimizations, I added 3 new node types to C2:

  • the pair
    ScopedValueGetHitsInCacheNode/ScopedValueGetLoadFromCacheNode for
    the cache probe

  • a cfg node ScopedValueGetResultNode to help locate the result of the
    get() call in the IR graph.

In pseudo code, once the nodes are inserted, the code of a get() is:

hits_in_the_cache = ScopedValueGetHitsInCache(scopedValue)
if (hits_in_the_cache) {
  res = ScopedValueGetLoadFromCache(hits_in_the_cache);
} else {
  res = ..; //slow call possibly inlined. Subgraph can be arbitray complex
}
res = ScopedValueGetResult(res)

In the snippet:

v1 = scopedValue.get();
...
v2 = scopedValue.get();

Replacing v2 by v1 is then done by starting from the
ScopedValueGetResult node for the second get() and looking for a
dominating ScopedValueGetResult for the same ScopedValue
object. When one is found, it is used as a replacement. Eliminating
the second get() call is achieved by making
ScopedValueGetHitsInCache always successful if there's a dominating
ScopedValueGetResult and replacing its companion
ScopedValueGetLoadFromCache by the dominating
ScopedValueGetResult.

Hoisting a get() out of loop is achieved by peeling one iteration of
the loop. The optimization above then finds a dominating get() and
removed the get() from the loop body.

An important case, I think, is when profile predicts the slow case to
never taken. Then the code of get() is:

hits_in_the_cache = ScopedValueGetHitsInCache(scopedValue)
if (hits_in_the_cache) {
  res = ScopedValueGetLoadFromCache(hits_in_the_cache);
} else {
  trap();
}
res = ScopedValueGetResult(res)

The ScopedValueGetResult doesn't help and is removed early one. The
optimization process then looks for a pair of
ScopedValueGetHitsInCache/ScopedValueGetLoadFromCache that
dominates the current pair of
ScopedValueGetHitsInCache/ScopedValueGetLoadFromCache and can
replace them. In that case, hoisting a ScopedValue.get() can be
done by predication and I added special logic in predication for that.

Adding the new nodes to the graph when a ScopedValue.get() call is
encountered is done in several steps:

1- inlining of ScopedValue.get() is delayed and the call is enqueued
for late inlining.

2- Once the graph is fully constructed, for each call to
ScopedValue.get(), a ScopedValueGetResult is added between the
result of the call and its uses.

3- the call is then inlined by parsing the ScopedValue.get() method

4- finally the subgraph that results is pattern matched and the pieces
required to perform the cache probe are extracted and attached to new
ScopedValueGetHitsInCache/ScopedValueGetLoadFromCache nodes

There are a couple of reasons for steps 3 and 4:

  • As mentioned above probing the cache is a multi step process. Having
    only 2 nodes in a simple graph shape to represent it makes it easier
    to write robust optimizations

  • the subgraph for the method after parsing contains valuable pieces
    of information: profile data that captures which of the 2 locations
    in the cache is the most likely to causee a hit. Profile data is
    attached to the nodes.

Removal of redundant nodes is done during loop opts. The ScopedValue
nodes are then expanded. That also happens during loop opts because
once expansion is over, there are opportunities for further
optimizations/clean up that can only happens during loop opts. During
expansion, ScopedValueGetResult nodes are removed and
ScopedValueGetHitsInCache/ScopedValueGetLoadFromCache are expanded
to the multi step process of probing the cache. Profile data attached
to the nodes are used to assign correct frequencies/counts to the If
nodes. Of the 2 locations in the cache that are tested, the one that's
the most likely to see a hit (from profile data) is done first.

/cc hotspot-compiler


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed (2 reviews required, with at least 2 Reviewers)

Issue

  • JDK-8320649: C2: Optimize scoped values (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/16966/head:pull/16966
$ git checkout pull/16966

Update a local copy of the PR:
$ git checkout pull/16966
$ git pull https://git.openjdk.org/jdk.git pull/16966/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 16966

View PR using the GUI difftool:
$ git pr show -t 16966

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/16966.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Dec 5, 2023

👋 Welcome back roland! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Dec 5, 2023
@openjdk
Copy link

openjdk bot commented Dec 5, 2023

@rwestrel
The hotspot-compiler label was successfully added.

@openjdk openjdk bot added the rfr Pull request is ready for review label Dec 5, 2023
@mlbridge
Copy link

mlbridge bot commented Dec 5, 2023

Copy link
Member

@TobiHartmann TobiHartmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No review yet, I just performed some quick testing.

The optimized build fails:

[2023-12-05T16:26:12,957Z] open/src/hotspot/share/opto/loopnode.cpp:4745: error: undefined reference to 'ScopedValueGetHitsInCacheNode::verify() const'
[2023-12-05T16:26:12,960Z] open/src/hotspot/share/opto/loopnode.cpp:4761: error: undefined reference to 'ScopedValueGetLoadFromCacheNode::verify() const'
[2023-12-05T16:26:12,964Z] open/src/hotspot/share/opto/loopnode.cpp:4908: error: undefined reference to 'ScopedValueGetHitsInCacheNode::verify() const'
[2023-12-05T16:26:12,967Z] open/src/hotspot/share/opto/loopnode.cpp:4911: error: undefined reference to 'ScopedValueGetLoadFromCacheNode::verify() const'
[2023-12-05T16:26:12,976Z] open/src/hotspot/share/opto/loopopts.cpp:3935: error: undefined reference to 'ScopedValueGetHitsInCacheNode::verify() const'
[2023-12-05T16:26:15,455Z] collect2: error: ld returned 1 exit status

compiler/c2/irTests/TestScopedValue.java fails with -Xcomp on Linux x64:

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f8de687d1b9, pid=270115, tid=270131
#
# JRE version: Java(TM) SE Runtime Environment (22.0) (fastdebug build 22-internal-2023-12-05-1616186.tobias.hartmann.jdk2)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 22-internal-2023-12-05-1616186.tobias.hartmann.jdk2, compiled mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0x12931b9]  PhaseIdealLoop::get_early_ctrl(Node*)+0x4c9

Current CompileTask:
C2:30390 8110    b  4       compiler.c2.irTests.TestScopedValue::testFastPath13 (28 bytes)

Stack: [0x00007f8dc4353000,0x00007f8dc4453000],  sp=0x00007f8dc444d6c0,  free space=1001k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x12931b9]  PhaseIdealLoop::get_early_ctrl(Node*)+0x4c9  (loopnode.hpp:1139)
V  [libjvm.so+0x1293d95]  PhaseIdealLoop::set_subtree_ctrl(Node*, bool) [clone .part.0]+0x75  (loopnode.cpp:251)
V  [libjvm.so+0x1293df6]  PhaseIdealLoop::set_subtree_ctrl(Node*, bool) [clone .part.0]+0xd6  (node.hpp:399)
V  [libjvm.so+0x1293df6]  PhaseIdealLoop::set_subtree_ctrl(Node*, bool) [clone .part.0]+0xd6  (node.hpp:399)
V  [libjvm.so+0x1293df6]  PhaseIdealLoop::set_subtree_ctrl(Node*, bool) [clone .part.0]+0xd6  (node.hpp:399)
V  [libjvm.so+0x1293df6]  PhaseIdealLoop::set_subtree_ctrl(Node*, bool) [clone .part.0]+0xd6  (node.hpp:399)
V  [libjvm.so+0x1293df6]  PhaseIdealLoop::set_subtree_ctrl(Node*, bool) [clone .part.0]+0xd6  (node.hpp:399)
V  [libjvm.so+0x1293df6]  PhaseIdealLoop::set_subtree_ctrl(Node*, bool) [clone .part.0]+0xd6  (node.hpp:399)
V  [libjvm.so+0x1293df6]  PhaseIdealLoop::set_subtree_ctrl(Node*, bool) [clone .part.0]+0xd6  (node.hpp:399)
V  [libjvm.so+0x12960e0]  PhaseIdealLoop::test_and_load_from_cache(Node*, Node*, Node*, Node*, float, float, Node*, Node*&, Node*&, Node*&)+0x820  (loopnode.cpp:4900)
V  [libjvm.so+0x1296b6a]  PhaseIdealLoop::expand_get_from_sv_cache(ScopedValueGetHitsInCacheNode*)+0x82a  (loopnode.cpp:4822)
V  [libjvm.so+0x1297473]  PhaseIdealLoop::expand_scoped_value_get_nodes()+0x243  (loopnode.cpp:4737)
V  [libjvm.so+0x12a45ed]  PhaseIdealLoop::build_and_optimize()+0xf0d  (loopnode.cpp:4672)
V  [libjvm.so+0x9f4ea2]  PhaseIdealLoop::optimize(PhaseIterGVN&, LoopOptsMode)+0x432  (loopnode.hpp:1113)
V  [libjvm.so+0x9ed945]  Compile::optimize_loops(PhaseIterGVN&, LoopOptsMode)+0x75  (compile.cpp:2248)
V  [libjvm.so+0x9f0253]  Compile::Optimize()+0xfd3  (compile.cpp:2500)
V  [libjvm.so+0x9f37e1]  Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0x1c21  (compile.cpp:860)
V  [libjvm.so+0x83eca7]  C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0x1e7  (c2compiler.cpp:134)
V  [libjvm.so+0x9ff17c]  CompileBroker::invoke_compiler_on_method(CompileTask*)+0x92c  (compileBroker.cpp:2299)
V  [libjvm.so+0x9ffe08]  CompileBroker::compiler_thread_loop()+0x468  (compileBroker.cpp:1958)
V  [libjvm.so+0xeb93bc]  JavaThread::thread_main_inner()+0xcc  (javaThread.cpp:720)
V  [libjvm.so+0x17992c6]  Thread::call_run()+0xb6  (thread.cpp:220)
V  [libjvm.so+0x14a30f7]  thread_native_entry(Thread*)+0x127  (os_linux.cpp:787)

compiler/c2/irTests/TestScopedValue.java fails with -ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -server -XX:-TieredCompilation on Linux x64:

Failed IR Rules (1) of Methods (1)
----------------------------------
1) Method "public static void compiler.c2.irTests.TestScopedValue.testFastPath7()" - [Failed IR rules: 1]:
   * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={}, failOn={"_#C#CALL_OF_METHOD#_", "slowGet"}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
     > Phase "PrintIdeal":
       - failOn: Graph contains forbidden nodes:
         * Constraint 1: "(\\d+(\\s){2}(Call.*Java.*)+(\\s){2}===.*slowGet )"
           - Matched forbidden node:
             * 501  CallStaticJava  === 370 6 7 8 1 (648 1 1 1 1 1 ) [[ 502 503 504 ]] # Static  java.lang.ScopedValue::slowGet 

compiler/c2/irTests/TestScopedValue.java fails with -XX:TypeProfileLevel=222 on AArch64:

# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (/System/Volumes/Data/mesos/work_dir/slaves/0db9c48f-6638-40d0-9a4b-bd9cc7533eb8-S29331/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/e2db05c4-923c-4a63-923b-5f9870681cc5/runs/c74e986d-15f1-46dc-822b-a41d12c079e0/workspace/open/src/hotspot/share/opto/callGenerator.cpp:929), pid=44590, tid=26115
#  Error: assert(in->Opcode() == Op_LoadP || in->Opcode() == Op_LoadN) failed

Current CompileTask:
C2:766  689    b  4       compiler.c2.irTests.TestScopedValue::testFastPath1 (30 bytes)

Stack: [0x00000001719ec000,0x0000000171bef000],  sp=0x0000000171beb0c0,  free space=2044k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.dylib+0x1130268]  VMError::report_and_die(int, char const*, char const*, char*, Thread*, unsigned char*, void*, void*, char const*, int, unsigned long)+0x564  (callGenerator.cpp:929)
V  [libjvm.dylib+0x1130a88]  VMError::report_and_die(Thread*, unsigned int, unsigned char*, void*, void*)+0x0
V  [libjvm.dylib+0x5618b0]  print_error_for_unit_test(char const*, char const*, char*)+0x0
V  [libjvm.dylib+0x396ea0]  LateInlineScopedValueCallGenerator::process_result(GraphKit&)+0x2534
V  [libjvm.dylib+0x38f8dc]  CallGenerator::do_late_inline_helper()+0x660
V  [libjvm.dylib+0x4cd2bc]  Compile::inline_scoped_value_calls(PhaseIterGVN&)+0x570
V  [libjvm.dylib+0x4c6944]  Compile::Optimize()+0x210
V  [libjvm.dylib+0x4c54bc]  Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0x1228
V  [libjvm.dylib+0x38a590]  C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0x1e0
V  [libjvm.dylib+0x4e2f48]  CompileBroker::invoke_compiler_on_method(CompileTask*)+0x854
V  [libjvm.dylib+0x4e238c]  CompileBroker::compiler_thread_loop()+0x348
V  [libjvm.dylib+0x8bb170]  JavaThread::thread_main_inner()+0x1dc
V  [libjvm.dylib+0x1076548]  Thread::call_run()+0xf4
V  [libjvm.dylib+0xe39138]  thread_native_entry(Thread*)+0x138
C  [libsystem_pthread.dylib+0x726c]  _pthread_start+0x94

compiler/c2/irTests/TestScopedValue.java fails with -XX:+UnlockDiagnosticVMOptions -XX:TieredStopAtLevel=3 -XX:+StressLoopInvariantCodeMotion -XX:+StressRangeCheckElimination -XX:+StressLinearScan on AArch64:

compiler.lib.ir_framework.shared.TestRunException: There was an error while invoking @Run method private void compiler.c2.irTests.TestScopedValue.testFastPath1Runner() throws java.lang.Exception
	at compiler.lib.ir_framework.test.CustomRunTest.invokeTest(CustomRunTest.java:162)
	at compiler.lib.ir_framework.test.CustomRunTest.run(CustomRunTest.java:87)
	at compiler.lib.ir_framework.test.TestVM.runTests(TestVM.java:822)
	at compiler.lib.ir_framework.test.TestVM.start(TestVM.java:249)
	at compiler.lib.ir_framework.test.TestVM.main(TestVM.java:164)
Caused by: java.lang.reflect.InvocationTargetException
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:118)
	at java.base/java.lang.reflect.Method.invoke(Method.java:580)
	at compiler.lib.ir_framework.test.CustomRunTest.invokeTest(CustomRunTest.java:159)
	... 4 more
Caused by: java.lang.RuntimeException: should be compiled
	at compiler.c2.irTests.TestScopedValue.testFastPath1Runner(TestScopedValue.java:87)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	... 6 more

compiler/c2/TestUnsignedByteCompare.java and compiler/codegen/TestSignedMultiplyLong.java fail with -Duse.JTREG_TEST_THREAD_FACTORY=Virtual -XX:-VerifyContinuations intermittent on Windows x64:

# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (/System/Volumes/Data/mesos/work_dir/slaves/0db9c48f-6638-40d0-9a4b-bd9cc7533eb8-S29331/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/e2db05c4-923c-4a63-923b-5f9870681cc5/runs/c74e986d-15f1-46dc-822b-a41d12c079e0/workspace/open/src/hotspot/share/opto/compile.cpp:813), pid=29127, tid=26371
#  assert(IncrementalInline || (_late_inlines.length() == 0 && !has_mh_late_inlines())) failed: incremental inlining is off

Current CompileTask:
C2:5797 3627    b        java.lang.System$2::scopedValueCache (4 bytes)

Stack: [0x0000000171694000,0x0000000171897000],  sp=0x0000000171894bc0,  free space=2050k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.dylib+0x1130268]  VMError::report_and_die(int, char const*, char const*, char*, Thread*, unsigned char*, void*, void*, char const*, int, unsigned long)+0x564  (compile.cpp:813)
V  [libjvm.dylib+0x1130a88]  VMError::report_and_die(Thread*, unsigned int, unsigned char*, void*, void*)+0x0
V  [libjvm.dylib+0x5618b0]  print_error_for_unit_test(char const*, char const*, char*)+0x0
V  [libjvm.dylib+0x4c5794]  Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0x1500
V  [libjvm.dylib+0x38a590]  C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0x1e0
V  [libjvm.dylib+0x4e2f48]  CompileBroker::invoke_compiler_on_method(CompileTask*)+0x854
V  [libjvm.dylib+0x4e238c]  CompileBroker::compiler_thread_loop()+0x348
V  [libjvm.dylib+0x8bb170]  JavaThread::thread_main_inner()+0x1dc
V  [libjvm.dylib+0x1076548]  Thread::call_run()+0xf4
V  [libjvm.dylib+0xe39138]  thread_native_entry(Thread*)+0x138
C  [libsystem_pthread.dylib+0x726c]  _pthread_start+0x94

Just let me know if you need any more information.

@rwestrel
Copy link
Contributor Author

rwestrel commented Dec 7, 2023

No review yet, I just performed some quick testing.

Thanks for doing that. All issues should be fixed now (I couldn't reproduce the last one so not 100% sure about that one).

@dean-long
Copy link
Member

I'm not a C2 expert, so my high-level comments might not all make sense, but here goes.

  • My first reaction was why does this need to be so complicated? Can't we treat the slow path and cache implementation as a black box, and just common up the high-level get()? In my mind the ScopedValue get should be similar to a read from a "condy" dynamic constant.
  • The reason for breaking up the operations into individual nodes seems to be because of the profiling information. So I'm wondering how much this helps, given the added complexity.
  • I would expect multiple get() calls in the same method or in loops to be rare and/or programmer errors. The real advantage seems to be when the binding and the get() can be connected from caller to callee through inlining.
  • Are we able to optimize a get() on a constant/final ScopedValue into a simple array load at a constant offset?
  • Needing to do things like treat ScopedValueGetHitsInCache as always successful give be a bad feeling for some reason, and seem unnecessary if we did more at a higher (macro?) level rather than eagerly expanding the high-level operation into individual nodes.

@theRealAph
Copy link
Contributor

A couple of answers:

I'm not a C2 expert, so my high-level comments might not all make sense, but here goes.

* I would expect multiple get() calls in the same method or in loops to be rare and/or programmer errors.  The real advantage seems to be when the binding and the get() can be connected from caller to callee through inlining.

Binding and get() are usually separated by a long way. It's a common pattern to use get() inside a loop when a ScopedValue is used to hold a capability object which is private within a library context.

* Are we able to optimize a get() on a constant/final ScopedValue into a simple array load at a constant offset?

Maybe I'm misunderstanding this question, but that's what the scoped value cache does.

@rwestrel
Copy link
Contributor Author

rwestrel commented Jan 5, 2024

I'm not a C2 expert, so my high-level comments might not all make sense, but here goes.

* My first reaction was why does this need to be so complicated?

That's a fair reaction.

Can't we treat the slow path and cache implementation as a black box, and just common up the high-level get()? In my mind the ScopedValue get should be similar to a read from a "condy" dynamic constant.

Initially, I thought about delaying the inlining of get() methods and simply have a pass that look for get() calls with the same inputs. I don't think that works well because the current late inlining framework can't delay inlining very late. We don't run loop opts before we're done with inlining for instance. If we wanted to hoist a call out of loop we would need loop opts. For instance, tt's likely a call to get() depends on a null check that we would need to hoist first.

The other thing about optimizing get() calls is that they are heavy weight nodes (a high level get() macro node would be very similar to a get() call node whichever way you look at it). We don't know how to hoist a call out of loop. A call acts as a barrier on the entire memory state and get in the way of memory optimizations. If profile reports the slow path to be never taken then the shape of the get() becomes lighter weight. It doesn't disrupt other optimizations. Probing the cache acts as a load + test which we know how to hoist from a loop.

It felt to me that it would be fairly common for the slow path to not be needed and given the shape without the slow path is much easier to optimize, it was important to be able to expose early on if the slow path was there or not.

* The reason for breaking up the operations into individual nodes seems to be because of the profiling information.  So I'm wondering how much this helps, given the added complexity.

The thing about get() is that in simple cases, it optimizes well because of profile data. A get() call once inlined can essentially be hoisted out of loop if all goes well. It doesn't take much for simple optimizations on get() to not happen anymore. The goal of this patch is to bring consistency and have optimizations work well in all sort of scenarios. But it would be hard to sell if the simple cases don't work as well as they do without the patch. And I believe that requires profile data.

* I would expect multiple get() calls in the same method or in loops to be rare and/or programmer errors.  The real advantage seems to be when the binding and the get() can be connected from caller to callee through inlining.

Eliminating get() calls with the same inputs may not be common in java code but that transformation is a building block for optimizations. Hoisting a get() out of loop can be achieved by peeling one iteration and letting the get() from the loop body be removed because it's redundant with the one from the peeled iteration. Also, code that c2 optimizes once inlining has happened and dead paths have been trimmed doesn't necessarily look like the java code the programmer wrote.

* Needing to do things like treat ScopedValueGetHitsInCache as always successful give be a bad feeling for some reason, and seem unnecessary if we did more at a higher (macro?) level rather than eagerly expanding the high-level operation into individual nodes.

I think my comments above cover that one.

@dean-long
Copy link
Member

Thanks @rwestrel, that helps. I have no objections to this change, but I don't understand C2 enough to do a deeper review.

@dean-long
Copy link
Member

Maybe I'm misunderstanding this question, but that's what the scoped value cache does.

@theRealAph I guess it boils down to whether the hash value can be treated as a compile-time constant, which seems possible because it's marked final.

@theRealAph
Copy link
Contributor

Maybe I'm misunderstanding this question, but that's what the scoped value cache does.

@theRealAph I guess it boils down to whether the hash value can be treated as a compile-time constant, which seems possible because it's marked final.

It always has been in the tests I've done. One of the interesting challenges with this work has been to make sure scoped value performance doesn't regress. A great advantage of this PR is that a dedicated scoped value optimization helps to make such regressions less likely.

@TobiHartmann
Copy link
Member

Tests all pass now.

Copy link
Contributor

@eme64 eme64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just studied the background and hope to look into this in the next days.

Personal wishlist: can you add a case where this optimization enables vectorization? Or do your optimizations happen too late for that?

// }
// MyLong long2 = (MyLong)scopedValue.get();
// return long1.getValue() + long2.getValue();
// }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you still working on this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I couldn't make the test work unfortunately, so I wasn't sure whether to leave the test commented out (in case someone revisits that later) or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe have the body of the test put in, and the IR-rules commented out, with a follow-up RFE for investigation, if you think there is something one can do about it?

Copy link
Contributor

@eme64 eme64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @rwestrel
I'm sending out a first batch or comments, more coming later.

@@ -105,7 +105,8 @@ Node* BarrierSetC2::store_at_resolved(C2Access& access, C2AccessValue& val) cons
assert(access.is_opt_access(), "either parse or opt access");
C2OptAccess& opt_access = static_cast<C2OptAccess&>(access);
Node* ctl = opt_access.ctl();
MergeMemNode* mm = opt_access.mem();
assert(opt_access.mem()->is_MergeMem(), "");
MergeMemNode* mm = opt_access.mem()->as_MergeMem();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does as_MergeMem not assert is_MergeMem?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update ✔️

// }
// continue:
//
// slow_call:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes it look like continue is a fall-through to slow_call, that is not what you want, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update ✔️

src/hotspot/share/opto/callGenerator.cpp Outdated Show resolved Hide resolved
// the result of the parsing of the java code where the success path ends with an Halt node. The reason for that is
// that some paths may end with an uncommon trap and if one traps, we want the trap to be recorded for the right bci.
// When the ScopedValueGetHitsInCache/ScopedValueGetLoadFromCache pair is expanded, split if finds the duplicate
// logic and cleans it up.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer the comment section at the beginning of the method. Otherwise I may start reading down linearly, reverse-engineer the code, and only discover this afterwards... 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates ✔️

_process_result = v;
}

virtual void process_result(GraphKit& kit) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be really nice if you refactored this huge method (6+ pages of code) into smaller units.
It would for example make separation of pattern-matching and transformation easier to see.

slow_call = c->as_CallStaticJava();
assert(slow_call->method()->intrinsic_id() == vmIntrinsics::_SVslowGet, "");
} else {
assert(c->is_Proj() || c->is_Catch(), "");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert(c->is_Proj() || c->is_Catch(), "");
assert(c->is_Proj() || c->is_Catch(), "unexpected node in pattern matching");

Unique_Node_List wq;
wq.push(kit.control());
for (uint i = 0; i < wq.size(); ++i) {
Node* c = wq.at(i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we assert that these are all CFG nodes? And give it a more expressive name?

}
}
// get_first_iff/get_second_iff contain the first/second check we ran into during the graph traversal but they may
// not be the first/second one in execution order. Perform another traversal to figure out which is first.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can this not be done in the first traversal, and why does this (down) traversal do the right thing?

Can we assert that c is always CFG?
Please mention in a comment that we are only traversing the same CFG nodes from the first traversal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can this not be done in the first traversal, and why does this (down) traversal do the right thing?

The first traversal starts from the end of the method and follows control paths until it reaches the Thread.scopedValueCache() call. Given the shape of the method and that some paths may have been trimmed and end with an uncommon trap, it could reach either the first or the second if that probe the cache first.

// get_first_iff/get_second_iff contain the first/second check we ran into during the graph traversal but they may
// not be the first/second one in execution order. Perform another traversal to figure out which is first.
if (get_second_iff != nullptr) {
Node_Stack stack(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No visited set. Can this trigger an exponential explosion with if/region diamonds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No visited set. Can this trigger an exponential explosion with if/region diamonds?

It only follows the control subgraph for the ScopedValue.get() which is fairly simple.

get_first_iff->in(1)->as_Bool()->_test._test == BoolTest::ne ? 0 : 1);
CallStaticJavaNode* get_first_iff_unc = get_first_iff_failure->is_uncommon_trap_proj(Deoptimization::Reason_none);
if (get_first_iff_unc != nullptr) {
// first cache check never hits, keep only the second.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm struggling to understand:
We still have an unc-trap for the first. So we never failed so far, right? So we always found it in the cache, or am I wrong?
We are not removing this unc-trap though, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ScopedValue.get() codee probes 2 cache locations. If, when pattern matching the get() subgraph:

  • we only find a single if that probes the cache, then, according to profile data, there was always a hit at the first cache location.

  • we find 2 ifs, then the first and second locations were probed. If the first if's other branch is to an uncommon trap, then that location never saw a cache hit. In that case, when the ScopedValueGetHitsInCacheNode is expanded, only code to probe the second location is added back to the IR.

Copy link
Contributor

@eme64 eme64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second badge of comments.

kit.set_i_o(io);

// remove the scopedValueCache() call
CallProjections scoped_value_cache_projs = CallProjections();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
CallProjections scoped_value_cache_projs = CallProjections();
CallProjections scoped_value_cache_projs;

Is the assignment really necessary, or style-wise preferrable? I see you use it without elsewhere.

// Now move right above the scopedValueCache() call
Node* mem = scoped_value_cache->in(TypeFunc::Memory);
Node* c = scoped_value_cache->in(TypeFunc::Control);
Node* io = scoped_value_cache->in(TypeFunc::I_O);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use input_mem, input_ctrl, input_io. Then the replacements below would read more intuitively.

first_index == nullptr ? C->top() : first_index,
second_index == nullptr ? C->top() : second_index);

// It will later be expanded back to all the checks so record profile data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also copy the node info (e.g. line number etc)?

sv_hits_in_cache->set_profile_data(2, get_second_iff->_fcnt, get_second_prob);
} else {
sv_hits_in_cache->set_profile_data(2, 0, 0);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another case of code duplication. Why not write a method that extracts cnt, prob for an iff?`Or maybe that already exists?

assert(sv_hits_in_cachex == sv_hits_in_cache, "");

// And compute the probability of a miss in the cache
float prob;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
float prob;
float cache_miss_prob;

r->init_req(2, in_cache);

// ScopedValueGetLoadFromCache is a single that represents the result of a hit in the cache
Node* cache_value = kit.gvn().transform(new ScopedValueGetLoadFromCacheNode(C, in_cache, sv_hits_in_cache));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Node* cache_value = kit.gvn().transform(new ScopedValueGetLoadFromCacheNode(C, in_cache, sv_hits_in_cache));
Node* sv_load_from_cache = kit.gvn().transform(new ScopedValueGetLoadFromCacheNode(C, in_cache, sv_hits_in_cache));

For consistency with sv_hits_in_cache and the node class name.

do_intrinsic(_SVslowGet, java_lang_ScopedValue, slowGet_name, void_object_signature, F_R) \
do_name( slowGet_name, "slowGet") \
do_intrinsic(_SVCacheInvalidate, java_lang_ScopedValue_Cache, invalidate_name, int_void_signature, F_S) \
do_name( invalidate_name, "invalidate") \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What prevents us from writing out _scopedValueGet etc, just like the other names here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update ✔️


ProjNode* result_out() {
return proj_out_or_null(Result);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either verify that we have not null, or else rename to result_out_or_null.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update ✔️

@@ -461,6 +462,7 @@ void Compile::disconnect_useless_nodes(Unique_Node_List& useful, Unique_Node_Lis
remove_useless_late_inlines( &_string_late_inlines, useful);
remove_useless_late_inlines( &_boxing_late_inlines, useful);
remove_useless_late_inlines(&_vector_reboxing_late_inlines, useful);
remove_useless_late_inlines( &_scoped_value_late_inlines, useful);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
remove_useless_late_inlines( &_scoped_value_late_inlines, useful);
remove_useless_late_inlines( &_scoped_value_late_inlines, useful);

Copy link
Contributor

@eme64 eme64 Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make it align with the code above

And:
We keep adding more and more of these...
Soon we will need to refactor this to a list of lists 😅

@@ -2025,6 +2030,73 @@ void Compile::inline_boxing_calls(PhaseIterGVN& igvn) {
}
}

void Compile::inline_scoped_value_calls(PhaseIterGVN& igvn) {
if (_scoped_value_late_inlines.length() > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than indenting everything, I would just check _scoped_value_late_inlines.is_empty() and return.

Copy link
Contributor

@eme64 eme64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Third comment batch.
I still have not looked at some parts, but I think this is enough details for today ;)

}
C->set_has_scoped_value_get_nodes(true);
CallNode* call = cg->call_node();
CallProjections projs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
CallProjections projs;
CallProjections call_projs;

call->extract_projections(&projs, true);
Node* sv = call->in(TypeFunc::Parms);
Node* control_out = projs.fallthrough_catchproj;
Node* res = projs.resproj;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have longer and more descriptive names for sv and res please 🙏 😃

gvn->record_for_igvn(control_out);
res = res->clone();
gvn->set_type_bottom(res);
gvn->record_for_igvn(res);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you clone these? (maybe add a comment)

case Op_ScopedValueGetResult:
case Op_ScopedValueGetHitsInCache:
case Op_ScopedValueGetLoadFromCache: {
ShouldNotReachHere();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? Add a comment!

Node* ScopedValueGetLoadFromCacheNode::scoped_value() const {
Node* hits_in_cache = in(1);
assert(hits_in_cache->Opcode() == Op_ScopedValueGetHitsInCache, "");
return ((ScopedValueGetHitsInCacheNode*)hits_in_cache)->scoped_value();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not add the neccessary bits to the class so you can use as_ScopedValueGetLoadFromCache()?

wq.push(u);
}
}
} else if (n->Opcode() != Op_Halt) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So a path without a call and only Halt node is also a uncommon trap?

}
return res;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code duplication warning 😉
Not sure what is the best solution though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could remove duplication with a simple "implement" function, that takes a parameter "want_unique". Then if you find the element, and don't want unique, you return. If you are looking for unique, you just continue and check that you don't find it again.

@@ -614,6 +614,16 @@ void Type::Initialize_shared(Compile* current) {
TypeInstKlassPtr::OBJECT = TypeInstKlassPtr::make(TypePtr::NotNull, current->env()->Object_klass(), 0);
TypeInstKlassPtr::OBJECT_OR_NULL = TypeInstKlassPtr::make(TypePtr::BotPTR, current->env()->Object_klass(), 0);

const Type **fgetfromcache =(const Type**)shared_type_arena->AmallocWords(3*sizeof(Type*));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const Type **fgetfromcache =(const Type**)shared_type_arena->AmallocWords(3*sizeof(Type*));
const Type** fgetfromcache =(const Type**)shared_type_arena->AmallocWords(3*sizeof(Type*));

fgetfromcache[1] = TypeInstPtr::BOTTOM;
fgetfromcache[2] = TypeAryPtr::OOPS;
TypeTuple::make(3, fgetfromcache);
const Type **fsvgetresult =(const Type**)shared_type_arena->AmallocWords(2*sizeof(Type*));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const Type **fsvgetresult =(const Type**)shared_type_arena->AmallocWords(2*sizeof(Type*));
const Type** fsvgetresult =(const Type**)shared_type_arena->AmallocWords(2*sizeof(Type*));

@@ -746,6 +746,7 @@ class TypeTuple : public Type {
static const TypeTuple *LONG_PAIR;
static const TypeTuple *INT_CC_PAIR;
static const TypeTuple *LONG_CC_PAIR;
static const TypeTuple *SV_GET_RESULT;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static const TypeTuple *SV_GET_RESULT;
static const TypeTuple* SV_GET_RESULT;

Copy link
Contributor

@eme64 eme64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Path 1 of today.

Node* first_index = nullptr; // index in the cache for first hash
Node* second_index = nullptr; // index in the cache for second hash
CallStaticJavaNode* slow_call = nullptr; // slowGet() call if any
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This setup really looks like it should be a class, maybe called ScopedValueGetPatternMatcher?
All your variables here could be fields, and the scope below a method, or even split into multiple methods.

}


bool PhaseIdealLoop::loop_predication_for_scoped_value_get(IdealLoopTree* loop, IfProjNode* if_success_proj,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a short comment above, that we are trying to hoist the If for a ScopedValueGetHitsInCache out of the loop, if possible.
You have one on the first line, but I think generally we place them at the top, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, now looking at the method... maybe a longer comment with a picture or pseudocode would be helpful.
It would greatly help me in reviewing the code - otherwise I basically have to draw the picture on a piece of paper myself before understanding it ;)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it the most when there is ascii art that uses the variable names in the code below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll read the code in a later review.

// A ScopedValueGetHitsInCache check is loop invariant if the scoped value object it is applied to is loop invariant
BoolNode* bol = iff->in(1)->as_Bool();
if (bol->in(1)->Opcode() == Op_ScopedValueGetHitsInCache && invar.is_invariant(((ScopedValueGetHitsInCacheNode*)bol->in(1))->scoped_value()) &&
invar.is_invariant(((ScopedValueGetHitsInCacheNode*)bol->in(1))->index1()) && invar.is_invariant(((ScopedValueGetHitsInCacheNode*)bol->in(1))->index2())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refactor this if:

  • if we don't take it, you return false, so make it a bailout. This allows you to already bailout if bol->in(1)->Opcode() == Op_ScopedValueGetHitsInCache fails.
  • After that check, you can already have this line: ScopedValueGetHitsInCacheNode* hits_in_the_cache = (ScopedValueGetHitsInCacheNode*) bol->in(1);, and simplify the other 3 conditions of the if, make it more readable.
  • But make them bailouts as well, instead of indenting the rest of the function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really know how is_invar works. But why is a use-node not automatically variant, if a def-node is variant. Or stated in other terms: why not just check invar.is_invariant(hits_in_the_cache)?

bool PhaseIdealLoop::loop_predication_for_scoped_value_get(IdealLoopTree* loop, IfProjNode* if_success_proj,
ParsePredicateSuccessProj* parse_predicate_proj,
Invariance &invar, Deoptimization::DeoptReason reason,
IfNode* iff, IfProjNode*&new_predicate_proj) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
IfNode* iff, IfProjNode*&new_predicate_proj) {
IfNode* iff, IfProjNode* &new_predicate_proj) {

if (TraceLoopPredicate) {
tty->print("Predicate invariant if: %d ", new_predicate_iff->_idx);
loop->dump_head();
} else if (TraceLoopOpts) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not have them as separate ifs? What if someone enables both, will they not miss a line?

Copy link
Contributor Author

@rwestrel rwestrel Jan 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the code pattern used elsewhere in PhaseIdealLoop::loop_predication_impl_helper().

return progress;
}

void PhaseIdealLoop::expand_get_from_sv_cache(ScopedValueGetHitsInCacheNode* get_from_cache) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void PhaseIdealLoop::expand_get_from_sv_cache(ScopedValueGetHitsInCacheNode* get_from_cache) {
void PhaseIdealLoop::expand_sv_get_hits_in_cache_and_load_from_cache(ScopedValueGetHitsInCacheNode* get_from_cache) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And again, a picture / pseudocode of the transformation would help immensely.

set_ctrl(zero, C->root());
_igvn.replace_input_of(iff, 1, zero);
_igvn.replace_node(get_from_cache, C->top());

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

return;
}

Node* load_of_cache = get_from_cache->in(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Node* load_of_cache = get_from_cache->in(1);
Node* cache_adr = get_from_cache->in(1);

}

Node* mem() const {
return in(Memory);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not verify that this is a MemNode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MemNodes are not the only ones carrying memory state (projection for a call, membar etc).

Node* first_index = get_from_cache->index1();
Node* second_index = get_from_cache->index2();

if (first_index == C->top() && second_index == C->top()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this not be done during igvn?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens here is that the cache was always seen to be null so no code to to probe the cache was added to the IR. When that happens, the optimizations still apply (i.e. there could be dominated ScopedValue.get() that can be replaced by this one or a dominating one that can replace this one). This should also be very uncommon. So the nodes that are put in place to enable optimizations should be left in until expansion to not miss optimization opportunities.

Co-authored-by: Andrey Turbanov <turbanoff@gmail.com>
@rwestrel
Copy link
Contributor Author

Personal wishlist: can you add a case where this optimization enables vectorization? Or do your optimizations happen too late for that?

In its simplest form, a ScopedValue get will have a test (to check it's indeed in the cache) and a load so I don't think vectorization is possible.

@eme64
Copy link
Contributor

eme64 commented Jan 25, 2024

In its simplest form, a ScopedValue get will have a test (to check it's indeed in the cache) and a load so I don't think vectorization is possible.

So you think this could not be predicated and hoisted out of the loop? That would also be a sad limitation 😞

Co-authored-by: Emanuel Peter <emanuel.peter@oracle.com>
@rwestrel
Copy link
Contributor Author

In its simplest form, a ScopedValue get will have a test (to check it's indeed in the cache) and a load so I don't think vectorization is possible.

So you think this could not be predicated and hoisted out of the loop? That would also be a sad limitation 😞

Even if that was possible, ScopedValue get loads are from the cache indexed by a hash stored as a field in the ScopedValue object. I'm not sure how you would be able to tell which of the loads from several get() calls are contiguous in memory.

@rwestrel
Copy link
Contributor Author

rwestrel commented May 2, 2024

I am wondering if it would make sense to have some scoped_value.hpp/cpp, where you can put all your new classes. This would also allow you to put documentation about the general approach at the top of the scoped_value.hpp file. Currently, the code is spread all over, and it would be hard to know where one could find a good summary of the whole optimization.

I moved most of the scoped value specific code to scoped_value.hpp/cpp in the new commit.

@rwestrel
Copy link
Contributor Author

rwestrel commented May 2, 2024

I also removed Node::find_unique_out_with() and replaced it with Node* find_out_with(int opcode, bool want_unique = false). I'll look into the automatic casting but I'd like to possibly do it as a separate clean up.

@openjdk openjdk bot removed the rfr Pull request is ready for review label May 2, 2024
@openjdk openjdk bot added the rfr Pull request is ready for review label May 2, 2024
@rwestrel
Copy link
Contributor Author

rwestrel commented May 2, 2024

@eme64 change is ready for another review

@eme64
Copy link
Contributor

eme64 commented May 6, 2024

@rwestrel I feel like I am heavily stepping on your toes now....
Can you please do refactoring in a separate prior PR? This change is now 3K+ lines, and even reading through it all takes me more than a day, I simply cannot commit this many hours at a time.

I'm thinking in particular about your most recent changes with:

  • class Invariance
  • estimate_if_peeling_possible

Don't get me wrong: I like those refactorings, but they should be done separately.

If you can find anything else that could be done separately, that would help greatly.

I have been painstakingly separating my SuperWord PR's into more reviewable patches, and I do get quicker reviews that way.

My concern: I think the code is now in a state that can be understood (if one spends a day reading it all), but it is hard for me to say that it is correct. If I now approve this patch, then a subsequent reviewer will pay less attention, hence, I feel like I cannot just approve it too quickly now.

If I am too annoying, feel free to ask someone else to review and I will just step back. Maybe @theRealAph wants to review for a while?

@eme64
Copy link
Contributor

eme64 commented May 6, 2024

@rwestrel one idea to split things here:

  • Early inline of ScopedValue methods
  • Parsing to IR nodes and expansion back.
  • Optimization
  • Tests

This way, I can spend only a few hours on one at a time, and we can get this done.
Of course, you cannot really integrate an individual one, maybe there is a way to use skara for dependent PR's?

@rwestrel
Copy link
Contributor Author

rwestrel commented May 6, 2024

I'm thinking in particular about your most recent changes with:

* `class Invariance`

* `estimate_if_peeling_possible`

Don't get me wrong: I like those refactorings, but they should be done separately.

The problem I see is that they have little value unless this patch is integrated as it is. What if another reviewer thinks it's better to keep everything related to loop predication together? There's no need to change the class Invariance then.

@rwestrel
Copy link
Contributor Author

rwestrel commented May 6, 2024

@rwestrel one idea to split things here:

* Early inline of ScopedValue methods

* Parsing to IR nodes and expansion back.

* Optimization

* Tests

This way, I can spend only a few hours on one at a time, and we can get this done. Of course, you cannot really integrate an individual one, maybe there is a way to use skara for dependent PR's?

Would one commit per line above work? Or do you think it needs to be different PRs?

@eme64
Copy link
Contributor

eme64 commented May 6, 2024

@rwestrel
Just using commits is probably not really helpful. What would you do if there needs to be an update to commit 1, requested by a reviewer?

Honestly, I would like to take a break from this for now.
I leave it up to you how to present it in a way that is easier to review.
Once you get someone to review and accept it, I can see if I find time to review again.

I think the code is significantly better/readable than when we first started. So if someone like @vnkozlov simply scans and approves it, and as such takes the responsibility of "first reviewer", then I'm totally fine with that.

@vnkozlov
Copy link
Contributor

vnkozlov commented May 7, 2024

I want to see performance numbers on x64 and aarch64 before starting looking on it. It would be nice to have data for all micros test/micro/org/openjdk/bench/java/lang/ScopedValues*.java

Put results into JBS and post short summary here.

You can compare by disable/enable new intrinsics.

@theRealAph
Copy link
Contributor

I want to see performance numbers on x64 and aarch64 before starting looking on it. It would be nice to have data for all micros test/micro/org/openjdk/bench/java/lang/ScopedValues*.java

Put results into JBS and post short summary here.

You can compare by disable/enable new intrinsics.

I'm on it.

@openjdk openjdk bot added the merge-conflict Pull request has merge conflict with target branch label May 9, 2024
@theRealAph
Copy link
Contributor

I've spoken to some senior JDK developers and their feeling is that this patch is too specific to the current scoped value implementation and too complex to go into HotSpot, especially for a feature like scoped values that is not yet out of preview.
This PR is good. It does everything needed to generate better, smaller code for scoped values. It also means that scoped values have a smaller memory footprint, a real win-win. However, I think we're going to have to park this PR for now.
I think we should revisit it when scoped values is out of preview.
Thanks, and I'm sorry that you put so much work into this patch for it not to be committed.

@eme64
Copy link
Contributor

eme64 commented May 23, 2024

I know this is "parked" now, but there are some internal conversations happening.

One question that @dean-long and @rose00 had:
If there is more than one ScopedValue, what about profile pollution?

@rwestrel
Copy link
Contributor Author

There are 3 optimizations that this patch performs:

1- replace a ScopedValue.get() by a dominating ScopedValue.get()
2- move a ScopedValue.get() out of loop
3- streamline code emitted for ScopedValue.get()

There are 2 ScopedValue.get() patterns that are handled:

1- when profile reports that the slow path that updates the cache is not taken: ScopedValue.get() always hits in the cache
2- when profile reports that the slow path is taken and the code that updates the cache is included in the compiled code

Obviously, before ScopedValue.get() can use the cache, the cache has to be updated. So the slow path is taken at some point. But because, hotspot doesn't profile the first invocations of a method, profile data can report the slow path as never taken. That's likely what happens with a simple micro benchmark. Also there can be multiple ScopedValue and profile can still report the slow path as not taken. It's a more a matter of when the cache is updated than how many ScopedValue there are.

The patch performs optimization 1-, 2- and 3- for patterns 1- and 2- but, it does it better for pattern 1- than 2-. If the slow path is included in compiled code, then only a ScopedValue.get call that dominate the backedge of a loop is hoisted out of loop. That's because hoisting is a 2 step process in the case of pattern 2-: peel one iteration of the loop and replace the ScopedValue.get in the loop with the one from the peeled iteration. When the slow path is compiled in the method, a lot of extra code is also included and that likely disrupts other optimizations that might be needed before ScopedValue.get can be optimized. For instance, the slow path likely comes with a non inlined call that could get in the way of memory subgraph optimizations. That's why I made sure the patch handles both patterns.

I thought about always speculating initially that the slow path is not taken when compiling ScopedValue.get or trying to find some way to work around profile pollution. I also thought about having cleverer ways of optimizing pattern 2-. But the patch felt complicated enough and when @theRealAph experimented with it, he reported that it was doing ok the way it is.

@dean-long
Copy link
Member

What's a good benchmark to run to show the benefit of this change, or to show the effect of different cache sizes and/or Java implementation changes?

I tried running micro:ScopedValue benchmarks with -Djava.lang.ScopedValue.cacheSize=2 and didn't see a difference. But the new compiler/scoped_value/TestScopedValue.java test fails in compiler.c2.irTests.TestScopedValue.testFastPath16 with the cache size set to 2.

Given the right benchmark, there are some experiments I'd like to try, related to the ScopedValue Java implemenation:

  1. use only a primary slot probe, no secondary
  2. use a deterministic secondary probe (based on the hash), not random
  3. fix put() so it will reuse an existing slot. Currently it blindly set both victim and other slots. It seems like it should check the other slot first and reuse it if already set.
  4. separate cache bitmap from slow path bitmaps, which could be 64-bits with only 1 bit per SV, not 2.
  5. Use a per-SV MethodHandle getter using MethodHandles.guardWithTest() to avoid profile pollution

@dean-long
Copy link
Member

1- replace a ScopedValue.get() by a dominating ScopedValue.get()
2- move a ScopedValue.get() out of loop
3- streamline code emitted for ScopedValue.get()

I think it might make sense to split 1 and 2, which are independent of the details of get() and put(), from 3. Then we can consider if there are other optimizations we can do around opaque() get() and put(). For example, why can't we replace a get() with the value from a dominating put()? Why can't we eliminate both the put() and get() completely, as long as the value can't "escape" or we deoptimize?

@theRealAph
Copy link
Contributor

What's a good benchmark to run to show the benefit of this change, or to show the effect of different cache sizes and/or Java implementation changes?

I tried running micro:ScopedValue benchmarks with -Djava.lang.ScopedValue.cacheSize=2 and didn't see a difference. But the new compiler/scoped_value/TestScopedValue.java test fails in compiler.c2.irTests.TestScopedValue.testFastPath16 with the cache size set to 2.

With -jvmArgsAppend -Djava.lang.ScopedValue.cacheSize=4 I get

Benchmark Mode Cnt Score Error Units
ScopedValues.sixValues_ScopedValue avgt 10 11.881 ± 0.017 us/op

before this patch and

ScopedValues.sixValues_ScopedValue avgt 10 0.006 ± 0.001 us/op

after.

Given the right benchmark, there are some experiments I'd like to try, related to the ScopedValue Java implemenation:

1. use only a primary slot probe, no secondary

2. use a deterministic secondary probe (based on the hash), not random

Looks pretty deterministic to me. Every value has two hash codes, primary and secondary,and they are different.

3. fix put() so it will reuse an existing slot.  Currently it blindly set both `victim` and `other` slots.  It seems like it should check the `other` slot first and reuse it if already set.

Put another conditional load in the control flow? I'm not sure that would do much, but OK. I guess I don't know how this would work.

4. separate cache bitmap from slow path bitmaps, which could be 64-bits with only 1 bit per SV, not 2.

I guess that might help.

5. Use a per-SV MethodHandle getter using MethodHandles.guardWithTest() to avoid profile pollution

Interesting. I did a version of the code that used bytecode generation to produce a new accessor method for each scoped value a year or two ago, for that same reason. It did work, but was rather heavyweight.

Re benchmarks: the benchmarks are all there, but the current design is based on principles, as well as benchmark results.

  1. As much as possible, and this is hard to do with just random hashing, I (and I believe, we) want the performance to be linear and predictable, rather than mostly luck. That's what this patch really brings to the party!

  2. We have a hard guarantee, not just a probabilistic one, that if you are repeatedly using two different scoped values, we never have to do a fallback linear search.
    This is hard to capture with a benchmark, but I guess setting java.lang.ScopedValue.cacheSize=2 would do it.

  3. The bind method may do a little bit of extra work to help the get(), but not too much: I can see some cases where binding is done fairly frequently, and it should not be too heavyweight. But I don't know what too heavyweight really means, so...

@theRealAph
Copy link
Contributor

theRealAph commented May 30, 2024

  1. use only a primary slot probe, no secondary

  2. use a deterministic secondary probe (based on the hash), not random

Looks pretty deterministic to me. Every value has two hash codes, primary and secondary,and they are different.

Ah, I just realized what you meant. Without random replacement in the cache, there is a high probability that scoped values will collide, because the cache is small. Even with only two scoped values accessed alternately, each will repeatedly kick out the other, leading to a linear probe every time. The only way to avoid this without random replacement would be to make the cache considerably larger, and even then it would still occasionally happen.

@dean-long
Copy link
Member

I guess I don't understand how random replacement is supposed to help. Do you have a pointer to where I can read up on the topic?
I would think that for small numbers of scoped values, assigning hash slots sequentially would work well. Only if there is a collision, use secondary slot.

@theRealAph
Copy link
Contributor

On 5/30/24 21:54, Dean Long wrote:

I guess I don't understand how random replacement is supposed to help.
Do you have a pointer to where I can read up on the topic?

https://en.wikipedia.org/wiki/Cache_replacement_policies#Random_replacement_(RR)
has links.

The maths is pretty simple: if you have only one slot for each entry, one
time in 16 two scoped locals will hit the same slot, and repeatedly kick
each other out. If you have two slots with random replacement, then each
get() will retry a couple of times until each entry is in a different slot.

Random replacement is good for software implementation because it doesn't
require history, unlike (say) LRU. Maintaining access history in order to
use LRU would be as expensive as the actual get().

I would think that for small numbers of scoped values, assigning hash
slots sequentially would work well. Only if there is a collision, use
secondary slot.

In practice, today's processors speculate both primary and secondary loads
in parallel, so there's no added latency for doing both. The cost is
almost nothing.

Random replacement is a low-cost way to increase the hit ratio of a cache
without incurring the added runtime cost of (say) LRU.

@theRealAph
Copy link
Contributor

One other thing. Let's say you always check the slot before overwriting it, and only then go to the secondary slot. You find the secondary slot is occupied. The best thing to do then is random replacement. Given that the end effect of just doing random replacement is the same, there's nothing to be gained from the added complexity.

@openjdk openjdk bot removed the rfr Pull request is ready for review label Jun 6, 2024
@bridgekeeper
Copy link

bridgekeeper bot commented Jun 28, 2024

@rwestrel This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-compiler hotspot-compiler-dev@openjdk.org merge-conflict Pull request has merge conflict with target branch
8 participants