New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8253180: ZGC: Implementation of JEP 376: ZGC: Concurrent Thread-Stack Processing #296
Conversation
/label hotspot |
|
@fisk |
@fisk The following label will be automatically applied to this pull request: When this pull request is ready to be reviewed, an RFR email will be sent to the corresponding mailing list. If you would like to change these labels, use the |
/label hotspot-gc |
@fisk |
Webrevs
|
/label remove hotspot-gc |
@fisk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've pre-reviewed most of the changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partially re-reviewed the changes.
@fisk - You are missing testing information for this PR. |
Sorry about that. I have performed testing with ZGC enabled from tier1-7 (multiple times), and standard testing tier1-5. I have also stress tested the change with gc-test-suite, with full verification on (there is a bunch of it), and the GC interval reduced to very small (GC all the time). Recent versions of this patch I also tested from tier1-8 (2 times). I have also perf tested this in dev-submit to check for regressions (none found), as well as stress tested SPECjbb2015 on a 3 TB machine with ~2200 threads. I have also manually stress tested the changes on an AArch64 machine, using gc-test-suite. |
@fisk - Thanks for the info. An amazing amount of testing!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Next set of comments.
@fisk this pull request can not be integrated into git checkout 8253180_conc_stack_scanning
git fetch https://git.openjdk.java.net/jdk master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've gone over the entire patch, but I'm leaving the compiler parts to others to review.
Thanks for reviewing @stefank, and thanks for all the pre-reviewing as well. I have updated the PR to resolve all your comments. |
In my last PR update, I included a fix to an exception handling problem that I encountered after lots of stress testing that I have been running for a while now. I managed to catch the issue, get a reliable reproducer, and fix it. The root problem is that the hook I had placed in SharedRuntime::exception_handler_for_return_address has been ignored. The reason is that the stack is not walkable at this point. The hook then just ignores it. This had some unexpected consequences. After looking closer at this code, I found that if we did have a walkable stack when we call SharedRuntime::raw_exception_handler_for_return_address, that would have been the only hook we need at all for exception handling. It is always the common root point where we unwind into a caller frame due to an exception throwing into the caller, and we need to look up the rethrow handler of the caller. However, we are indeed not walkable here. To deal with this, I have rearranged the exceptino hooks a bit. First of all, I have deleted all before_unwind hooks for exception handling, because they should not be needed if the after_unwind hook is reliably called on the caller side instead. And those hooks do indeed need to be there, because we do not always have a point where we can call before_unwind (e.g. C1 unwind exception code, that just unwinds and looks up the rethrow handler via SharedRuntime::exception_handler_for_return_address). I have then traced all paths from SharedRuntime::raw_exception_handler_for_return_address into runtime rethrow handlers called, for each rethrow exception handler PC exposed in the function. They are:
Each rethrow handler returned has a corresponding comment saying which rethrow runtime rethrow handler it will end up in, once the stack has been walkable and we have transferred control into the caller. And all of those runtime hooks now have an after_unwind() hook. The good news is that now the responsibility for who calls the unwind hook for exception is clearer: it is never done by the callee, and always done by the caller, in its rethrow handler, at which point the stack is walkable. In order to avoid further issues where an unwind hook is ignored, I have changed them to assert that there is a last_Java_frame present. Previously I did not assert that, because there was shared code between runtime native transitions and the native wrapper, that called an unwind hook. This prevented the unwind hooks to assert this, as compiler threads would still perform native transitions, that just ignored the request. I moved the problematic hook for native up one level in the hierarchy to a path where it is only called by native wrappers (where we always have a last_Java_frame), so that I can finally assert that the unwind hooks always are called at points where we have a last_Java_frame. This makes me feel confident that I do not have another hook that is being accidentally ignored. However, the relationship for the various exception handling code executed in the caller, the callee, and between the two (before we are walkable) is rather complicated. So it would be good to have someone that knows the exception code very well have a look at this, to make sure I have not missed anything. I have rerun all testing and done a load of stress testing to sanity check this. The reproducer I eventually found that reproduced the issue with 100% success rate, was run many times with the new patch, and no longer reproduces any issue. |
Mailing list message from Kim Barrett on hotspot-dev:
That?s fine with me. |
// The call to start_processing fixes the thread's oops and the first few frames. | ||
// | ||
// The call has been carefully placed here to cater for a few situations: | ||
// 1) After we exit from block after a global pool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: pool -> poll
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
|
||
void StackWatermark::yield_processing() { | ||
update_watermark(); | ||
MutexUnlocker mul(&_lock, Mutex::_no_safepoint_check_flag); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a little dubious - is it just a heuristic? There is no guarantee that unlocking the Mutex will allow another thread to claim it before this thread re-locks it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is indeed just a heuristic. There is no need for a guarantee.
StackWatermark* _next; | ||
JavaThread* _jt; | ||
StackWatermarkFramesIterator* _iterator; | ||
Mutex _lock; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are you guaranteeing that the Mutex is unused at the time the StackWatermark is deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The StackWatermarks are deleted when the thread is deleted (and its destructor runs). Hence, I'm relying on the Threads SMR project here. Anyone that pokes around at the StackWatermark is either the current thread, or a thread that has a ThreadsListHandle containing the thread, making it safe to access that thread without it racingly being deleted.
if (thread->handshake_state()->should_process()) { | ||
thread->handshake_state()->process_by_self(); // Recursive | ||
} | ||
} | ||
|
||
uintptr_t SafepointMechanism::compute_poll_word(bool armed, uintptr_t stack_watermark) { | ||
if (armed) { | ||
log_debug(stackbarrier)("Computed armed at %d", Thread::current()->osthread()->thread_id()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/at/for/ ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Hi Erik, |
Hi David, Thanks for reviewing this code. There are various polls in the VM. We have runtime transitions, interpreter transitions, transitions at returns, native wrappers, transitions in nmethods... and sometimes they are a bit different. The "poll word" encapsulates enough information to be able to poll for returns (stack watermark barrier), or poll for normal handshakes/safepoints, with a conditional branch. So really, we could use the "poll word" for every single poll. A low order bit is a boolean saying if handshake/safepoint is armed, and the rest of the word denotes the watermark for which frame has armed returns. The "poll page" is for polls that do not use conditional branches, but instead uses an indirect load. It is used still in nmethod loop polls, because I experimentally found it to perform worse with conditional branches on one machine, and did not want to risk regressions. It is also used for VM configurations that do not yet support stack watermark barriers, such as Graal, PPC, S390 and 32 bit platforms. They will hopefully eventually support this mechanism, but having the poll page allows a more smooth transition. And unless it is crystal clear that the performance of the conditional branch loop poll really is fast enough on sufficiently many machines, we might keep it until that changes. Hope this makes sense. Thanks, |
@fisk Unknown command |
Mailing list message from Andrew Haley on hotspot-dev: On 06/10/2020 08:22, Erik ?sterlund wrote:
One small thing: the couple of uses of lea(InternalAddress) should really be adr; diff --git a/src/hotspot/cpu/aarch64/c1_CodeStubs_aarch64.cpp b/src/hotspot/cpu/aarch64/c1_CodeStubs_aarch64.cpp assert(SharedRuntime::polling_page_return_handler_blob() != NULL, __ bind(entry->_stub_label); -- |
Hi Andrew, Thanks for having a look. I applied your patch. Having said that, this is run on the safepoint slow path, so should be a rather cold path, where threads have to wear coats and gloves. But it does not hurt to optimize the encoding further, I suppose. Thanks, |
@fisk Unknown command |
Mailing list message from David Holmes on serviceability-dev: Hi Erik, On 6/10/2020 5:37 pm, Erik ?sterlund wrote:
Yes but I am somewhat surprised. The conventional wisdom has always been Cheers, |
When thread local handshakes was built, both a branch based and indirect load based prototype was implemented. I had a branch based solution and Mikael Gerdin built an indirect load based solution, so we could compare them. He compared them on many machines and found that sometimes branches are a bit faster and sometimes a bit slower, depending on CPU model. But the results with indirect loads was more stable from machine to machine, while the branch based solution depended a bit more on what CPU model was being used. That is why the indirect load solution was chosen: it was not always best but it was never bad on any machine. Since then, we got loop strip mining in C2 which makes the frequency of polls less tight. My hypothesis was that with that in place, a new evaluation would show that branching is fine now. However, one machine did not agree with that. To be fair, that machine is not giving very stable results at all right now, so I am not sure if there is a real problem or not. But I thought I'll keep the old behaviour for now anyway as I do not yet have a good reason to change it, and not good enough proof that it is okay... yet. /Erik |
@fisk Unknown command |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been reviewing this and stepping through the debugger. It looks OK to me.
Thanks for the review Stuart. |
/integrate |
This PR the implementation of "JEP 376: ZGC: Concurrent Thread-Stack Processing" (cf. https://openjdk.java.net/jeps/376).
Basically, this patch modifies the epilog safepoint when returning from a frame (supporting interpreter frames, c1, c2, and native wrapper frames), to compare the stack pointer against a thread-local value. This turns return polls into more of a swiss army knife that can be used to poll for safepoints, handshakes, but also returns into not yet safe to expose frames, denoted by a "stack watermark".
ZGC will leave frames (and other thread oops) in a state of a mess in the GC checkpoint safepoints, rather than processing all threads and their stacks. Processing is initialized automagically when threads wake up for a safepoint, or get poked by a handshake or safepoint. Said initialization processes a few (3) frames and other thread oops. The rest - the bulk of the frame processing, is deferred until it is actually needed. It is needed when a frame is exposed to either 1) execution (returns or unwinding due to exception handling), or 2) stack walker APIs. A hook is then run to go and finish the lazy processing of frames.
Mutator and GC threads can compete for processing. The processing is therefore performed under a per-thread lock. Note that disarming of the poll word (that the returns are comparing against) is only performed by the thread itself. So sliding the watermark up will require one runtime call for a thread to note that nothing needs to be done, and then update the poll word accordingly. Downgrading the poll word concurrently by other threads was simply not worth the complexity it brought (and is only possible on TSO machines). So left that one out.
Progress
Testing
Failed test tasks
Issue
Reviewers
Download
$ git fetch https://git.openjdk.java.net/jdk pull/296/head:pull/296
$ git checkout pull/296