Skip to content

8255243: Reinforce escape barrier interactions with ZGC conc stack processing #832

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

fisk
Copy link
Contributor

@fisk fisk commented Oct 23, 2020

The escape barrier reallocates scalarized objects potentially deep into the stack of a remote thread. Each allocation can safepoint, causing referenced frames to be invalid. Some sprinklings were added that deal with that, but I believe it was subsequently broken with the integration of the new vector API, that has its own new deoptimization code that did not know about this. Not surprisingly, the integration of the new vector API had no idea about this subtlety, and allocates an object, and then reads an object deep from the stack of a remote thread (using an escape barrier). I suppose the issue is that all these 3 things were integrated at almost the same time. The problematic code sequence is in VectorSupport::allocate_vector() in vectorSupport.cpp, which is called from Deoptimization::realloc_objects(). It first allocates an oop (possibly safepointing), and then reads a vector oop from the stack. This is usually fine, but not through the escape barrier, with concurrent stack scanning. While I have not seen any crashes yet, I can see from code inspection, that there is no way that this works correctly.

In order to make this less fragile for future changes, we should really have a RAII object that keeps the target thread's stack of the escape barrier, stable and processed, across safepoints. This patch fixes that. Then it becomes much easier to reason about its correctness, compared to hoping the various hooks are applied after each safepoint.

With this new robustness fix, the thread running the escape barrier, keeps the target thread stack processed, straight through safepoints on the requesting thread, making it easy and intuitive to understand why this works correctly. The RAII object basically just has to cover the code block that pokes at the remote stack and goes in and out of safepoints, arbitrarily. Arguably, this escape barrier doesn't need to be blazingly fast, and can afford keeping stacks sane through its operation.


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Testing

Linux x32 Linux x64 Windows x64 macOS x64
Build ✔️ (1/1 passed) ✔️ (5/5 passed) ✔️ (2/2 passed) ✔️ (2/2 passed)
Test (tier1) ✔️ (9/9 passed) ❌ (1/9 failed) ✔️ (9/9 passed)

Failed test task

Issue

  • JDK-8255243: Reinforce escape barrier interactions with ZGC conc stack processing

Reviewers

Contributors

  • Richard Reingruber <rrich@openjdk.org>

Download

$ git fetch https://git.openjdk.java.net/jdk pull/832/head:pull/832
$ git checkout pull/832

@bridgekeeper
Copy link

bridgekeeper bot commented Oct 23, 2020

👋 Welcome back eosterlund! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Oct 23, 2020
@openjdk
Copy link

openjdk bot commented Oct 23, 2020

@fisk The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot hotspot-dev@openjdk.org label Oct 23, 2020
@fisk
Copy link
Contributor Author

fisk commented Oct 23, 2020

@reinrich Since you wrote the escape barrier, I thought I'd ping you in case you are interested. This makes the mechanism more robust and reliable, w.r.t. concurrent stack processing. No issue observed, but looks wrong with the new vector API deoptimization code that was integrated recently. This makes it more future-proof and easy to reason about.

@mlbridge
Copy link

mlbridge bot commented Oct 23, 2020

Webrevs

@fisk
Copy link
Contributor Author

fisk commented Oct 23, 2020

Forgot to mention, but I tested this from tier1-5, to sanity check that the new solution doesn't introduce any new issues.

@reinrich
Copy link
Member

@fisk thanks for pinging. I remember reading about a new vector API a while ago and noticed that it potentially interferes with escape barriers but then I lost track. I'll look at this.

@dcubed-ojdk
Copy link
Member

Thanks for adding the tier testing info. Sounds like there is no need for new
tests since this is "just" a RAII object to reinforce existing code interactions,
but please confirm that...

@fisk
Copy link
Contributor Author

fisk commented Oct 23, 2020

Thanks for adding the tier testing info. Sounds like there is no need for new

tests since this is "just" a RAII object to reinforce existing code interactions,

but please confirm that...

Exactly; confirmed.

Copy link
Member

@reinrich reinrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really glad you caught that one! And I like the abstraction provided by KeepStackGCProcessedMark.

There is one execution path you missed coming from VM_GetOrSetLocal::deoptimize_objects(javaVFrame* jvf). This code should probably be moved into EscapeBarrier. EscapeBarrier::deoptimize_objects(int depth) could be changed to be more generic EscapeBarrier::deoptimize_objects(int from_depth, int to_depth). VM_GetOrSetLocal::doit_prologue() could call eb.deoptimize_objects(_depth, _depth) then. That would be better but maybe not yet really good...

Update: I see now that there is also a stackwalk in VM_GetOrSetLocal::doit_prologue() which needs to be taken care of with regard to concurrent stack processing. I'd like to try to refactor this. Will propose a patch.

Thanks again, Richard.

@@ -89,7 +89,7 @@ void SafepointMechanism::process(JavaThread *thread) {
// 1) After we exit from block after a global poll
// 2) After a thread races with the disarming of the global poll and transitions from native/blocked
// 3) Before the handshake code is run
StackWatermarkSet::start_processing(thread, StackWatermarkKind::gc);
StackWatermarkSet::on_safepoint(thread);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start_processing in the comment above should be renamed too.

return;
}
StackWatermark* their_watermark = StackWatermarkSet::get(jt, StackWatermarkKind::gc);
our_watermark->link_watermark(their_watermark);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assert our_watermark->_linked_watermark == NULL to avoid unintentional nesting?

@reinrich
Copy link
Member

Hi Erik, the last commit in https://github.com/reinrich/jdk/commits/pr-832-with-better-encapsulation would be the refactoring I would like to do. It removes the code not compliant with concurrent thread stack processing from VM_GetOrSetLocal::doit_prologue(). Instead EscapeBarrier::deoptimize_objects(int d1, int d2) is called. You added already a KeepStackGCProcessedMark to that method and I changed it to accept a range [d1, d2] of frames do the object deoptimization for.

I'm not sure how to handle this from a process point of view. Can the refactoring be done within this change? Should a new item or subtask be created for it. I'd be glad if you could give an advice on that.

Thanks, Richard.

@fisk
Copy link
Contributor Author

fisk commented Oct 26, 2020

Hi Erik, the last commit in https://github.com/reinrich/jdk/commits/pr-832-with-better-encapsulation would be the refactoring I would like to do. It removes the code not compliant with concurrent thread stack processing from VM_GetOrSetLocal::doit_prologue(). Instead EscapeBarrier::deoptimize_objects(int d1, int d2) is called. You added already a KeepStackGCProcessedMark to that method and I changed it to accept a range [d1, d2] of frames do the object deoptimization for.

I'm not sure how to handle this from a process point of view. Can the refactoring be done within this change? Should a new item or subtask be created for it. I'd be glad if you could give an advice on that.

Thanks, Richard.

If you are okay with it, I can add your refactorings into this change, and add you as a co-author of the change. Sounds good?

Thanks,
/Erik

@openjdk
Copy link

openjdk bot commented Oct 26, 2020

@fisk Unknown command erik - for a list of valid commands use /help.

@reinrich
Copy link
Member

It does sound good indeed to me if you don't mind doing that. Thanks!
I have run the tests dedicated to EscapeBarriers with ZGC enabled and also the DeoptimizeObjectsALot stress testing. I will run some more serviceability tests and my teams CI testing until tomorrow.

fisk and others added 2 commits October 27, 2020 12:50
…ilitate correct interaction with concurrent thread stack processing.

The Stackwalk for object deoptimization in VM_GetOrSetLocal::doit_prologue is not prepared for concurrent thread stack processing.
EscapeBarrier::deoptimize_objects(int depth) is extended to cover a range of frames from depth d1 to depth d2. It is also prepared for concurrent thread stack processing. With this change it is used to deoptimize objects in the prologue of VM_GetOrSetLocal.
@openjdk openjdk bot added the serviceability serviceability-dev@openjdk.org label Oct 27, 2020
@fisk
Copy link
Contributor Author

fisk commented Oct 27, 2020

Thanks @reinrich. I uploaded your patch to this PR, and will add you as contributor. Also addressed your review comments. Hope your testing went fine.

@reinrich
Copy link
Member

Thanks for importing the patch and for addressing my comments.
I've tested hotspot_serviceability, jdk_svc, jdk_jdi, vmTestbase_nsk_jdi, vmTestbase_nsk_jvmti, vmTestbase_nsk_jdwp. Just the test jdk/jdk/jfr/event/runtime/TestClassLoaderStatsEvent.java fails repeatedly with ZGC. It does so independently of this pr.

Copy link
Member

@reinrich reinrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making EscapeBarriers more robust with regard to concurrent thread stack processing.
The change looks good to me.

@fisk
Copy link
Contributor Author

fisk commented Oct 28, 2020

/contributor add @reinrich

@openjdk
Copy link

openjdk bot commented Oct 28, 2020

@fisk
Contributor Richard Reingruber <rrich@openjdk.org> successfully added.

@openjdk
Copy link

openjdk bot commented Oct 28, 2020

@fisk Unknown command erik - for a list of valid commands use /help.

@openjdk
Copy link

openjdk bot commented Oct 28, 2020

@fisk
Contributor Richard Reingruber <rrich@openjdk.org> successfully added.

Copy link
Contributor

@sspitsyn sspitsyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Erik and Richard,
Changes in the serviceability files looks fine.
Thanks,
Serguei

@openjdk
Copy link

openjdk bot commented Oct 28, 2020

@fisk This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8255243: Reinforce escape barrier interactions with ZGC conc stack processing

Co-authored-by: Richard Reingruber <rrich@openjdk.org>
Reviewed-by: rrich, sspitsyn

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 92 new commits pushed to the master branch:

  • 3f20612: 8255555: Bad copyright headers in SocketChannelCompare.java SocketChannelConnectionSetup.java UnixSocketChannelReadWrite.java
  • 42fc158: 8253939: [TESTBUG] Increase coverage of the cgroups detection code
  • 01eb690: 8255554: Bad copyright header in AbstractFileSystemProvider.java
  • 1215b1a: 8255457: Shenandoah: cleanup ShenandoahMarkTask
  • af33e16: 8255441: Cleanup ciEnv/jvmciEnv::lookup_method-s
  • 8ad7f38: 8255014: Record Classes javax.lang.model changes, follow-up
  • 6bb7e45: 8245194: Unix domain socket channel implementation
  • 8bde2f4: 8255013: implement Record Classes as a standard feature in Java, follow-up
  • 0425889: 8255429: Remove C2-based profiling
  • aaf4f69: 8255233: InterpreterRuntime::at_unwind should be a JRT_LEAF
  • ... and 82 more: https://git.openjdk.java.net/jdk/compare/4634dbef6d554b6f091dd7893e266682b267757f...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Oct 28, 2020
@fisk
Copy link
Contributor Author

fisk commented Oct 28, 2020

Hi Erik and Richard,

Changes in the serviceability files looks fine.

Thanks,

Serguei

Thanks for the review Serguei!

@fisk
Copy link
Contributor Author

fisk commented Oct 29, 2020

/integrate

@openjdk openjdk bot closed this Oct 29, 2020
@openjdk openjdk bot added integrated Pull request has been integrated and removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Oct 29, 2020
@openjdk
Copy link

openjdk bot commented Oct 29, 2020

@fisk Since your change was applied there have been 107 commits pushed to the master branch:

  • faf23de: 8255534: Shenandoah: Fix CmpP optimization wrt native-LRB
  • 579e50b: 8255564: InterpreterMacroAssembler::remove_activation() needs to restore thread right after VM call on x86_32
  • 4b20e46: 8255579: x86: Use cmpq(Register,Address) in safepoint_poll
  • 72ff8e2: 8254782: Fix benchmark issues in java/lang/StringIndexOfChar.java benchmark
  • ea26ff1: 8247614: java/nio/channels/DatagramChannel/Connect.java timed out
  • 38574d5: 8255298: Remove SurvivorAlignmentInBytes functionality
  • 4031cb4: 8254189: Improve comments for StackOverFlow and fix in_xxx() functions
  • caec8d2: 8233560: [TESTBUG] ToolTipManager/Test6256140.java is failing on macos
  • a5b42ec: 8233570: [TESTBUG] HTMLEditorKit test bug5043626.java is failing on macos
  • 7e305ad: 8255405: sun/net/ftp/imp/FtpClient uses SimpleDateFormat in not thread-safe manner
  • ... and 97 more: https://git.openjdk.java.net/jdk/compare/4634dbef6d554b6f091dd7893e266682b267757f...master

Your commit was automatically rebased without conflicts.

Pushed as commit 5b18558.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot hotspot-dev@openjdk.org integrated Pull request has been integrated serviceability serviceability-dev@openjdk.org
Development

Successfully merging this pull request may close these issues.

4 participants