Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8290965: PPC64: Implement post-call NOPs #17171

Closed
wants to merge 6 commits into from

Conversation

reinrich
Copy link
Member

@reinrich reinrich commented Dec 20, 2023

Implementation of post call nops (PCNs) on ppc64.

Depends on #17150

About post call nops:

  • instruction(s) at return addresses of compiled java calls
  • emitted iff vm continuations are enabled to support virtual threads
  • encode data that can be be used to find the corresponding CodeBlob and oop map faster
  • mt-safe patchable to trigger deoptimization

Background:

  • Frames in continuation StackChunks are not visited if their compiled method is made not entrant (in contrast to frames on stack).
    Instead all PCNs of the compiled method are patched to trigger deoptimization when control returns to such frames.
  • With vm continuations, stacks are walked and inspected more frequently. This requires lookup of metadata like frame size and oop maps. As an optimization the offset of the CodeBlob to the PCN and the oop map slot are encoded as data in the PCN.

Post call nops on ppc64

  • 1 instruction, i.e. 4 bytes (either CMPI or CMPLI[1])
    x86_64: 1 instruction, 8 bytes
    aarch64: 3 instruction, 12 bytes
    [1] 3.1.10 Fixed Point Compare Instructions in Power ISA 3.1B
    https://openpowerfoundation.org/specifications/isa/

  • 26 bits data payload
    x86_64: 32 bits; aarch64: 32 bits

  • 9 bits dedicated to oop map slot. With 8 bits there where cases with SPECjvm2008 where the slot could not be encoded (on ppc64 and x86_64).
    x86_64: 8 bits; aarch64: 8 bits

  • 17 bits dedicated to cb offset. Effectively 19 bits due to instruction alignment.
    x86_64: 24 bits; aarch64: 24 bits

  • Also used when reconstructing the back chain after thawing continuation frames (see Thaw<ConfigT>::patch_caller_links)

  • Refactored frame constructors to make use of fast CodeBlob lookup based on PCNs.
    The fast lookup may only be used if the pc is known to be in the code cache because CodeCache::find_blob_fast can yield wrong results if it finds instructions outside the code cache that look just like PCNs. Callers of the frame class constructors need to pass frame::kind::native in that case to avoid errors. Other platforms don't make this explicit which is a problem in my eyes. Picking the wrong constructor can cause errors when porting and in future development.

  • Currently only the PCNs in nmethods are initialized. Therefore we don't even try to make a fast lookup based on PCNs if we know the CodeBlob is, e.g., a RuntimeStub. To achieve this we call the frame constructor passing frame::kind::code_blob.

Statistics

SpecJVM2008 compiler.compiler with fix iterations ppc64le x86_64
PCN lookup success 3715494 3410337
PCN lookup failure 220987 235436
PCN decode success 3660675 3320496
PCN decode failure (C1) 53539 46816
PCN patch success 63848 42310
PCN patch cb offset failure 0 0
PCN patch oopmap slot failure 0 298
test/jdk/java/lang/Thread/virtual/stress/Skynet.java ppc64le x86_64
PCN lookup success 306955525 247185016
PCN lookup failure 500975 421098
PCN decode success (C2) 306951893 247181691
PCN decode failure 3168 59
PCN patch success 2080 2662
PCN patch cb offset failure 0 0
PCN patch oopmap slot failure 0 0

Comments

C1: We get decode failures even if patching always succeeded because not all PCNs are patched. Only PCNs in nmethods are actually patched. E.g. C2 runtime stubs like _new_array_nozero_Java have PCNs that are not patched.

C2: With Skynet.java there are 100x more PCN lookups. This is because it stresses virtual threads.

C2: With Skynet.java there are more PCN lookups on ppc64le. They originate from Thaw<ConfigT>::patch_caller_links.

Testing

The change passed our CI testing. JTReg tests: tier1-4 of hotspot and jdk. All of Langtools and jaxp. SPECjvm2008, SPECjbb2015, Renaissance Suite, and SAP specific tests.
All testing was done with fastdebug and release builds on the main platforms and also on Linux/PPC64le and AIX.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8290965: PPC64: Implement post-call NOPs (Enhancement - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/17171/head:pull/17171
$ git checkout pull/17171

Update a local copy of the PR:
$ git checkout pull/17171
$ git pull https://git.openjdk.org/jdk.git pull/17171/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 17171

View PR using the GUI difftool:
$ git pr show -t 17171

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/17171.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Dec 20, 2023

👋 Welcome back rrich! A progress list of the required criteria for merging this PR into pr/17150 will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Dec 20, 2023

@reinrich The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot hotspot-dev@openjdk.org label Dec 20, 2023
@reinrich
Copy link
Member Author

/label remove hotspot
/label add hotspot-compiler

@reinrich reinrich marked this pull request as ready for review December 20, 2023 20:32
@openjdk openjdk bot added rfr Pull request is ready for review and removed hotspot hotspot-dev@openjdk.org labels Dec 20, 2023
@openjdk
Copy link

openjdk bot commented Dec 20, 2023

@reinrich
The hotspot label was successfully removed.

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Dec 20, 2023
@openjdk
Copy link

openjdk bot commented Dec 20, 2023

@reinrich
The hotspot-compiler label was successfully added.

@mlbridge
Copy link

mlbridge bot commented Dec 20, 2023

Webrevs

Co-authored-by: Andrew Haley <aph-open@littlepinkcloud.com>
Copy link
Contributor

@TheRealMDoerr TheRealMDoerr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usage of CMPI/CMPLI looks great. Assuming kind::nmethod by default will likely work, but I wonder if we could avoid that without measurable performance loss (see comments below).

if (_pc == nullptr) {
_pc = (address)own_abi()->lr;
assert(_pc != nullptr, "must have PC");
}

if (_cb == nullptr) {
_cb = CodeCache::find_blob(_pc);
if (_cb == nullptr ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the whitespace!

if (_cb == nullptr) {
_cb = CodeCache::find_blob(_pc);
if (_cb == nullptr ) {
_cb = knd == kind::nmethod ? CodeCache::find_blob_fast(_pc) : CodeCache::find_blob(_pc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(knd == kind::nmethod) would look better.

@@ -393,16 +393,26 @@
inline common_abi* own_abi() const { return (common_abi*) _sp; }
inline common_abi* callers_abi() const { return (common_abi*) _fp; }

enum class kind {
native, // The frame's pc is not necessarily in the CodeCache.
// CodeCache::find_blob_fast(void* pc) can yield wrong results in this case and must not be used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably call it unknown.

inline frame::frame(intptr_t* sp, address pc, intptr_t* unextended_sp, intptr_t* fp, CodeBlob* cb)
: _sp(sp), _pc(pc), _cb(cb), _oop_map(nullptr),
_on_heap(false), DEBUG_ONLY(_frame_index(-1) COMMA) _unextended_sp(unextended_sp), _fp(fp) {
setup();
setup(kind::nmethod);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think kind::nmethod should only be used if cb != nullptr which is not checked, here. Is this one performance critical?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think kind::nmethod should only be used if cb != nullptr which is not checked, here. Is this one performance critical?

I don't quite understand: the purpose of using kind::nmethod is to allow for a fast lookup of the cb which is only done if cb != nullptr.
See also my other response where kind::nmethod is default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this one performance critical?

This is a good question. Honestly I have difficulties understanding why PCNs should be performance critical at all. AFAIK frames are only iterated on the slow path when freezing/thawing. Maybe the slow path is not that uncommen, e.g. if StackChunks are visited by GC.
I wanted to use kind::nmethod as default whenever possible in order not to miss a place that actually is performance critical.
See also #8955 (comment)

@@ -1187,8 +1187,12 @@ void MacroAssembler::post_call_nop() {
if (!Continuations::enabled()) {
return;
}
// We use CMPI/CMPLI instructions to encode post call nops.
// We set bit 9 to distinguish post call nops from real CMPI/CMPI instructions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be CMPI/CMPLI. Maybe add that CMPI and CMPLI opcodes only differ in one bit which we use to encode data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// |0 0 1 0 1|DATA HI| 1| DATA LO |
// | |4 bits | | 22 bits |
//
// Bit 9 is alwys 1 for PCNs to distinguish them from CMPI/CMPLI
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

always, maybe distinguish from "regular CMPI/CMPLI".


public:

// Constructors
inline frame(intptr_t* sp, intptr_t* fp, address pc);
inline frame(intptr_t* sp, address pc, intptr_t* unextended_sp = nullptr, intptr_t* fp = nullptr, CodeBlob* cb = nullptr);
inline frame(intptr_t* sp, address pc, kind knd = kind::nmethod);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using kind::nmethod by default is potentially dangerous. The pc may be outside of the code cache and calling find_blob_fast would be unreliable. It's used by pns for debugging code. It doesn't look performance critical and we could use a conservative default.
I guess that we don't see issues because native code doesn't set bit 9 in CMPI/CMPLI.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pns does not use this constructor. It uses frame::frame(void* sp, void* fp, void* pc) : frame((intptr_t*)sp, (address)pc, kind::code_blob). So there's no problem. pns seems to be the only user of this one. It might good to use kind::native there.

Using kind::native (or kind::unknow) as default instead of kind::nmethod is potentially problematic since there might be locations in shared code that should set kind::nmethod. I think this requires a clean-up of the shared frame api. Note also that using the wrong kind (wrong constructor on other platfroms) hit the assertion in CodeCache::find_blob_and_oopmap (that's how I noticed that the distinction is actually needed :))

@@ -89,21 +89,27 @@ inline void frame::setup() {
inline frame::frame() : _sp(nullptr), _pc(nullptr), _cb(nullptr), _oop_map(nullptr), _deopt_state(unknown),
_on_heap(false), DEBUG_ONLY(_frame_index(-1) COMMA) _unextended_sp(nullptr), _fp(nullptr) {}

inline frame::frame(intptr_t* sp) : frame(sp, nullptr) {}
inline frame::frame(intptr_t* sp) : frame(sp, nullptr, kind::nmethod) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Potentially dangerous default value. Not performance critical AFAICS.

@mlbridge
Copy link

mlbridge bot commented Dec 28, 2023

Mailing list message from Andrew Haley on hotspot-compiler-dev:

On 12/20/23 20:36, Richard Reingruber wrote:

| test/jdk/java/lang/Thread/virtual/stress/Skynet.java | ppc64le | x86_64 |
|------------------------------------------------------|-----------|-----------|
| PCN lookup success | 306955525 | 247185016 |
| PCN lookup failure | 500975 | 421098 |
| PCN decode success (C2) | 306951893 | 247181691 |
| PCN decode failure | 3168 | 59 |
| PCN patch success | 2080 | 2662 |
| PCN patch cb offset failure | 0 | 0 |
| PCN patch oopmap slot failure | 0 | 0 |

These data are really interesting. How did you gather them? Thanks.

@reinrich
Copy link
Member Author

reinrich commented Jan 9, 2024

Mailing list message from Andrew Haley on hotspot-compiler-dev:

On 12/20/23 20:36, Richard Reingruber wrote:

test/jdk/java/lang/Thread/virtual/stress/Skynet.java ppc64le x86_64
PCN lookup success 306955525 247185016
PCN lookup failure 500975 421098
PCN decode success (C2) 306951893 247181691
PCN decode failure 3168 59
PCN patch success 2080 2662
PCN patch cb offset failure 0 0
PCN patch oopmap slot failure 0 0

These data are really interesting. How did you gather them? Thanks.

This is the code for the stats based on master: c376fcc
This is the version for this pr: ae2b6ba
(Actually these are a cleaner reimplementations of the original code)

@openjdk-notifier openjdk-notifier bot changed the base branch from pr/17150 to master January 10, 2024 12:21
@openjdk-notifier
Copy link

The parent pull request that this pull request depends on has now been integrated and the target branch of this pull request has been updated. This means that changes from the dependent pull request can start to show up as belonging to this pull request, which may be confusing for reviewers. To remedy this situation, simply merge the latest changes from the new target branch into this pull request by running commands similar to these in the local repository for your personal fork:

git checkout ppc_post_call_nop
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# if there are conflicts, follow the instructions given by git merge
git commit -m "Merge master"
git push

@openjdk
Copy link

openjdk bot commented Jan 10, 2024

@reinrich this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout ppc_post_call_nop
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

@openjdk openjdk bot added the merge-conflict Pull request has merge conflict with target branch label Jan 10, 2024
@openjdk openjdk bot removed the merge-conflict Pull request has merge conflict with target branch label Jan 10, 2024
Copy link
Contributor

@TheRealMDoerr TheRealMDoerr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates! The constructors should still be used with care, but I think your code is at least as good as other platforms (rather better IMHO).

@openjdk
Copy link

openjdk bot commented Jan 11, 2024

@reinrich This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8290965: PPC64: Implement post-call NOPs

Reviewed-by: mdoerr

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been no new commits pushed to the master branch. If another commit should be pushed before you perform the /integrate command, your PR will be automatically rebased. If you prefer to avoid any potential automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Jan 11, 2024
@reinrich
Copy link
Member Author

I intend to ship this ppc only pr tomorrow if the tests pass after merging master. I don't expect another review.

@reinrich
Copy link
Member Author

/integrate

@openjdk
Copy link

openjdk bot commented Jan 17, 2024

Going to push as commit de97c0e.
Since your change was applied there have been 27 commits pushed to the master branch:

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Jan 17, 2024
@openjdk openjdk bot closed this Jan 17, 2024
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Jan 17, 2024
@openjdk
Copy link

openjdk bot commented Jan 17, 2024

@reinrich Pushed as commit de97c0e.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@reinrich reinrich deleted the ppc_post_call_nop branch March 27, 2024 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-compiler hotspot-compiler-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

3 participants