Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8316309: AArch64: VMError::print_native_stack() crashes on Java native method frame #15972

Closed
wants to merge 3 commits into from

Conversation

pchilano
Copy link
Contributor

@pchilano pchilano commented Sep 28, 2023

Please review the following patch. As explained in the bug comments the problem is that os::get_sender_for_C_frame() always constructs a frame as if the sender is also a native C/C++ frame. Setting a correct value for _unextended_sp is important to avoid crashes if this value is later used to get that frame's caller, which will happen if we end up calling frame::sender_for_compiled_frame().

The issue exists on aarch64 for both linux and macos but the fix for linux is different. The "Procedure Call Standard for the Arm 64-bit Architecture" doesn't specify a location for the frame record within a stack frame (6.4.6), and gcc happens to choose to save it the top of the frame (lowest address) rather than the bottom. This means that changing fr->link() for fr->sender_sp() won't work. The fix is to use the value of fr->link() but adjusted using the code blob frame size before setting it as the _unextended_sp of the sender frame. While working on this fix I realized the issue is not only when the sender is a native nmethod but with all frames associated with a CodeBlob with a frame size > 0 (runtime stub, safepoint stub, etc) so the check takes that into account. I also made a small fix to next_frame() since these mentioned frames should also use frame::sender().

I created a new test to verify that walking the stack over a native nmethod or runtime stub now works okay. I'll try to add a reliable test case for walking over a safepoint stub too. I tested the fix by running the new test and also running tiers1-4 in mach5. I'll run the upper tiers too.

Thanks,
Patricio


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8316309: AArch64: VMError::print_native_stack() crashes on Java native method frame (Bug - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/15972/head:pull/15972
$ git checkout pull/15972

Update a local copy of the PR:
$ git checkout pull/15972
$ git pull https://git.openjdk.org/jdk.git pull/15972/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 15972

View PR using the GUI difftool:
$ git pr show -t 15972

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/15972.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Sep 28, 2023

👋 Welcome back pchilanomate! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Sep 28, 2023

@pchilano The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot hotspot-dev@openjdk.org label Sep 28, 2023
@pchilano pchilano marked this pull request as ready for review September 28, 2023 21:49
@openjdk openjdk bot added the rfr Pull request is ready for review label Sep 28, 2023
@mlbridge
Copy link

mlbridge bot commented Sep 28, 2023

Webrevs

* @requires os.arch=="amd64" | os.arch=="x86_64" | os.arch=="aarch64"
* @requires os.family != "windows"
* @library /test/lib
* @run main/othervm StackWalkNativeToJava
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be driver instead of main/othervm here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

commands.addAll(Arrays.asList(extraFlags));
commands.add("StackWalkNativeToJava$TestNativeToJavaNative");

ProcessBuilder pb = ProcessTools.createJavaProcessBuilder(commands);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test ignores any external VM flags. Pleas add
@requires vm.flagless
to the test header to don't run this test with any additional VM flags

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


public void callNativeMethod() throws Exception {
Object obj = new Object();
obj.wait();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, the aim here is to call a native method that will complete by throwing an exception, so you can abort the VM. A comment to that affect would be good. Thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

test.callVMMethod();
}

public void callVMMethod() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again a comment outlining how you expect this to abort the VM would be good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -431,7 +431,7 @@ static frame next_frame(frame fr, Thread* t) {
if (!t->is_in_full_stack((address)(fr.real_fp() + 1))) {
return invalid;
}
if (fr.is_java_frame() || fr.is_native_frame() || fr.is_runtime_frame()) {
if (fr.is_interpreted_frame() || (fr.cb() != nullptr && fr.cb()->frame_size() > 0)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part of the fix is unclear to me. How do the old conditions relate to the new ones?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second part of the condition includes the previous checks for is_compiled_frame(), is_native_frame(), is_runtime_frame() plus any other frame that would use sender_for_compiled_frame() when calling frame::sender(), like the safepoint stub.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what is left that is not covered by that condition? Just wondering if there is a simpler form that makes it somewhat clearer what this actually tests for as the old condition was easily understandable and the new one is obscure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entry frame and upcall stub frame cases, which do not expect a call to frame::sender() unless there are more frames to walk (see those respective cases in frame::sender_raw()). The native frame case. And then I guess all cases where we crash in some stub routine where the pc belongs to the CodeCache but doesn't match any particular frame (interpreter, compiled, runtime stub, etc), like if we crash in the generate_cont_thaw() stub.

CodeBlob* cb = CodeCache::find_blob(pc);
bool use_codeblob = cb != nullptr && cb->frame_size() > 0;
assert(!use_codeblob || !Interpreter::contains(pc), "should not be an interpreter frame");
intptr_t* sender_sp = use_codeblob ? (fr->link() + frame::metadata_words - cb->frame_size()) : fr->link();
Copy link
Contributor

@theRealAph theRealAph Sep 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this assuming that, if the caller is a native frame, the current FP will point to the lowest word in the caller's stack frame?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a native frame the current FP would point to the current's frame lowest address. The value stored there would be the sender's FP. If the sender is also a native frame, then that value would just point to the lowest address of that frame. If the sender is a frame associated with some CodeBlob and we know its size then the sender's FP would point two words below the highest address of that frame (unless the sender's FP value is wrong but then _unextended_sp would be wrong anyways if we set it to the sender's FP as the old code) and we can calculate the actual _unextended_sp so that when getting the sender of that frame we don't crash.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a native frame the current FP would point to the current's frame lowest address. The value stored there would be the sender's FP.

Yes to both of those.

If the sender is also a native frame, then that value would just point to the lowest address of that frame.

Maybe. I think that's the way GCC works, but not the ABI. All the ABI guarantees is that the frame pointers form a chain. And I don't know that GCC always does it this way, e.g. with variable-size arrays. At least this needs a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, is your concern about setting the right _unextended_sp/_sp value when the caller is also a native frame? If that's the case then yes we would need to know where the frame records are stored to know what to do. I guess the assumption was already that they were stored at the lowest address and that's why we pass fr->link() for the sp when constructing the sender frame. But in any case that sp value is not used to get the sender so even if the value is not correct we won't crash. If that was your concern I can add a comment for that.

Copy link
Contributor

@theRealAph theRealAph Oct 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think I get it now. The sender_sp may or may not really be the actual SP in a native frame (although with current GCC it is) but we do know that there is a chain of {frame pointer, PC} pairs. If we find a PC that is in a code buffer somewhere in that chain, we can find the size of the corresponding Java stack frame, and by subtraction the "real" SP of that Java frame.

So why is MacOS different?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think I get it now. The sender_sp may or may not really be the actual SP in a native frame (although with current GCC it is) but we do know that there is a chain of {frame pointer, PC} pairs. If we find a PC that is in a code buffer somewhere in that chain, we can find the size of the corresponding Java stack frame, and by subtraction the "real" SP of that Java frame.

Exactly.

So why is MacOS different?

Clang saves the frame records at the bottom of the frame (highest address), so using fr->sender_sp() works fine there. I can change it to have the same fix as gcc if we don't want to rely on that assumption. The only reason why I went with that simpler fix is that I think knowing that the sender sp is always two words above the current rfp would allow to walk the stack in some cases whereas with the other fix we would crash. Like if a frame passes the os::is_first_C_frame() check but fr->link() is not really the sender's rfp, doing rfp + 2 would still give a valid sender sp, whereas with the other calculation it would set a wrong value. But maybe that's very unlikely. I still added a new test that will fail if the location of the frame records change. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So why is MacOS different?

Clang saves the frame records at the bottom of the frame (highest address), so using fr->sender_sp() works fine there.

Huh, so it does. I never knew that.

I can change it to have the same fix as gcc if we don't want to rely on that assumption. The only reason why I went with that simpler fix is that I think knowing that the sender sp is always two words above the current rfp would allow to walk the stack in some cases whereas with the other fix we would crash.

I guess so, but clang might change that tomorrow. But OK, from what you say it makes sense.

Like if a frame passes the os::is_first_C_frame() check but fr->link() is not really the sender's rfp,

We're confident of two things: the frame pointers are a continuous chain through foreign code, and we can unwind frames we create ourselves in the VM. As to where exactly the frame pointer chain is in the stack frame, there are no guarantees: it might be in the middle, and indeed it is in the middle if a function has any local variables that are variable-sized arrays.

doing rfp + 2 would still give a valid sender sp, whereas with the other calculation it would set a wrong value. But maybe that's very unlikely. I still added a new test that will fail if the location of the frame records change. What do you think?

That's a good idea.

Depending on the internal details of some other open source project is pathological coupling. At the best this is a code smell. Having said that, to fix it properly we'd have to use an unwinder library, and that's a much bigger project. So, OK, I guess this patch is an improvement.

One final thing, though. I'm looking at jdk.hotspot.agent/share/classes/sun/jvm/hotspot/runtime/aarch64/AARCH64Frame.java and I see AARCH64Frame::sender

    if (cb != null) {
      return senderForCompiledFrame(map, cb);
    }

    // Must be native-compiled frame, i.e. the marshaling code for native
    // methods that exists in the core system.
    return new AARCH64Frame(getSenderSP(), getLink(), getSenderPC());

We try to keep the agent code and the HotSpot frame code in step.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NB, I'm not suggesting you should fix AARCH64Frame.java in this, just noting that this looks like the same bug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's also the same code we have in frame::sender_raw(). I'm not sure how we can get to that native frame case when we use frame::sender() though. So if we start from the last Java frame we shouldn't find a native frame in the chain. And when we start from os::current_frame() we use VMError::print_native_stack() which uses os::get_sender_for_C_frame() for native frames. The only case I see is the debug utility ps(), which in case there is no last Java frame, creates a frame with os::current_frame() and then uses frame::sender(). So seems either we should change that last line in sender_raw() to be os::get_sender_for_C_frame() or replace it with an assert(false, "") and fix ps().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's also the same code we have in frame::sender_raw(). I'm not sure how we can get to that native frame case when we use frame::sender() though. So if we start from the last Java frame we shouldn't find a native frame in the chain. And when we start from os::current_frame() we use VMError::print_native_stack() which uses os::get_sender_for_C_frame() for native frames. The only case I see is the debug utility ps(), which in case there is no last Java frame, creates a frame with os::current_frame() and then uses frame::sender(). So seems either we should change that last line in sender_raw() to be os::get_sender_for_C_frame() or replace it with an assert(false, "") and fix ps().

Yes, I see. OK for now, thanks.

Copy link
Member

@lmesnik lmesnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes. The test looks good for me.

@openjdk
Copy link

openjdk bot commented Sep 29, 2023

@pchilano This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8316309: AArch64: VMError::print_native_stack() crashes on Java native method frame

Reviewed-by: lmesnik, aph

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 79 new commits pushed to the master branch:

  • 3bcfac1: 8317246: Cleanup java.net.URLEncoder and URLDecoder use of file.encoding property
  • b6a97c0: 8316880: AArch64: "stop: Header is not fast-locked" with -XX:-UseLSE since JDK-8315880
  • 287b243: 8316893: Compile without -fno-delete-null-pointer-checks
  • 26c21f5: 8314294: Unsafe::allocateMemory and Unsafe::freeMemory are slower than malloc/free
  • 6e1aacd: 8296631: NSS tests failing on OL9 linux-aarch64 hosts
  • d2e2c4c: 8309667: TLS handshake fails because of ConcurrentModificationException in PKCS12KeyStore.engineGetEntry
  • e25121d: 8316929: Shenandoah: Shenandoah degenerated GC and full GC need to cleanup old OopMapCache entries
  • 5c8366e: 8268622: Performance issues in javac Name class
  • ad81abd: 8317034: Remove redundant type cast in the java.util.stream package
  • d7d1d42: 8316771: Krb5.java has not defined messages for all error codes
  • ... and 69 more: https://git.openjdk.org/jdk/compare/2f311d59dcbbf7605e52fac0b8ebd35d7d51a48b...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Sep 29, 2023
@pchilano
Copy link
Contributor Author

pchilano commented Oct 9, 2023

Thanks for the reviews @lmesnik and @theRealAph.

@pchilano
Copy link
Contributor Author

pchilano commented Oct 9, 2023

@dholmes-ora are you okay with the last version?

@dholmes-ora
Copy link
Member

@pchilano sorry you were waiting for me. I'm not familiar enough with the Aarch64 code to Review it. My comments were just in passing on the shared code.

@pchilano
Copy link
Contributor Author

@pchilano sorry you were waiting for me. I'm not familiar enough with the Aarch64 code to Review it. My comments were just in passing on the shared code.

Ok, thanks for taking a look anyways.

@pchilano
Copy link
Contributor Author

/integrate

@openjdk
Copy link

openjdk bot commented Oct 16, 2023

Going to push as commit 2d38495.
Since your change was applied there have been 228 commits pushed to the master branch:

  • 0d09168: 8312527: (ch) Re-examine use of sun.nio.ch.Invoker.myGroupAndInvokeCount
  • 1a7fd5d: 8317687: (fs) FileStore.supportsFileAttributeView("posix") incorrectly returns 'true' for FAT32 volume on macOS
  • eb7d972: 8308659: Use CSS scroll-margin instead of flexbox layout in API documentation
  • 7028fb9: 8317975: [JVMCI] assert(pointee != nullptr) failed: invariant
  • 36993ae: 8316918: Optimize conversions duplicated across phi nodes
  • 668d4b0: 8318154: Improve stability of WheelModifier.java test
  • a36eaf0: 8317112: Add screenshot for Frame/DefaultSizeTest.java
  • a27fc7e: 8317994: Serial: Use SerialHeap in generation
  • 37eb986: 8154846: SwingNode does not resize when content size constraints are changed
  • 37aed6f: 8315362: NMT: summary diff reports threads count incorrectly
  • ... and 218 more: https://git.openjdk.org/jdk/compare/2f311d59dcbbf7605e52fac0b8ebd35d7d51a48b...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Oct 16, 2023
@openjdk openjdk bot closed this Oct 16, 2023
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Oct 16, 2023
@openjdk
Copy link

openjdk bot commented Oct 16, 2023

@pchilano Pushed as commit 2d38495.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot hotspot-dev@openjdk.org integrated Pull request has been integrated
4 participants