8316309: AArch64: VMError::print_native_stack() crashes on Java native method frame #15972

pchilano · 2023-09-28T21:07:09Z

Please review the following patch. As explained in the bug comments the problem is that os::get_sender_for_C_frame() always constructs a frame as if the sender is also a native C/C++ frame. Setting a correct value for _unextended_sp is important to avoid crashes if this value is later used to get that frame's caller, which will happen if we end up calling frame::sender_for_compiled_frame().

The issue exists on aarch64 for both linux and macos but the fix for linux is different. The "Procedure Call Standard for the Arm 64-bit Architecture" doesn't specify a location for the frame record within a stack frame (6.4.6), and gcc happens to choose to save it the top of the frame (lowest address) rather than the bottom. This means that changing fr->link() for fr->sender_sp() won't work. The fix is to use the value of fr->link() but adjusted using the code blob frame size before setting it as the _unextended_sp of the sender frame. While working on this fix I realized the issue is not only when the sender is a native nmethod but with all frames associated with a CodeBlob with a frame size > 0 (runtime stub, safepoint stub, etc) so the check takes that into account. I also made a small fix to next_frame() since these mentioned frames should also use frame::sender().

I created a new test to verify that walking the stack over a native nmethod or runtime stub now works okay. I'll try to add a reliable test case for walking over a safepoint stub too. I tested the fix by running the new test and also running tiers1-4 in mach5. I'll run the upper tiers too.

Thanks,
Patricio

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Issue

JDK-8316309: AArch64: VMError::print_native_stack() crashes on Java native method frame (Bug - P4)

Reviewers

Leonid Mesnik (@lmesnik - Reviewer)
Andrew Haley (@theRealAph - Reviewer)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/15972/head:pull/15972
$ git checkout pull/15972

Update a local copy of the PR:
$ git checkout pull/15972
$ git pull https://git.openjdk.org/jdk.git pull/15972/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 15972

View PR using the GUI difftool:
$ git pr show -t 15972

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/15972.diff

Webrev

Link to Webrev Comment

bridgekeeper · 2023-09-28T21:08:16Z

👋 Welcome back pchilanomate! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2023-09-28T21:10:24Z

@pchilano The following label will be automatically applied to this pull request:

hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

mlbridge · 2023-09-28T21:54:40Z

Webrevs

01: Full - Incremental (8757025e)
00: Full (eceeed15)

lmesnik · 2023-09-29T03:49:50Z

test/hotspot/jtreg/runtime/ErrorHandling/StackWalkNativeToJava.java

+ * @requires os.arch=="amd64" | os.arch=="x86_64" | os.arch=="aarch64"
+ * @requires os.family != "windows"
+ * @library /test/lib
+ * @run main/othervm StackWalkNativeToJava


I think it should be driver instead of main/othervm here.

lmesnik · 2023-09-29T03:53:17Z

test/hotspot/jtreg/runtime/ErrorHandling/StackWalkNativeToJava.java

+        commands.addAll(Arrays.asList(extraFlags));
+        commands.add("StackWalkNativeToJava$TestNativeToJavaNative");
+
+        ProcessBuilder pb = ProcessTools.createJavaProcessBuilder(commands);


The test ignores any external VM flags. Pleas add
@requires vm.flagless
to the test header to don't run this test with any additional VM flags

dholmes-ora · 2023-09-29T04:26:49Z

test/hotspot/jtreg/runtime/ErrorHandling/StackWalkNativeToJava.java

+
+        public void callNativeMethod() throws Exception {
+            Object obj = new Object();
+            obj.wait();


Just to be clear, the aim here is to call a native method that will complete by throwing an exception, so you can abort the VM. A comment to that affect would be good. Thanks

dholmes-ora · 2023-09-29T04:27:42Z

test/hotspot/jtreg/runtime/ErrorHandling/StackWalkNativeToJava.java

+            test.callVMMethod();
+        }
+
+        public void callVMMethod() throws Exception {


Again a comment outlining how you expect this to abort the VM would be good.

dholmes-ora · 2023-09-29T04:29:34Z

src/hotspot/share/utilities/vmError.cpp

@@ -431,7 +431,7 @@ static frame next_frame(frame fr, Thread* t) {
    if (!t->is_in_full_stack((address)(fr.real_fp() + 1))) {
      return invalid;
    }
-    if (fr.is_java_frame() || fr.is_native_frame() || fr.is_runtime_frame()) {
+    if (fr.is_interpreted_frame() || (fr.cb() != nullptr && fr.cb()->frame_size() > 0)) {


This part of the fix is unclear to me. How do the old conditions relate to the new ones?

The second part of the condition includes the previous checks for is_compiled_frame(), is_native_frame(), is_runtime_frame() plus any other frame that would use sender_for_compiled_frame() when calling frame::sender(), like the safepoint stub.

So what is left that is not covered by that condition? Just wondering if there is a simpler form that makes it somewhat clearer what this actually tests for as the old condition was easily understandable and the new one is obscure.

The entry frame and upcall stub frame cases, which do not expect a call to frame::sender() unless there are more frames to walk (see those respective cases in frame::sender_raw()). The native frame case. And then I guess all cases where we crash in some stub routine where the pc belongs to the CodeCache but doesn't match any particular frame (interpreter, compiled, runtime stub, etc), like if we crash in the generate_cont_thaw() stub.

theRealAph · 2023-09-29T13:14:28Z

src/hotspot/os_cpu/linux_aarch64/os_linux_aarch64.cpp

+  CodeBlob* cb = CodeCache::find_blob(pc);
+  bool use_codeblob = cb != nullptr && cb->frame_size() > 0;
+  assert(!use_codeblob || !Interpreter::contains(pc), "should not be an interpreter frame");
+  intptr_t* sender_sp = use_codeblob ? (fr->link() + frame::metadata_words - cb->frame_size()) : fr->link();


Is this assuming that, if the caller is a native frame, the current FP will point to the lowest word in the caller's stack frame?

If this is a native frame the current FP would point to the current's frame lowest address. The value stored there would be the sender's FP. If the sender is also a native frame, then that value would just point to the lowest address of that frame. If the sender is a frame associated with some CodeBlob and we know its size then the sender's FP would point two words below the highest address of that frame (unless the sender's FP value is wrong but then _unextended_sp would be wrong anyways if we set it to the sender's FP as the old code) and we can calculate the actual _unextended_sp so that when getting the sender of that frame we don't crash.

If this is a native frame the current FP would point to the current's frame lowest address. The value stored there would be the sender's FP.

Yes to both of those.

If the sender is also a native frame, then that value would just point to the lowest address of that frame.

Maybe. I think that's the way GCC works, but not the ABI. All the ABI guarantees is that the frame pointers form a chain. And I don't know that GCC always does it this way, e.g. with variable-size arrays. At least this needs a comment.

Ah, is your concern about setting the right _unextended_sp/_sp value when the caller is also a native frame? If that's the case then yes we would need to know where the frame records are stored to know what to do. I guess the assumption was already that they were stored at the lowest address and that's why we pass fr->link() for the sp when constructing the sender frame. But in any case that sp value is not used to get the sender so even if the value is not correct we won't crash. If that was your concern I can add a comment for that.

OK, I think I get it now. The sender_sp may or may not really be the actual SP in a native frame (although with current GCC it is) but we do know that there is a chain of {frame pointer, PC} pairs. If we find a PC that is in a code buffer somewhere in that chain, we can find the size of the corresponding Java stack frame, and by subtraction the "real" SP of that Java frame.

So why is MacOS different?

OK, I think I get it now. The sender_sp may or may not really be the actual SP in a native frame (although with current GCC it is) but we do know that there is a chain of {frame pointer, PC} pairs. If we find a PC that is in a code buffer somewhere in that chain, we can find the size of the corresponding Java stack frame, and by subtraction the "real" SP of that Java frame.

Exactly.

So why is MacOS different?

Clang saves the frame records at the bottom of the frame (highest address), so using fr->sender_sp() works fine there. I can change it to have the same fix as gcc if we don't want to rely on that assumption. The only reason why I went with that simpler fix is that I think knowing that the sender sp is always two words above the current rfp would allow to walk the stack in some cases whereas with the other fix we would crash. Like if a frame passes the os::is_first_C_frame() check but fr->link() is not really the sender's rfp, doing rfp + 2 would still give a valid sender sp, whereas with the other calculation it would set a wrong value. But maybe that's very unlikely. I still added a new test that will fail if the location of the frame records change. What do you think?

So why is MacOS different?

Clang saves the frame records at the bottom of the frame (highest address), so using fr->sender_sp() works fine there.

Huh, so it does. I never knew that.

I can change it to have the same fix as gcc if we don't want to rely on that assumption. The only reason why I went with that simpler fix is that I think knowing that the sender sp is always two words above the current rfp would allow to walk the stack in some cases whereas with the other fix we would crash.

I guess so, but clang might change that tomorrow. But OK, from what you say it makes sense.

Like if a frame passes the os::is_first_C_frame() check but fr->link() is not really the sender's rfp,

We're confident of two things: the frame pointers are a continuous chain through foreign code, and we can unwind frames we create ourselves in the VM. As to where exactly the frame pointer chain is in the stack frame, there are no guarantees: it might be in the middle, and indeed it is in the middle if a function has any local variables that are variable-sized arrays.

doing rfp + 2 would still give a valid sender sp, whereas with the other calculation it would set a wrong value. But maybe that's very unlikely. I still added a new test that will fail if the location of the frame records change. What do you think?

That's a good idea.

Depending on the internal details of some other open source project is pathological coupling. At the best this is a code smell. Having said that, to fix it properly we'd have to use an unwinder library, and that's a much bigger project. So, OK, I guess this patch is an improvement.

One final thing, though. I'm looking at jdk.hotspot.agent/share/classes/sun/jvm/hotspot/runtime/aarch64/AARCH64Frame.java and I see AARCH64Frame::sender

if (cb != null) { return senderForCompiledFrame(map, cb); } // Must be native-compiled frame, i.e. the marshaling code for native // methods that exists in the core system. return new AARCH64Frame(getSenderSP(), getLink(), getSenderPC());

We try to keep the agent code and the HotSpot frame code in step.

NB, I'm not suggesting you should fix AARCH64Frame.java in this, just noting that this looks like the same bug.

Yes, it's also the same code we have in frame::sender_raw(). I'm not sure how we can get to that native frame case when we use frame::sender() though. So if we start from the last Java frame we shouldn't find a native frame in the chain. And when we start from os::current_frame() we use VMError::print_native_stack() which uses os::get_sender_for_C_frame() for native frames. The only case I see is the debug utility ps(), which in case there is no last Java frame, creates a frame with os::current_frame() and then uses frame::sender(). So seems either we should change that last line in sender_raw() to be os::get_sender_for_C_frame() or replace it with an assert(false, "") and fix ps().

Yes, it's also the same code we have in frame::sender_raw(). I'm not sure how we can get to that native frame case when we use frame::sender() though. So if we start from the last Java frame we shouldn't find a native frame in the chain. And when we start from os::current_frame() we use VMError::print_native_stack() which uses os::get_sender_for_C_frame() for native frames. The only case I see is the debug utility ps(), which in case there is no last Java frame, creates a frame with os::current_frame() and then uses frame::sender(). So seems either we should change that last line in sender_raw() to be os::get_sender_for_C_frame() or replace it with an assert(false, "") and fix ps().

Yes, I see. OK for now, thanks.

lmesnik

Thanks for the changes. The test looks good for me.

openjdk · 2023-09-29T18:44:19Z

@pchilano This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8316309: AArch64: VMError::print_native_stack() crashes on Java native method frame

Reviewed-by: lmesnik, aph

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 79 new commits pushed to the master branch:

3bcfac1: 8317246: Cleanup java.net.URLEncoder and URLDecoder use of file.encoding property
b6a97c0: 8316880: AArch64: "stop: Header is not fast-locked" with -XX:-UseLSE since JDK-8315880
287b243: 8316893: Compile without -fno-delete-null-pointer-checks
26c21f5: 8314294: Unsafe::allocateMemory and Unsafe::freeMemory are slower than malloc/free
6e1aacd: 8296631: NSS tests failing on OL9 linux-aarch64 hosts
d2e2c4c: 8309667: TLS handshake fails because of ConcurrentModificationException in PKCS12KeyStore.engineGetEntry
e25121d: 8316929: Shenandoah: Shenandoah degenerated GC and full GC need to cleanup old OopMapCache entries
5c8366e: 8268622: Performance issues in javac Name class
ad81abd: 8317034: Remove redundant type cast in the java.util.stream package
d7d1d42: 8316771: Krb5.java has not defined messages for all error codes
... and 69 more: https://git.openjdk.org/jdk/compare/2f311d59dcbbf7605e52fac0b8ebd35d7d51a48b...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

pchilano · 2023-10-09T18:37:55Z

Thanks for the reviews @lmesnik and @theRealAph.

pchilano · 2023-10-09T18:38:24Z

@dholmes-ora are you okay with the last version?

dholmes-ora · 2023-10-15T23:15:02Z

@pchilano sorry you were waiting for me. I'm not familiar enough with the Aarch64 code to Review it. My comments were just in passing on the shared code.

pchilano · 2023-10-16T16:17:02Z

@pchilano sorry you were waiting for me. I'm not familiar enough with the Aarch64 code to Review it. My comments were just in passing on the shared code.

Ok, thanks for taking a look anyways.

pchilano · 2023-10-16T16:17:12Z

/integrate

openjdk · 2023-10-16T16:23:13Z

Going to push as commit 2d38495.
Since your change was applied there have been 228 commits pushed to the master branch:

0d09168: 8312527: (ch) Re-examine use of sun.nio.ch.Invoker.myGroupAndInvokeCount
1a7fd5d: 8317687: (fs) FileStore.supportsFileAttributeView("posix") incorrectly returns 'true' for FAT32 volume on macOS
eb7d972: 8308659: Use CSS scroll-margin instead of flexbox layout in API documentation
7028fb9: 8317975: [JVMCI] assert(pointee != nullptr) failed: invariant
36993ae: 8316918: Optimize conversions duplicated across phi nodes
668d4b0: 8318154: Improve stability of WheelModifier.java test
a36eaf0: 8317112: Add screenshot for Frame/DefaultSizeTest.java
a27fc7e: 8317994: Serial: Use SerialHeap in generation
37eb986: 8154846: SwingNode does not resize when content size constraints are changed
37aed6f: 8315362: NMT: summary diff reports threads count incorrectly
... and 218 more: https://git.openjdk.org/jdk/compare/2f311d59dcbbf7605e52fac0b8ebd35d7d51a48b...master

Your commit was automatically rebased without conflicts.

openjdk · 2023-10-16T16:23:44Z

@pchilano Pushed as commit 2d38495.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

v1

eceeed1

openjdk bot added the hotspot hotspot-dev@openjdk.org label Sep 28, 2023

pchilano marked this pull request as ready for review September 28, 2023 21:49

openjdk bot added the rfr Pull request is ready for review label Sep 28, 2023

lmesnik suggested changes Sep 29, 2023

View reviewed changes

dholmes-ora reviewed Sep 29, 2023

View reviewed changes

theRealAph reviewed Sep 29, 2023

View reviewed changes

pchilano added 2 commits September 29, 2023 10:32

use driver + @requires vm.flagless

a9ac61c

add comment to tests

8757025

lmesnik approved these changes Sep 29, 2023

View reviewed changes

openjdk bot added the ready Pull request is ready to be integrated label Sep 29, 2023

theRealAph approved these changes Oct 3, 2023

View reviewed changes

openjdk bot added the integrated Pull request has been integrated label Oct 16, 2023

openjdk bot closed this Oct 16, 2023

openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Oct 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

8316309: AArch64: VMError::print_native_stack() crashes on Java native method frame #15972

8316309: AArch64: VMError::print_native_stack() crashes on Java native method frame #15972

pchilano commented Sep 28, 2023 •

edited by openjdk bot

Loading

bridgekeeper bot commented Sep 28, 2023

openjdk bot commented Sep 28, 2023

mlbridge bot commented Sep 28, 2023 •

edited

Loading

lmesnik Sep 29, 2023

pchilano Sep 29, 2023

lmesnik Sep 29, 2023

pchilano Sep 29, 2023

dholmes-ora Sep 29, 2023

pchilano Sep 29, 2023

dholmes-ora Sep 29, 2023

pchilano Sep 29, 2023

dholmes-ora Sep 29, 2023

pchilano Sep 29, 2023

dholmes-ora Oct 3, 2023

pchilano Oct 3, 2023

theRealAph Sep 29, 2023 •

edited

Loading

pchilano Sep 29, 2023

theRealAph Sep 29, 2023

pchilano Sep 29, 2023

theRealAph Oct 2, 2023 •

edited

Loading

pchilano Oct 2, 2023

theRealAph Oct 3, 2023

theRealAph Oct 3, 2023

pchilano Oct 3, 2023

theRealAph Oct 4, 2023

lmesnik left a comment

openjdk bot commented Sep 29, 2023 •

edited

Loading

pchilano commented Oct 9, 2023

pchilano commented Oct 9, 2023

dholmes-ora commented Oct 15, 2023

pchilano commented Oct 16, 2023

pchilano commented Oct 16, 2023

openjdk bot commented Oct 16, 2023

openjdk bot commented Oct 16, 2023

8316309: AArch64: VMError::print_native_stack() crashes on Java native method frame #15972

8316309: AArch64: VMError::print_native_stack() crashes on Java native method frame #15972

Conversation

pchilano commented Sep 28, 2023 • edited by openjdk bot Loading

Progress

Issue

Reviewers

Reviewing

Webrev

bridgekeeper bot commented Sep 28, 2023

openjdk bot commented Sep 28, 2023

mlbridge bot commented Sep 28, 2023 • edited Loading

Webrevs

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

theRealAph Sep 29, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

theRealAph Oct 2, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lmesnik left a comment

Choose a reason for hiding this comment

openjdk bot commented Sep 29, 2023 • edited Loading

pchilano commented Oct 9, 2023

pchilano commented Oct 9, 2023

dholmes-ora commented Oct 15, 2023

pchilano commented Oct 16, 2023

pchilano commented Oct 16, 2023

openjdk bot commented Oct 16, 2023

openjdk bot commented Oct 16, 2023

pchilano commented Sep 28, 2023 •

edited by openjdk bot

Loading

mlbridge bot commented Sep 28, 2023 •

edited

Loading

theRealAph Sep 29, 2023 •

edited

Loading

theRealAph Oct 2, 2023 •

edited

Loading

openjdk bot commented Sep 29, 2023 •

edited

Loading