-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8316309: AArch64: VMError::print_native_stack() crashes on Java native method frame #15972
Conversation
👋 Welcome back pchilanomate! A progress list of the required criteria for merging this PR into |
Webrevs
|
* @requires os.arch=="amd64" | os.arch=="x86_64" | os.arch=="aarch64" | ||
* @requires os.family != "windows" | ||
* @library /test/lib | ||
* @run main/othervm StackWalkNativeToJava |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be driver instead of main/othervm here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
commands.addAll(Arrays.asList(extraFlags)); | ||
commands.add("StackWalkNativeToJava$TestNativeToJavaNative"); | ||
|
||
ProcessBuilder pb = ProcessTools.createJavaProcessBuilder(commands); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test ignores any external VM flags. Pleas add
@requires vm.flagless
to the test header to don't run this test with any additional VM flags
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
public void callNativeMethod() throws Exception { | ||
Object obj = new Object(); | ||
obj.wait(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear, the aim here is to call a native method that will complete by throwing an exception, so you can abort the VM. A comment to that affect would be good. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
test.callVMMethod(); | ||
} | ||
|
||
public void callVMMethod() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again a comment outlining how you expect this to abort the VM would be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -431,7 +431,7 @@ static frame next_frame(frame fr, Thread* t) { | |||
if (!t->is_in_full_stack((address)(fr.real_fp() + 1))) { | |||
return invalid; | |||
} | |||
if (fr.is_java_frame() || fr.is_native_frame() || fr.is_runtime_frame()) { | |||
if (fr.is_interpreted_frame() || (fr.cb() != nullptr && fr.cb()->frame_size() > 0)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part of the fix is unclear to me. How do the old conditions relate to the new ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second part of the condition includes the previous checks for is_compiled_frame(), is_native_frame(), is_runtime_frame() plus any other frame that would use sender_for_compiled_frame() when calling frame::sender(), like the safepoint stub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what is left that is not covered by that condition? Just wondering if there is a simpler form that makes it somewhat clearer what this actually tests for as the old condition was easily understandable and the new one is obscure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The entry frame and upcall stub frame cases, which do not expect a call to frame::sender() unless there are more frames to walk (see those respective cases in frame::sender_raw()). The native frame case. And then I guess all cases where we crash in some stub routine where the pc belongs to the CodeCache but doesn't match any particular frame (interpreter, compiled, runtime stub, etc), like if we crash in the generate_cont_thaw() stub.
CodeBlob* cb = CodeCache::find_blob(pc); | ||
bool use_codeblob = cb != nullptr && cb->frame_size() > 0; | ||
assert(!use_codeblob || !Interpreter::contains(pc), "should not be an interpreter frame"); | ||
intptr_t* sender_sp = use_codeblob ? (fr->link() + frame::metadata_words - cb->frame_size()) : fr->link(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this assuming that, if the caller is a native frame, the current FP will point to the lowest word in the caller's stack frame?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is a native frame the current FP would point to the current's frame lowest address. The value stored there would be the sender's FP. If the sender is also a native frame, then that value would just point to the lowest address of that frame. If the sender is a frame associated with some CodeBlob and we know its size then the sender's FP would point two words below the highest address of that frame (unless the sender's FP value is wrong but then _unextended_sp would be wrong anyways if we set it to the sender's FP as the old code) and we can calculate the actual _unextended_sp so that when getting the sender of that frame we don't crash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is a native frame the current FP would point to the current's frame lowest address. The value stored there would be the sender's FP.
Yes to both of those.
If the sender is also a native frame, then that value would just point to the lowest address of that frame.
Maybe. I think that's the way GCC works, but not the ABI. All the ABI guarantees is that the frame pointers form a chain. And I don't know that GCC always does it this way, e.g. with variable-size arrays. At least this needs a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, is your concern about setting the right _unextended_sp/_sp value when the caller is also a native frame? If that's the case then yes we would need to know where the frame records are stored to know what to do. I guess the assumption was already that they were stored at the lowest address and that's why we pass fr->link() for the sp when constructing the sender frame. But in any case that sp value is not used to get the sender so even if the value is not correct we won't crash. If that was your concern I can add a comment for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I think I get it now. The sender_sp
may or may not really be the actual SP in a native frame (although with current GCC it is) but we do know that there is a chain of {frame pointer, PC} pairs. If we find a PC that is in a code buffer somewhere in that chain, we can find the size of the corresponding Java stack frame, and by subtraction the "real" SP of that Java frame.
So why is MacOS different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I think I get it now. The sender_sp may or may not really be the actual SP in a native frame (although with current GCC it is) but we do know that there is a chain of {frame pointer, PC} pairs. If we find a PC that is in a code buffer somewhere in that chain, we can find the size of the corresponding Java stack frame, and by subtraction the "real" SP of that Java frame.
Exactly.
So why is MacOS different?
Clang saves the frame records at the bottom of the frame (highest address), so using fr->sender_sp() works fine there. I can change it to have the same fix as gcc if we don't want to rely on that assumption. The only reason why I went with that simpler fix is that I think knowing that the sender sp is always two words above the current rfp would allow to walk the stack in some cases whereas with the other fix we would crash. Like if a frame passes the os::is_first_C_frame() check but fr->link() is not really the sender's rfp, doing rfp + 2 would still give a valid sender sp, whereas with the other calculation it would set a wrong value. But maybe that's very unlikely. I still added a new test that will fail if the location of the frame records change. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So why is MacOS different?
Clang saves the frame records at the bottom of the frame (highest address), so using fr->sender_sp() works fine there.
Huh, so it does. I never knew that.
I can change it to have the same fix as gcc if we don't want to rely on that assumption. The only reason why I went with that simpler fix is that I think knowing that the sender sp is always two words above the current rfp would allow to walk the stack in some cases whereas with the other fix we would crash.
I guess so, but clang might change that tomorrow. But OK, from what you say it makes sense.
Like if a frame passes the os::is_first_C_frame() check but fr->link() is not really the sender's rfp,
We're confident of two things: the frame pointers are a continuous chain through foreign code, and we can unwind frames we create ourselves in the VM. As to where exactly the frame pointer chain is in the stack frame, there are no guarantees: it might be in the middle, and indeed it is in the middle if a function has any local variables that are variable-sized arrays.
doing rfp + 2 would still give a valid sender sp, whereas with the other calculation it would set a wrong value. But maybe that's very unlikely. I still added a new test that will fail if the location of the frame records change. What do you think?
That's a good idea.
Depending on the internal details of some other open source project is pathological coupling. At the best this is a code smell. Having said that, to fix it properly we'd have to use an unwinder library, and that's a much bigger project. So, OK, I guess this patch is an improvement.
One final thing, though. I'm looking at jdk.hotspot.agent/share/classes/sun/jvm/hotspot/runtime/aarch64/AARCH64Frame.java
and I see AARCH64Frame::sender
if (cb != null) {
return senderForCompiledFrame(map, cb);
}
// Must be native-compiled frame, i.e. the marshaling code for native
// methods that exists in the core system.
return new AARCH64Frame(getSenderSP(), getLink(), getSenderPC());
We try to keep the agent code and the HotSpot frame code in step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NB, I'm not suggesting you should fix AARCH64Frame.java in this, just noting that this looks like the same bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's also the same code we have in frame::sender_raw(). I'm not sure how we can get to that native frame case when we use frame::sender() though. So if we start from the last Java frame we shouldn't find a native frame in the chain. And when we start from os::current_frame() we use VMError::print_native_stack() which uses os::get_sender_for_C_frame() for native frames. The only case I see is the debug utility ps(), which in case there is no last Java frame, creates a frame with os::current_frame() and then uses frame::sender(). So seems either we should change that last line in sender_raw() to be os::get_sender_for_C_frame() or replace it with an assert(false, "") and fix ps().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's also the same code we have in frame::sender_raw(). I'm not sure how we can get to that native frame case when we use frame::sender() though. So if we start from the last Java frame we shouldn't find a native frame in the chain. And when we start from os::current_frame() we use VMError::print_native_stack() which uses os::get_sender_for_C_frame() for native frames. The only case I see is the debug utility ps(), which in case there is no last Java frame, creates a frame with os::current_frame() and then uses frame::sender(). So seems either we should change that last line in sender_raw() to be os::get_sender_for_C_frame() or replace it with an assert(false, "") and fix ps().
Yes, I see. OK for now, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes. The test looks good for me.
@pchilano This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 79 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
Thanks for the reviews @lmesnik and @theRealAph. |
@dholmes-ora are you okay with the last version? |
@pchilano sorry you were waiting for me. I'm not familiar enough with the Aarch64 code to Review it. My comments were just in passing on the shared code. |
Ok, thanks for taking a look anyways. |
/integrate |
Going to push as commit 2d38495.
Your commit was automatically rebased without conflicts. |
Please review the following patch. As explained in the bug comments the problem is that os::get_sender_for_C_frame() always constructs a frame as if the sender is also a native C/C++ frame. Setting a correct value for _unextended_sp is important to avoid crashes if this value is later used to get that frame's caller, which will happen if we end up calling frame::sender_for_compiled_frame().
The issue exists on aarch64 for both linux and macos but the fix for linux is different. The "Procedure Call Standard for the Arm 64-bit Architecture" doesn't specify a location for the frame record within a stack frame (6.4.6), and gcc happens to choose to save it the top of the frame (lowest address) rather than the bottom. This means that changing fr->link() for fr->sender_sp() won't work. The fix is to use the value of fr->link() but adjusted using the code blob frame size before setting it as the _unextended_sp of the sender frame. While working on this fix I realized the issue is not only when the sender is a native nmethod but with all frames associated with a CodeBlob with a frame size > 0 (runtime stub, safepoint stub, etc) so the check takes that into account. I also made a small fix to next_frame() since these mentioned frames should also use frame::sender().
I created a new test to verify that walking the stack over a native nmethod or runtime stub now works okay. I'll try to add a reliable test case for walking over a safepoint stub too. I tested the fix by running the new test and also running tiers1-4 in mach5. I'll run the upper tiers too.
Thanks,
Patricio
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/15972/head:pull/15972
$ git checkout pull/15972
Update a local copy of the PR:
$ git checkout pull/15972
$ git pull https://git.openjdk.org/jdk.git pull/15972/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 15972
View PR using the GUI difftool:
$ git pr show -t 15972
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/15972.diff
Webrev
Link to Webrev Comment