Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8313796: AsyncGetCallTrace crash on unreadable interpreter method pointer #15178

Closed
wants to merge 3 commits into from

Conversation

richardstartin
Copy link
Contributor

@richardstartin richardstartin commented Aug 7, 2023

We have observed invalid pointers to the interpreted method at Datadog. The fix is based on a discussion with and a code snippet from @parttimenerd.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8313796: AsyncGetCallTrace crash on unreadable interpreter method pointer (Bug - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/15178/head:pull/15178
$ git checkout pull/15178

Update a local copy of the PR:
$ git checkout pull/15178
$ git pull https://git.openjdk.org/jdk.git pull/15178/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 15178

View PR using the GUI difftool:
$ git pr show -t 15178

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/15178.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Aug 7, 2023

👋 Welcome back richardstartin! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Aug 7, 2023

@richardstartin The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot hotspot-dev@openjdk.org label Aug 7, 2023
@richardstartin richardstartin force-pushed the JDK-8313796 branch 2 times, most recently from 366684e to d9c92a1 Compare August 7, 2023 13:57
@richardstartin richardstartin marked this pull request as ready for review August 7, 2023 17:00
@openjdk openjdk bot added the rfr Pull request is ready for review label Aug 7, 2023
@richardstartin richardstartin changed the title JDK-8313796: AsyncGetCallTrace crash on unreadable interpreter method pointer 8313796: AsyncGetCallTrace crash on unreadable interpreter method pointer Aug 7, 2023
@mlbridge
Copy link

mlbridge bot commented Aug 7, 2023

Webrevs

if (m_addr == nullptr || !os::is_readable_pointer(m_addr)) {
return false;
}
Method* m = *m_addr;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it make more sense to define a function which takes a pointer to a (possible) method pointer and returns true if the method is valid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me how the proposed solution could make less sense than another solution - could you elaborate please? Are you concerned about code repetition? I must say this code is already rather repetitive across the different architectures and, while I don't want to make it any worse in that respect, I'm trying to make the smallest possible change to prevent the observed crash from recurring.

Comment on lines 510 to 514
Method** m_addr = interpreter_frame_method_addr();
if (m_addr == nullptr || !os::is_readable_pointer(m_addr)) {
return false;
}
Method* m = *m_addr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Method** m_addr = interpreter_frame_method_addr();
if (m_addr == nullptr || !os::is_readable_pointer(m_addr)) {
return false;
}
Method* m = *m_addr;
Method* m_addr = interpreter_frame_method_addr();
if (m_addr == nullptr) {
return false;
}
Method* m = SafeFetchN(m_addr, nullptr);
if (m == nullptr) {
return false;
}

Reason: more robust against changes in memory map, faster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for raising this. It doesn't look like SafeFetchN is async signal safe because it calls into this:

template <class T>
ATTRIBUTE_NO_ASAN static bool _SafeFetchXX_internal(const T *adr, T* result) {

  T n = 0;

  // Set up a jump buffer. Anchor its pointer in TLS. Then read from the unsafe address.
  // If that address was invalid, we fault, and in the signal handler we will jump back
  // to the jump point.
  sigjmp_buf jb;
  if (sigsetjmp(jb, 1) != 0) {
    // We faulted. Reset TLS slot, then return.
    pthread_setspecific(g_jmpbuf_key, nullptr);
    *result = 0;
    return false;
  }

  // Anchor jump buffer in TLS
  pthread_setspecific(g_jmpbuf_key, &jb);

  // unsafe access
  n = *adr;

  // Still here... All went well, adr was valid.
  // Reset TLS slot, then return result.
  pthread_setspecific(g_jmpbuf_key, nullptr);
  *result = n;

  return true;

}

This means we shouldn't be calling os:: is_readable_pointer either here, because it calls SafeFetch32 which calls into the same function above. I had originally intended to simply perform a null check here (because that's the condition we've actually observed) and will push a change to revert to the null check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for raising this. It doesn't look like SafeFetchN is async signal safe because it calls into this:

template <class T>
ATTRIBUTE_NO_ASAN static bool _SafeFetchXX_internal(const T *adr, T* result) {

This means we shouldn't be calling os:: is_readable_pointer either here, because it calls SafeFetch32 which calls into the same function above. I had originally intended to simply perform a null check here (because that's the condition we've actually observed) and will push a change to revert to the null check.

That's not how SafeFetchN is normally implemented. On Linux and BSD it's more like

    # Support for intptr_t SafeFetchN(intptr_t* address, intptr_t defaultval);
    #
    #  x1 : address
    #  x0 : defaultval
SafeFetchN_impl:
_SafeFetchN_fault:
    ldr      x0, [x0]
    ret
_SafeFetchN_continuation:
    mov      x0, x1
    ret

which should be fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tstuefe implemented explicitly for being signal safe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK let's go with your suggestion, thanks for explaining. I'm actually skeptical this can actually be a non-null bad pointer, as we've only seen this crash happen once, and the pointer was null in that instance. But this solution looks robust, so thanks for suggesting it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fisk this might be solved in a new version of the API. I already spoke with @apangin about this and he's positive :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tstuefe @fisk I hadn't appreciated that the cause was probably concurrent method unloading, we don't have a core dump, just the backtrace from the crash and the disassembly from objdump, so all I knew was that the pointer was null but not why. This is not the sort of thing that reproduces readily. I don't have as much context about the adjacent JVM mechanisms as others in this thread and am just trying to fix a crash based on the evidence I have.

This pointer being null seems to be a symptom rather than a cause and it doesn't appear there's anything we can do about concurrent method unloading interfering with AsyncGetCallTrace, so I wonder how worthwhile attempting to fix this is. On the one hand it will crash another way sometimes, on the other hand the probability of this happening is significantly reduced to the subsequent usages of the pointer, whereas that window of time for unloading a method to cause a crash in AsyncGetCallTrace is currently the duration of the unwind preceding the current frame. Let me know what you think about proceeding and I'll submit a fix with the null check which would have been sufficient to avoid the observed segfault. Thanks.

Copy link
Member

@tstuefe tstuefe Aug 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richardstartin I definitely think a patch in the propsed form - first check for NULL, then check again with SafeFetch - makes a lot of sense. Not perfect, but it will reduce the chance of crashes happening. And it is very simple and backportable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardening the API is always a good idea, especially if it doesn't have a performance impact. We generally don't know in which state ASGCT is called. I added comparable checks at many places before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prepared a new patch incorporating the suggestions made in this thread, which seems more straightforward to review given the force push and having gone round in circles #15193


Method* m = *interpreter_frame_method_addr();
Method** m_addr = interpreter_frame_method_addr();
if (m_addr == nullptr || !os::is_readable_pointer(m_addr)) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering, why checking is_readable_pointer + dereference instead of SafeFetch which does both in one shot? Especially since os::is_readable_pointer is implemented with SafeFetch anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the question. I've just removed the is_readable_pointer check since it's not necessary. We only observed a null pointer so the null check is sufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented @theRealAph's suggestion instead, please take a look

@openjdk
Copy link

openjdk bot commented Aug 7, 2023

@richardstartin Please do not rebase or force-push to an active PR as it invalidates existing review comments. Note for future reference, the bots always squash all changes into a single commit automatically as part of the integration. See OpenJDK Developers’ Guide for more information.

@richardstartin
Copy link
Contributor Author

I apologise for force-pushing as I was not aware of OpenJDK etiquette regarding this.

@richardstartin
Copy link
Contributor Author

I intend to revert to the null check I originally intended, I'll follow up with another PR rather than force push again.

Copy link
Member

@tstuefe tstuefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Richard,

Looks good.

It is still not perfectly safe since the method could go out of scope concurrently while you are using it.

Cheers, Thomas


Oh, I see you closed it. Ah, well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot hotspot-dev@openjdk.org rfr Pull request is ready for review
6 participants