Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

async-profiler causes SIGSEGV nmethod::get_deopt_original_pc #73

Closed
stIncMale opened this issue Dec 1, 2017 · 13 comments
Closed

async-profiler causes SIGSEGV nmethod::get_deopt_original_pc #73

stIncMale opened this issue Dec 1, 2017 · 13 comments

Comments

@stIncMale
Copy link

Environment

  • async-profiler version
    1063a82
  • java -version
java version "1.8.0_25"
Java(TM) SE Runtime Environment (build 1.8.0_25-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode)
  • uname -a
    Linux app1-us-va.blabla.com 3.13.0-101-generic #148-Ubuntu SMP Thu Oct 20 22:08:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Description
We are repeatedly launching async-profiler on one of our production servers as follows:
profiler.sh -d 600 -i 1000000 -f prof.txt -o summary,traces=200,flat=200,collapsed=samples -e cpu -b 5000000 <pid>
And it always eventually (in about 1-2 hours) leads to the same fatal error (for full log see https://gist.github.com/stIncMale/802da958cb928758029e4567b7e943cd):

#  SIGSEGV (0xb) at pc=0x00007f142aaf166e, pid=5951, tid=139719484446464
#
# JRE version: Java(TM) SE Runtime Environment (8.0_25-b17) (build 1.8.0_25-b17)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.25-b02 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0x8ab66e]  nmethod::get_deopt_original_pc(frame const*)+0x6e
@apangin
Copy link
Collaborator

apangin commented Dec 2, 2017

The crash log indicates the problem is inside AsyncGetCallTrace. I belive it is a JVM bug probably caused by a race condition, but I need to investigate this a bit deeper.

In long term I plan to get rid of AsyncGetCallTrace call altogether: #66

@stIncMale
Copy link
Author

Thanks for the assessment! Seems like a long way to go before the potential improvement of the situation :(

@apangin
Copy link
Collaborator

apangin commented Dec 5, 2017

I did some research on the issue and found that AsyncGetCallTrace could not reconstruct the top frame of the stack trace since PC pointed to a compiled method outside its code section.

This should not normally happen, because PC points to the real code being executed. So I guess async-profiler may have spoiled PC by the pop-frame trick in Profiler::getJavaTraceAsync.

Can you please retry with the trick disabled and see if it makes things better?

-   if (trace.num_frames == ticks_unknown_Java) {
+   if (false) {
        // If current Java stack is not walkable (e.g. the top frame is not fully constructed),
        // try to manually pop the top frame off, hoping that the previous frame is walkable.
        // This is a temporary workaround for AsyncGetCallTrace issues,
        // see https://bugs.openjdk.java.net/browse/JDK-8178287

If it does not help, then the problem is indeed inside JVM.

BTW, JDK 8u25 is a bit too old - it has several major issues that may crash JVM (we did see those crashes in our projects), so I kindly suggest (regardless of async-profiler) to switch to newer update of JDK.

@stIncMale
Copy link
Author

Thank you. I will try profiling without the trick, and report back to you in a day or two.

JDK 8u25 is a bit too old - it has several major issues that may crash JVM (we did see those crashes in our projects)

We have also seen several crashes, but it's not a big deal for us because our app servers are stateless. Nevertheless, I will suggest our devops updating the JVM.

@stIncMale
Copy link
Author

Andrei, without the trick you mentioned I was able to successfully repeatedly profile our prod app server for 24 hours. It seemed like a success because when we tried previously (without modifications to async-profiler, i.e. with the trick enabled) we had two successive crashes with 1-2 hour delay. So I decided to try it once again without modifications to async-profiler, and it has been profiling for more than 6 hours now without failing :(

Hence, unfortunately, my experiments do not confirm anything about the proposed reason for the failure. I will continue monitoring.

@stIncMale
Copy link
Author

We finally got the same crash with the unmodified async-profiler after almost 2 days of profiling. I switched to the modified version (with the trick disabled) and will monitor it for a long period of time (a week or more).

@apangin
Copy link
Collaborator

apangin commented Dec 10, 2017

Thank you for testing this. Looks like the workaround really works. I think I will add more validation checks to the pop-frame path so that modified frame could not crash AsyncGetCallTrace. I'll come back with a proposed fix a bit later.

@stIncMale
Copy link
Author

@apangin It's been more than a week without crashes with modified async-profiler. So I think it's quite safe to say the fix works.

Would be great to have a solution for this problem implemented in async-profiler.

@apangin
Copy link
Collaborator

apangin commented Dec 19, 2017

I appreciate your feedback. Now I'm sure where the bug is. But I'd need some spare time to fix it properly.

@apangin
Copy link
Collaborator

apangin commented Feb 25, 2018

After a long pause the fix is finally there! It conservatively avoids calling AsyncGetCallTrace when PC does not look like a valid executable address. At the same time it does not affect the quality of collected stack traces in normal case.

@apangin apangin closed this as completed Feb 25, 2018
@stIncMale
Copy link
Author

@apangin Andrei, this is great news! I have redeployed the updated profiler on our servers and will let you know in about a week if everything is OK :)

@apangin
Copy link
Collaborator

apangin commented Feb 26, 2018

Ok, I'll be waiting for the feedback. Thank you!

@stIncMale
Copy link
Author

Not a single crash so far. Thanks for the fix 👍

6fears7 pushed a commit to 6fears7/async-profiler that referenced this issue Apr 10, 2023
grafana/pyroscope#1793

Add a new configuration option PYROSCOPE_INGEST_MAX_TRIES (Default is 8)
Example :
PYROSCOPE_INGEST_MAX_TRIES=2 - try uploading a profile at most 2 times and then discard if not succeeded
PYROSCOPE_INGEST_MAX_TRIES=-1 - try uploading a profile infinitely until succeeded
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants