async-profiler causes SIGSEGV nmethod::get_deopt_original_pc #73

stIncMale · 2017-12-01T16:05:01Z

Environment

async-profiler version
1063a82
java -version

java version "1.8.0_25"
Java(TM) SE Runtime Environment (build 1.8.0_25-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode)

uname -a
Linux app1-us-va.blabla.com 3.13.0-101-generic #148-Ubuntu SMP Thu Oct 20 22:08:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Description
We are repeatedly launching async-profiler on one of our production servers as follows:
profiler.sh -d 600 -i 1000000 -f prof.txt -o summary,traces=200,flat=200,collapsed=samples -e cpu -b 5000000 <pid>
And it always eventually (in about 1-2 hours) leads to the same fatal error (for full log see https://gist.github.com/stIncMale/802da958cb928758029e4567b7e943cd):

#  SIGSEGV (0xb) at pc=0x00007f142aaf166e, pid=5951, tid=139719484446464
#
# JRE version: Java(TM) SE Runtime Environment (8.0_25-b17) (build 1.8.0_25-b17)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.25-b02 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0x8ab66e]  nmethod::get_deopt_original_pc(frame const*)+0x6e

The text was updated successfully, but these errors were encountered:

apangin · 2017-12-02T23:42:52Z

The crash log indicates the problem is inside AsyncGetCallTrace. I belive it is a JVM bug probably caused by a race condition, but I need to investigate this a bit deeper.

In long term I plan to get rid of AsyncGetCallTrace call altogether: #66

stIncMale · 2017-12-04T05:44:46Z

Thanks for the assessment! Seems like a long way to go before the potential improvement of the situation :(

apangin · 2017-12-05T17:27:10Z

I did some research on the issue and found that AsyncGetCallTrace could not reconstruct the top frame of the stack trace since PC pointed to a compiled method outside its code section.

This should not normally happen, because PC points to the real code being executed. So I guess async-profiler may have spoiled PC by the pop-frame trick in Profiler::getJavaTraceAsync.

Can you please retry with the trick disabled and see if it makes things better?

-   if (trace.num_frames == ticks_unknown_Java) {
+   if (false) {
        // If current Java stack is not walkable (e.g. the top frame is not fully constructed),
        // try to manually pop the top frame off, hoping that the previous frame is walkable.
        // This is a temporary workaround for AsyncGetCallTrace issues,
        // see https://bugs.openjdk.java.net/browse/JDK-8178287

If it does not help, then the problem is indeed inside JVM.

BTW, JDK 8u25 is a bit too old - it has several major issues that may crash JVM (we did see those crashes in our projects), so I kindly suggest (regardless of async-profiler) to switch to newer update of JDK.

stIncMale · 2017-12-05T18:08:25Z

Thank you. I will try profiling without the trick, and report back to you in a day or two.

JDK 8u25 is a bit too old - it has several major issues that may crash JVM (we did see those crashes in our projects)

We have also seen several crashes, but it's not a big deal for us because our app servers are stateless. Nevertheless, I will suggest our devops updating the JVM.

stIncMale · 2017-12-07T19:08:34Z

Andrei, without the trick you mentioned I was able to successfully repeatedly profile our prod app server for 24 hours. It seemed like a success because when we tried previously (without modifications to async-profiler, i.e. with the trick enabled) we had two successive crashes with 1-2 hour delay. So I decided to try it once again without modifications to async-profiler, and it has been profiling for more than 6 hours now without failing :(

Hence, unfortunately, my experiments do not confirm anything about the proposed reason for the failure. I will continue monitoring.

stIncMale · 2017-12-09T07:12:58Z

We finally got the same crash with the unmodified async-profiler after almost 2 days of profiling. I switched to the modified version (with the trick disabled) and will monitor it for a long period of time (a week or more).

apangin · 2017-12-10T20:49:50Z

Thank you for testing this. Looks like the workaround really works. I think I will add more validation checks to the pop-frame path so that modified frame could not crash AsyncGetCallTrace. I'll come back with a proposed fix a bit later.

stIncMale · 2017-12-18T08:02:42Z

@apangin It's been more than a week without crashes with modified async-profiler. So I think it's quite safe to say the fix works.

Would be great to have a solution for this problem implemented in async-profiler.

apangin · 2017-12-19T21:40:27Z

I appreciate your feedback. Now I'm sure where the bug is. But I'd need some spare time to fix it properly.

apangin · 2018-02-25T21:39:51Z

After a long pause the fix is finally there! It conservatively avoids calling AsyncGetCallTrace when PC does not look like a valid executable address. At the same time it does not affect the quality of collected stack traces in normal case.

stIncMale · 2018-02-26T12:21:24Z

@apangin Andrei, this is great news! I have redeployed the updated profiler on our servers and will let you know in about a week if everything is OK :)

apangin · 2018-02-26T23:28:59Z

Ok, I'll be waiting for the feedback. Thank you!

stIncMale · 2018-03-05T11:08:26Z

Not a single crash so far. Thanks for the fix 👍

grafana/pyroscope#1793 Add a new configuration option PYROSCOPE_INGEST_MAX_TRIES (Default is 8) Example : PYROSCOPE_INGEST_MAX_TRIES=2 - try uploading a profile at most 2 times and then discard if not succeeded PYROSCOPE_INGEST_MAX_TRIES=-1 - try uploading a profile infinitely until succeeded

apangin mentioned this issue Dec 2, 2017

Implement stack walking without AsyncGetCallTrace #66

Closed

10 tasks

apangin added a commit that referenced this issue Feb 25, 2018

#73: Protect AGCT from crash after pop-frame trick

01f3173

apangin closed this as completed Feb 25, 2018

apangin mentioned this issue Aug 9, 2018

App dumps core when profiling startup #149

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

async-profiler causes SIGSEGV nmethod::get_deopt_original_pc #73

async-profiler causes SIGSEGV nmethod::get_deopt_original_pc #73

stIncMale commented Dec 1, 2017

apangin commented Dec 2, 2017

stIncMale commented Dec 4, 2017

apangin commented Dec 5, 2017

stIncMale commented Dec 5, 2017

stIncMale commented Dec 7, 2017

stIncMale commented Dec 9, 2017

apangin commented Dec 10, 2017

stIncMale commented Dec 18, 2017

apangin commented Dec 19, 2017

apangin commented Feb 25, 2018

stIncMale commented Feb 26, 2018

apangin commented Feb 26, 2018

stIncMale commented Mar 5, 2018

async-profiler causes SIGSEGV nmethod::get_deopt_original_pc #73

async-profiler causes SIGSEGV nmethod::get_deopt_original_pc #73

Comments

stIncMale commented Dec 1, 2017

apangin commented Dec 2, 2017

stIncMale commented Dec 4, 2017

apangin commented Dec 5, 2017

stIncMale commented Dec 5, 2017

stIncMale commented Dec 7, 2017

stIncMale commented Dec 9, 2017

apangin commented Dec 10, 2017

stIncMale commented Dec 18, 2017

apangin commented Dec 19, 2017

apangin commented Feb 25, 2018

stIncMale commented Feb 26, 2018

apangin commented Feb 26, 2018

stIncMale commented Mar 5, 2018