-
Notifications
You must be signed in to change notification settings - Fork 818
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
async-profiler causes SIGSEGV nmethod::get_deopt_original_pc #73
Comments
The crash log indicates the problem is inside In long term I plan to get rid of |
Thanks for the assessment! Seems like a long way to go before the potential improvement of the situation :( |
I did some research on the issue and found that AsyncGetCallTrace could not reconstruct the top frame of the stack trace since PC pointed to a compiled method outside its code section. This should not normally happen, because PC points to the real code being executed. So I guess async-profiler may have spoiled PC by the pop-frame trick in Can you please retry with the trick disabled and see if it makes things better?
If it does not help, then the problem is indeed inside JVM. BTW, JDK 8u25 is a bit too old - it has several major issues that may crash JVM (we did see those crashes in our projects), so I kindly suggest (regardless of async-profiler) to switch to newer update of JDK. |
Thank you. I will try profiling without the trick, and report back to you in a day or two.
We have also seen several crashes, but it's not a big deal for us because our app servers are stateless. Nevertheless, I will suggest our devops updating the JVM. |
Andrei, without the trick you mentioned I was able to successfully repeatedly profile our prod app server for 24 hours. It seemed like a success because when we tried previously (without modifications to async-profiler, i.e. with the trick enabled) we had two successive crashes with 1-2 hour delay. So I decided to try it once again without modifications to async-profiler, and it has been profiling for more than 6 hours now without failing :( Hence, unfortunately, my experiments do not confirm anything about the proposed reason for the failure. I will continue monitoring. |
We finally got the same crash with the unmodified async-profiler after almost 2 days of profiling. I switched to the modified version (with the trick disabled) and will monitor it for a long period of time (a week or more). |
Thank you for testing this. Looks like the workaround really works. I think I will add more validation checks to the pop-frame path so that modified frame could not crash AsyncGetCallTrace. I'll come back with a proposed fix a bit later. |
@apangin It's been more than a week without crashes with modified async-profiler. So I think it's quite safe to say the fix works. Would be great to have a solution for this problem implemented in async-profiler. |
I appreciate your feedback. Now I'm sure where the bug is. But I'd need some spare time to fix it properly. |
After a long pause the fix is finally there! It conservatively avoids calling AsyncGetCallTrace when PC does not look like a valid executable address. At the same time it does not affect the quality of collected stack traces in normal case. |
@apangin Andrei, this is great news! I have redeployed the updated profiler on our servers and will let you know in about a week if everything is OK :) |
Ok, I'll be waiting for the feedback. Thank you! |
Not a single crash so far. Thanks for the fix 👍 |
grafana/pyroscope#1793 Add a new configuration option PYROSCOPE_INGEST_MAX_TRIES (Default is 8) Example : PYROSCOPE_INGEST_MAX_TRIES=2 - try uploading a profile at most 2 times and then discard if not succeeded PYROSCOPE_INGEST_MAX_TRIES=-1 - try uploading a profile infinitely until succeeded
Environment
async-profiler
version1063a82
java -version
uname -a
Linux app1-us-va.blabla.com 3.13.0-101-generic #148-Ubuntu SMP Thu Oct 20 22:08:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Description
We are repeatedly launching
async-profiler
on one of our production servers as follows:profiler.sh -d 600 -i 1000000 -f prof.txt -o summary,traces=200,flat=200,collapsed=samples -e cpu -b 5000000 <pid>
And it always eventually (in about 1-2 hours) leads to the same fatal error (for full log see https://gist.github.com/stIncMale/802da958cb928758029e4567b7e943cd):
The text was updated successfully, but these errors were encountered: