Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8178287: AsyncGetCallTrace fails to traverse valid Java stacks #4436

Closed
wants to merge 8 commits into from

Conversation

@luhenry
Copy link
Member

@luhenry luhenry commented Jun 9, 2021

When the signal sent for AsyncGetCallTrace or JFR would land on a runtime stub (like arraycopy), a vtable stub, or the prolog of a compiled method, it wouldn't be able to detect the sender (caller) frame for multiple reasons. This patch fixes these cases through adding CodeBlob-specific frame parser which are in the best position to know how a frame is setup.

The following examples have been profiled with honest-profiler which uses AsyncGetCallTrace.

Prof1

public class Prof1 {

    public static void main(String[] args) {
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < 1000000; i++) {
            sb.append("ab");
            sb.delete(0, 1);
        }
        System.out.println(sb.length());
    }
}
  • Baseline:
Flat Profile (by method):
        (t 99.4,s 99.4) AGCT::Unknown Java[ERR=-5]
        (t  0.5,s  0.2) Prof1::main
        (t  0.2,s  0.2) java.lang.AbstractStringBuilder::append
        (t  0.1,s  0.1) AGCT::Unknown not Java[ERR=-3]
        (t  0.0,s  0.0) java.lang.AbstractStringBuilder::ensureCapacityInternal
        (t  0.0,s  0.0) java.lang.AbstractStringBuilder::shift
        (t  0.0,s  0.0) java.lang.String::getBytes
        (t  0.0,s  0.0) java.lang.AbstractStringBuilder::putStringAt
        (t  0.0,s  0.0) java.lang.StringBuilder::delete
        (t  0.2,s  0.0) java.lang.StringBuilder::append
        (t  0.0,s  0.0) java.lang.AbstractStringBuilder::delete
        (t  0.0,s  0.0) java.lang.AbstractStringBuilder::putStringAt
  • With StubRoutinesBlob::FrameParser:
Flat Profile (by method):
        (t 98.7,s 98.7) java.lang.AbstractStringBuilder::ensureCapacityInternal
        (t  0.9,s  0.9) java.lang.AbstractStringBuilder::delete
        (t 99.8,s  0.2) Prof1::main
        (t  0.1,s  0.1) AGCT::Unknown not Java[ERR=-3]
        (t  0.0,s  0.0) AGCT::Unknown Java[ERR=-5]
        (t 98.8,s  0.0) java.lang.AbstractStringBuilder::append
        (t 98.8,s  0.0) java.lang.StringBuilder::append
        (t  0.9,s  0.0) java.lang.StringBuilder::delete

Prof2

import java.util.function.Supplier;

public class Prof2 {

    public static void main(String[] args) {
        var rand = new java.util.Random(0);
        Supplier[] suppliers = {
                () -> 0,
                () -> 1,
                () -> 2,
                () -> 3,
        };

        long sum = 0;
        for (int i = 0; i >= 0; i++) {
            sum += (int)suppliers[i % suppliers.length].get();
        }
    }
}
  • Baseline:
Flat Profile (by method):
        (t 60.7,s 60.7) AGCT::Unknown Java[ERR=-5]
        (t 39.2,s 35.2) Prof2::main
        (t  1.4,s  1.4) Prof2::lambda$main$3
        (t  1.0,s  1.0) Prof2::lambda$main$2
        (t  0.9,s  0.9) Prof2::lambda$main$1
        (t  0.7,s  0.7) Prof2::lambda$main$0
        (t  0.1,s  0.1) AGCT::Unknown not Java[ERR=-3]
        (t  0.0,s  0.0) java.lang.Thread::exit
        (t  0.9,s  0.0) Prof2$$Lambda$2.0x0000000800c00c28::get
        (t  1.0,s  0.0) Prof2$$Lambda$3.0x0000000800c01000::get
        (t  1.4,s  0.0) Prof2$$Lambda$4.0x0000000800c01220::get
        (t  0.7,s  0.0) Prof2$$Lambda$1.0x0000000800c00a08::get
  • With VtableBlob::FrameParser and nmethod::FrameParser:
Flat Profile (by method):
        (t 74.1,s 70.3) Prof2::main
        (t  6.5,s  5.5) Prof2$$Lambda$29.0x0000000800081220::get
        (t  6.6,s  5.4) Prof2$$Lambda$28.0x0000000800081000::get
        (t  5.7,s  5.0) Prof2$$Lambda$26.0x0000000800080a08::get
        (t  5.9,s  5.0) Prof2$$Lambda$27.0x0000000800080c28::get
        (t  4.9,s  4.9) AGCT::Unknown Java[ERR=-5]
        (t  1.2,s  1.2) Prof2::lambda$main$2
        (t  0.9,s  0.9) Prof2::lambda$main$3
        (t  0.9,s  0.9) Prof2::lambda$main$1
        (t  0.7,s  0.7) Prof2::lambda$main$0
        (t  0.1,s  0.1) AGCT::Unknown not Java[ERR=-3]

Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8178287: AsyncGetCallTrace fails to traverse valid Java stacks

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/4436/head:pull/4436
$ git checkout pull/4436

Update a local copy of the PR:
$ git checkout pull/4436
$ git pull https://git.openjdk.java.net/jdk pull/4436/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 4436

View PR using the GUI difftool:
$ git pr show -t 4436

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/4436.diff

@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Jun 9, 2021

👋 Welcome back luhenry! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr label Jun 9, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Jun 9, 2021

@luhenry The following labels will be automatically applied to this pull request:

  • hotspot
  • serviceability

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@mlbridge
Copy link

@mlbridge mlbridge bot commented Jun 9, 2021

@sspitsyn
Copy link
Contributor

@sspitsyn sspitsyn commented Jun 9, 2021

Hi Ludovic,
Thank you for working on this fix in the AsyncGetCallTrace.
What version of JDK release do you intent to target?
Just wanted to make sure you know the JDK 17 development cycle will be closed tomorrow for P4 bugs and enhancements. The repository will be forked and the RDP 1 phase started.
I doubt the review of your fix will be completed by this time.
So, please, keep in mind your PR will go to 18, not 17.
Thanks,
Serguei

luhenry added 5 commits Jun 10, 2021
Whether and how a frame is setup is controlled by the code generator
for the specific CodeBlock. The CodeBlock is then in the best place to know how
to parse the sender's frame from the current frame in the given CodeBlock.

This refactoring proposes to extract this parsing out of `frame` and into a
`CodeBlock::FrameParser`. This FrameParser is then specialized in the relevant
inherited children of CodeBlock.

This change is to largely facilitate adding new supported cases for JDK-8252417
like runtime stubs.
When the signal sent for AsyncGetCallTrace or JFR would land on a stub
(like arraycopy), it wouldn't be able to detect the sender (caller)
frame because `_cb->frame_size() == 0`.

Because we fully control how the prolog and epilog of stub code is
generated, we know there are two cases:
1. A stack frame is allocated via macroAssembler->enter(), and consists
in `push rbp; mov rsp, rbp;`.
2. No stack frames are allocated and rbp is left unchanged and rsp is
decremented with the `call` instruction that push the return `pc` on the
stack.

For case 1., we can easily know the sender frame by simply looking at
rbp, especially since we know that all stubs preserve the frame pointer
(on x86 at least).

For case 2., we end up returning the sender's sender, but that already
gives us more information than what we have today.

The results are as follows:

- Baseline:

Flat Profile (by method):
        (t 99.4,s 99.4) AGCT::Unknown Java[ERR=-5]
        (t  0.5,s  0.2) Prof1::main
        (t  0.2,s  0.2) java.lang.AbstractStringBuilder::append
        (t  0.1,s  0.1) AGCT::Unknown not Java[ERR=-3]
        (t  0.0,s  0.0) java.lang.AbstractStringBuilder::ensureCapacityInternal
        (t  0.0,s  0.0) java.lang.AbstractStringBuilder::shift
        (t  0.0,s  0.0) java.lang.String::getBytes
        (t  0.0,s  0.0) java.lang.AbstractStringBuilder::putStringAt
        (t  0.0,s  0.0) java.lang.StringBuilder::delete
        (t  0.2,s  0.0) java.lang.StringBuilder::append
        (t  0.0,s  0.0) java.lang.AbstractStringBuilder::delete
        (t  0.0,s  0.0) java.lang.AbstractStringBuilder::putStringAt

- With StubRoutinesBlob::FrameParser

Flat Profile (by method):
        (t 98.7,s 98.7) java.lang.AbstractStringBuilder::ensureCapacityInternal
        (t  0.9,s  0.9) java.lang.AbstractStringBuilder::delete
        (t 99.8,s  0.2) Prof1::main
        (t  0.1,s  0.1) AGCT::Unknown not Java[ERR=-3]
        (t  0.0,s  0.0) AGCT::Unknown Java[ERR=-5]
        (t 98.8,s  0.0) java.lang.AbstractStringBuilder::append
        (t 98.8,s  0.0) java.lang.StringBuilder::append
        (t  0.9,s  0.0) java.lang.StringBuilder::delete

The program is as follows:

```
public class Prof1 {

    public static void main(String[] args) {
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < 1000000; i++) {
            sb.append("ab");
            sb.delete(0, 1);
        }
        System.out.println(sb.length());
    }
}
```

We now account for the arraycopy stub which is called by
AbstractStringBuilder::ensureCapacityInternal. It was previously ignored
because it would not know how to parse the frame for the arraycopy stub
and would fall in the AGCT::Unknown Java[ERR=-5] section.

However, it still isn't perfect since it doesn't point to the arraycopy stub
directly.
When sampling hits the prolog of a method, Hotspot assumes it's unable
to parse the frame. This change allows to parse such frame on x86 by
specializing which instruction it's hitting in the prolog.

The results are as follows:

- Baseline:

Flat Profile (by method):
        (t 60.7,s 60.7) AGCT::Unknown Java[ERR=-5]
        (t 39.2,s 35.2) Prof2::main
        (t  1.4,s  1.4) Prof2::lambda$main$3
        (t  1.0,s  1.0) Prof2::lambda$main$2
        (t  0.9,s  0.9) Prof2::lambda$main$1
        (t  0.7,s  0.7) Prof2::lambda$main$0
        (t  0.1,s  0.1) AGCT::Unknown not Java[ERR=-3]
        (t  0.0,s  0.0) java.lang.Thread::exit
        (t  0.9,s  0.0) Prof2$$Lambda$2.0x0000000800c00c28::get
        (t  1.0,s  0.0) Prof2$$Lambda$3.0x0000000800c01000::get
        (t  1.4,s  0.0) Prof2$$Lambda$4.0x0000000800c01220::get
        (t  0.7,s  0.0) Prof2$$Lambda$1.0x0000000800c00a08::get

- With incomplete frame parsing:

Flat Profile (by method):
        (t 39.3,s 39.3) AGCT::Unknown Java[ERR=-5]
        (t 40.3,s 36.1) Prof2::main
        (t  6.4,s  5.3) Prof2$$Lambda$28.0x0000000800081000::get
        (t  6.1,s  5.1) Prof2$$Lambda$29.0x0000000800081220::get
        (t  6.0,s  5.0) Prof2$$Lambda$27.0x0000000800080c28::get
        (t  6.1,s  5.0) Prof2$$Lambda$26.0x0000000800080a08::get
        (t  1.1,s  1.1) Prof2::lambda$main$2
        (t  1.1,s  1.1) Prof2::lambda$main$0
        (t  1.0,s  1.0) Prof2::lambda$main$1
        (t  0.9,s  0.9) Prof2::lambda$main$3
        (t  0.1,s  0.1) AGCT::Unknown not Java[ERR=-3]
        (t  0.0,s  0.0) java.util.Locale::getInstance
        (t  0.0,s  0.0) AGCT::Not walkable Java[ERR=-6]
        (t  0.0,s  0.0) jdk.internal.loader.BuiltinClassLoader::loadClassOrNull
        (t  0.0,s  0.0) java.lang.ClassLoader::loadClass
        (t  0.0,s  0.0) sun.net.util.URLUtil::urlNoFragString
        (t  0.0,s  0.0) java.lang.Class::forName0
        (t  0.0,s  0.0) java.util.Locale::initDefault
        (t  0.0,s  0.0) jdk.internal.loader.BuiltinClassLoader::loadClass
        (t  0.0,s  0.0) jdk.internal.loader.URLClassPath::getLoader
        (t  0.0,s  0.0) jdk.internal.loader.URLClassPath::getResource
        (t  0.0,s  0.0) java.lang.String::toLowerCase
        (t  0.0,s  0.0) sun.launcher.LauncherHelper::loadMainClass
        (t  0.0,s  0.0) sun.launcher.LauncherHelper::checkAndLoadMain
        (t  0.0,s  0.0) java.util.Locale::<clinit>
        (t  0.0,s  0.0) jdk.internal.loader.BuiltinClassLoader::findClassOnClassPathOrNull
        (t  0.0,s  0.0) jdk.internal.loader.ClassLoaders$AppClassLoader::loadClass
        (t  0.0,s  0.0) java.lang.Class::forName

The program is as follows:

```
import java.util.function.Supplier;

public class Prof2 {

    public static void main(String[] args) {
        var rand = new java.util.Random(0);
        Supplier[] suppliers = {
                () -> 0,
                () -> 1,
                () -> 2,
                () -> 3,
        };

        long sum = 0;
        for (int i = 0; i >= 0; i++) {
            sum += (int)suppliers[i % suppliers.length].get();
        }
    }
}
```

We see that the results are particularely useful in this case as the
methods are very short (it only returns an integer), and the probability
of hitting the prolog is then very high.
The program is the following:

```
import java.util.function.Supplier;

public class Prof2 {

    public static void main(String[] args) {
        var rand = new java.util.Random(0);
        Supplier[] suppliers = {
                () -> 0,
                () -> 1,
                () -> 2,
                () -> 3,
        };

        long sum = 0;
        for (int i = 0; i >= 0; i++) {
            sum += (int)suppliers[i % suppliers.length].get();
        }
    }
}
```

The results are as follows:

- Baseline (from previous commit):

Flat Profile (by method):
            (t 39.3,s 39.3) AGCT::Unknown Java[ERR=-5]
            (t 40.3,s 36.1) Prof2::main
            (t  6.4,s  5.3) Prof2$$Lambda$28.0x0000000800081000::get
            (t  6.1,s  5.1) Prof2$$Lambda$29.0x0000000800081220::get
            (t  6.0,s  5.0) Prof2$$Lambda$27.0x0000000800080c28::get
            (t  6.1,s  5.0) Prof2$$Lambda$26.0x0000000800080a08::get
            (t  1.1,s  1.1) Prof2::lambda$main$2
            (t  1.1,s  1.1) Prof2::lambda$main$0
            (t  1.0,s  1.0) Prof2::lambda$main$1
            (t  0.9,s  0.9) Prof2::lambda$main$3
            (t  0.1,s  0.1) AGCT::Unknown not Java[ERR=-3]

- With unwind from vtable stub

Flat Profile (by method):
        (t 74.1,s 70.3) Prof2::main
        (t  6.5,s  5.5) Prof2$$Lambda$29.0x0000000800081220::get
        (t  6.6,s  5.4) Prof2$$Lambda$28.0x0000000800081000::get
        (t  5.7,s  5.0) Prof2$$Lambda$26.0x0000000800080a08::get
        (t  5.9,s  5.0) Prof2$$Lambda$27.0x0000000800080c28::get
        (t  4.9,s  4.9) AGCT::Unknown Java[ERR=-5]
        (t  1.2,s  1.2) Prof2::lambda$main$2
        (t  0.9,s  0.9) Prof2::lambda$main$3
        (t  0.9,s  0.9) Prof2::lambda$main$1
        (t  0.7,s  0.7) Prof2::lambda$main$0
        (t  0.1,s  0.1) AGCT::Unknown not Java[ERR=-3]

We attribute the vtable stub to the caller and not the callee, which is
already an improvement from the existing case.
@luhenry
Copy link
Member Author

@luhenry luhenry commented Jun 10, 2021

@sspitsyn thank you for the reminder. It's perfectly fine for this change to land in JDK 18. We'll see in the future if there is a demand to backport it to JDK 17 and we'll do accordingly.

Copy link

@jbachorik jbachorik left a comment

I have done a sanity check to make sure that the code was not functionally modified while moving around.
No problems found there.

src/hotspot/cpu/aarch64/codeBlob_aarch64.cpp Outdated Show resolved Hide resolved
@luhenry
Copy link
Member Author

@luhenry luhenry commented Jun 25, 2021

/label add hotspot-jfr

@openjdk openjdk bot added the hotspot-jfr label Jun 25, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Jun 25, 2021

@luhenry
The hotspot-jfr label was successfully added.

@luhenry
Copy link
Member Author

@luhenry luhenry commented Jun 29, 2021

@sspitsyn how could we take this change forward? Thank you

@sspitsyn
Copy link
Contributor

@sspitsyn sspitsyn commented Jun 30, 2021

Hi Ludovic,
You need two reviews including one (R)eviewer before integration of your fix.
Thanks,
Serguei

#include "interpreter/interpreter.hpp"
#include "runtime/frame.hpp"

bool CodeBlob::FrameParser::sender_frame(JavaThread *thread, bool check, address pc, intptr_t* sp, intptr_t* unextended_sp, intptr_t* fp, bool fp_safe,
Copy link

@tivrfoa tivrfoa Jul 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tivrfoa, thanks for making a comment in an OpenJDK project!

All comments and discussions in the OpenJDK Community must be made available under the OpenJDK Terms of Use. If you already are an OpenJDK Author, Committer or Reviewer, please click here to open a new issue so that we can record that fact. Please Use "Add GitHub user tivrfoa for the summary.

If you are not an OpenJDK Author, Committer or Reviewer, simply check the box below to accept the OpenJDK Terms of Use for your comments.

Your comment will be automatically restored once you have accepted the OpenJDK Terms of Use.

@apangin
Copy link

@apangin apangin commented Jul 11, 2021

Hi Ludovic,

Thank you for working on this long-standing bug.
I like the idea of the proposed solution, but unfortunately it cannot be applied as is. Since the stack walking code runs inside a signal handler, it is very limited in things it can do. In particular, it must not allocate, acquire locks, etc. In your implementation, FrameParser does allocate though.

The issue is not just theoretical: when I ran JDK with this patch with async-profiler, I immediately got the following deadlock:

(gdb) bt
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007fa2363ca025 in __GI___pthread_mutex_lock (mutex=0x7fa235da5440 <tc_mutex>)
    at ../nptl/pthread_mutex_lock.c:80
#2  0x00007fa235696cb6 in ThreadCritical::ThreadCritical() () from /usr/java/jdk-18/lib/server/libjvm.so
#3  0x00007fa234b6fe53 in Chunk::next_chop() () from /usr/java/jdk-18/lib/server/libjvm.so
#4  0x00007fa234e88523 in frame::safe_for_sender(JavaThread*) () from /usr/java/jdk-18/lib/server/libjvm.so
#5  0x00007fa234e838f2 in vframeStreamForte::forte_next() () from /usr/java/jdk-18/lib/server/libjvm.so
#6  0x00007fa2349fbb9b in forte_fill_call_trace_given_top(JavaThread*, ASGCT_CallTrace*, int, frame) [clone .isra.20]
    () from /usr/java/jdk-18/lib/server/libjvm.so
#7  0x00007fa234e8426e in AsyncGetCallTrace () from /usr/java/jdk-18/lib/server/libjvm.so
#8  0x00007fa228519312 in Profiler::getJavaTraceAsync(void*, ASGCT_CallFrame*, int) ()
   from /mnt/c/Users/Andrei/java/async-profiler/build/libasyncProfiler.so
#9  0x00007fa228519c72 in Profiler::recordSample(void*, unsigned long long, int, Event*) ()
   from /mnt/c/Users/Andrei/java/async-profiler/build/libasyncProfiler.so
#10 0x00007fa2285164f8 in WallClock::signalHandler(int, siginfo_t*, void*) ()
   from /mnt/c/Users/Andrei/java/async-profiler/build/libasyncProfiler.so
#11 <signal handler called>
#12 __pthread_mutex_unlock_usercnt (decr=1, mutex=0x7fa235da5440 <tc_mutex>) at pthread_mutex_unlock.c:41
#13 __GI___pthread_mutex_unlock (mutex=0x7fa235da5440 <tc_mutex>) at pthread_mutex_unlock.c:356
#14 0x00007fa235696d3b in ThreadCritical::~ThreadCritical() () from /usr/java/jdk-18/lib/server/libjvm.so
#15 0x00007fa234b6fe71 in Chunk::next_chop() () from /usr/java/jdk-18/lib/server/libjvm.so
#16 0x00007fa234d1ca62 in ClassFileParser::parse_method(ClassFileStream const*, bool, ConstantPool const*, AccessFlags*, JavaThread*) () from /usr/java/jdk-18/lib/server/libjvm.so
#17 0x00007fa234d1e338 in ClassFileParser::parse_methods(ClassFileStream const*, bool, AccessFlags*, bool*, bool*, JavaThread*) () from /usr/java/jdk-18/lib/server/libjvm.so
#18 0x00007fa234d22459 in ClassFileParser::parse_stream(ClassFileStream const*, JavaThread*) ()
   from /usr/java/jdk-18/lib/server/libjvm.so
#19 0x00007fa234d2291c in ClassFileParser::ClassFileParser(ClassFileStream*, Symbol*, ClassLoaderData*, ClassLoadInfo const*, ClassFileParser::Publicity, JavaThread*) () from /usr/java/jdk-18/lib/server/libjvm.so
#20 0x00007fa2351febb6 in KlassFactory::create_from_stream(ClassFileStream*, Symbol*, ClassLoaderData*, ClassLoadInfo const&, JavaThread*) ()
   from /usr/java/jdk-18/lib/server/libjvm.so
#21 0x00007fa235645b40 in SystemDictionary::resolve_class_from_stream(ClassFileStream*, Symbol*, Handle, ClassLoadInfo const&, JavaThread*) ()
   from /usr/java/jdk-18/lib/server/libjvm.so
#22 0x00007fa2350bad0a in jvm_define_class_common(char const*, _jobject*, signed char const*, int, _jobject*, char const*, JavaThread*) [clone .constprop.299] ()
   from /usr/java/jdk-18/lib/server/libjvm.so
#23 0x00007fa2350bae6d in JVM_DefineClassWithSource () from /usr/java/jdk-18/lib/server/libjvm.so
#24 0x00007fa236a0ee12 in Java_java_lang_ClassLoader_defineClass1 () from /usr/java/jdk-18/lib/libjava.so

@luhenry
Copy link
Member Author

@luhenry luhenry commented Jul 19, 2021

@apangin Thanks for pointing that out! I'm updating it right now and should be pushing an update very soon. I'll also add examples on how it impacts JFR.

luhenry added 2 commits Jul 19, 2021
The need for the allocation would be it non async-safe. However,
AsyncGetCallTrace is async-safe and thus can't allow for allocations.
@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Aug 16, 2021

@luhenry This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@thegreystone
Copy link
Member

@thegreystone thegreystone commented Aug 17, 2021

Hi @sspitsyn! Will you be the one reviewing this?

@sspitsyn
Copy link
Contributor

@sspitsyn sspitsyn commented Aug 18, 2021

Hi Markus! I have a lack of expertize in this area and can't be the primary reviewer.
The problem is we have no test coverage to verify fixes.

@thegreystone
Copy link
Member

@thegreystone thegreystone commented Aug 18, 2021

Ok. Do you know who should be the primary reviewer? Rickard Bäckman? Nils Eliasson?

@jbachorik
Copy link

@jbachorik jbachorik commented Aug 23, 2021

@dcubed-ojdk , @dholmes-ora would you be willing/able to take a look at this change, pretty please?

@dholmes-ora
Copy link
Member

@dholmes-ora dholmes-ora commented Aug 23, 2021

Sorry @jbachorik but frames and code-blobs are not an area I know.

David

@thegreystone
Copy link
Member

@thegreystone thegreystone commented Aug 23, 2021

@dholmes-ora, do you know anyone who could be the primary reviewer for this?

@neliasso? @rickard?

@apangin
Copy link

@apangin apangin commented Aug 24, 2021

Hi @luhenry,
I had a chance to test your patch on realistic applications, and here are my observations:

  1. JVM does not crash or hang, which is already a good sign.
  2. The amount of unknown_Java samples indeed decreased, roughly by a factor of 2.
  3. Now the bad news: with this patch, AsyncGetCallTrace starts returning many orphaned (truncated) stack traces that consist of just one or a few top frames.

For example, here are the results of profiling spring-petclinic application startup.
I summarized the amount of AsyncGetCallTrace failures by type in the following table.
Click on the links in the left column to open the entire flame graphs.

  Success Failures unknown_Java not_walkable_Java unknown_not_Java not_walkable_not_Java Truncated
Baseline 86.17% 13.83% 7.55% 1.54% 2.28% 2.45% 0.01%
AGCT patch 89.75% 10.25% 3.77% 1.02% 2.26% - 3.20%
Async-profiler 98.12% 1.88% 1.45% 0.29% 0.09% - 0.05%
Async-profiler with patch 94.95% 5.05% 1.21% 0.05% 0.10% - 3.69%

My biggest concern is the large number (>3%) of truncated stack traces. The probem is, such stack traces are indistinguishable (by the tools) from normal stack traces, and therefore the tools will display misleading information to a user, without even knowing that something went wrong. In this sense I'd prefer to return an error code rather than an invalid stack trace.

The above table illustrates the problem in practice. With the proposed patch, async-profiler actually performs worse than without it, because async-profiler can recover from most AGCT failures on its own, but not when AGCT returns an invalid stack trace instead of an error code.

In view of the aforementioned, I'll ask not to commit this patch before the problem of truncated stack traces is fixed. Thank you.

@sspitsyn
Copy link
Contributor

@sspitsyn sspitsyn commented Aug 24, 2021

Ok. Do you know who should be the primary reviewer? Rickard Bäckman? Nils Eliasson?

Markus,
There is no primary reviewer for this area as it is officially unsupported.
Also, the problem is that we have no test coverage in this area.
It is not easy to take a responsibility without it.
Even though it is unsupported we do not want to add a degradation for existing profilers.

@mlbridge
Copy link

@mlbridge mlbridge bot commented Sep 3, 2021

Mailing list message from Ludovic Henry on hotspot-jfr-dev:

Hello,

As it's been quite a while already, I wanted to nudge this RFR forward. Who
would be best qualified to review this PR?

Thank you!
Ludovic

On Wed, Aug 18, 2021 at 9:21 PM Marcus Hirt <hirt at openjdk.java.net> wrote:

@mlbridge
Copy link

@mlbridge mlbridge bot commented Sep 3, 2021

Mailing list message from Ludovic Henry on hotspot-jfr-dev:

[Sending again now with the email address properly registered to the
mailing lists]

Hello,

As it's been quite a while already, I wanted to nudge this RFR forward. Who
would be best qualified to review this PR?

Thank you!
Ludovic

On Fri, Aug 20, 2021 at 8:49 AM Ludovic Henry <ludovic at datadoghq.com> wrote:

Hello,

As it's been quite a while already, I wanted to nudge this RFR forward.
Who would be best qualified to review this PR?

Thank you!
Ludovic

On Wed, Aug 18, 2021 at 9:21 PM Marcus Hirt <hirt at openjdk.java.net> wrote:

@mlbridge
Copy link

@mlbridge mlbridge bot commented Sep 3, 2021

Mailing list message from David Holmes on hotspot-jfr-dev:

On 24/08/2021 9:40 am, Marcus Hirt wrote:

On Mon, 23 Aug 2021 10:57:54 GMT, David Holmes <dholmes at openjdk.org> wrote:

@dcubed-ojdk , @dholmes-ora would you be willing/able to take a look at this change, pretty please?

Sorry @jbachorik but frames and code-blobs are not an area I know.

David

@dholmes-ora, do you know anyone who could be the primary reviewer for this?

I don't know sorry.

David

@luhenry
Copy link
Member Author

@luhenry luhenry commented Sep 14, 2021

Closing it for now until we figure out all the raised points.

@luhenry luhenry closed this Sep 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
7 participants