Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Userspace stack is not unwinded in most samples with offcputime.py #1641

Closed
arssher opened this issue Mar 21, 2018 · 22 comments
Closed

Userspace stack is not unwinded in most samples with offcputime.py #1641

arssher opened this issue Mar 21, 2018 · 22 comments
Milestone

Comments

@arssher
Copy link

arssher commented Mar 21, 2018

Hi,

I'm experimenting with tracing postgres using offcputime.py. However, userspace stack is not properly unwinded in most samples as shown in attached flamegraph [1], making the tool basically unusable. Postgres is compiled with -fno-omit-frame-pointer (and this is confirmed a bunch of correctly unwinded stacks). I thought that one reason for this might be glibc compiled with -omit-frame-pointer, because the last resolved symbol is epoll_pwait from libc.so. I took musl libc and compiled it with -fno-omit-frame-pointer, but the result is still the same. What can be other reasons for this and is there anything I can do about it?

Kernel version is 4.14.13.

[1] https://oc.postgrespro.ru/index.php/s/qDINPrRnlNqxXBd

@brendangregg
Copy link
Member

I'd do a normal perf on-CPU profile and see if that works:

perf record -F 99 -a -g -- sleep 30
perf script

If the stacks are broken there, then it's not a problem with bcc/BPF, and would be other problems (like glibc).

@arssher
Copy link
Author

arssher commented Mar 22, 2018

perf --call-graph fp resolves around 95% of stacks well, see [2]. The rest are sort of misplaced, or contain unknown symbols. However, I have noticed that most of correctly resolved stacktraces had been caught when the process was in userspace, while unhealthy ones contain kernel symbols. offcputime.py is sampled only in the kernel (finish_task_switch), so this is likely the same problem.

I think I should recompile the kernel with frame pointers explicitly enabled. Currently config says

ars@ars-thinkpad ~ $ cat /boot/config-4.14.13 | grep -i -E 'frame|unwind'
CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES=y
CONFIG_SCHED_OMIT_FRAME_POINTER=y
CONFIG_FRAME_VECTOR=y
# Frame buffer Devices
# Frame buffer hardware drivers
CONFIG_FRAMEBUFFER_CONSOLE=y
CONFIG_FRAMEBUFFER_CONSOLE_DETECT_PRIMARY=y
CONFIG_FRAMEBUFFER_CONSOLE_ROTATION=y
# Common Clock Framework
CONFIG_FRAME_WARN=2048
CONFIG_UNWINDER_ORC=y
# CONFIG_UNWINDER_FRAME_POINTER is not set
# CONFIG_UNWINDER_GUESS is not set

Seems like I need to say CONFIG_FRAME_POINTER=y here.

[2] https://oc.postgrespro.ru/index.php/s/IHwjDjh1fKnJw0X

@arssher
Copy link
Author

arssher commented Mar 22, 2018

Okay, I've recompiled the kernel:

ars@ars-thinkpad ~ $ cat /boot/config-$(uname -r) | grep -i frame
CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES=y
CONFIG_SCHED_OMIT_FRAME_POINTER=y
CONFIG_FRAME_VECTOR=y
# Frame buffer Devices
# Frame buffer hardware drivers
CONFIG_FRAMEBUFFER_CONSOLE=y
CONFIG_FRAMEBUFFER_CONSOLE_DETECT_PRIMARY=y
CONFIG_FRAMEBUFFER_CONSOLE_ROTATION=y
# Common Clock Framework
CONFIG_FRAME_WARN=2048
CONFIG_FRAME_POINTER=y
CONFIG_UNWINDER_FRAME_POINTER=y

But nothing changed, stacks are still not resolved. Any ideas?

@brendangregg
Copy link
Member

How about testing the same off-CPU events? Eg:

perf record -e cs -a -g -- sleep 10
perf script

@arssher
Copy link
Author

arssher commented Mar 23, 2018

Here [1] is the results of

sudo perf record -e cs -a --call-graph fp -o perf_cs.out -p `pgpid.sh` -- sleep 10
sudo perf script -i perf_cs.out | stackcollapse-perf.pl > out.perf-folded && flamegraph.pl --title "perf record -e cs -a --call-graph fp -p \`pgpid.sh\`" out.perf-folded > perf_cs.svg

It shows that a lot of context switches happened inside secure_read (which is true, that's where we wait for data from the client) and a bunch of broken or semi-broken stacktraces. In this test, I've intentionally created heavy lock contention on a single lock, and I know that process actually spends > 80% of its time waiting for this lock. Here is the correct backtrace to this lock (I constantly see it when I attach with gdb at random moments):

#0  epoll_pwait (fd=19, ev=0x21de640, cnt=1, to=-1, sigs=0x0) at src/linux/epoll.c:29
#1  0x000000000085b807 in WaitEventSetWaitBlock (set=0x21de5d0, cur_timeout=-1, occurred_events=0x7ffd254e01e0, nevents=1) at /home/ars/postgres/postgres_cluster/src/backend/storage/ipc/latch.c:1048
#2  0x000000000085b6d7 in WaitEventSetWait (set=0x21de5d0, timeout=-1, occurred_events=0x7ffd254e01e0, nevents=1, wait_event_info=50331648) at /home/ars/postgres/postgres_cluster/src/backend/storage/ipc/latch.c:1000
#3  0x000000000085ae0a in WaitLatchOrSocket (latch=0x7f6c5d2e6954, wakeEvents=1, sock=-1, timeout=-1, wait_event_info=50331648) at /home/ars/postgres/postgres_cluster/src/backend/storage/ipc/latch.c:385
#4  0x000000000085ace5 in WaitLatch (latch=0x7f6c5d2e6954, wakeEvents=1, timeout=0, wait_event_info=50331648) at /home/ars/postgres/postgres_cluster/src/backend/storage/ipc/latch.c:339
#5  0x0000000000873431 in ProcSleep (locallock=0x21dbb90, lockMethodTable=0xee3c40 <default_lockmethod>) at /home/ars/postgres/postgres_cluster/src/backend/storage/lmgr/proc.c:1255
#6  0x000000000086c372 in WaitOnLock (locallock=0x21dbb90, owner=0x2296ba8) at /home/ars/postgres/postgres_cluster/src/backend/storage/lmgr/lock.c:1702
#7  0x000000000086b01b in LockAcquireExtended (locktag=0x7ffd254e0660, lockmode=8, sessionLock=0 '\000', dontWait=0 '\000', reportMemoryError=1 '\001') at /home/ars/postgres/postgres_cluster/src/backend/storage/lmgr/lock.c:998
#8  0x000000000086a62d in LockAcquire (locktag=0x7ffd254e0660, lockmode=8, sessionLock=0 '\000', dontWait=0 '\000') at /home/ars/postgres/postgres_cluster/src/backend/storage/lmgr/lock.c:688
#9  0x0000000000868990 in LockRelationOid (relid=16391, lockmode=8) at /home/ars/postgres/postgres_cluster/src/backend/storage/lmgr/lmgr.c:112
#10 0x000000000056fddc in RangeVarGetRelidExtended (relation=0x2260990, lockmode=8, missing_ok=0 '\000', nowait=0 '\000', callback=0x653ad5 <RangeVarCallbackForLockTable>, callback_arg=0x2260a58) at /home/ars/postgres/postgres_cluster/src/backend/catalog/namespace.c:365
#11 0x0000000000653a94 in LockTableCommand (lockstmt=0x2260a48) at /home/ars/postgres/postgres_cluster/src/backend/commands/lockcmds.c:60
#12 0x0000000000892115 in standard_ProcessUtility (pstmt=0x2260eb8, queryString=0x225ff40 "LOCK pgbench_accounts IN ACCESS EXCLUSIVE MODE;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x2260f50, completionTag=0x7ffd254e0b90 "") at /home/ars/postgres/postgres_cluster/src/backend/tcop/utility.c:753
#13 0x00007f6c6efe4b24 in pathman_process_utility_hook (first_arg=0x2260eb8, queryString=0x225ff40 "LOCK pgbench_accounts IN ACCESS EXCLUSIVE MODE;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x2260f50, completionTag=0x7ffd254e0b90 "") at src/hooks.c:946
#14 0x00000000008917a9 in ProcessUtility (pstmt=0x2260eb8, queryString=0x225ff40 "LOCK pgbench_accounts IN ACCESS EXCLUSIVE MODE;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x2260f50, completionTag=0x7ffd254e0b90 "") at /home/ars/postgres/postgres_cluster/src/backend/tcop/utility.c:353
#15 0x0000000000890755 in PortalRunUtility (portal=0x21d13a0, pstmt=0x2260eb8, isTopLevel=1 '\001', setHoldSnapshot=0 '\000', dest=0x2260f50, completionTag=0x7ffd254e0b90 "") at /home/ars/postgres/postgres_cluster/src/backend/tcop/pquery.c:1178
#16 0x0000000000890975 in PortalRunMulti (portal=0x21d13a0, isTopLevel=1 '\001', setHoldSnapshot=0 '\000', dest=0x2260f50, altdest=0x2260f50, completionTag=0x7ffd254e0b90 "") at /home/ars/postgres/postgres_cluster/src/backend/tcop/pquery.c:1324
#17 0x000000000088fe94 in PortalRun (portal=0x21d13a0, count=9223372036854775807, isTopLevel=1 '\001', run_once=1 '\001', dest=0x2260f50, altdest=0x2260f50, completionTag=0x7ffd254e0b90 "") at /home/ars/postgres/postgres_cluster/src/backend/tcop/pquery.c:799
#18 0x0000000000889b96 in exec_simple_query (query_string=0x225ff40 "LOCK pgbench_accounts IN ACCESS EXCLUSIVE MODE;") at /home/ars/postgres/postgres_cluster/src/backend/tcop/postgres.c:1099
#19 0x000000000088e097 in PostgresMain (argc=1, argv=0x21e0bb8, dbname=0x21b4cd0 "postgres", username=0x21e0ab8 "ars") at /home/ars/postgres/postgres_cluster/src/backend/tcop/postgres.c:4088
#20 0x00000000007ecabc in BackendRun (port=0x21dc200) at /home/ars/postgres/postgres_cluster/src/backend/postmaster/postmaster.c:4357
#21 0x00000000007ec1cc in BackendStartup (port=0x21dc200) at /home/ars/postgres/postgres_cluster/src/backend/postmaster/postmaster.c:4029
#22 0x00000000007e8680 in ServerLoop () at /home/ars/postgres/postgres_cluster/src/backend/postmaster/postmaster.c:1753
#23 0x00000000007e7c23 in PostmasterMain (argc=3, argv=0x21b2b00) at /home/ars/postgres/postgres_cluster/src/backend/postmaster/postmaster.c:1361
#24 0x0000000000722688 in main (argc=3, argv=0x21b2b00) at /home/ars/postgres/postgres_cluster/src/backend/main/main.c:228

There is no such backtrace at [1]. There is one relatively large stack (taking 14.26% samples) containing locks-related symbols, but it was definitely wrongly walked: GrantLockLocal never calls LWLockAttemptLock.

As an experiment, I tried to generate off-cpu graph using -e sched:sched_stat_sleep -e sched:sched_switch -e sched:sched_process_exit events, as described in [2]. Here is the result [3].
It is quite similar to offcputime flamegraph [4]... in a sense that they both can't walk up the main backtrace.

The moral is that this is indeed not bcc/bpf-specific problem. Stacks caught during off-cpu time analysis are almost always not unwinded, while stacks caught with usual on-cpu profiling (e.g. -e task-clock) are mostly fine.

[1] https://oc.postgrespro.ru/index.php/s/v6zNMbHxHOyMXqe
[2] http://www.brendangregg.com/blog/2015-02-26/linux-perf-off-cpu-flame-graph.html
[3] https://oc.postgrespro.ru/index.php/s/Hff4vygjiP3jWys
[4] https://oc.postgrespro.ru/index.php/s/NkXWKdhklbNsGQP

@yonghong-song
Copy link
Collaborator

@arssher maybe you could help by creating a simple representative application which can reproduce your stack tracing behavior and folks here can help? Maybe indeed it is kernel related.
Or you could post your exact steps to reduce the issue with postgres, we may help as well.

@arssher
Copy link
Author

arssher commented Mar 25, 2018

Ok, here is the simple example which hopefully reproduces the sample problem. All the program at the bottom does is repeating 3 times 1) do some pointless number crunching 2) sleep 3 seconds.

I run it with

/usr/local/musl/bin/musl-gcc -no-pie -O0 -g -fno-omit-frame-pointer tmp2.c && ./a.out

If I invoke offcputime.py with

sudo ~/bcc/tools/offcputime.py -d -f --stack-storage-size 1024 -p  `pgrep a.out`  10 > out.offcpustacks
flamegraph.pl --color=aqua --countname=us --title="Off-CPU" < out.offcpustacks > offcpu_test.svg

It gives wrong stacks:

a.out;_start_c+0x0 [a.out];[unknown] [a.out];__libc_start_main+0x38 [libc.so];main+0x9 [a.out];__float128+0x9 [a.out];float+0x16 [a.out];waste_cycles+0x16 [a.out];-;swapgs_restore_regs_and_return_to_usermode;prepare_exit_to_usermode;exit_to_usermode_loop;schedule;finish_task_switch 9
a.out;_start_c+0x0 [a.out];[unknown] [a.out];__libc_start_main+0x38 [libc.so];main+0x9 [a.out];__float128+0x9 [a.out];float+0x16 [a.out];waste_cycles+0x27 [a.out];-;swapgs_restore_regs_and_return_to_usermode;prepare_exit_to_usermode;exit_to_usermode_loop;schedule;finish_task_switch 89
a.out;_start_c+0x0 [a.out];[unknown] [a.out];__libc_start_main+0x38 [libc.so];main+0x9 [a.out];__float128+0x9 [a.out];float+0x16 [a.out];waste_cycles+0x1f [a.out];-;swapgs_restore_regs_and_return_to_usermode;prepare_exit_to_usermode;exit_to_usermode_loop;schedule;finish_task_switch 672
a.out;_start_c+0x0 [a.out];[unknown] [a.out];__libc_start_main+0x38 [libc.so];main+0x9 [a.out];__float128+0x9 [a.out];float+0x36 [a.out];printf+0x98 [libc.so];vfprintf+0x10e [libc.so];printf_core+0x101 [libc.so];[unknown] [libc.so];-;system_call_fast_compare_end;sys_nanosleep;hrtimer_nanosleep;do_nanosleep;schedule;finish_task_switch 3000075

(See the flamegraph at [1]).
There is no called functions f and g, but some invalid ones, e.g. _float128.

On the other had, on-cpu profiling with perf with

sudo perf record --call-graph fp -e task-clock:ppp -F 99 -p `pgrep a.out` -o perf_fp.data sleep 10
sudo perf script -i perf_fp.data  | stackcollapse-perf.pl

Gives the proper stack

a.out;_start_c;[a.out];__libc_start_main;main;g;f;waste_cycles 454

Here is the program:

#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>

void waste_cycles(void)
{
	long i;
	long n = 24;
	for (i = 0; i < 990000000; i++)
	{
		n = i * n + 38 - i;
	}
}

void f(void)
{
	int i;

	for (i = 0; i < 3; i++)
	{
		waste_cycles();
		printf("Going to sleep %d'th time \n", i);
		sleep(3);
		printf("Awoken\n");
	}
}

void g(void)
{
	f();
}

int main()
{
	g();
	return 0;
}

[1] https://oc.postgrespro.ru/index.php/s/U9LWVOWFhe5rlVh

@palmtenor
Copy link
Member

Can you try the following patch:

diff --git a/tools/offcputime.py b/tools/offcputime.py
index ad540ee..264e4fe 100755
--- a/tools/offcputime.py
+++ b/tools/offcputime.py
@@ -199,9 +199,9 @@ bpf_text = bpf_text.replace('MINBLOCK_US_VALUE', str(args.min_block_time))
 bpf_text = bpf_text.replace('MAXBLOCK_US_VALUE', str(args.max_block_time))

 # handle stack args
-kernel_stack_get = "stack_traces.get_stackid(ctx, BPF_F_REUSE_STACKID)"
+kernel_stack_get = "stack_traces.get_stackid(ctx, 0)"
 user_stack_get = \
-    "stack_traces.get_stackid(ctx, BPF_F_REUSE_STACKID | BPF_F_USER_STACK)"
+    "stack_traces.get_stackid(ctx, BPF_F_USER_STACK)"
 stack_context = ""
 if args.user_stacks_only:
     stack_context = "user"

Basically, disable BPF_F_REUSE_STACKID in offcputime.py?

@arssher
Copy link
Author

arssher commented Mar 26, 2018

Unfortunately, seems like nothing changed:

ars@ars-thinkpad ~/tmp/tmp $ sudo ~/bcc/tools/offcputime.py -d -f --stack-storage-size 1024 -p  `pgrep a.out`  10 > out.offcpustacks
ars@ars-thinkpad ~/tmp/tmp $ cat out.offcpustacks
a.out;_start_c+0x0 [a.out];[unknown] [a.out];__libc_start_main+0x38 [libc.so];main+0x9 [a.out];__float128+0x9 [a.out];float+0x16 [a.out];waste_cycles+0x1f [a.out];-;swapgs_restore_regs_and_return_to_usermode;prepare_exit_to_usermode;exit_to_usermode_loop;schedule;finish_task_switch 41
a.out;_start_c+0x0 [a.out];[unknown] [a.out];__libc_start_main+0x38 [libc.so];main+0x9 [a.out];__float128+0x9 [a.out];float+0x16 [a.out];waste_cycles+0x23 [a.out];-;swapgs_restore_regs_and_return_to_usermode;prepare_exit_to_usermode;exit_to_usermode_loop;schedule;finish_task_switch 45
a.out;_start_c+0x0 [a.out];[unknown] [a.out];__libc_start_main+0x38 [libc.so];main+0x9 [a.out];__float128+0x9 [a.out];float+0x16 [a.out];waste_cycles+0x16 [a.out];-;swapgs_restore_regs_and_return_to_usermode;prepare_exit_to_usermode;exit_to_usermode_loop;schedule;finish_task_switch 48
a.out;_start_c+0x0 [a.out];[unknown] [a.out];__libc_start_main+0x38 [libc.so];main+0x9 [a.out];__float128+0x9 [a.out];float+0x36 [a.out];printf+0x98 [libc.so];vfprintf+0x10e [libc.so];printf_core+0x101 [libc.so];[unknown] [libc.so];-;system_call_fast_compare_end;sys_nanosleep;hrtimer_nanosleep;do_nanosleep;schedule;finish_task_switch 3000076

Still bogus float and printf functions instead of f and g.

yonghong-song added a commit that referenced this issue Mar 26, 2018
Fix issue #1641

The bcc user space stack is not printed out properly.

From https://en.wikipedia.org/wiki/Name_mangling,
all mangled symbols begin with _Z, and an identifier
beginning with an underscore followed by a capital is
a reserved identifier in C, so conflict with user identifiers
is avoided.

Further, from the llvm demangle code
https://github.com/llvm-mirror/libcxxabi/blob/master/src/cxa_demangle.cpp
The demangled name has the following specification:
    <mangled-name> ::= _Z <encoding>
                   ::= <type>
    extension      ::= ___Z <encoding> _block_invoke
    extension      ::= ___Z <encoding> _block_invoke<decimal-digit>+
    extension      ::= ___Z <encoding> _block_invoke_<decimal-digit>+

In the issue #1641, the function name are "f" and "g", which is demangled
to type "float" and "__float128", according to the above implementation.

In bcc case, we only care about functions, so only do demangling
for symbols starting with _Z or ___Z.

Signed-off-by: Yonghong Song <yhs@fb.com>
@yonghong-song
Copy link
Collaborator

@arssher I just pushed a pull request which should fix your issue.

@yonghong-song
Copy link
Collaborator

The issue @palmtenor pointed to is actually a real issue since in offcputime.py, pid and user stack id is recorded in the same key.

struct key_t {
    u32 pid;
    u32 tgid;
    int user_stack_id;
    int kernel_stack_id;
    char name[TASK_COMM_LEN];
};

For example, initially, <pid=100, user_stack_id=10> is recorded, and later on
user_stack_id=10 is reused and <pid=200, user_stack_id=10> is recorded.
Later on, <pid=100, user_stack_id=10> will be incorrect. One way COULD
be to prevent the second <pid=200, user_stack_id=10> by not allowing reuse.
and warn user to increase the stack size to avoid collision...

@arssher
Copy link
Author

arssher commented Mar 27, 2018

Commit 2f0705168 indeed helped a bit: now there are f and g functions instead of float. However, stack for the example shown above is still not quite correct.

sudo ~/bcc/tools/offcputime.py -d -f --stack-storage-size 1024 -p  `pgrep a.out` 10 > out.offcpustacks

Gives me the following:

ars@ars-thinkpad ~/tmp/tmp $ cat out.offcpustacks
a.out;_start_c+0x0 [a.out];[unknown] [a.out];__libc_start_main+0x38 [libc.so];main+0x9 [a.out];g+0x9 [a.out];f+0x16 [a.out];waste_cycles+0x16 [a.out];-;swapgs_restore_regs_and_return_to_usermode;prepare_exit_to_usermode;exit_to_usermode_loop;schedule;finish_task_switch 10
a.out;_start_c+0x0 [a.out];[unknown] [a.out];__libc_start_main+0x38 [libc.so];main+0x9 [a.out];g+0x9 [a.out];f+0x16 [a.out];waste_cycles+0x1f [a.out];-;swapgs_restore_regs_and_return_to_usermode;prepare_exit_to_usermode;exit_to_usermode_loop;schedule;finish_task_switch 82
a.out;_start_c+0x0 [a.out];[unknown] [a.out];__libc_start_main+0x38 [libc.so];main+0x9 [a.out];g+0x9 [a.out];f+0x36 [a.out];printf+0x98 [libc.so];vfprintf+0x10e [libc.so];printf_core+0x101 [libc.so];[unknown] [libc.so];-;system_call_fast_compare_end;sys_nanosleep;hrtimer_nanosleep;do_nanosleep;schedule;finish_task_switch 3000082

Actual sleep is the last line, with 3000082 us. This stack contains a bunch of printf functions, which is wrong: while sleep (see the program above) is surrounded by printfs, we don't sleep in printf itself, at least that's what my sanity tells me. The correct stack should be like the one given by gdb:

main;g;f;sleep;nanosleep;__syscall_cp_c;__cp_end 10

Things become even more mad if I make the program only a little bit more complex by adding recursion to wasting cycles:

#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>

long long waste_cycles_fib(long long n)
{
	if (n == 0 || n == 1)
		return n;
	return waste_cycles_fib(n - 1) + waste_cycles_fib(n - 2);
}

void f(void)
{
	int i;

	for (i = 0; i < 6; i++)
	{
		waste_cycles_fib(43);
		printf("Going to sleep %d'th time \n", i);
		sleep(3);
		printf("Awoken\n");
	}
}

void g(void)
{
	f();
}

int main()
{
	g();
	return 0;
}
 

Now, running it with

/usr/local/musl/bin/musl-gcc -no-pie -O0 -g -fno-omit-frame-pointer tmp.c && ./a.out

While profiling with

ars@ars-thinkpad ~/tmp/tmp $ sudo ~/bcc/tools/offcputime.py -d -f --stack-storage-size 1024 -p  `pgrep a.out` 38 > out.offcpustacks
WARNING: 1 stack traces could not be displayed.

Gives me, beside warning, the following (main, larger than 500 us) stacks:

ars@ars-thinkpad ~/tmp/tmp $ cat out.offcpustacks | awk '{ if ($NF > 500) { print } }'
a.out;[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];-;system_call_fast_compare_end;sys_nanosleep;hrtimer_nanosleep;do_nanosleep;schedule;finish_task_switch 3000054
a.out;[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];-;system_call_fast_compare_end;sys_nanosleep;hrtimer_nanosleep;do_nanosleep;schedule;finish_task_switch 3000060
a.out;[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];-;system_call_fast_compare_end;sys_nanosleep;hrtimer_nanosleep;do_nanosleep;schedule;finish_task_switch 3000072
a.out;[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];-;system_call_fast_compare_end;sys_nanosleep;hrtimer_nanosleep;do_nanosleep;schedule;finish_task_switch 3000075
a.out;[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];-;system_call_fast_compare_end;sys_nanosleep;hrtimer_nanosleep;do_nanosleep;schedule;finish_task_switch 3000076
a.out;[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];-;system_call_fast_compare_end;sys_nanosleep;hrtimer_nanosleep;do_nanosleep;schedule;finish_task_switch 3000077

@arssher
Copy link
Author

arssher commented Mar 27, 2018

As for BPF_F_REUSE_STACKID, I have applied the patch, and it looks like it didn't change anything here; while in general this probably should be fixed, in my simple example programs total number of stacks is small, and I always pass --stack-storage-size 1024 to offcputime.py.

@yonghong-song
Copy link
Collaborator

@arssher I cannot reproduce the above issues for the below incorrect stack

a.out;_start_c+0x0 [a.out];[unknown] [a.out];__libc_start_main+0x38 [libc.so];main+0x9 [a.out];g+0x9 [a.out];f+0x36 [a.out];printf+0x98 [libc.so];vfprintf+0x10e [libc.so];printf_core+0x101 [libc.so];[unknown] [libc.so];-;system_call_fast_compare_end;sys_nanosleep;hrtimer_nanosleep;do_nanosleep;schedule;finish_task_switch 3000082

Looks like the issue starts from resolving libc.so. Maybe libc is not compiled with frame pointer on so kernel unwinding has issue here?

a.out;[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];-;system_call_fast_compare_end;sys_nanosleep;hrtimer_nanosleep;do_nanosleep;schedule;finish_task_switch 3000054

This looks like offcputime.py tries to resolve user space stack and the user space process has exited. So it cannot even resolve the libc address.

You could use the following hack to print out the user address to make sure it is correct.

@@ -294,7 +294,7 @@ for k, v in sorted(counts.items(), key=lambda counts: counts[1].value):
         if need_delimiter:
             print("    --")
         for addr in user_stack:
-            print("    %s" % b.sym(addr, k.tgid))
+            print("    %x:%s" % (addr, b.sym(addr, k.tgid)))
         print("    %-16s %s (%d)" % ("-", k.name.decode(), k.pid))
         print("        %d\n" % v.value)

@arssher
Copy link
Author

arssher commented Mar 28, 2018

Well, I have stepped with the debugger until syscall instruction, and rbp is fine right before and after syscall. I use musl libc here, compiled with -fno-omit-frame-pointer:

ars@ars-thinkpad ~/tmp/tmp $ cat ~/musl/config.mak | grep CFLAGS
CFLAGS = -ggdb3 -fno-omit-frame-pointer
CFLAGS_AUTO = -g -Os -pipe -fno-unwind-tables -fno-asynchronous-unwind-tables -ffunction-sections -fdata-sections -Werror=implicit-function-declaration -Werror=implicit-int -Werror=pointer-sign -Werror=pointer-arith
CFLAGS_C99FSE = -std=c99 -nostdinc -ffreestanding -fexcess-precision=standard -frounding-math -Wa,--noexecstack
CFLAGS_MEMOPS = -fno-tree-loop-distribute-patterns
CFLAGS_NOSSP = -fno-stack-protector

Here are the addresses of wrong stacktrace, printed as you hinted:

ars@ars-thinkpad ~/tmp/tmp $ sudo ~/bcc/tools/offcputime.py -d --stack-storage-size 1024 -p  `pgrep a.out` 10
Tracing off-CPU time (us) of PID 16368 by user + kernel stack for 10 secs.
...
    ffffffffaa6a0c81:finish_task_switch [kernel]
    ffffffffaacb0c56:schedule [kernel]
    ffffffffaacb5361:do_nanosleep [kernel]
    ffffffffaa6f44fe:hrtimer_nanosleep [kernel]
    ffffffffaa6f469f:sys_nanosleep [kernel]
    ffffffffaae0008b:system_call_fast_compare_end [kernel]
    --
    7f7d67099e41:[unknown] [libc.so]
    7f7d67091f70:printf_core+0x101 [libc.so]
    7f7d670929bc:vfprintf+0x10e [libc.so]
    7f7d6709044a:printf+0x98 [libc.so]
    4005a9:f+0x36 [a.out]
    4005cb:g+0x9 [a.out]
    4005d7:main+0x9 [a.out]
    7f7d6706559e:__libc_start_main+0x38 [libc.so]
    40046e:[unknown] [a.out]
    400446:_start_c+0x0 [a.out]
    -                a.out (16368)
        3000073

And here is the map of process memory space:

ars@ars-thinkpad ~/tmp/tmp $ cat /proc/`pgrep a.out`/maps
00400000-00401000 r-xp 00000000 08:07 526443                             /home/ars/tmp/tmp/a.out
00600000-00601000 r--p 00000000 08:07 526443                             /home/ars/tmp/tmp/a.out
00601000-00602000 rw-p 00001000 08:07 526443                             /home/ars/tmp/tmp/a.out
7f7d67040000-7f7d670d3000 r-xp 00000000 08:06 809176                     /usr/local/musl/lib/libc.so
7f7d672d2000-7f7d672d3000 r--p 00092000 08:06 809176                     /usr/local/musl/lib/libc.so
7f7d672d3000-7f7d672d4000 rw-p 00093000 08:06 809176                     /usr/local/musl/lib/libc.so
7f7d672d4000-7f7d672d7000 rw-p 00000000 00:00 0 
7ffdc8ec8000-7ffdc8eea000 rw-p 00000000 00:00 0                          [stack]
7ffdc8fd7000-7ffdc8fda000 r--p 00000000 00:00 0                          [vvar]
7ffdc8fda000-7ffdc8fdc000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

Let's see how sleep() is implemented and try to understand what's going on here. sleep function in musl just calls another function nanosleep:

ars@ars-thinkpad ~/tmp/tmp $ objdump -S --start-address=0x000000000005d608 --stop-address=0x000000000005d638 /usr/local/musl/lib/libc.so

/usr/local/musl/lib/libc.so:     file format elf64-x86-64


Disassembly of section .text:

000000000005d608 <sleep>:
#include <unistd.h>
#include <time.h>

unsigned sleep(unsigned seconds)
{
   5d608:	55                   	push   %rbp
	struct timespec tv = { .tv_sec = seconds, .tv_nsec = 0 };
   5d609:	89 f8                	mov    %edi,%eax
{
   5d60b:	48 89 e5             	mov    %rsp,%rbp
   5d60e:	48 83 ec 10          	sub    $0x10,%rsp
	if (nanosleep(&tv, &tv))
   5d612:	48 8d 7d f0          	lea    -0x10(%rbp),%rdi
	struct timespec tv = { .tv_sec = seconds, .tv_nsec = 0 };
   5d616:	48 89 45 f0          	mov    %rax,-0x10(%rbp)
   5d61a:	48 c7 45 f8 00 00 00 	movq   $0x0,-0x8(%rbp)
   5d621:	00 
	if (nanosleep(&tv, &tv))
   5d622:	48 89 fe             	mov    %rdi,%rsi
   5d625:	e8 14 df ff ff       	callq  5b53e <nanosleep>
   5d62a:	31 d2                	xor    %edx,%edx
   5d62c:	85 c0                	test   %eax,%eax
   5d62e:	74 04                	je     5d634 <sleep+0x2c>
		return tv.tv_sec;
   5d630:	48 8b 55 f0          	mov    -0x10(%rbp),%rdx
	return 0;
}
   5d634:	89 d0                	mov    %edx,%eax
   5d636:	c9                   	leaveq 
   5d637:	c3                   	retq 

Note that rbp is saved on enter and restored on exit as it should. nanosleep prepares the arguments and calls __syscall_cp:

ars@ars-thinkpad ~/tmp/tmp $ objdump -S --start-address=0x000000000005b53e --stop-address=0x000000000005b56a /usr/local/musl/lib/libc.so

/usr/local/musl/lib/libc.so:     file format elf64-x86-64


Disassembly of section .text:

000000000005b53e <nanosleep>:
#include <time.h>
#include "syscall.h"
#include "libc.h"

int nanosleep(const struct timespec *req, struct timespec *rem)
{
   5b53e:	55                   	push   %rbp
   5b53f:	48 89 f2             	mov    %rsi,%rdx
	return syscall_cp(SYS_nanosleep, req, rem);
   5b542:	45 31 c9             	xor    %r9d,%r9d
   5b545:	48 89 fe             	mov    %rdi,%rsi
   5b548:	45 31 c0             	xor    %r8d,%r8d
   5b54b:	31 c9                	xor    %ecx,%ecx
{
   5b54d:	48 89 e5             	mov    %rsp,%rbp
	return syscall_cp(SYS_nanosleep, req, rem);
   5b550:	48 83 ec 08          	sub    $0x8,%rsp
   5b554:	bf 23 00 00 00       	mov    $0x23,%edi
   5b559:	6a 00                	pushq  $0x0
   5b55b:	e8 c5 b1 ff ff       	callq  56725 <__syscall_cp>
   5b560:	48 89 c7             	mov    %rax,%rdi
   5b563:	e8 28 d5 fb ff       	callq  18a90 <__syscall_ret>
}
   5b568:	c9                   	leaveq 
   5b569:	c3                   	retq

Again, rbp is properly pushed and popped. __syscall_cp just invokes __syscall_cp_c:

ars@ars-thinkpad ~/tmp/tmp $ objdump -S --start-address=0x0000000000056725 --stop-address=0x000000000005672f /usr/local/musl/lib/libc.so

/usr/local/musl/lib/libc.so:     file format elf64-x86-64


Disassembly of section .text:

0000000000056725 <__syscall_cp>:
weak_alias(sccp, __syscall_cp_c);

long (__syscall_cp)(syscall_arg_t nr,
                    syscall_arg_t u, syscall_arg_t v, syscall_arg_t w,
                    syscall_arg_t x, syscall_arg_t y, syscall_arg_t z)
{
   56725:	55                   	push   %rbp
   56726:	48 89 e5             	mov    %rsp,%rbp
	return __syscall_cp_c(nr, u, v, w, x, y, z);
}
   56729:	5d                   	pop    %rbp
	return __syscall_cp_c(nr, u, v, w, x, y, z);
   5672a:	e9 4a 0b 00 00       	jmpq   57279 <__syscall_cp_c>

Here rbp is popped before the call to __syscall_cp_c and return address is not pushed at all, but that's ok, we will just return directly to nanosleep. __syscall_cp_c is the most complex piece:

ars@ars-thinkpad ~/tmp/tmp $ objdump -S --start-address=0x0000000000057279 --stop-address=0x0000000000057311 /usr/local/musl/lib/libc.so

/usr/local/musl/lib/libc.so:     file format elf64-x86-64


Disassembly of section .text:

0000000000057279 <__syscall_cp_c>:
                      syscall_arg_t, syscall_arg_t, syscall_arg_t);

long __syscall_cp_c(syscall_arg_t nr,
                    syscall_arg_t u, syscall_arg_t v, syscall_arg_t w,
                    syscall_arg_t x, syscall_arg_t y, syscall_arg_t z)
{
   57279:	55                   	push   %rbp
   5727a:	48 89 f8             	mov    %rdi,%rax
   5727d:	4d 89 c2             	mov    %r8,%r10
   57280:	48 89 f7             	mov    %rsi,%rdi
   57283:	48 89 d6             	mov    %rdx,%rsi
   57286:	48 89 ca             	mov    %rcx,%rdx
   57289:	48 89 e5             	mov    %rsp,%rbp
   5728c:	41 54                	push   %r12
   5728e:	53                   	push   %rbx
   5728f:	4c 8b 5d 10          	mov    0x10(%rbp),%r11
static inline struct pthread *__pthread_self()
{
	struct pthread *self;
	__asm__ __volatile__ ("mov %%fs:0,%0" : "=r" (self) );
   57293:	64 48 8b 1c 25 00 00 	mov    %fs:0x0,%rbx
   5729a:	00 00 
	pthread_t self;
	long r;
	int st;

	if ((st=(self=__pthread_self())->canceldisable)
   5729c:	8b 4b 4c             	mov    0x4c(%rbx),%ecx
   5729f:	85 c9                	test   %ecx,%ecx
   572a1:	74 14                	je     572b7 <__syscall_cp_c+0x3e>
	    && (st==PTHREAD_CANCEL_DISABLE || nr==SYS_close))
   572a3:	ff c9                	dec    %ecx
   572a5:	74 06                	je     572ad <__syscall_cp_c+0x34>
   572a7:	48 83 f8 03          	cmp    $0x3,%rax
   572ab:	75 0a                	jne    572b7 <__syscall_cp_c+0x3e>

static __inline long __syscall6(long n, long a1, long a2, long a3, long a4, long a5, long a6)
{
	unsigned long ret;
	register long r10 __asm__("r10") = a4;
	register long r8 __asm__("r8") = a5;
   572ad:	4d 89 c8             	mov    %r9,%r8
	register long r9 __asm__("r9") = a6;
   572b0:	4d 89 d9             	mov    %r11,%r9
	__asm__ __volatile__ ("syscall" : "=a"(ret) : "a"(n), "D"(a1), "S"(a2),
   572b3:	0f 05                	syscall 
		return __syscall(nr, u, v, w, x, y, z);
   572b5:	eb 51                	jmp    57308 <__syscall_cp_c+0x8f>

	r = __syscall_cp_asm(&self->cancel, nr, u, v, w, x, y, z);
   572b7:	41 53                	push   %r11
   572b9:	49 89 d0             	mov    %rdx,%r8
   572bc:	41 51                	push   %r9
   572be:	48 89 fa             	mov    %rdi,%rdx
   572c1:	48 8d 7b 48          	lea    0x48(%rbx),%rdi
   572c5:	48 89 f1             	mov    %rsi,%rcx
   572c8:	4d 89 d1             	mov    %r10,%r9
   572cb:	48 89 c6             	mov    %rax,%rsi
   572ce:	49 89 c4             	mov    %rax,%r12
   572d1:	e8 42 2b 00 00       	callq  59e18 <__syscall_cp_asm>
	if (r==-EINTR && nr!=SYS_close && self->cancel &&
   572d6:	48 83 f8 fc          	cmp    $0xfffffffffffffffc,%rax
   572da:	5a                   	pop    %rdx
   572db:	59                   	pop    %rcx
   572dc:	75 2a                	jne    57308 <__syscall_cp_c+0x8f>
   572de:	49 83 fc 03          	cmp    $0x3,%r12
   572e2:	74 24                	je     57308 <__syscall_cp_c+0x8f>
   572e4:	8b 43 48             	mov    0x48(%rbx),%eax
   572e7:	85 c0                	test   %eax,%eax
   572e9:	74 16                	je     57301 <__syscall_cp_c+0x88>
	    self->canceldisable != PTHREAD_CANCEL_DISABLE)
   572eb:	8b 43 4c             	mov    0x4c(%rbx),%eax
	if (r==-EINTR && nr!=SYS_close && self->cancel &&
   572ee:	ff c8                	dec    %eax
   572f0:	74 0f                	je     57301 <__syscall_cp_c+0x88>
		r = __cancel();
	return r;
}
   572f2:	48 8d 65 f0          	lea    -0x10(%rbp),%rsp
		r = __cancel();
   572f6:	31 c0                	xor    %eax,%eax
}
   572f8:	5b                   	pop    %rbx
   572f9:	41 5c                	pop    %r12
   572fb:	5d                   	pop    %rbp
		r = __cancel();
   572fc:	e9 45 ff ff ff       	jmpq   57246 <__cancel>
   57301:	48 c7 c0 fc ff ff ff 	mov    $0xfffffffffffffffc,%rax
}
   57308:	48 8d 65 f0          	lea    -0x10(%rbp),%rsp
   5730c:	5b                   	pop    %rbx
   5730d:	41 5c                	pop    %r12
   5730f:	5d                   	pop    %rbp
   57310:	c3                   	retq   

But again it handles rbp correctly. The syscall itself happens in the following line:

   572d1:	e8 42 2b 00 00       	callq  59e18 <__syscall_cp_asm>

Here a small piece of asm is executed:

ars@ars-thinkpad ~/tmp/tmp $ objdump -d --start-address=0x0000000000059e18 --stop-address=0x0000000000059e47 /usr/local/musl/lib/libc.so

/usr/local/musl/lib/libc.so:     file format elf64-x86-64


Disassembly of section .text:

0000000000059e18 <__syscall_cp_asm>:
   59e18:	8b 07                	mov    (%rdi),%eax
   59e1a:	85 c0                	test   %eax,%eax
   59e1c:	75 24                	jne    59e42 <__cp_cancel>
   59e1e:	49 89 fb             	mov    %rdi,%r11
   59e21:	48 89 f0             	mov    %rsi,%rax
   59e24:	48 89 d7             	mov    %rdx,%rdi
   59e27:	48 89 ce             	mov    %rcx,%rsi
   59e2a:	4c 89 c2             	mov    %r8,%rdx
   59e2d:	4d 89 ca             	mov    %r9,%r10
   59e30:	4c 8b 44 24 08       	mov    0x8(%rsp),%r8
   59e35:	4c 8b 4c 24 10       	mov    0x10(%rsp),%r9
   59e3a:	4c 89 5c 24 08       	mov    %r11,0x8(%rsp)
   59e3f:	0f 05                	syscall 

0000000000059e41 <__cp_end>:
   59e41:	c3                   	retq   

0000000000059e42 <__cp_cancel>:
   59e42:	e9 ff d3 ff ff       	jmpq   57246 <__cancel>

It does not save & restore rbp. However, it doesn't touch it either, so the only trouble happening here is the loss of the name of the caller (__syscall_cp here). The whole callchain should not be damaged.
And that's what I see in gdb: before and immediately after syscall, I see that rbp points to the beginning of __syscall_cp_c stack frame. We can print the address of its caller, who is nanosleep:

(gdb) x/5i $pc - 4
+x/5i $pc - 4
   0x7ffff7dc1e3d <__syscall_cp_asm+37>:        and    $0x8,%al
   0x7ffff7dc1e3f <__syscall_cp_asm+39>:        syscall
=> 0x7ffff7dc1e41 <__cp_end>:   retq
   0x7ffff7dc1e42 <__cp_cancel>:        jmpq   0x7ffff7dbf246 <__cancel>
(gdb) x/1xg $rbp+0x8
+x/1xg $rbp+0x8
0x7fffffffdab8: 0x00007ffff7dc3560
(gdb) info symbol 0x00007ffff7dc3560
+info symbol 0x00007ffff7dc3560
nanosleep + 34 in section .text of /lib/ld-musl-x86_64.so.1

(Disregard that here __cp_end has different virtual address then in the map at the beginning of the post -- I restarted the program to debug it).

What we see in the walked stack trace instead? The first unknown symbol is

7f7d67099e41:[unknown] [libc.so]

This is actually __cp_end label, as 0x7f7d67099e41 - 0x7f7d67040000 = 0x59e41, and this is good.
However, the next entry

7f7d67091f70:printf_core+0x101 [libc.so]

is wrong, it should be nanosleep instead. It looks like someone at the kernel side overwrites rbp without pushing it to the stack first, breaking the linked list of rbps. It probably happens immediately in the interrupt handler, because all the kernel functions are resolved well. I can objdump the code around system_call_fast_compare_end, but I don't know how system calls are exactly handled in Linux and have no idea how to debug this.

@yonghong-song yonghong-song added this to the mnsl milestone Mar 28, 2018
@yonghong-song
Copy link
Collaborator

Looks like it could be kernel related. I am not able to reproduce the issue with the latest net-next.
But I can reproduce with 4.14.13. Using sudo ./trace.py -U 'SyS_writev', on 4.14.13, the stack trace is already bad. Need further debug for the root cause.

@yonghong-song
Copy link
Collaborator

Spent quite some time to debug but did not find the root cause.

  • tried to add USDT in musl library in __stdio_write and nanosleep. Unfortunately, as soon as I tried to print the stack of these USDTs. The program is gone.
  • studied code in kernel/events/callchain.c get_perf_callchain trying to see whether
    the registers used for kernel stack and user stack are consistent or not. It should be
    consistent as finish_task_switch restored both regs and current which saved before previous finish_task_switch.
  • worst, after some hacking, I am not able to reproduce the issue any more, even if you go back to
    square zero with clean musl/test.c and clean 4.14.13 kernel code.

Could this be related to CONFIG_PREEMPT? I did not have CONFIG_PREEMPT but still can reproeuce the issue earlier.

@arssher could you share your kernel config if that can make it easy to reproduce the issue?

@arssher
Copy link
Author

arssher commented Mar 30, 2018

@yonghong-song
Copy link
Collaborator

@arssher , through some debugging. I am able to root cause the issue. The file is arch/x86/entry/entry_64.S. A link to latest stable 4.13 code https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/tree/arch/x86/entry/entry_64.S?h=linux-4.13.y#n177 which is very similar to 4.14.13 code.

By studying the code, I found one workaround, enable kernel tracepoint raw_syscalls:sys_enter or raw_syscalls:sys_exit.

[root@localhost sys_enter]# pwd
/sys/kernel/debug/tracing/events/raw_syscalls/sys_enter
[root@localhost sys_enter]# echo 1 > enable

This will add TIF_SYSCALL_TRACEPOINT to every task in the system and effectively force slowpath for syscall processing in entry_64.S. I did not find an easy workaround than this yet. This does have overhead as it will flood the debugfs trace_pipe.

The issue has been fixed in upcoming 4.16 release and backported to stable release 4.14 and 4.15. So if you upgrade your kernel version from 4.14.13 to some later 4.14.x version, the problem will get fixed.

At the same time, I am studying whether we have easy way to fix the older kernels (4.13 or 4.9), or we could request the same patch (as in 4.14/4.15 stable release) back ported to 4.13/4.9.

@yonghong-song
Copy link
Collaborator

Looks like 4.14.18 has the fix in arch/x86/entry/entry_64.S which should avoid the wrong user stack.

The config you provided has CONFIG_AUDITSYSCALL=y, so you could manipulate audit rules for syscall auditing. If there is a syscall audit rule, the syscall handling in x86 will go to a different path where bp register will be correctly setup.

For example, I change the audit rule from

[yhs@localhost ~]$ sudo auditctl -l
-a never,task
[yhs@localhost ~]$

to

[yhs@localhost ~]$ sudo auditctl -l
-a never,exit -S umask
[yhs@localhost ~]$

and the user space will be correct.

yonghong-song added a commit that referenced this issue Apr 4, 2018
The issue comes up when I investigated issue #1641.
A func symbol defined in assembly code will be size of 0,
e.g., http://git.musl-libc.org/cgit/musl/tree/src/thread/x86_64/syscall_cp.s
symbol __cp_begin.
  .text
  .global __cp_begin
  .hidden __cp_begin
  ...
  __syscall_cp_asm:

  __cp_begin:
	mov (%rdi),%eax
	test %eax,%eax

and __cp_begin cannot be traced through bcc since
symbol resolution rejects any func symbol with size 0.

This patch removed size-must-not-zero restriction so that
the symbol like __cp_begin can be traced.
Command line: trace.py -p <pid> -U '<binary_path>:__cp_begin'

Signed-off-by: Yonghong Song <yhs@fb.com>
@arssher
Copy link
Author

arssher commented Apr 10, 2018

Sorry for the delay. Indeed, sudo auditctl -a never,exit -S umask or upgrade to 4.16 solves the problem.

I still don't quite understand where exactly rbp was clobbered, it would be interesting to know that. The line you've linked doesn't look criminal:

movq	PER_CPU_VAR(current_task), %r11

What should I have done to investigate this myself?

Thank you. Should I close the issue or you will do that once solution is found for older kernels?

@yonghong-song
Copy link
Collaborator

The related source code in 4.14.13, arch/x86/entry/entry_64.S.

        pushq   %rcx                            /* pt_regs->cx */
        pushq   $-ENOSYS                        /* pt_regs->ax */
        pushq   %r8                             /* pt_regs->r8 */
        pushq   %r9                             /* pt_regs->r9 */
        pushq   %r10                            /* pt_regs->r10 */
        pushq   %r11                            /* pt_regs->r11 */
        sub     $(6*8), %rsp                    /* pt_regs->bp, bx, r12-15 not saved */
        UNWIND_HINT_REGS extra=0

        TRACE_IRQS_OFF

        /*
         * If we need to do entry work or if we guess we'll need to do
         * exit work, go straight to the slow path.
         */
        movq    PER_CPU_VAR(current_task), %r11
        testl   $_TIF_WORK_SYSCALL_ENTRY|_TIF_ALLWORK_MASK, TASK_TI_flags(%r11)
        jnz     entry_SYSCALL64_slow_path

entry_SYSCALL_64_fastpath:
        /*
         * Easy case: enable interrupts and issue the syscall.  If the syscall
         * needs pt_regs, we'll call a stub that disables interrupts again
         * and jumps to the slow path.
         */
        TRACE_IRQS_ON
        ENABLE_INTERRUPTS(CLBR_NONE)
#if __SYSCALL_MASK == ~0
        cmpq    $__NR_syscall_max, %rax
#else
        andl    $__SYSCALL_MASK, %eax
        cmpl    $__NR_syscall_max, %eax
#endif
        ja      1f                              /* return -ENOSYS (already in pt_regs->ax) */
        movq    %r10, %rcx

        /*
         * This call instruction is handled specially in stub_ptregs_64.
         * It might end up jumping to the slow path.  If it jumps, RAX
         * and all argument registers are clobbered.
         */
        call    *sys_call_table(, %rax, 8)
.Lentry_SYSCALL_64_after_fastpath_call:

Basically, when the syscall goes to fastpath, bp register is not saved
in pt_regs. So its value will be the same as the previous slowpath.
That is when things go wrong.

So rbp is not clobbered, but merely not saved so its value still points to somewhere
on the stack but may not be current. My workaround is to force slowpath so rbp
is saved into pt_regs properly.

I think you can close the issue. For old long term supported kernels, 4.9, 4.13, 4.15,
all latest stable kernels have fixed the issue.

@arssher arssher closed this as completed Apr 18, 2018
banh-gao pushed a commit to banh-gao/bcc that referenced this issue Jun 21, 2018
Fix issue iovisor#1641

The bcc user space stack is not printed out properly.

From https://en.wikipedia.org/wiki/Name_mangling,
all mangled symbols begin with _Z, and an identifier
beginning with an underscore followed by a capital is
a reserved identifier in C, so conflict with user identifiers
is avoided.

Further, from the llvm demangle code
https://github.com/llvm-mirror/libcxxabi/blob/master/src/cxa_demangle.cpp
The demangled name has the following specification:
    <mangled-name> ::= _Z <encoding>
                   ::= <type>
    extension      ::= ___Z <encoding> _block_invoke
    extension      ::= ___Z <encoding> _block_invoke<decimal-digit>+
    extension      ::= ___Z <encoding> _block_invoke_<decimal-digit>+

In the issue iovisor#1641, the function name are "f" and "g", which is demangled
to type "float" and "__float128", according to the above implementation.

In bcc case, we only care about functions, so only do demangling
for symbols starting with _Z or ___Z.

Signed-off-by: Yonghong Song <yhs@fb.com>
banh-gao pushed a commit to banh-gao/bcc that referenced this issue Jun 21, 2018
The issue comes up when I investigated issue iovisor#1641.
A func symbol defined in assembly code will be size of 0,
e.g., http://git.musl-libc.org/cgit/musl/tree/src/thread/x86_64/syscall_cp.s
symbol __cp_begin.
  .text
  .global __cp_begin
  .hidden __cp_begin
  ...
  __syscall_cp_asm:

  __cp_begin:
	mov (%rdi),%eax
	test %eax,%eax

and __cp_begin cannot be traced through bcc since
symbol resolution rejects any func symbol with size 0.

This patch removed size-must-not-zero restriction so that
the symbol like __cp_begin can be traced.
Command line: trace.py -p <pid> -U '<binary_path>:__cp_begin'

Signed-off-by: Yonghong Song <yhs@fb.com>
CrackerCat pushed a commit to CrackerCat/bcc that referenced this issue Jul 31, 2024
Fix issue iovisor#1641

The bcc user space stack is not printed out properly.

From https://en.wikipedia.org/wiki/Name_mangling,
all mangled symbols begin with _Z, and an identifier
beginning with an underscore followed by a capital is
a reserved identifier in C, so conflict with user identifiers
is avoided.

Further, from the llvm demangle code
https://github.com/llvm-mirror/libcxxabi/blob/master/src/cxa_demangle.cpp
The demangled name has the following specification:
    <mangled-name> ::= _Z <encoding>
                   ::= <type>
    extension      ::= ___Z <encoding> _block_invoke
    extension      ::= ___Z <encoding> _block_invoke<decimal-digit>+
    extension      ::= ___Z <encoding> _block_invoke_<decimal-digit>+

In the issue iovisor#1641, the function name are "f" and "g", which is demangled
to type "float" and "__float128", according to the above implementation.

In bcc case, we only care about functions, so only do demangling
for symbols starting with _Z or ___Z.

Signed-off-by: Yonghong Song <yhs@fb.com>
CrackerCat pushed a commit to CrackerCat/bcc that referenced this issue Jul 31, 2024
The issue comes up when I investigated issue iovisor#1641.
A func symbol defined in assembly code will be size of 0,
e.g., http://git.musl-libc.org/cgit/musl/tree/src/thread/x86_64/syscall_cp.s
symbol __cp_begin.
  .text
  .global __cp_begin
  .hidden __cp_begin
  ...
  __syscall_cp_asm:

  __cp_begin:
	mov (%rdi),%eax
	test %eax,%eax

and __cp_begin cannot be traced through bcc since
symbol resolution rejects any func symbol with size 0.

This patch removed size-must-not-zero restriction so that
the symbol like __cp_begin can be traced.
Command line: trace.py -p <pid> -U '<binary_path>:__cp_begin'

Signed-off-by: Yonghong Song <yhs@fb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants