Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full symbols in a libfabric stack trace? #7939

Closed
mwheinz opened this issue Aug 10, 2022 · 7 comments
Closed

Full symbols in a libfabric stack trace? #7939

mwheinz opened this issue Aug 10, 2022 · 7 comments

Comments

@mwheinz
Copy link
Contributor

mwheinz commented Aug 10, 2022

Hey, guys,

I’m trying to track down what expresses as a PSM3 error report but which I suspect is a NCCL bug. To do that I’m trying to get a symbolic stack trace of the executable when I call abort() inside PSM3 – but simply adding –enable-debug to the libfabric configure doesn’t seem to work.

Any ideas? The current configure I'm using is:

./autogen.sh && ./configure --prefix=${HOME} --enable-debug --with-cuda=/usr/local/cuda-11.6 --enable-cuda-dlopen --enable-only --enable-psm3

@aingerson
Copy link
Contributor

Sometimes I find I need to explicitly set CFLAGS="-g -O0" to fully enable the gdb-able build.

@mwheinz
Copy link
Contributor Author

mwheinz commented Aug 10, 2022

Sometimes I find I need to explicitly set CFLAGS="-g -O0" to fully enable the gdb-able build.

I'll give that a try. Thanks.

@j-xiong
Copy link
Contributor

j-xiong commented Aug 10, 2022

That's weird. What is the output of grep ^CFLAGS Makefile?

@mwheinz
Copy link
Contributor Author

mwheinz commented Aug 10, 2022

CFLAGS = -g -O0 -Wall -Wundef -Wpointer-arith -Wextra -Wno-unused-parameter -Wno-sign -compare -Wno-missing-field-initializers -fstack-protector-strong -fvisibility=hidde n -Wall -Wundef -Wpointer-arith

however, the backtrace still looks like this:

[octo2:3400787] Signal code:  (-6)
[octo2:3400787] [ 0] /lib64/libpthread.so.0(+0x12b20)[0x7fbd602d3b20]
[octo2:3400787] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7fbd5f7a937f]
[octo2:3400787] [ 2] /lib64/libc.so.6(abort+0x127)[0x7fbd5f793db5]
[octo2:3400787] [ 3] /home/mwheinz/lib/libfabric.so.1(+0x9e33a)[0x7fbcec83c33a]
[octo2:3400787] [ 4] /home/mwheinz/lib/libfabric.so.1(+0x9eab7)[0x7fbcec83cab7]
[octo2:3400787] [ 5] /home/mwheinz/lib/libfabric.so.1(+0x9fa59)[0x7fbcec83da59]
[octo2:3400787] [ 6] /home/mwheinz/aws-ofi-nccl/lib/libnccl-net.so(+0x48d3)[0x7fbcd01448d3]
[octo2:3400787] [ 7] /home/mwheinz/horovod/0.22.1-ompi-4.1.3-cuda-ofi-nccl/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0x177eb4)[0x7fbcfe7b7eb4]```

@bsbernd
Copy link

bsbernd commented Aug 10, 2022

Looks like a glibc backtrace - you can resolve this with addr2line or eu-addr2line. I had created a script for our project to automate that. Or use libbacktrace, which provides auto resolved lines (unless debug symbols are stripped).

@mwheinz
Copy link
Contributor Author

mwheinz commented Aug 12, 2022

@aakefbs - I still can't figure out why abort() didn't produce the function names, but addr2line worked perfectly. Thanks.

@mwheinz mwheinz closed this as completed Aug 12, 2022
@mwheinz
Copy link
Contributor Author

mwheinz commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants