Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault with cuda-7.5 #8

Closed
Jerrynet opened this issue Jan 15, 2016 · 6 comments
Closed

Segmentation fault with cuda-7.5 #8

Jerrynet opened this issue Jan 15, 2016 · 6 comments

Comments

@Jerrynet
Copy link

I build nccl with cuda-7.5:

make CUDA_HOME=/usr/local/cuda-7.5 test

And run test with the following command:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:./build/lib
./build/test/all_reduce_test

causes a segmentation fault:

# Using devices
Segmentation fault

But all tests run smoothly if I build nccl with cuda-7.0.
Is the current version of nccl not compatible with cuda-7.5?

@cliffwoolley
Copy link
Collaborator

I've seen a similar issue when libcuda.so is not in the LD_LIBRARY_PATH (on
one of my systems, only libcuda.so.1 was there, but the usual libcuda.so
symlink was absent). Can you please check that? If it turns out that's
your issue, you can either simply ln -s libcuda.so.1 libcuda.so in the
relevant directory or you can modify NCCL to dlopen libcuda.so.1 instead of
libcuda.so.

Thanks,
Cliff

On Thu, Jan 14, 2016 at 10:28 PM, Jerry Lin notifications@github.com
wrote:

I build nccl with cuda-7.5:

make CUDA_HOME=/usr/local/cuda-7.5 test

And run test with the following command:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:./build/lib
./build/test/all_reduce_test

causes a segmentation fault:

Using devices

Segmentation fault

But all tests run smoothly if I build nccl with cuda-7.0.
Is the current version of nccl not compatible with cuda-7.5?


Reply to this email directly or view it on GitHub
#8.

@cliffwoolley
Copy link
Collaborator

Oh, and yes, NCCL is normally compatible with CUDA 7.5. It actually is a
bit more complete on CUDA 7.5 than on 7.0, since 7.0 lacked some of the
necessary support for the fp16 'half' datatype.

On Thu, Jan 14, 2016 at 10:41 PM, Cliff Woolley cliffwoolley@gmail.com
wrote:

I've seen a similar issue when libcuda.so is not in the LD_LIBRARY_PATH
(on one of my systems, only libcuda.so.1 was there, but the usual
libcuda.so symlink was absent). Can you please check that? If it turns
out that's your issue, you can either simply ln -s libcuda.so.1 libcuda.so in the relevant directory or you can modify NCCL to dlopen
libcuda.so.1 instead of libcuda.so.

Thanks,
Cliff

On Thu, Jan 14, 2016 at 10:28 PM, Jerry Lin notifications@github.com
wrote:

I build nccl with cuda-7.5:

make CUDA_HOME=/usr/local/cuda-7.5 test

And run test with the following command:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:./build/lib
./build/test/all_reduce_test

causes a segmentation fault:

Using devices

Segmentation fault

But all tests run smoothly if I build nccl with cuda-7.0.
Is the current version of nccl not compatible with cuda-7.5?


Reply to this email directly or view it on GitHub
#8.

@cliffwoolley
Copy link
Collaborator

Side note: assuming this is the same issue with libcuda.so that I'm referring to, we should fix the tests to fail more gracefully when the communicator cannot be created. The segfault happens when we pass a NULL communicator to some subsequent routine.

@Jerrynet
Copy link
Author

@cliffwoolley Thanks for the explanation.
I create a symlink to libcuda.so.1 and now it works!
So it's the same issue with libcuda.so.

@cliffwoolley
Copy link
Collaborator

Great! Glad to hear it. We'll leave this issue open to deal with the libcuda.so[.1] loading (perhaps we could try both variants before giving up) as well as to detect communicator creation failure in the test apps without segfaulting. I believe @nluehr already has fixes pending for one or both of these issues.

@nluehr
Copy link
Contributor

nluehr commented Feb 12, 2016

These issues are resolved in change sets caa40b8 and 2758353.

@nluehr nluehr closed this as completed Feb 12, 2016
dfyz added a commit to dfyz/nccl that referenced this issue Sep 15, 2023
NCCL can be built with `-DPROFILE_PROXY -DENABLE_TIMER` and run
with `NCCL_PROXY_PROFILE=trace.json` environment variable to dump
the timeline of all proxy operations to `trace.json` using the
so-called Trace Event Format.

However, currently NCCL segfaults even when trying to profile very
simple operations. Just running any binary from `nccl-tests` is
enough to trigger the issue, for example:

$ cat run.sh
NCCL_P2P_DISABLE=1 NCCL_P2P_DIRECT_DISABLE=1 NCCL_SHM_DISABLE=1 \
NCCL_PROXY_PROFILE="proxy_${OMPI_COMM_WORLD_RANK}.json" \
./build/all_reduce_perf -g 2
$ mpirun -n 2 ./run.sh
...
./run.sh: line 4: 372891 Segmentation fault [...]
...
$ gdb build/all_reduce_perf core
...
Program terminated with signal SIGSEGV, Segmentation fault.
57        event->timestamp[state%8] = gettime()-profilingStart;

Note that the only point of `NCCL_P2P_DISABLE=1 NCCL_P2P_DIRECT_DISABLE=1 NCCL_SHM_DISABLE=1`
is to trigger network communication (and hence proxying) between the
two GPUs even when they are on the same host. If the ranks are on
different hosts, you don't need these variables to trigger the issue.

The root cause of the problem is that the proxy profiler tries to track
only `NCCL_STEPS` (i.e., 8) steps per sub at a time, since there can
never be more than `NCCL_STEPS` in flight at any given moment.
So, for any tracked step `e`, the profiler will:
  a) fill `e->timestamps[ncclProxyProfileBegin]` at the very beginning
     of the sub, allocating a slot in `args->subs[sub].profilingEvents[step%NCCL_STEPS]`
  b) fill `e->timestamps[x]` for `x < ncclProxyProfileEnd` (e.g.,
     x = ncclProxyProfileSendGPUWait, x = ncclProxyProfileSendWait, etc.)
     when `x` happens for the step
  c) fill `e->timestamps[ncclProxyProfileEnd]` when the step is finished
  d) set `args->subs[sub].profilingEvents[step%NCCL_STEPS] = NULL` to clear this
     slot corresponding to `e`, so that the next steps can be safely
     written to `slot`

The problem happens in a): the profiler tries to set `e->timestamps[ncclProxyProfileBegin]`
for *ALL* steps at once, ignoring the fact that we can only track 8
steps at at time. As as result, we essentially only allocate slots for the first 8 steps.
When b) first happens for e.g. step NVIDIA#8 (take a look at the gdb stack
trace above), the corresponding slot is still set to `NULL`, since we cleared
the slot after step #0 finished, but never re-allocated it.

This commit fixes this by first storing the timestamp when profiling began in the proxy args.
When b) first happens, we allocate a slot for the step, and copy the stored timestamp
to `e->timestamp[ncclProxyProfileBegin]`. This way, the generated
timeline is exactly the same as before, but we never try processing
more than `NCCL_STEPS` steps at once.
dfyz added a commit to dfyz/nccl that referenced this issue Sep 15, 2023
NCCL can be built with `-DPROFILE_PROXY -DENABLE_TIMER` and run
with `NCCL_PROXY_PROFILE=trace.json` environment variable to dump
the timeline of all proxy operations to `trace.json` using the
so-called Trace Event Format.

However, currently NCCL segfaults even when trying to profile very
simple operations. Just running any binary from `nccl-tests` is
enough to trigger the issue, for example:

$ cat run.sh
NCCL_P2P_DISABLE=1 NCCL_P2P_DIRECT_DISABLE=1 NCCL_SHM_DISABLE=1 \
NCCL_PROXY_PROFILE="proxy_${OMPI_COMM_WORLD_RANK}.json" \
./build/all_reduce_perf -g 2
$ mpirun -n 2 ./run.sh
...
./run.sh: line 4: 372891 Segmentation fault [...]
...
$ gdb build/all_reduce_perf core
...
Program terminated with signal SIGSEGV, Segmentation fault.
57        event->timestamp[state%8] = gettime()-profilingStart;

Note that the only point of `NCCL_P2P_DISABLE=1 NCCL_P2P_DIRECT_DISABLE=1 NCCL_SHM_DISABLE=1`
is to trigger network communication (and hence proxying) between the
two GPUs even when they are on the same host. If the ranks are on
different hosts, you don't need these variables to trigger the issue.

The root cause of the problem is that the proxy profiler tries to track
only `NCCL_STEPS` (i.e., 8) steps per sub at a time, since there can
never be more than `NCCL_STEPS` in flight at any given moment.
So, for any tracked step `e`, the profiler will:
  a) fill `e->timestamps[ncclProxyProfileBegin]` at the very beginning
     of the sub, allocating a slot in `args->subs[sub].profilingEvents[step%NCCL_STEPS]`
  b) fill `e->timestamps[x]` for `x < ncclProxyProfileEnd` (e.g.,
     x = ncclProxyProfileSendGPUWait, x = ncclProxyProfileSendWait, etc.)
     when `x` happens for the step
  c) fill `e->timestamps[ncclProxyProfileEnd]` when the step is finished
  d) set `args->subs[sub].profilingEvents[step%NCCL_STEPS] = NULL` to clear this
     slot corresponding to `e`, so that the next steps can be safely
     written to `slot`

The problem happens in a): the profiler tries to set `e->timestamps[ncclProxyProfileBegin]`
for *ALL* steps at once, ignoring the fact that we can only track 8
steps at at time. As as result, we essentially only allocate slots for the first 8 steps.
When b) first happens for e.g. step NVIDIA#8 (take a look at the gdb stack
trace above), the corresponding slot is still set to `NULL`, since we cleared
the slot after step #0 finished, but never re-allocated it.

This commit fixes this by first storing the timestamp when profiling began in the proxy args.
When b) first happens, we allocate a slot for the step, and copy the stored timestamp
to `e->timestamp[ncclProxyProfileBegin]`. This way, the generated
timeline is exactly the same as before, but we never try processing
more than `NCCL_STEPS` steps at once.
dfyz added a commit to dfyz/nccl that referenced this issue Nov 24, 2023
NCCL can be built with `-DPROFILE_PROXY -DENABLE_TIMER` and run
with `NCCL_PROXY_PROFILE=trace.json` environment variable to dump
the timeline of all proxy operations to `trace.json` using the
so-called Trace Event Format.

However, currently NCCL segfaults even when trying to profile very
simple operations. Just running any binary from `nccl-tests` is
enough to trigger the issue, for example:

$ cat run.sh
NCCL_P2P_DISABLE=1 NCCL_P2P_DIRECT_DISABLE=1 NCCL_SHM_DISABLE=1 \
NCCL_PROXY_PROFILE="proxy_${OMPI_COMM_WORLD_RANK}.json" \
./build/all_reduce_perf -g 2
$ mpirun -n 2 ./run.sh
...
./run.sh: line 4: 372891 Segmentation fault [...]
...
$ gdb build/all_reduce_perf core
...
Program terminated with signal SIGSEGV, Segmentation fault.
57        event->timestamp[state%8] = gettime()-profilingStart;

Note that the only point of `NCCL_P2P_DISABLE=1 NCCL_P2P_DIRECT_DISABLE=1 NCCL_SHM_DISABLE=1`
is to trigger network communication (and hence proxying) between the
two GPUs even when they are on the same host. If the ranks are on
different hosts, you don't need these variables to trigger the issue.

The root cause of the problem is that the proxy profiler tries to track
only `NCCL_STEPS` (i.e., 8) steps per sub at a time, since there can
never be more than `NCCL_STEPS` in flight at any given moment.
So, for any tracked step `e`, the profiler will:
  a) fill `e->timestamps[ncclProxyProfileBegin]` at the very beginning
     of the sub, allocating a slot in `args->subs[sub].profilingEvents[step%NCCL_STEPS]`
  b) fill `e->timestamps[x]` for `x < ncclProxyProfileEnd` (e.g.,
     x = ncclProxyProfileSendGPUWait, x = ncclProxyProfileSendWait, etc.)
     when `x` happens for the step
  c) fill `e->timestamps[ncclProxyProfileEnd]` when the step is finished
  d) set `args->subs[sub].profilingEvents[step%NCCL_STEPS] = NULL` to clear this
     slot corresponding to `e`, so that the next steps can be safely
     written to `slot`

The problem happens in a): the profiler tries to set `e->timestamps[ncclProxyProfileBegin]`
for *ALL* steps at once, ignoring the fact that we can only track 8
steps at at time. As as result, we essentially only allocate slots for the first 8 steps.
When b) first happens for e.g. step NVIDIA#8 (take a look at the gdb stack
trace above), the corresponding slot is still set to `NULL`, since we cleared
the slot after step #0 finished, but never re-allocated it.

This commit fixes this by first storing the timestamp when profiling began in the proxy args.
When b) first happens, we allocate a slot for the step, and copy the stored timestamp
to `e->timestamp[ncclProxyProfileBegin]`. This way, the generated
timeline is exactly the same as before, but we never try processing
more than `NCCL_STEPS` steps at once.
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants