Segmentation fault with cuda-7.5 #8

Jerrynet · 2016-01-15T06:28:16Z

I build nccl with cuda-7.5:

make CUDA_HOME=/usr/local/cuda-7.5 test

And run test with the following command:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:./build/lib
./build/test/all_reduce_test

causes a segmentation fault:

# Using devices
Segmentation fault

But all tests run smoothly if I build nccl with cuda-7.0.
Is the current version of nccl not compatible with cuda-7.5?

The text was updated successfully, but these errors were encountered:

cliffwoolley · 2016-01-15T06:41:56Z

I've seen a similar issue when libcuda.so is not in the LD_LIBRARY_PATH (on
one of my systems, only libcuda.so.1 was there, but the usual libcuda.so
symlink was absent). Can you please check that? If it turns out that's
your issue, you can either simply ln -s libcuda.so.1 libcuda.so in the
relevant directory or you can modify NCCL to dlopen libcuda.so.1 instead of
libcuda.so.

Thanks,
Cliff

On Thu, Jan 14, 2016 at 10:28 PM, Jerry Lin notifications@github.com
wrote:

I build nccl with cuda-7.5:

make CUDA_HOME=/usr/local/cuda-7.5 test

And run test with the following command:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:./build/lib
./build/test/all_reduce_test

causes a segmentation fault:

Using devices

Segmentation fault

But all tests run smoothly if I build nccl with cuda-7.0.
Is the current version of nccl not compatible with cuda-7.5?

—
Reply to this email directly or view it on GitHub
#8.

cliffwoolley · 2016-01-15T06:43:14Z

Oh, and yes, NCCL is normally compatible with CUDA 7.5. It actually is a
bit more complete on CUDA 7.5 than on 7.0, since 7.0 lacked some of the
necessary support for the fp16 'half' datatype.

On Thu, Jan 14, 2016 at 10:41 PM, Cliff Woolley cliffwoolley@gmail.com
wrote:

I've seen a similar issue when libcuda.so is not in the LD_LIBRARY_PATH
(on one of my systems, only libcuda.so.1 was there, but the usual
libcuda.so symlink was absent). Can you please check that? If it turns
out that's your issue, you can either simply ln -s libcuda.so.1 libcuda.so in the relevant directory or you can modify NCCL to dlopen
libcuda.so.1 instead of libcuda.so.

Thanks,
Cliff

On Thu, Jan 14, 2016 at 10:28 PM, Jerry Lin notifications@github.com
wrote:

I build nccl with cuda-7.5:

make CUDA_HOME=/usr/local/cuda-7.5 test

And run test with the following command:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:./build/lib
./build/test/all_reduce_test

causes a segmentation fault:

Using devices

Segmentation fault

But all tests run smoothly if I build nccl with cuda-7.0.
Is the current version of nccl not compatible with cuda-7.5?

—
Reply to this email directly or view it on GitHub
#8.

cliffwoolley · 2016-01-15T06:44:41Z

Side note: assuming this is the same issue with libcuda.so that I'm referring to, we should fix the tests to fail more gracefully when the communicator cannot be created. The segfault happens when we pass a NULL communicator to some subsequent routine.

Jerrynet · 2016-01-15T07:11:05Z

@cliffwoolley Thanks for the explanation.
I create a symlink to libcuda.so.1 and now it works!
So it's the same issue with libcuda.so.

cliffwoolley · 2016-01-15T08:03:47Z

Great! Glad to hear it. We'll leave this issue open to deal with the libcuda.so[.1] loading (perhaps we could try both variants before giving up) as well as to detect communicator creation failure in the test apps without segfaulting. I believe @nluehr already has fixes pending for one or both of these issues.

nluehr · 2016-02-12T22:44:48Z

These issues are resolved in change sets caa40b8 and 2758353.

NCCL can be built with `-DPROFILE_PROXY -DENABLE_TIMER` and run with `NCCL_PROXY_PROFILE=trace.json` environment variable to dump the timeline of all proxy operations to `trace.json` using the so-called Trace Event Format. However, currently NCCL segfaults even when trying to profile very simple operations. Just running any binary from `nccl-tests` is enough to trigger the issue, for example: $ cat run.sh NCCL_P2P_DISABLE=1 NCCL_P2P_DIRECT_DISABLE=1 NCCL_SHM_DISABLE=1 \ NCCL_PROXY_PROFILE="proxy_${OMPI_COMM_WORLD_RANK}.json" \ ./build/all_reduce_perf -g 2 $ mpirun -n 2 ./run.sh ... ./run.sh: line 4: 372891 Segmentation fault [...] ... $ gdb build/all_reduce_perf core ... Program terminated with signal SIGSEGV, Segmentation fault. 57 event->timestamp[state%8] = gettime()-profilingStart; Note that the only point of `NCCL_P2P_DISABLE=1 NCCL_P2P_DIRECT_DISABLE=1 NCCL_SHM_DISABLE=1` is to trigger network communication (and hence proxying) between the two GPUs even when they are on the same host. If the ranks are on different hosts, you don't need these variables to trigger the issue. The root cause of the problem is that the proxy profiler tries to track only `NCCL_STEPS` (i.e., 8) steps per sub at a time, since there can never be more than `NCCL_STEPS` in flight at any given moment. So, for any tracked step `e`, the profiler will: a) fill `e->timestamps[ncclProxyProfileBegin]` at the very beginning of the sub, allocating a slot in `args->subs[sub].profilingEvents[step%NCCL_STEPS]` b) fill `e->timestamps[x]` for `x < ncclProxyProfileEnd` (e.g., x = ncclProxyProfileSendGPUWait, x = ncclProxyProfileSendWait, etc.) when `x` happens for the step c) fill `e->timestamps[ncclProxyProfileEnd]` when the step is finished d) set `args->subs[sub].profilingEvents[step%NCCL_STEPS] = NULL` to clear this slot corresponding to `e`, so that the next steps can be safely written to `slot` The problem happens in a): the profiler tries to set `e->timestamps[ncclProxyProfileBegin]` for *ALL* steps at once, ignoring the fact that we can only track 8 steps at at time. As as result, we essentially only allocate slots for the first 8 steps. When b) first happens for e.g. step NVIDIA#8 (take a look at the gdb stack trace above), the corresponding slot is still set to `NULL`, since we cleared the slot after step #0 finished, but never re-allocated it. This commit fixes this by first storing the timestamp when profiling began in the proxy args. When b) first happens, we allocate a slot for the step, and copy the stored timestamp to `e->timestamp[ncclProxyProfileBegin]`. This way, the generated timeline is exactly the same as before, but we never try processing more than `NCCL_STEPS` steps at once.

nluehr closed this as completed Feb 12, 2016

bklooste mentioned this issue Apr 30, 2016

Support Windows autumnai/collenchyma#61

Open

fmana mentioned this issue Sep 22, 2016

NCCL allreduce hangs when cudaFreeHost #48

Closed

hpjeonGIT mentioned this issue Oct 28, 2017

nccl all_reduce_test hangs #117

Closed

shaochuang-wsc mentioned this issue Feb 12, 2020

Segmentation fault when setting the environment NCCL_SHM_DISABLE=1 #291

Closed

weberxie mentioned this issue Sep 30, 2020

NCCL hang issue #394

Closed

himanshucodz55 mentioned this issue Jul 25, 2022

RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Timeout waiting for key: default_pg/0/0 after 1800000 ms #708

Open

xw285cornell mentioned this issue Nov 15, 2022

NCCL Hang with CUDA_LAUNCH_BLOCKING=1 #750

Closed

junior-zsy mentioned this issue Jun 29, 2023

FasterTransformer NcclAllReduceSum with 4 GPUs hangs #901

Closed

raninbowlalala mentioned this issue Jul 4, 2023

2 allreduce and a allgather hang in multi-node #899

Open

dbfancier mentioned this issue Jul 14, 2023

nccl-test hung and tcp socket failed sometimes #914

Closed

acphile mentioned this issue Sep 29, 2023

Question about ncclCommAbort stuck issue #1013

Open

yanminjia mentioned this issue Nov 27, 2023

NCCL Crashes when do NET initialization #1091

Open

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault with cuda-7.5 #8

Segmentation fault with cuda-7.5 #8

Jerrynet commented Jan 15, 2016

cliffwoolley commented Jan 15, 2016

Using devices

cliffwoolley commented Jan 15, 2016

Using devices

cliffwoolley commented Jan 15, 2016

Jerrynet commented Jan 15, 2016

cliffwoolley commented Jan 15, 2016

nluehr commented Feb 12, 2016

Segmentation fault with cuda-7.5 #8

Segmentation fault with cuda-7.5 #8

Comments

Jerrynet commented Jan 15, 2016

cliffwoolley commented Jan 15, 2016

Using devices

cliffwoolley commented Jan 15, 2016

Using devices

cliffwoolley commented Jan 15, 2016

Jerrynet commented Jan 15, 2016

cliffwoolley commented Jan 15, 2016

nluehr commented Feb 12, 2016