You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Operations on different communicators should therefore be used at different epochs with a locking mechanism, and applications should ensure operations are submitted in the same order across ranks.
Does this mean that we should wait for any nccl operation to complete (by calling ncclStreamSynchronize) before starting a new one on a different thread on a different communicator?
Or we can start new operation in other thread immediately after the previous operation was enqueued (without ncclStreamSynchronize call under mutex)?
The text was updated successfully, but these errors were encountered:
I found same question here #195 (comment) (my case is launch the allreduce between GPU0 and GPU2 and GPU1 and GPU3 concurrently)
Is it safe to launch concurrent allreduce on communicators with different GPUs? For example, lets say we launch the allreduce on all 4 GPUs and wait for it to complete. Then we launch the allreduce between GPU0 and GPU2 and GPU1 and GPU3 concurrently. Would this be safe since the GPUs used are distinct?
https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/communicators.html#using-multiple-nccl-communicators-concurrently
Does this mean that we should wait for any nccl operation to complete (by calling ncclStreamSynchronize) before starting a new one on a different thread on a different communicator?
Or we can start new operation in other thread immediately after the previous operation was enqueued (without ncclStreamSynchronize call under mutex)?
The text was updated successfully, but these errors were encountered: