Global lock for multiple communicators in one process #1174

akhoroshev · 2024-02-07T20:20:16Z

https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/communicators.html#using-multiple-nccl-communicators-concurrently

Operations on different communicators should therefore be used at different epochs with a locking mechanism, and applications should ensure operations are submitted in the same order across ranks.

Does this mean that we should wait for any nccl operation to complete (by calling ncclStreamSynchronize) before starting a new one on a different thread on a different communicator?

Or we can start new operation in other thread immediately after the previous operation was enqueued (without ncclStreamSynchronize call under mutex)?

The text was updated successfully, but these errors were encountered:

akhoroshev · 2024-02-07T20:29:17Z

My case is 4 gpu devices: 2 x tp2 inference engines in one process

akhoroshev · 2024-02-07T20:54:25Z

I found same question here #195 (comment) (my case is launch the allreduce between GPU0 and GPU2 and GPU1 and GPU3 concurrently)

Is it safe to launch concurrent allreduce on communicators with different GPUs? For example, lets say we launch the allreduce on all 4 GPUs and wait for it to complete. Then we launch the allreduce between GPU0 and GPU2 and GPU1 and GPU3 concurrently. Would this be safe since the GPUs used are distinct?

And answer is #195 (comment)

Yes, that is safe.

So if I don’t have dependencies between groups (0,2 GPU and 1,3 GPU), then I don’t need any synchronization at all, right?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Global lock for multiple communicators in one process #1174

Global lock for multiple communicators in one process #1174

akhoroshev commented Feb 7, 2024

akhoroshev commented Feb 7, 2024

akhoroshev commented Feb 7, 2024

Global lock for multiple communicators in one process #1174

Global lock for multiple communicators in one process #1174

Comments

akhoroshev commented Feb 7, 2024

akhoroshev commented Feb 7, 2024

akhoroshev commented Feb 7, 2024