Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 13 additions & 2 deletions docs/software/communication/nccl.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,15 +9,26 @@ It is commonly used in machine learning frameworks, but traditional scientific a
To use the Slingshot network on Alps, the [`aws-ofi-nccl`](https://github.com/aws/aws-ofi-nccl) plugin must be used.
With the container engine, the [AWS OFI NCCL hook][ref-ce-aws-ofi-hook] can be used to load the plugin into the container and configure NCCL to use it.

While the container engine does this automatically, regardless of application, the following environment variable should always be set when using NCCL:
While the container engine does this automatically, regardless of application, the following environment variables should always be set when using NCCL:

```bash
export NCCL_NET_PLUGIN="ofi"
export NCCL_NET="AWS Libfabric"
```

This forces NCCL to use the libfabric plugin, enabling full use of the Slingshot network.
Conversely, if the plugin can not be found, applications will fail to start instead of falling back to e.g. TCP, which would be significantly slower than with the plugin.

For optimal performance, the following environment variables should also be set (these are set automatically by the container engine):

```bash
export NCCL_NET_GDR_LEVEL=PHB
export FI_CXI_DISABLE_HOST_REGISTER=1
export FI_MR_CACHE_MONITOR=userfaultfd
export FI_CXI_DEFAULT_CQ_SIZE=131072
export FI_CXI_DEFAULT_TX_SIZE=32768
export FI_CXI_RX_MATCH_MODE=software
```

!!! warning "GPU-aware MPI with NCCL"
Using GPU-aware MPI together with NCCL [can easily lead to deadlocks](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/mpi.html#inter-gpu-communication-with-cuda-aware-mpi).
Unless care is taken to ensure that the two methods of communication are not used concurrently, we recommend not using GPU-aware MPI with NCCL.
Expand Down