Skip to content
Optimized primitives for collective multi-GPU communication
C++ C Makefile Other
Branch: master
Clone or download

Latest commit

Fetching latest commit…
Cannot retrieve the latest commit at this time.


Type Name Latest commit message Commit time
Failed to load latest commit information.
makefiles 2.5.6-2 Dec 7, 2019
pkg 2.5.6-2 Dec 7, 2019
src Fix clang build (#274) Dec 9, 2019
.gitignore 2.3.5-5 Sep 25, 2018
LICENSE.txt NCCL 2.4.6-1 Apr 5, 2019
Makefile NCCL 2.4.6-1 Apr 5, 2019 Update debian dependencies in README (#228) May 23, 2019


Optimized primitives for collective multi-GPU communication.


NCCL (pronounced "Nickel") is a stand-alone library of standard collective communication routines for GPUs, implementing all-reduce, all-gather, reduce, broadcast, and reduce-scatter. It has been optimized to achieve high bandwidth on platforms using PCIe, NVLink, NVswitch, as well as networking using InfiniBand Verbs or TCP/IP sockets. NCCL supports an arbitrary number of GPUs installed in a single node or across multiple nodes, and can be used in either single- or multi-process (e.g., MPI) applications.

For more information on NCCL usage, please refer to the NCCL documentation.

What's inside

At present, the library implements the following collectives operations:

  • all-reduce
  • all-gather
  • reduce-scatter
  • reduce
  • broadcast

These operations are implemented using ring algorithms and have been optimized for throughput and latency. For best performance, small operations can be either batched into larger operations or aggregated through the API.


NCCL requires at least CUDA 7.0 and Kepler or newer GPUs. For PCIe based platforms, best performance is achieved when all GPUs are located on a common PCIe root complex, but multi-socket configurations are also supported.


Note: the official and tested builds of NCCL can be downloaded from: You can skip the following build steps if you choose to use the official builds.

To build the library :

$ cd nccl
$ make -j

If CUDA is not installed in the default /usr/local/cuda path, you can define the CUDA path with :

$ make CUDA_HOME=<path to cuda install>

NCCL will be compiled and installed in build/ unless BUILDDIR is set.

By default, NCCL is compiled for all supported architectures. To accelerate the compilation and reduce the binary size, consider redefining NVCC_GENCODE (defined in makefiles/ to only include the architecture of the target platform :

$ make -j NVCC_GENCODE="-gencode=arch=compute_70,code=sm_70"


To install NCCL on the system, create a package then install it as root.

Debian/Ubuntu :

$ # Install tools to create debian packages
$ sudo apt install build-essential devscripts debhelper fakeroot
$ # Build NCCL deb package
$ make
$ ls build/pkg/deb/

RedHat/CentOS :

$ # Install tools to create rpm packages
$ sudo yum install rpm-build rpmdevtools
$ # Build NCCL rpm package
$ make
$ ls build/pkg/rpm/

OS-agnostic tarball :

$ make
$ ls build/pkg/txz/


Tests for NCCL are maintained separately at

$ git clone
$ cd nccl-tests
$ make
$ ./build/all_reduce_perf -b 8 -e 256M -f 2 -g <ngpus>


All source code and accompanying documentation is copyright (c) 2015-2019, NVIDIA CORPORATION. All rights reserved.

You can’t perform that action at this time.