Skip to content

Latest commit

 

History

History
114 lines (72 loc) · 2.43 KB

README.md

File metadata and controls

114 lines (72 loc) · 2.43 KB

pynccl

Nvidia NCCL2 Python bindings using ctypes and numba.

Many codes and ideas of this project come from the project pyculib. It is originally as part of the distributed deep learning project called necklace.

Install

  • NCCL

Please follow the Nvidia doc here to install NCCL.

  • pynccl

from source,

python setup.py install

or just,

pip install pynccl

Usage

Environments

  • for numba
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

(the following may be no need)

export NUMBAPRO_CUDALIB=/usr/local/cuda/lib64
export NUMBAPRO_NVVM=/usr/local/cuda/nvvm/lib64/libnvvm.so
export NUMBAPRO_LIBDEVICE=/usr/local/cuda/nvvm/libdevice/
  • for NCCL
export NUMBA_NCCLLIB=/usr/lib/x86_64-linux-gnu/

export NCCL_DEBUG=INFO

export NCCL_SOCKET_IFNAME=<your-ifname-like-ens11>

Examples

  • pynccl.NcclWrp

This piece of code is an example of NcclWrp with multiprocessing for dispatching the ncclUniqueId to all processes. See the complete code here

    nc = pynccl.NcclWrp(kn, rank, gpu_i)

    if rank == 0:
        nuid = nc.get_nuid()

        for j in range(kn - 1):
            q.put((nuid, w))
    else:
        nuid, w = q.get()

    nc.set_nuid(nuid)
    nc.init_comm()
  • pynccl.Nccl

You also can use the original functions of pynccl.Nccl, see the code here

    # NOTE: do this at first of all
    cuda.select_device(gpu_i)

    nk = pynccl.Nccl()

    comm_i = nk.get_comm()
    r = nk.comm_init_rank(byref(comm_i), world_size, nuid, rank)

    stream_i = nk.get_stream()

    r = nk.group_start()

    ......

    r = nk.all_gather(p_arr_send, p_arr_recv,
                      sz,
                      pynccl.binding.ncclFloat,
                      comm_i, stream_i.handle)

    r = nk.group_end()

    stream_i.synchronize()
  • multi comms

You can create multiple NCCL communicators with different world_size and ranks list, which is something like the process group and important for distributed deep learning framework, see the code here