# 3. Distributed training with NCCL
In section two, we implemented distributed trainging using MPI communication API. This type of communication is very inconvenient, we need to turn the data into numpy, and use CPU to communicate. It doesn't take advantage of multiple GPUs. Therefore it is essential to use the NVIDIA Collective Communication Library(NCCL), which is developed by NVIDIA official.

To enable direct communication between GPUs in NCCL, we should crreate a communicator first. In terms of concrete implementation, first we need to call the `ncclGetUniqueId()` function, it will return an ID, which will be used by all processees and threads to synchronize and understand they are part of the same communicator. Then we can use `ncclCommInitRank()` to create the communicator objects. The key issue is that we need to broadcast ID to all participating threads and processes using any CPU communication system. In the original MPI with CUDA program, we can call the CUDA-based MPI API to finish the broadcast. But in our project, we call CUDA program via Python, MPI is also based on Python. As a result, we can't use the CUDA-based MPI API but we can use the Python-based. 

Our solutions are as follows:

1. Python program calls CUDA API, CUDA program gets the ID and returns it to Python.
2. Python program calls Python-based MPI API to broadcast the ID.
3. All processees and threads get the same ID, calls CUDA API to establish a connection.


The relevant codes arre as follows:

Python code:
```
def init():
    comm = MPI.COMM_WORLD
    size = comm.Get_size()
    rank = comm.Get_rank() # call MPI API to get world_size and rank
    device = ndl.cuda(rank) # choose different GPUs
    print(f'Use cuda: {rank}')

    if rank==0:
        vec = device.get_id() # get ID
    else:
        vec = None
    vec = comm.bcast(vec, root=0) # broadcast ID

    device.init_nccl(vec,rank,size) # establish a connection
    return rank, size, device
```

CUDA code:
```
struct CudaCommAndStream{
    int nRanks,localRank,myRank;
    ncclUniqueId id;
    ncclComm_t comm;
    cudaStream_t s;
}mess;
void SetDevice(int id) # set different device
{
    mess.localRank=id;
    cudaSetDevice(id);
}
std::vector<uint8_t> GetId()
{
    ncclGetUniqueId(&mess.id); # get id 
    auto vec = std::vector<uint8_t>(reinterpret_cast<uint8_t*>(&mess.id),reinterpret_cast<uint8_t*>(&mess.id) + NCCL_UNIQUE_ID_BYTES); # put id into vector
    return vec;
}

void InitNccl(std::vector<uint8_t> vec,int rank,int size) 
{
    mess.nRanks = size;
    mess.myRank = rank;
    std::memcpy(&mess.id, vec.data(), vec.size()); # change vector to id
    ncclCommInitRank(&mess.comm, mess.nRanks, mess.id, mess.myRank); # establish a connection
    cudaStreamCreate(&mess.s);
}
PYBIND11_MODULE(ndarray_backend_cuda, m) {
    ...
    m.def("set_device", SetDevice);
    m.def("get_id", GetId);
    m.def("init_nccl", InitNccl);
}

```