# 1. Introduction

We implement distributed training based on the Needle framework in our final project. In distributed training, the workload to train a model is split up and shared among multiple devices like GPUs, called nodes. These nodes work in parallel to speed up model training. The two main types of distributed training are data parallelism and model parallelism. In short, data parallelism divides the training data into partitions; model parallelism segments the model into different parts that can run concurrently in different nodes [1]. This project implmements the data parallism apporach. We'll elaborate a bit more about data parallelism in the following sections.

In data parallelism, the training data is divided into partitions, where the number of partitions is equal to the total number of available nodes. The partitions are assigned to the available nodes.
The model is copied in each of these nodes and each nodes operates on its own subset of the partition. Each node calculates the gradients of the model parameters independently. The calculated gradients of the nodes are aggragated to obtain the average gradients. Finally, each node updates the model parameters using the average gradients. 

Here we also give a brief explanation of the mathematical theory of data parallelism. Let $w$ be the parameters of the model; $\frac{\delta{L}}{\delta{w}}$ is the original gradients of the batch of size $n$; $l_i$ is the loss for data point $i$ and $k$ is the number of nodes. Then we have
$$
\frac{\delta{L}}{\delta{w}}=\frac{\delta[\frac{1}{n}\sum_{i=1}^{n}l_i]}{\delta{w}} \\
                              =\frac{1}{n}\sum_{i=1}^{n}\frac{\delta{l_i}}{\delta{w}} \\
                              =\frac{m_1}{n}\frac{\frac{1}{m_1}\sum_{i=1}^{m_1}l_i}{\delta{w}} 
                               +\frac{m_2}{n}\frac{\frac{1}{m_2}\sum_{i=m_1+1}^{m_1+m2}l_i}{\delta{w}}
                               + \dots
                               + \frac{m_k}{n}\frac{\frac{1}{m_k}\sum_{i=m_{k-1}+1}^{m_{k-1}+m_{k}}l_i} {\delta{w}} \\
                              =\frac{m_1}{n}\frac{\delta{l_1}}{\delta{w}}+\frac{m_2}{n}\frac{\delta{l_2}}{\delta{w}}
                              +\dots+\frac{m_k}{n}\frac{\delta{l_k}}{\delta{w}}
$$
where $m_k$ is the number of data points assigned to node $k$, and 
$$
m_1+m_2+\dots+m_{k}=n
$$
If $m_1=m_2=\dots=m_k=\frac{n}{k}$, we have
$$
\frac{\delta{L}}{\delta{w}}=\frac{1}{k}[\frac{\delta{l_1}}{\delta{w}}+\frac{\delta{l_2}}{\delta{w}}+\dots+\frac{\delta{l_k}}{\delta{w}}]
$$
where $\frac{\delta{l_k}}{\delta{w}}$ means the gradients calculated by node $k$ based on the data points $\{m_{k-1}+1,m_{k-1}+2,\dots,m_{k-1}+m_k\}$.
According to the above equation, we could know that the average gradients of all the nodes are equal to the original gradients [2]. 

The source code of the project can be found here: [TODO]

# 2. Usage

In this project, we tried to create the process similar to what horovod provides. The training process will take place in different process at the same time, and each process would communicate with each other through Message Passing Interface (MPI) protocol.

Let's see how it works.

The following lines of code should be inside a python file.
Now we are trying to apply the distributed training for a ResNet9 model.

In [None]:
import sys
import numpy as np
sys.path.append('./python')
import needle as ndl
from apps.simple_training import train_cifar10, evaluate_cifar10
from apps.models import ResNet9

After importing what we need from the basic needle framework, we now can import the ddp (distributed data parallel) from apps

In [None]:
import apps.ddp as ddp

Here, we are going to initialize everything we need

In [None]:
# this function initialize the ddp functionality
# and return a desired cuda device
rank, device = ddp.init()

dataset = ndl.data.CIFAR10Dataset("data/cifar-10-batches-py", train=True)

#  this function do the partition for dataset and
#  returns a dataloader and batch_size for the current process
train_dataloader, bsz = ddp.partition_dataset(
    dataset=dataset, batch_size=128, device=device, dtype='float32')

model = ResNet9(device=device, dtype="float32")
#  Before training, we must broadcast the parameters to different process
ddp.broadcast_parameters(model)

model.train()
opt = ndl.optim.Adam(model.parameters(), lr=0.001, weight_decay=0.001)
#  After defining the optimizer, we need to call this to
#  make the optimizer work for distributed class
opt  = ddp.DistributedOptimizer(opt)

loss_fn = ndl.nn.SoftmaxLoss()

After the initialization, the training step is very simple. Here we can see that the training process is similar to what we normally do in needle framework

In [None]:
n_epochs = 1
for i in range(n_epochs):
    if rank == 0:
        print(f'epoch: {i+1}/{n_epochs}')
    for batch in train_dataloader:
        opt.reset_grad()
        X, y = batch
        out = model(X)
        correct = np.sum(np.argmax(out.numpy(), axis=1) == y.numpy())
        loss = loss_fn(out, y)
        loss.backward()
        opt.step()

To run the script, we can do:
mpiexec -np NUM_GPU python train_script.py

Now we put the demo code into a python file

In [17]:
my_file = open("train_script.py","w+")
my_file.write('''import sys
import time
from random import Random
import numpy as np
from mpi4py import MPI
sys.path.append('./python')
sys.path.append('./apps')
from simple_training import train_cifar10, evaluate_cifar10
from models import ResNet9
import needle as ndl
import ddp

if __name__ == "__main__":
    np.random.seed(0)
    rank, device = ddp.init()

    dataset = ndl.data.CIFAR10Dataset("data/cifar-10-batches-py", train=True)

    train_set, bsz = ddp.partition_dataset(
        dataset, 128, device=device, dtype='float32')
    print(f'orignal dataset length: {len(dataset)}')

    model = ResNet9(device=device, dtype="float32")
    ddp.broadcast_parameters(model)

    model.train()
    correct, total_loss = 0, 0
    opt = ndl.optim.Adam(model.parameters(), lr=0.001, weight_decay=0.001)
    opt = ddp.DistributedOptimizer(opt)
    loss_fn = ndl.nn.SoftmaxLoss()
    n_epochs = 1
    begin = time.time()
    for i in range(n_epochs):
        if rank == 0:
            print(f'epoch: {i+1}/{n_epochs}')
        count = 0
        for batch in train_set:
            opt.reset_grad()
            X, y = batch
            out = model(X)
            correct = np.sum(np.argmax(out.numpy(), axis=1) == y.numpy())
            loss = loss_fn(out, y)
            loss.backward()
            opt.step()
            acc = correct/(y.shape[0])
            if rank == 0 and count % 100 == 0:
                print(f'acc: {acc}; avg_loss: {loss.data.numpy()}')
            count += 1
    end = time.time()
    if rank == 0:
        print(f'Training Time: {end-begin}')
''')
my_file.close()

Using pytorch to find how many gpu available

In [20]:
import torch
num_of_gpus = torch.cuda.device_count()
print(num_of_gpus)

0


Now, we using this number to run the script

In [23]:
!mpiexec -np {num_of_gpus} python train_script.py

zsh:1: command not found: mpiexec


# References

[1] Distributed training. https://learn.microsoft.com/en-us/azure/machine-learning/concept-distributed-training

[2] Data Parallelism VS Model Parallelism in Distributed Deep Learning Training. https://leimao.github.io/blog/Data-Parallelism-vs-Model-Paralelism/

[3] Horovod Framework: https://horovod.ai