# Introduction

This notebook demonstrates the use of distributed training of a deep neural network. It follows the [this tutorial](https://pytorch.org/tutorials/intermediate/dist_tuto.html).

# Initial Setup

We will use the functions which are implemented in `partition.py` and `train.py`.
When the file `train.py` is run from the terminal, it uses the configuration parameters, which are listed in `config.yaml`. In our initial setup, we will use the following parameters: 

In [1]:
!cat src/config.yaml

size: 2
 partition_sizes: [0.5, 0.5]
 custom_partition: False
 params:
  lr: 0.01
  momentum: 0.5
  use_batch_norm: False
  async_op: False

Running `train.py` will perform distributed training of a neural network model across 2 parallel processes, using the MNIST dataset. Here we will use the internal functions in `train.py`, such that it will be easier to see which exact parameters we'll be modifying at different stages.

In [2]:
import sys
sys.path.append('src/')

In [3]:
import yaml

from torch.multiprocessing import Process
from train import init_process, run

try:
    with open(r'src/config.yaml') as file:
        config_dict = yaml.load(file, Loader=yaml.SafeLoader)
except:
    print(f'Could not open configuration file. Aborting.')

size = config_dict['size']
partition_sizes = config_dict['partition_sizes']
custom_partition = config_dict['custom_partition']
params = config_dict['params']

def distributed_training():
    processes = []
    for rank in range(size):
        p = Process(target=init_process, 
                    args=(run, rank, size, partition_sizes,
                          custom_partition, params))
        p.start()
        processes.append(p)

    for p in processes:
        p.join()

distributed_training()

Rank 0, epoch 0:  Train Loss 1.310  Train Acc. 0.558  --|-- Val. Loss 0.645 Val. Acc. 0.799
Rank 1, epoch 0:  Train Loss 1.306  Train Acc. 0.561  --|-- Val. Loss 0.642 Val. Acc. 0.803
Rank 0, epoch 1:  Train Loss 0.548  Train Acc. 0.838  --|-- Val. Loss 0.422 Val. Acc. 0.869
Rank 1, epoch 1:  Train Loss 0.539  Train Acc. 0.835  --|-- Val. Loss 0.441 Val. Acc. 0.866
Rank 0, epoch 2:  Train Loss 0.428  Train Acc. 0.873  --|-- Val. Loss 0.363 Val. Acc. 0.893
Rank 1, epoch 2:  Train Loss 0.418  Train Acc. 0.873  --|-- Val. Loss 0.375 Val. Acc. 0.891
Rank 0, epoch 3:  Train Loss 0.369  Train Acc. 0.892  --|-- Val. Loss 0.306 Val. Acc. 0.912
Rank 1, epoch 3:  Train Loss 0.359  Train Acc. 0.895  --|-- Val. Loss 0.322 Val. Acc. 0.906
Rank 1, epoch 4:  Train Loss 0.312  Train Acc. 0.909  --|-- Val. Loss 0.288 Val. Acc. 0.913
Rank 0, epoch 4:  Train Loss 0.320  Train Acc. 0.906  --|-- Val. Loss 0.265 Val. Acc. 0.921
Rank 0, epoch 5:  Train Loss 0.290  Train Acc. 0.915  --|-- Val. Loss 0.258 Val.

We observe a decrease in the loss and an increase in accuracy on both training and validation sets.


# Unbalanced Partition

Let's see what happens if we modify the partition ratio to 70% : 30%.

In [4]:
partition_sizes = [0.7, 0.3]
distributed_training()

Rank 1, epoch 0:  Train Loss 1.689  Train Acc. 0.418  --|-- Val. Loss 0.891 Val. Acc. 0.725
Rank 1, epoch 1:  Train Loss 0.711  Train Acc. 0.779  --|-- Val. Loss 0.602 Val. Acc. 0.821
Rank 0, epoch 0:  Train Loss 1.114  Train Acc. 0.631  --|-- Val. Loss 0.526 Val. Acc. 0.837
Rank 1, epoch 2:  Train Loss 0.544  Train Acc. 0.835  --|-- Val. Loss 0.469 Val. Acc. 0.856
Rank 1, epoch 3:  Train Loss 0.457  Train Acc. 0.862  --|-- Val. Loss 0.418 Val. Acc. 0.879
Rank 0, epoch 1:  Train Loss 0.473  Train Acc. 0.857  --|-- Val. Loss 0.371 Val. Acc. 0.888
Rank 1, epoch 4:  Train Loss 0.404  Train Acc. 0.881  --|-- Val. Loss 0.384 Val. Acc. 0.887
Rank 1, epoch 5:  Train Loss 0.363  Train Acc. 0.894  --|-- Val. Loss 0.348 Val. Acc. 0.897
Rank 0, epoch 2:  Train Loss 0.360  Train Acc. 0.896  --|-- Val. Loss 0.314 Val. Acc. 0.907
Rank 1, epoch 6:  Train Loss 0.339  Train Acc. 0.898  --|-- Val. Loss 0.311 Val. Acc. 0.909
Rank 1, epoch 7:  Train Loss 0.312  Train Acc. 0.908  --|-- Val. Loss 0.285 Val.

We can see that the process has failed with an error due to a lack of synchronization between the two processes. This can be solved by setting the *async_op* parameter in the dist.all_reduce function. Let's observed what happens if we do that:

In [5]:
params['async_op'] = True
distributed_training()

Rank 1, epoch 0:  Train Loss 1.997  Train Acc. 0.291  --|-- Val. Loss 1.266 Val. Acc. 0.582
Rank 1, epoch 1:  Train Loss 0.959  Train Acc. 0.698  --|-- Val. Loss 0.774 Val. Acc. 0.768
Rank 0, epoch 0:  Train Loss 1.394  Train Acc. 0.522  --|-- Val. Loss 0.645 Val. Acc. 0.801
Rank 1, epoch 2:  Train Loss 0.690  Train Acc. 0.787  --|-- Val. Loss 0.586 Val. Acc. 0.814
Rank 1, epoch 3:  Train Loss 0.577  Train Acc. 0.824  --|-- Val. Loss 0.531 Val. Acc. 0.842
Rank 0, epoch 1:  Train Loss 0.574  Train Acc. 0.826  --|-- Val. Loss 0.458 Val. Acc. 0.862
Rank 1, epoch 4:  Train Loss 0.528  Train Acc. 0.839  --|-- Val. Loss 0.502 Val. Acc. 0.847
Rank 1, epoch 5:  Train Loss 0.457  Train Acc. 0.863  --|-- Val. Loss 0.449 Val. Acc. 0.871
Rank 0, epoch 2:  Train Loss 0.437  Train Acc. 0.871  --|-- Val. Loss 0.383 Val. Acc. 0.885
Rank 1, epoch 6:  Train Loss 0.430  Train Acc. 0.871  --|-- Val. Loss 0.457 Val. Acc. 0.859
Rank 1, epoch 7:  Train Loss 0.399  Train Acc. 0.878  --|-- Val. Loss 0.369 Val.

This time, the training process has completed smoothly.

# Adding Batch Normalization

Next, let's see how the addition of Batch Normalization affects performance.

In [6]:
params['use_batch_norm'] = True
distributed_training()

Rank 1, epoch 0:  Train Loss 1.500  Train Acc. 0.511  --|-- Val. Loss 0.913 Val. Acc. 0.723
Rank 1, epoch 1:  Train Loss 0.753  Train Acc. 0.776  --|-- Val. Loss 0.623 Val. Acc. 0.817
Rank 0, epoch 0:  Train Loss 1.004  Train Acc. 0.682  --|-- Val. Loss 0.504 Val. Acc. 0.845
Rank 1, epoch 2:  Train Loss 0.557  Train Acc. 0.833  --|-- Val. Loss 0.482 Val. Acc. 0.861
Rank 1, epoch 3:  Train Loss 0.445  Train Acc. 0.869  --|-- Val. Loss 0.412 Val. Acc. 0.882
Rank 0, epoch 1:  Train Loss 0.444  Train Acc. 0.870  --|-- Val. Loss 0.357 Val. Acc. 0.898
Rank 1, epoch 4:  Train Loss 0.386  Train Acc. 0.884  --|-- Val. Loss 0.373 Val. Acc. 0.889
Rank 1, epoch 5:  Train Loss 0.333  Train Acc. 0.904  --|-- Val. Loss 0.318 Val. Acc. 0.913
Rank 1, epoch 6:  Train Loss 0.313  Train Acc. 0.908  --|-- Val. Loss 0.287 Val. Acc. 0.920
Rank 0, epoch 2:  Train Loss 0.350  Train Acc. 0.899  --|-- Val. Loss 0.280 Val. Acc. 0.916
Rank 1, epoch 7:  Train Loss 0.283  Train Acc. 0.918  --|-- Val. Loss 0.280 Val.

We can see that slight improvement has been achieved due to the regularizing effect of batch normalization.

# Training with Disjoint Subsets of Samples

Let's see what happens if we split the samples such that process 1 will only see labels 0-4, and process 2 will only see labels 5-9.

In [7]:
partition_sizes = [0.5, 0.5]
custom_partition = True
params['async_op'] = True
distributed_training()

Rank 1, epoch 0:  Train Loss 0.745  Train Acc. 0.743  --|-- Val. Loss 3.822 Val. Acc. 0.428
Rank 0, epoch 0:  Train Loss 0.581  Train Acc. 0.824  --|-- Val. Loss 4.068 Val. Acc. 0.485
Rank 1, epoch 1:  Train Loss 0.265  Train Acc. 0.920  --|-- Val. Loss 4.191 Val. Acc. 0.450
Rank 0, epoch 1:  Train Loss 0.192  Train Acc. 0.948  --|-- Val. Loss 4.606 Val. Acc. 0.491
Rank 1, epoch 2:  Train Loss 0.188  Train Acc. 0.945  --|-- Val. Loss 4.624 Val. Acc. 0.457
Rank 0, epoch 2:  Train Loss 0.143  Train Acc. 0.960  --|-- Val. Loss 4.629 Val. Acc. 0.498
Rank 1, epoch 3:  Train Loss 0.164  Train Acc. 0.952  --|-- Val. Loss 4.731 Val. Acc. 0.459
Rank 0, epoch 3:  Train Loss 0.117  Train Acc. 0.969  --|-- Val. Loss 4.892 Val. Acc. 0.498
Rank 1, epoch 4:  Train Loss 0.133  Train Acc. 0.961  --|-- Val. Loss 4.882 Val. Acc. 0.462
Rank 0, epoch 4:  Train Loss 0.101  Train Acc. 0.974  --|-- Val. Loss 4.995 Val. Acc. 0.503
Rank 1, epoch 5:  Train Loss 0.143  Train Acc. 0.958  --|-- Val. Loss 5.056 Val.

Such setup clearly results in an overfit, since every process is exposed to only half of the labels. Resultantly, on each process, the optimization results in an overfit, which cannot be balanced by simple averaging of the weights, due to the highly non-linear nature of the optimization objective.