# Distributed training


Several algorithms for distributed training of neural networks exist. The are two main classes:
 1. Data-parallel
 1. Model-parallel
 
**Data-parallel** approaches to distributed training keep a copy of the entire model on each worker, processing different subsets (mini-batches) of the training data set on each; while **model parallelism** assumes splitting a model between multiple workers (for instance, putting different layers on different GPUs). The latter is harder to implement.

<img src="images/ModelDataParallelism.svg">

Data parallel training approaches all require some method of combining results and synchronizing the model parameters between each worker.

For each of those classes one can distinguish between:
 1. Asynchronous
 1. Synchronous
 1. n-soft synchronous
 

## Synchronous training

With synchronous approach in it's strictest form, often referred to as a **hard-synchronous** approach the training is done as follows:

 1. Initialize the network parameters randomly based on the model configuration 
 1. Distribute a copy of the current parameters to each worker
 1. Perform forward and backward passes on each worker using a mini-batch of data
 1. Reduce the parameter to the master (parameter server), average them
 1. Broadcast **global** parameters back to workers, return to the step 2. until there is more data

<img src="images/ParameterAveraging.svg">

A variation of this approach assumes reducing the weight updates rather than weights.

### Soft synchronization

With the above approach, master has to collect all N weight updates. What if there are slow nodes / crashed nodes?
Keeping this in mind, a more fault tolerant approach would require to collect a fraction of weight updates (normally > 90-95% of all) before proceeding to average.

Implementation wise, this is normally done using a FIFO queue rather than a hard barrier: more about this later.

## Asynchronous training

In asynchronous algorithm the weight updates are applied as soon as they are available. Variations of this algorithm still perform synchronization, but only every n mini-batches.

Among advantages of this approach are the speed of model convergence and low network communication overhead. The disadvantage would be the accuracy. The challenges of asynchronous training lay in handling the stale gradients - on average, the gradient staleness is equal to N - where N is the number of workers. Intuitively, one wants to minimize the impact of stale gradients during the weights update:
$$W_{k} = W_{k-1} + \lambda_{k}\cdot \Delta W$$ 

typically this is achieved by incorporating the gradient staleness into the learning rate schedule as $$W_{k} = W_{k-1} + \frac{1}{N}\lambda_{k} \Delta W$$
where N (number of workers) is the average gradient staleness.

<img src="images/StaleGradients.svg">

## Sync replicas optimizer

```python
tf.train.SyncReplicasOptimizer
```

Is a class to synchronize, aggregate weight updates and pass them to the optimizer.