# PyTorch 1.0 Distributed Trainer with Amazon AWS

**Nathan Inkawhich, Pieter Noordhuis & Teng Li**

In this tutorial we will show how to setup, code, and run a PyTorch 1.0 distributed trainer across two multi-gpu Amazon AWS nodes. We will start with describing the AWS setup, then the PyTorch environment configuration, and finally the code for the distributed trainer. Hopefully you will find that there is actually very little code change required to extend your current training code to a distributed application, and most of the work is in the one-time environment setup.

## Amazon AWS Setup

### Creating the Nodes

- deeplearning AMI nodes
- using p2.8xlarge here
- create a new security group

### Configure Security Group

- configure settings in security group to allow all traffic between nodes in the security group
- test that nodes can talk

```
Run on machine A: nc -l 12345
Run on machine B: echo "hello world" | nc <IP of A> 1234
```

## Environment Setup

- new conda env with python 3.6 and numpy
    - `conda create -n nightly_pt python=3.6 numpy`
    - `source activate nightly_pt`
- install pytorch
    - `pip install torch_nightly -f https://download.pytorch.org/whl/nightly/cu90/torch_nightly.html`
- install torchvision from source
    - clone it to local machine: `git clone https://github.com/pytorch/vision.git`
    - build it with: `python setup.py install`
- Very Important: NCCL_SOCKET_IFNAME=ens3

## Distributed Training Code

- Most of the code here has been taken from the [PyTorch ImageNet Example](https://github.com/pytorch/examples/tree/master/imagenet) which also supports distributed training. This code provides a good starting point for a custom trainer as it has much of the boilerplate training loop, validation loop, and accuracy tracking functionality. However, you will notice that the argument parsing and other non-essential functions have been stripped out for simplicity.

- In this example we will use [torchvision.models.resnet18](https://pytorch.org/docs/stable/torchvision/models.html#torchvision.models.resnet18) model and will train it on the [torchvision.datasets.STL10](https://pytorch.org/docs/stable/torchvision/datasets.html#torchvision.datasets.STL10) dataset for simplicity. To accomodate for the dimensionality mismatch of STL-10 with Resnet18, we will resize each image to 224x224 with a transform. Notice, the choice of model and dataset are orthogonal to the distributed training code, you may use any dataset and model you wish and the process is the same. Lets get started!

### Imports

First let's get the imports out of the way. The important imports here are [torch.nn.parallel](https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel), [torch.distributed](https://pytorch.org/docs/stable/distributed.html), [torch.utils.data.distributed](https://pytorch.org/docs/stable/data.html#torch.utils.data.distributed.DistributedSampler), and [torch.multiprocessing](https://pytorch.org/docs/stable/multiprocessing.html). It is also important to set the multiprocessing start method to *spawn*, as the default is *fork* which may cause deadlocks when using multiple worker threads for dataloading.

In [None]:
import time
import sys
import torch
import torch.nn as nn
import torch.nn.parallel
import torch.distributed as dist
import torch.optim
import torch.utils.data
import torch.utils.data.distributed
import torchvision.transforms as transforms
import torchvision.datasets as datasets
import torchvision.models as models
from torch.multiprocessing import Pool, Process, set_start_method
try:
    set_start_method('spawn')
except RuntimeError:
    pass