<a id = 'toc'></a>
# Pytorch 101: An almost-comprehensive Pytorch tutorial for beginners

__N.B.__: This tutorial is using PyTorch version 0.3.1, which is currently installed in Caltech GPU cluster. The newest release, 0.4.0, brings some fundamental changes, which I will mention in details of this tutorial. <br>

Author: *Thong Q. Nguyen* <thong@caltech.edu>

# Table of Contents
1. <a href='#intro'> Getting started</a><br>
    a. <a href='#whatis'> Why PyTorch?</a><br>
    b. <a href='#tensor'> Tensors</a><br>
    c. <a href='#variable'> Variables</a><br>
2. <a href='#build'> Building a neural network</a><br>
    a. <a href='#module'> Modules and Parameters</a><br>
    b. <a href='#loss'> Loss function</a><br>
    c. <a href='#optim'> Optimizers</a><br>
    d. <a href='#dummy'> A dummy model</a><br>
    e. <a href='#remark'> Remark on training vs. validation</a><br>
3. <a href='#data'> Loading data</a><br>
    a. <a href='#dataset'> Dataset class</a><br>
    b. <a href='#dataloader'> DataLoader</a><br>
4. <a href='#parallel'>Parallel and distributed training</a><br>
    a. <a href='#datapara'>Data-parallel training</a><br>
    b. <a href='#distributed'>Distributed training</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;i. <a href='#backend'>Backend</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;ii. <a href='#world'>World size and rank</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;iii. <a href='#init'>Distributed package initialization</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;iv. <a href='#distmodel'>Distributed model and data</a><br>
5. <a href='#conclusions'>Final words</a><br>

---


<a id='intro'></a>
## Getting started


<a id='whatis'></a>
### Why PyTorch?
PyTorch is a python-based scientific computing package serving 2 purposes:

    1. replacement for NumPy to use the power of GPUs
    2. deep learning research platform that provides maximum flexibility and speed
    
PyTorch is widely used in the ML research community (as opposed to in production, where Caffe2 and Tensorflow are more popular). PyTorch offers maximum flexibility to implement new ideas with ease, allowing for a lot of space for algorithmic creativity. Recent DAWNBench competition shows an overwhelming amount of submissions using PyTorch that top the leaderboard in CIFAR10 training https://dawn.cs.stanford.edu/benchmark/CIFAR10/train.html.

<a id='tensor'></a>
### Tensors
The power of PyTorch lies in its <a href='https://pytorch.org/docs/stable/tensors.html'>Tensor class </a>, which is similar to numpy arrays. However, PyTorch tensors (which I will from now on refer to as `Tensors`) can be sent to GPU to accelerate computing. Converting between `Tensors` and numpy arrays and vice versa are extremely easy. Note that the conversion is only possible on CPU. Let's see how it works:

In [31]:
import torch
import os
os.environ['CUDA_VISIBLE_DEVICES']="1,2" # To select the GPU we want to use and mask out the occupied one
print(torch.__version__)
import sys
print(sys.version) # Make sure we are using Python3

0.3.1
3.5.2 (default, Nov 23 2017, 16:37:01) 
[GCC 5.4.0 20160609]


In [2]:
# Creating a 3x3 Tensor of zeros
x = torch.zeros([3, 3])
print("x = {}".format(x))
# Create a random 3x1 Tensor 
y = torch.rand(3,1)
print("y = {}".format(y))
# Tensor has broadcast property just like numpy 
z = x + y 
print("z = x + y = {}".format(z))

x = 
 0  0  0
 0  0  0
 0  0  0
[torch.FloatTensor of size 3x3]

y = 
 0.8678
 0.1241
 0.5901
[torch.FloatTensor of size 3x1]

z = x + y = 
 0.8678  0.8678  0.8678
 0.1241  0.1241  0.1241
 0.5901  0.5901  0.5901
[torch.FloatTensor of size 3x3]



In [3]:
# Send Tensor to GPU 
if torch.cuda.is_available():
    print("Current device index: {}. Total devices: {}".format(torch.cuda.current_device(), 
                                                               torch.cuda.device_count()))
    torch.cuda.set_device(0)
    z = z.cuda()
    print("On GPU: z = {}".format(z))
    print("Notice the type has now become torch.cuda.FloatTensor.")
    # Now let's send it to a different GPU
    print("Switching to a different device...")
    torch.cuda.set_device(1)
    print("Current device index: {}".format(torch.cuda.current_device()))
    z = z.cuda()
    print("On a different GPU: z = {}".format(z))

# Send Tensor back to CPU
z = z.cpu()
print("On CPU: z = {}".format(z))

Current device index: 0. Total devices: 2
On GPU: z = 
 0.8678  0.8678  0.8678
 0.1241  0.1241  0.1241
 0.5901  0.5901  0.5901
[torch.cuda.FloatTensor of size 3x3 (GPU 0)]

Notice the type has now become torch.cuda.FloatTensor.
Switching to a different device...
Current device index: 1
On a different GPU: z = 
 0.8678  0.8678  0.8678
 0.1241  0.1241  0.1241
 0.5901  0.5901  0.5901
[torch.cuda.FloatTensor of size 3x3 (GPU 1)]

On CPU: z = 
 0.8678  0.8678  0.8678
 0.1241  0.1241  0.1241
 0.5901  0.5901  0.5901
[torch.FloatTensor of size 3x3]



In [4]:
# Convert Tensor to Numpy array and vice versa
import numpy as np
z = x + y
z = z.numpy()
print("Converted to numpy, z = {}\n{}".format(type(z), z))

z = torch.from_numpy(z)
print("Converted to Tensor, z = {}".format(z))

# Note that the conversion is not possible when the Tensor is on GPU
z = z.cuda()
try:
    z = z.numpy()
except RuntimeError as e:
    print("Error: {}".format(e))

Converted to numpy, z = <class 'numpy.ndarray'>
[[0.8677637  0.8677637  0.8677637 ]
 [0.12407911 0.12407911 0.12407911]
 [0.59006226 0.59006226 0.59006226]]
Converted to Tensor, z = 
 0.8678  0.8678  0.8678
 0.1241  0.1241  0.1241
 0.5901  0.5901  0.5901
[torch.FloatTensor of size 3x3]

Error: can't convert CUDA tensor to numpy (it doesn't support GPU arrays). Use .cpu() to move the tensor to host memory first.


<a id='variable'></a>
### Variables
<a href='#toc'> Back to top </a><br>

*Note*: This concept of `Variables` only exists pre-0.4.0. In the 0.4.0 release, `Tensor` and `Variable` has been merged. However, it's good to understand the design philosophy of `Variables` which is fundamental to any neural network.

`Variable` is essentially a wrapper of `Tensor` with 1 extra component: gradient. Recall that in a deep neural network, the optimization is done via a process called back-propagation. Given a function $f(\mathbf{x})$ where $\mathbf{x}$ is an input vector, we want to compute the gradient of $f$ at $x$, ie, $\nabla _{x} f(\mathbf{x})$. This gradient is computed with chain rule backwardly from the loss function. 

Fortunately, in PyTorch the gradient computation is handled automatically with the automatic differention package called *torch.autograd*. `Variable`, a class in this package, should be all you need for now.

In [5]:
from torch.autograd import Variable
x = torch.rand(3,2)
x = Variable(x)
print("x = {}".format(x))

x = Variable containing:
 0.2907  0.9151
 0.6608  0.4161
 0.0041  0.5863
[torch.FloatTensor of size 3x2]



Variable has 2 component: Tensor and Gradient. Let's look at each of them:

In [6]:
print("Tensor of x = {}".format(x.data))
print("Gradient of x = {}".format(x.grad))

Tensor of x = 
 0.2907  0.9151
 0.6608  0.4161
 0.0041  0.5863
[torch.FloatTensor of size 3x2]

Gradient of x = None


The gradient component is empty right now because there is no computation performed. Knowing these 2 basic concepts, we are ready to move forward to most the exciting part.

<a id='build'></a>
## Building a neural network
<a href='#toc'> Back to top </a>

Just kidding, there are few more concepts you need to know before building a network.

<a id='module'></a>
### Modules and Parameters
`Modules` are the base class for all neural networks. A `Module` typically contains `Parameters`. `Parameters` are similar to `Variables` in the sense that they includes values (weights) and their corresponding gradients. `Parameters` have some extra built-in functions to use in a network's model. `Variables` and `Tensors` are usually just used for input and output data.  

<a id='loss'></a>
### Loss function
Loss function is another module that takes `(prediction, target)` as inputs and computes how far away the prediction is from target. The loss function can then perform a backprop calculation to compute the gradients of the neural network modules. A list of built-in loss functions can be found at https://pytorch.org/docs/0.3.1/nn.html#loss-functions. 

<a id='optim'></a>
### Optimizers
The `Optimizer` takes `Parameters` of the model as inputs, and then update the `Parameters` once the gradients have been computed by the loss function's backpropagation. Here is the list of available optimizers https://pytorch.org/docs/0.3.1/optim.html#algorithms.

<a id='dummy'></a>
### A dummy model
<a href='#toc'> Back to top </a>


Now you have all the ingradients, let's make an ultra-simple neural network to see how all the pieces are connected together. 

In [7]:
import torch.nn as nn
import torch.nn.functional as F

# Define a model
class Net(nn.Module): ### Network always has to be a subclass of nn.Module
    def __init__(self):
        super().__init__()
        self.dense1 = nn.Linear(3,5)
        self.dense2 = nn.Linear(5,1)
    
    def forward(self, x): # All the computation steps of the input are defined in this function
        x = self.dense1(x)
        x = F.relu(x)
        x = self.dense2(x)
        x = F.relu(x) # Or you can compact all steps into x = F.relu(self.dense2(F.relu(self.dense1(x)))) 
        return x
    
net = Net()
print(net)

Net(
  (dense1): Linear(in_features=3, out_features=5, bias=True)
  (dense2): Linear(in_features=5, out_features=1, bias=True)
)


Let's look deeper at the model we just created:

In [8]:
print(list(net.parameters()))

[Parameter containing:
-0.0259  0.4909  0.3690
 0.5152  0.2457 -0.4198
-0.3849  0.1651  0.5462
-0.4988  0.2408  0.0541
-0.5344  0.5316  0.1034
[torch.FloatTensor of size 5x3]
, Parameter containing:
 0.2021
-0.3067
-0.5410
 0.2676
-0.2391
[torch.FloatTensor of size 5]
, Parameter containing:
 0.2847  0.1915 -0.2046  0.4103  0.0190
[torch.FloatTensor of size 1x5]
, Parameter containing:
 0.4292
[torch.FloatTensor of size 1]
]


In [9]:
x = net.parameters().__next__() # Get the parameter of the first layer, a 5x3 matrix.

In [10]:
print("First layer value: {}".format(x.data))
print("First layer gradient: {}".format(x.grad))

First layer value: 
-0.0259  0.4909  0.3690
 0.5152  0.2457 -0.4198
-0.3849  0.1651  0.5462
-0.4988  0.2408  0.0541
-0.5344  0.5316  0.1034
[torch.FloatTensor of size 5x3]

First layer gradient: None


We just created a model, which contains 2 dense layers. The first layer is a 5x3 matrix, taking input of dimensionality `(N, C, 3)` where `N` is the batch size and `C` is number of channels. The output of the first layer would be of dimensionality `(N, C, 5)`, which is activated by a RELU function before fed into the second dense layer, a 5x1 matrix. The output is again activated by RELU before returning as the prediction of the given original input.

Now we will define our loss function and optimizer

In [11]:
criterion = nn.MSELoss() # We use MSE which is the mean squared error between the predicted output and target.

import torch.optim as optim
optimizer = optim.SGD(net.parameters(), lr=0.01)

Having all the ingredients, let's input a fake data point and see how things go:

In [12]:
input = Variable(torch.rand(1,3))
target = Variable(torch.ones(1))
print("Input = {}".format(input))
print("Target = {}".format(target))

Input = Variable containing:
 0.4169  0.1048  0.7408
[torch.FloatTensor of size 1x3]

Target = Variable containing:
 1
[torch.FloatTensor of size 1]



In [13]:
optimizer.zero_grad()   # zero the gradient buffers
prediction = net(input) # get the predicted output
loss = criterion(prediction, target) # compute the MSE of our prediction with the target 
loss.backward() # Backprop to compute the gradients

Let's look at the gradient

In [14]:
print("First layer value: {}".format(x.data))
print("First layer gradient: {}".format(x.grad))

First layer value: 
-0.0259  0.4909  0.3690
 0.5152  0.2457 -0.4198
-0.3849  0.1651  0.5462
-0.4988  0.2408  0.0541
-0.5344  0.5316  0.1034
[torch.FloatTensor of size 5x3]

First layer gradient: Variable containing:
-0.0884 -0.0222 -0.1571
 0.0000  0.0000  0.0000
 0.0000  0.0000  0.0000
-0.1275 -0.0320 -0.2265
 0.0000  0.0000  0.0000
[torch.FloatTensor of size 5x3]



Voila! The layer's weights are unchanged but now we have the gradient, which is also the 5x3 matrix showing how the first layer should change to get closer to the target given the original input.

Let's update the weights with newly computed gradients, and looks at the first layer again.

In [15]:
optimizer.step() # Update the weight of the parameters
print("First layer value: {}".format(x.data))
print("First layer gradient: {}".format(x.grad))

First layer value: 
-0.0250  0.4911  0.3706
 0.5152  0.2457 -0.4198
-0.3849  0.1651  0.5462
-0.4975  0.2411  0.0564
-0.5344  0.5316  0.1034
[torch.FloatTensor of size 5x3]

First layer gradient: Variable containing:
-0.0884 -0.0222 -0.1571
 0.0000  0.0000  0.0000
 0.0000  0.0000  0.0000
-0.1275 -0.0320 -0.2265
 0.0000  0.0000  0.0000
[torch.FloatTensor of size 5x3]



The last row of the matrix has been updated (with learning rate 0.01 we specified earlier). The gradient is still there, this is why we have to call `optimizer.zero_grad()` at every iteration, otherwise new gradients computed in the next step will add up to this old one and result in a crazy weight update. 

<a id='remark'></a>
### Remark on training vs validation
<a href='#toc'> Back to top </a>

During training, all `Variables` and `Parameters` need to keep track of their gradients, which is quite computationally expensive. During the validation/testing phase, no gradient update is needed. Therefore, we need to set `net.train()` during training and `net.eval()` during testing. 

When wrapping the inputs and targets into `Variables` during the validation/testing phase, you need to set the `volatile` parameter, such as `inputs, targets = Variable(inputs, volatile=True), Variable(targets,volatile=True)` so that the memory won't explode during validation. Again, this is only true for pre-0.3.5 versions of PyTorch.

---
Hooray!!! We just went through all basic concepts of PyTorch, including tensor, gradient, module, optimizer, loss function and observed how they work together. However, there is still one important component you need to learn before applying to your own project. 

<a id='loading'></a>
## Loading Data
<a href='#toc'> Back to top </a>

Generally, your dataset would be too big to fit in your device's memory and we have to write a data <a href="https://wiki.python.org/moin/Generators">generator</a> to feed small portions of the dataset to our network iteratively. Fortunately, there is built-in class in PyTorch that does this for us and all we need to do is to write our own Dataset class and pass it to the built-in generator class (DataLoader). 

<a id='dataset'></a>
### Dataset class
`torch.utils.data.Dataset` is an abstract class representing a dataset. Your custom dataset should inherit Dataset and override the following methods:
     - __len__ so that len(dataset) returns the size of the dataset.
     - __getitem__ to support the indexing such that dataset[i] can be used to get ith sample
     
<a id='dataloader'></a>
### DataLoader 
<a href='#toc'> Back to top </a><br>
`torch.utils.data.DataLoader` is an generator which takes our customized Dataset class as input and outputs each batch of dataset with the batch size specified by our own.

The following example is excerpted from my own project. The dataset is stored in multiple HDF5 files located at `/bigdata/shared/LCDJets_Abstract_IsoLep_lt_20`. Each file in the dataset has 2 keys `Images` and `Labels`, where `Images` is a numpy array of size (750, 150, 94, 5) -- read: 750 images of size (150, 94, 5), where 5 is the number of channels and (150x94) is the WxH dimensions of the images. `Labels` are the 1-hot encoded target of the event types: QCD, $t \bar{t}$, or $W$+jets.

In [16]:
from torch.utils.data import Dataset, DataLoader
import h5py
from glob import glob

class EventImage(Dataset):

    def check_data(self, file_names): 
        '''Count the number of events in each file and mark the threshold 
        boundaries between adjacent indices coming from 2 different files'''
        num_data = 0
        thresholds = [0]
        for in_file_name in file_names:
            h5_file = h5py.File( in_file_name, 'r' )
            X = h5_file[self.feature_name]
            if hasattr(X, 'keys'):
                num_data += len(X[X.keys()[0]])
                thresholds.append(num_data)
            else:
                num_data += len(X)
                thresholds.append(num_data)
            h5_file.close()
        return (num_data, thresholds)

    def __init__(self, dir_name, feature_name = 'Images', label_name = 'Labels'):
        self.feature_name = feature_name
        self.label_name = label_name
        self.file_names = glob(dir_name+'/*.h5')
        self.num_data, self.thresholds = self.check_data(self.file_names)

    def is_numpy_array(self, data):
        return isinstance(data, np.ndarray)

    def get_num_samples(self, data):
        """Input: dataset consisting of a numpy array or list of numpy arrays.
            Output: number of samples in the dataset"""
        if self.is_numpy_array(data):
            return len(data)
        else:
            return len(data[0])

    def load_data(self, in_file_name):
        """Loads numpy arrays from H5 file.
            If the features/labels groups contain more than one dataset,
            we load them all, alphabetically by key."""
        h5_file = h5py.File( in_file_name, 'r' )
        X = self.load_hdf5_data(h5_file[self.feature_name] )
        Y = self.load_hdf5_data(h5_file[self.label_name] )
        h5_file.close()
        return X,Y
    
    def load_hdf5_data(self, data):
        """Returns a numpy array or (possibly nested) list of numpy arrays
            corresponding to the group structure of the input HDF5 data.
            If a group has more than one key, we give its datasets alphabetically by key"""
        if hasattr(data, 'keys'):
            out = [ self.load_hdf5_data( data[key] ) for key in sorted(data.keys()) ]
        else:
            out = data[:]
        return out

    def get_data(self, data, idx):
        """Input: a numpy array or list of numpy arrays.
            Gets elements at idx for each array"""
        if self.is_numpy_array(data):
            return data[idx]
        else:
            return [arr[idx] for arr in data]

    def get_index(self, idx):
        """Translate the global index (idx) into local indexes,
        including file index and event index of that file"""
        file_index = next(i for i,v in enumerate(self.thresholds) if v > idx)
        file_index -= 1
        event_index = idx - self.thresholds[file_index]
        return file_index, event_index

    def get_thresholds(self):
        return self.thresholds

    # Below are the two functions you are required to define
    def __len__(self):
        return self.num_data

    def __getitem__(self, idx):
        file_index, event_index = self.get_index(idx)
        X, Y = self.load_data(self.file_names[file_index])
        return {'Images': self.get_data(X, event_index), 'Labels': np.argmax(self.get_data(Y, event_index))}

    
# Define the data generators from the training set and validation set. Let's try a batch size of 12.
train_set = EventImage(dir_name='/bigdata/shared/LCDJets_Abstract_IsoLep_lt_20//train/',
                       feature_name ='Images',label_name = 'Labels')
train_loader = DataLoader(train_set, batch_size=12, shuffle=True, num_workers=4)

val_set = EventImage(dir_name='/bigdata/shared/LCDJets_Abstract_IsoLep_lt_20//val/',
                     feature_name ='Images',label_name = 'Labels')
val_loader = DataLoader(val_set, batch_size=12, shuffle=True , num_workers=4)

  from ._conv import register_converters as _register_converters


Let's check what we got:

In [17]:
for batch_idx, data in enumerate(train_loader):
    inputs, targets = data['Images'], data['Labels']
    print("Inputs shape = {}".format(inputs.shape))
    print("Targets shape = {}".format(targets.shape))
    break

Inputs shape = torch.Size([12, 150, 94, 5])
Targets shape = torch.Size([12])


As expected, we got a pair of (input, target) from the data generator with batch size of 12. 

For an end-to-end training reference, please take a look at https://github.com/pytorch/examples/blob/0.3.1/mnist/main.py


<a id='parallel'></a>
## Parallel and distributed training
<a href='#toc'> Back to top </a><br>

When dealing with large dataset and complex model, we want to *parallelize* the model on multiple GPUs to speed up the training. In a computer cluster where there are multiple nodes (such as `imperium-sm`, `culture-plate-sm`, and `flere-imsaho-sm` in Caltech GPU cluster), we can also train the model across multiple nodes. In this case it's called *distributed training*. PyTorch supports out-of-the-box setup for parallel and distributed training that takes care of communication between GPUs.  

<a id='datapara'></a>
### DataParallel
<a href='#toc'> Back to top </a><br>

In a data parallel setting, the model is copied to multiple GPUs in 1 node. The mini-batch of samples is splitted into multiple smaller mini-batches to send to different GPUs; the model in each GPU performs the computation for this smaller mini-batch in parallel and synchronizes the outputs across all GPUs. 

The mechanism for data parallel is implemented in `torch.nn.DataParallel` class. We can wrap our model in `DataParallel` and it will be parallelized over multiple GPUs in the batch dimension. The rest of the training code should be the same.

In [18]:
net = torch.nn.DataParallel(net).cuda()
print(net)

DataParallel(
  (module): Net(
    (dense1): Linear(in_features=3, out_features=5, bias=True)
    (dense2): Linear(in_features=5, out_features=1, bias=True)
  )
)


<a id='distributed'></a>
### Distributed training
<a href='#toc'> Back to top </a><br>

In a distributed setting scenario, we can train the model on multiple GPUs in 1 node just like DataParallel training, but we can also extend it to different nodes of the cluster.

A nice example of writing a distributed training program can be found at https://github.com/pytorch/examples/blob/0.3.1/imagenet/main.py. I will go over a few concepts that we need to keep in mind when running a distributed program.

<a id='backend'></a>
#### Backend
PyTorch 0.3.1 supports 3 different backends that manage the communication method across all devices: `tcp`, `gloo`, and `mpi`. Since `tcp` only supports CPU and `mpi` needs customized installation, I recommend using `gloo` backend on Caltech cluster. 

<a id='world'></a>
#### World size and rank
In distributed training terminology, `world size` refers to the total number of distributed processes and `rank` represents the indices of the processes. The `master process`, which manages all the `workers`, will always be rank 0. 

<a id='init'></a>
#### Distribution package initialization
<a href='#toc'> Back to top </a><br>

To start the distributed training program, we need to initialize the distribution package `torch.distributed`. There are many different initialization methods; in this tutorial I will use the environment variable initialization, in which you need to set 2 variables `MASTER_PORT` and `MASTER_ADDR`, which indicate the port and address of the master node. The addresses of each nodes are listed in `/etc/hosts`. For example, if you decide to use `culture-plate-sm` as your master node:

In [30]:
os.environ['MASTER_PORT'] = '3466' # any random number would work
os.environ['MASTER_ADDR'] = '10.3.10.98' # the address of culture-plate-sm, as listed in /etc/hosts

Now you can start the initialization:

```python
import torch.distributed as dist

dist.init_process_group(backend='gloo',init_method= 'env://',
                        world_size=args.world_size, rank=args.rank)
```

In practice, `world_size` need to be `N` > 1; however, in this tutorial if we do so the notebook will hang until we spawn enough `N` processes. 

For example, we decide to set `world_size` = 3. We need to open a new terminal, run the same program again but set `rank` = 1 and then repeat in another terminal with `rank` = 2. In each initialization we can also change `os.environ['CUDA_VISIBLE_DEVICES']` to force each process to use a different GPU. 

If all processes are within the same node, there is a faster way that doesn't require opening `N` terminals. Write a script named `multiproc.py`:

```python
import torch
import sys
import subprocess

argslist = list(sys.argv)[1:]
world_size = torch.cuda.device_count()

if '--world-size' in argslist:
    world_size = int(argslist[argslist.index('--world-size')+1])
else:
    argslist.append('--world-size')
    argslist.append(str(world_size))

workers = []

for i in range(world_size):
    if '--rank' in argslist:
        argslist[argslist.index('--rank')+1] = str(i)
    else:
        argslist.append('--rank')
        argslist.append(str(i))
    stdout = None if i == 0 else open("GPU_"+str(i)+".log", "w")
    print(argslist)
    p = subprocess.Popen([str(sys.executable)]+argslist, stdout=stdout)
    workers.append(p)

for p in workers:
    p.wait()
```

And then run it with our main program, `main.py`, as the input parameter:

```bash
python3 multiproc main.py --world-size 3
```

<a id='distmodel'></a>
#### Distributed model and data
<a href='#toc'> Back to top </a><br>

Similar to `DataParallel`, we need to wrap our model inside `torch.nn.parallel.DistributedDataParallel` class. 
```python
net = Net().cuda()
net = nn.parallel.DistributedDataParallel(net)
```
However, unlike `DataParallel`, we need to modify our data loader to distribute smaller mini-batches to different processes, using `torch.utils.data.distributed.DistributedSampler` class:

```python
train_set = EventImage(dir_name='/bigdata/shared/LCDJets_Abstract_IsoLep_lt_20/train/', feature_name ='Images', label_name = 'Labels')
if args.world_size > 1: # Distributed training mode
    train_sampler = torch.utils.data.distributed.DistributedSampler(train_set)
else:
    train_sampler = None
train_loader = DataLoader(train_set, batch_size=args.batch_size, shuffle=(train_sampler == None), num_workers=args.workers, sampler=train_sampler)
```

---

<a id='conclusions'></a>
## Final words
<a href='#toc'> Back to top </a><br>

Hooray! We just went through a lot, which is certainly not everything, but it can give you a good start to write a deep learning program in PyTorch. The ability to print out anything without open a separated session sets PyTorch in its own league above Keras, Tensorflow, etc. in terms of convenience for debugging. In more complex tasks such as changing the target to a different dataset or modifying the network architecture after a few epochs, it's much easier to do so in PyTorch than in other frameworks. Even if you're not doing deep learning but usual numerical analysis where you have to invert gigantic matrices, writing it in PyTorch (ie, using Tensor instead of NumPy arrays) gives you a strong performance boost with GPUs. For those reasons, learning PyTorch would be a worthwhile investment.