# TUTORIAL : PyTorch computation using multiple GPUs

Tutorial adapted from [this PyTorch example](https://github.com/pytorch/examples/tree/master/mnist).

## Introduction

The aim of this tutorial is to use AI TRAINING product to train a simple model, on the [MNIST dataset](https://en.wikipedia.org/wiki/MNIST_database), with the PyTorch library and to compare performances of running it with one GPU versus multiple GPUs.

## Prerequities

* a Public cloud project
* an AI-TRAINING notebook job launched with the PyTorch preset image ([documentation available here](https://docs.ovh.com/gb/en/ai-training/start-use-notebooks/))
* the notebook resources should have at least **2 GPUs**

### Step 1: Import PyTorch library

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.optim.lr_scheduler import StepLR

### Step 2: Check that you have GPU(s) available on your notebook

In [2]:
for device_index in range(torch.cuda.device_count()):
    device = 'cuda:{}'.format(device_index)
    device_name = torch.cuda.get_device_name(device)
    print('{} ({})'.format(device, device_name))

cuda:0 (Tesla V100S-PCIE-32GB)
cuda:1 (Tesla V100S-PCIE-32GB)


### Step 3: Declare the neural network to train

In [3]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout(0.25)
        self.dropout2 = nn.Dropout(0.5)
        self.fc1 = nn.Linear(9216, 4096)
        self.fc2 = nn.Linear(4096, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        output = F.log_softmax(x, dim=1)
        return output

### Step 4: Declare train and test functions

In [4]:
def train(model, device, train_loader, test_loader, lr=1.0, gamma=0.7):
    print()
    print('Train {}'.format(device))
    optimizer = optim.Adadelta(model.parameters(), lr=lr)
    scheduler = StepLR(optimizer, step_size=1, gamma=gamma)
    for epoch in range(1, epochs + 1):
        train_one_epoch(model, device, train_loader, optimizer, epoch)
        test(model, device, test_loader)
        scheduler.step()

def train_one_epoch(model, device, train_loader, optimizer, epoch):
    losses = []
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()
        losses.append(loss.item())
        if batch_idx % 50 == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item()))
    return losses
                

def test(model, device, test_loader):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += F.nll_loss(output, target, reduction='sum').item()  # sum up batch loss
            pred = output.argmax(dim=1, keepdim=True)  # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)

    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))

### Step 5: Load MNIST dataset

In [5]:
transform=transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])
dataset1 = datasets.MNIST('/workspace/data', train=True, download=True, transform=transform)
dataset2 = datasets.MNIST('/workspace/data', train=False, transform=transform)

### Step 6: Train model with multiple GPUs

There are several ways to use multiple GPUs to train a model, in this notebook we will use [nn.DataParallel](https://pytorch.org/tutorials/beginner/former_torchies/parallelism_tutorial.html).
It is not the best way to use several GPUs, [nn.DistributedDataParallel](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html) should give better performances AND can be used to scale on several machines. But DDP does not work well inside Notebooks.

We will train our model for only one epoch to make the benchmark run fast but you can increase this value.

In [6]:
import timeit

# Input batch size for training
batch_size = 128
# Input batch size for testing
test_batch_size = 1000
# Number of epochs to train
epochs = 1

train_loader = torch.utils.data.DataLoader(dataset1, batch_size=batch_size)
test_loader = torch.utils.data.DataLoader(dataset2, batch_size=test_batch_size)

# Mono GPU benchmark
device = 'cuda:0'
model = Net().to(device)
variables = {
    'model': model, 'device': device, 'train_loader': train_loader, 'test_loader': test_loader,
}
mono_gpu_time = timeit.timeit(f'train(model, device, train_loader, test_loader)', globals=variables, number=1, setup="from __main__ import train")

# Multi GPU benchmark
device = 'cuda'
# -- Wrap the model with nn.DataParallel --
model = nn.DataParallel(Net()).to(device)
variables = {
    'model': model, 'device': device, 'train_loader': train_loader, 'test_loader': test_loader,
}
multi_gpu_time = timeit.timeit(f'train(model, device, train_loader, test_loader)', globals=variables, number=1, setup="from __main__ import train")

# Results
print('Mono GPU took {:.2f}s'.format(mono_gpu_time))
print('Multi GPU took {:.2f}s'.format(multi_gpu_time))
print('Multi GPU is {:.2f}x times faster than multi GPU to train this model'.format(mono_gpu_time / multi_gpu_time))


Train cuda:0

Test set: Average loss: 0.0528, Accuracy: 9820/10000 (98%)


Train cuda

Test set: Average loss: 0.0509, Accuracy: 9835/10000 (98%)

Mono GPU took 12.42s
Multi GPU took 31.25s
Multi GPU is 0.40x times faster than multi GPU to train this model


### Conclusion

If you check the resource usage of your job you will see that several GPUs are used.  
The training is not faster because `nn.DataParallel` does not always improve performances.  
There are better way to distribute the training of your model, like using `nn.DistributedDataParallel`, or carefully selecting which part of the model should be shared between multiple GPUs.

### Going further

* For more information about running computations with PyTorch we advise you to follow the [official documentation](https://pytorch.org/docs/stable/index.html).
* Resource consumption of your notebook is displayed in a dashboard that you can see. Just execute the following cells to get the URL corresponding to your notebook session. The credencials needed to access this dashboard are the same than those used for the current notebook.

In [None]:
import os

if 'NOTEBOOK_ID' in os.environ:
    VARID = "var-notebook=" + os.environ['NOTEBOOK_ID']
    HOST = os.environ['NOTEBOOK_HOST']
    SUBDOMAIN = "notebook"
else:
    VARID =  "var-job=" + os.environ['JOB_ID']
    HOST = os.environ['JOB_HOST']
    SUBDOMAIN = "job"


print(f'Your resource monitoring dashboard URL is :')
print(f'http://{HOST.replace(SUBDOMAIN, "monitoring")}/d/gpu/job-monitoring?orgId=1&from=now-5m&{VARID}&to=now')