
# Writing Device Safe PyTorch Code for Fun and Profit
Adam McSloy, 2021, Version 0.1

## Initialisation
Note that a CUDA compliant device, i.e. a Nvidia GPU, is required to execute much of the code present in this notebook.  

In [1]:
import torch
from torch import Tensor
torch.set_default_tensor_type(torch.FloatTensor)
if not torch.cuda.is_available():
    raise SystemError('No CUDA device found: A CUDA enabled GPU is required for this workbook')

## Introduction

This document aims provide a very simple introduction to writing "*device safe*" PyTorch code. When possible, PyTorch code should be made *device-safe* if it is intended to be executed across multiple, possibly heterogeneous, devices. Generally, it is advised to ensure that any non-trivial code that is intended to be run on a non-CPU device should be made device-safe. However, device safe code is not necessary if executing exclusively on the CPU.

Within the context of this document, the term "*device safe*" is used to embody three main concepts: **device consistency**, which is the idea that a function's outputs should be on the same device as its inputs; **device purity**, interactions between tensors on different devices should be avoided; and finally **device agnosticism**, results of a function should not depend on the device on which it was ran. It should be noted that these are only general rules, and that many exceptions exist.

This document is intended to act only as a initial introduction to the writing of device safe code with PyTorch, users are encouraged to consult the PyTorch documentation for a more in-depth guide. 

## PyTorch Devices
PyTorch has been designed in such a way that it allows the user to specify what device each individual tensor is placed on. This flexibility permits complex operations to be spread out over multiple devices (GPUs/CPUs), which can significantly increase performance. When a `torch.tensor` is initialised without specifying a device to placed it on:

In [2]:
new_tensor = torch.rand(4)
print(f'"new_tensor" was placed on device: {new_tensor.device}')

"new_tensor" was placed on device: cpu


then it is placed on the `default` device, which is normally the CPU. The keyword argument "`device`" can be used to tell PyTorch which device is responsible for that tensor. This keyword expects a `torch.device` as its input, some examples of which have been provided below:

In [3]:
CPU = torch.device('cpu')  # <- Device representing the cpu
GPU = torch.device('cuda:0')  # <- Device representing the first gpu

Which can be used to identify the device to create a new tensor on, like so:

In [4]:
cpu_tensor = torch.rand(10, device=CPU)
gpu_tensor = torch.rand(10, device=GPU)
print(f'"cpu_tensor" was placed on device: {cpu_tensor.device}')
print(f'"gpu_tensor" was placed on device: {gpu_tensor.device}')

"cpu_tensor" was placed on device: cpu
"gpu_tensor" was placed on device: cuda:0


PyTorch currently supports nine different device architectures:
1. **cpu**: Central Processing Unit (Default).
2. **cuda**: Compute Unified Device Architecture (a Nvidia GPU).
3. **mkldnn**
4. **opengl**
5. **opencl**
6. **ideep**
7. **hip**
8. **msnpu**
9. **xla**

This document will only focus on the first two: "cpu" and "cuda" as they are the most common, and most issues are  associated with cross CPU-GPU operations. At the time of writing PyTorch only supports a single CPU socket, but multiple GPUs. That is to say `'cuda:0'`, `'cuda:1'` and `'cuda:2'` will resolve to three different devices (assuming three GPUs are connected) but only `'cpu:0'` is valid (`'cpu:1'` and `'cpu:2'` will raise an error).

## Device Safe Coding   
For the most part, device agnosticism is enforced by the PyTorch module itself and requires little to no input on the part of the developer to achieve. Nevertheless, unit-tests should still check for agnosticism. Both device consistency and device purity must be explicitly built into the code by the developer. Some examples of functions which fail to uphold these two aforementioned concepts have been given below. It should be noted that these functions are **highly contrived** but are designed this way to highlight the issues associated with non-device safe code.

In [5]:
def multiply_by_identity_matrix_bad(mat: Tensor) -> Tensor:
    """Multiply a matrix by its identity.

    Arguments:
        mat: The matrix that is to be multiplied by an identity matrix.  
        
    Returns:
        mat_by_eye: `mat` multiplied by its identity matrix.
    
    Notes:
        This is an example of a non-device pure function.  
    """
    # Create the identity matrix
    eye = torch.eye(len(mat))
    # Multiply it by `mat` and return the result
    return mat @ eye

Upon first inspection, there does not seem to be anything overtly wrong with the above function. It even executes without issue when fed a tensor on the CPU.

In [6]:
some_input_matrix = torch.rand(10, 10, device=CPU)  # <- This would default to the CPU anyway
some_result = multiply_by_identity_matrix_bad(some_input_matrix)

However, if the input tensor is on any device other than the default (CPU) then a crash is encountered:

In [7]:
some_input_matrix_gpu = torch.rand(10, 10, device=GPU)  # <- Place on the GPU
some_result = multiply_by_identity_matrix_bad(some_input_matrix_gpu)

RuntimeError: Tensor for argument #3 'mat2' is on CPU, but expected it to be on GPU (while checking arguments for addmm)

Looking at the error message, it can be seen that PyTorch is not happy that an attempt was made to multiply together two tensors that are on different devices. In this case, the input matrix, `mat`, was on the GPU while the `torch.eye` function created a tensor on the CPU, which was the default device. Such a function is therefore **not** device pure which causes sporadic failures during execution.

An example of a non-device consistent function is given below: 

In [8]:
def build_matrix_of_ones_bad(mat: Tensor) -> Tensor:
    """Builds a matrix matching the size of `mat` filled with ones.
    
    Arguments:
        mat: Matrix to act as a template for the `ones` matrix.
        
    Returns:
        ones: A matrix of the same size as `mat` filled with ones.
    
    Notes:
        This is a rather contrived function but it is designed to highlight the
        issues associated with non-device consistent functions.     
    """
    # Build and return the ones matrix
    return torch.ones(mat.shape)

Again, the above function looks fine at first glance, and even runs without *apparent* issue on both CPU (default device) and GPU (non-default device) tensors alike.

In [9]:
input_matrix_cpu = torch.rand(5, 5, device=CPU)
input_matrix_gpu = torch.rand(5, 5, device=GPU)
result_matrix_cpu = build_matrix_of_ones_bad(input_matrix_cpu)
result_matrix_gpu = build_matrix_of_ones_bad(input_matrix_gpu)

However, upon closer inspection it can be seen that the results are on the cpu!

In [10]:
print(f'"result_matrix_cpu" is on the {result_matrix_cpu.device} (should be {CPU})')
print(f'"result_matrix_gpu" is on the {result_matrix_gpu.device} (should be {GPU})')

"result_matrix_cpu" is on the cpu (should be cpu)
"result_matrix_gpu" is on the cpu (should be cuda:0)


While this may not seem significant it can have serious implications; as it would likely result in the next operation failing due to the result being on an unexpected device:

In [11]:
input_matrix_cpu @ result_matrix_cpu  # <- is fine
input_matrix_gpu @ result_matrix_gpu  # <- crashes as the result is on a different device

RuntimeError: Tensor for argument #3 'mat2' is on CPU, but expected it to be on GPU (while checking arguments for addmm)

At the very least all subsequent code would then be executed the default device, rather than that what the user specified (e.g. the GPU). Which may greatly reduce performance and generally cause a lot of headaches. So the question now becomes what is the most effective way to resolve these issues?

### The (Subjectively) "Wrong" Way
The quickest and easiest way to "fix" this would be to simply change the default device like so:

In [12]:
torch.set_default_tensor_type(torch.cuda.FloatTensor)

Now, all new tensors will be created on the GPU by default. While this permits the above functions to operate on GPU tensors without issue.

In [13]:
some_input_matrix_gpu = torch.rand(10, 10, device=GPU)  # <- Place on the GPU
some_result = multiply_by_identity_matrix_bad(some_input_matrix_gpu)
input_matrix_gpu = torch.rand(5, 5, device=GPU)
result_matrix_gpu = build_matrix_of_ones_bad(input_matrix_gpu)
print(f'result_matrix_gpu is now on the {result_matrix_gpu.device} (should be cuda:0)')

result_matrix_gpu is now on the cuda:0 (should be cuda:0)


However, this has an unintended side-effect in that a crash is now encountered when attempting to run the above functions on CPU tensors. This is because the root cause of the problem was not fixed, but rather moved about. This may work if intending to use the GPU exclusively, however this will fail just about everywhere else.

In [15]:
some_input_matrix = torch.rand(10, 10, device=CPU)
some_result = multiply_by_identity_matrix_bad(some_input_matrix)

RuntimeError: Tensor for 'out' is on CPU, Tensor for argument #1 'self' is on CPU, but expected them to be on GPU (while checking arguments for addmm)

and

In [16]:
input_matrix_cpu = torch.rand(5, 5, device=CPU)
result_matrix_cpu = build_matrix_of_ones_bad(input_matrix_cpu)
print(f'"result_matrix_cpu" is on the {result_matrix_cpu.device} (should be cpu)')
torch.set_default_tensor_type(torch.FloatTensor)  # <- Reset default to cpu

"result_matrix_cpu" is on the cuda:0 (should be cpu)


Similar issues would also be likely when using multiple GPUs. Clearly a more elegant solution is required.
### The "Right" Way
Thankfully the solution to this problem is rather simple. All that is required is to ensure that any new tensors created within a function are placed on the same device as the inputs, see the example below:

In [86]:
# Again these functions are highly contrived examples that would never actually
# be seen in the real world!

def multiply_by_identity_matrix_good(mat: Tensor) -> Tensor:
    """Multiply a matrix by its identity.

    Arguments:
        mat: The matrix that is to be multiplied by an identity matrix.  
        
    Returns:
        mat_by_eye: `mat` multiplied by its identity matrix. 
    """
    # Create the identity matrix
    eye = torch.eye(len(mat), device=mat.device)  # <- Change is here!
    # Multiply it by `mat` and return the result
    return mat @ eye

def build_matrix_of_ones_good(mat: Tensor) -> Tensor:
    """Builds a matrix matching the size of `mat` filled with ones.
    
    Arguments:
        mat: Matrix to act as a template for the `ones` matrix.
        
    Returns:
        ones: A matrix of the same size as `mat` filled with ones.    
    """
    # Build and return the ones matrix
    return torch.ones(mat.shape, device=mat.device)  # <- Change is here!

The above functions now operate without issue. It should be noted that indexing tensors are technically exempt form the "don't mix device types" rule:

In [87]:
data = torch.tensor([1.0,2.0, 3.0, 4.0, 5.0], device=CPU)
indices = torch.tensor([0, 2, 4], device=GPU)
print(data[indices])  #  <- This does NOT crash as one might expect.

tensor([1., 3., 5.])


Nevertheless, it is strongly advised to try and keep them on the same device for performance reasons.
### Moving Things About
Once a tensor has been created it can be "*moved*" from one device to another using its `to` method like so:

In [88]:
cpu_tensor = torch.rand(4, device=CPU)  # <- create on the CPU
cpu_tensor_moved = cpu_tensor.to(GPU)  # <- make a copy on the GPU
print(f'"cpu_tensor_moved" is on the {cpu_tensor_moved.device}')

"cpu_tensor_moved" is on the cuda:0


It is important to note that the '.to' operation does not actually "*move*" the tensor, but rather creates a new copy on the specified device. This means that changes to the original tensor will not propagate to the new tensor, i.e. they do not share the same underling memory reference. For the most part, such operations should be limited to the end points of a code, i.e. the initial set up before running a calculation and the final collection of results. This is because moving information from the CPU to the GPU and vise versa will impact overall performance.

## Unit-Testing
This section gives some hints on how unit-tests should be constructed to best test for issues that may occur when trying to develop a device safe project. This assumes that PyTest testing framework is used.

### Setup
To start with, the following code should be appended to the `conftest.py` configuration file:

In [89]:
import pytest
import torch


def pytest_addoption(parser):
    parser.addoption(
        "--device", action="store", default="cpu", help="specify test device (cpu/cuda/etc.)"
    )


@pytest.fixture
def device(request) -> torch.device:
    """Defines the device on which each test should be run.

    Returns:
        device: The device on which the test will be run.

    """
    device_name = request.config.getoption("--device")
    if device_name == 'cuda':
        return torch.device('cuda:0')
    else:
        return torch.device(device_name)

The first function, `pytest_addoption`, adds a new optional argument, "`device`", to the `pytest` command line interface. This new argument allows for the user to specify what device the test is to be run on. For example:
```bash
pytest --device cpu
```

would run all tests on the CPU and

```bash
pytest --device cuda
```
would run all tests on the GPU. Note that if a device is not specified then it will default to `cpu`. The second function sets up a `pytest.fixture`, more information about fixtures can be found [here](docs.pytest.org/en/stable/fixture.html). In short, a fixture is a device designed to allow variables to be passed into test functions at run-time.

In [90]:
@pytest.fixture
def an_example_fixture():
    return 10

def test_function_example(an_example_fixture):
    pass

If PyTest was run on the above code, it would save the result of the fixture function "`an_example_fixture`", and would then feed that result into any test function which takes an argument of the same name, e.g. the `test_function_example` function. For device testing only one fixture is needed, "device". This allows the device to be passed in to the test functions as needed.  

These two functions allow for tests to be run easily on any device desired without having to: i) hard code the device; ii) write a separate test for each device; or iii) loop over all devices, all of which would be incredibly inefficient.

### Testing
It is important to ensure that the test functions actually run on the correct device, as specified by the user. This is best done by manually specifying the device on which each tensor is created, using the `device` fixture like so:

In [91]:
def test_my_function(device):
    x = torch.tensor([1., 2., 3.], device=device)
    y = torch.tensor([4., 5., 6.], device=device)
    z = my_function(x, y)
    ...
    ...
    # A check would be performed here to check that
    # z is within permitted tolerance limits. 
    ...
    ...
    device_check = z.device == device
    assert device_check, 'Device persistence check'

Thus, most if not all test functions should make use of the `device` fixture argument. To test for device consistency a short check should be performed at the end of every test to ensure the result has not slipped off the device.


Running such tests once on the CPU and again on the GPU should be enough to ensure device-safe code execution as: non-device-agnostic code would result in the tolerance check failing; non-device-pure code would result in an execution error; and non-device-consistent code would fail the `assert device_check` check.