# Using the GPU

The leveraging of GPUs to train neural networks has been a key ingedient in their current success. GPUs were primarily developed for rendering 3D graphics. Dealing with 3D geometry consists mainly of doing linear algebra on a very large scale. Hence, any problem that requires that type of computations can benefit from a GPU as well.

In this tutorial we will go over how we can use PyTorch to leverage our GPU. Lets start with the usual imports.

In [1]:
# standard libraries
import math, os, time
import numpy as np

# plotting
import matplotlib.pyplot as plt

# progress bars
from tqdm.notebook import trange, tqdm

# PyTorch
import torch

PyTorch accesses Nvidia GPUs via the CUDA API, hence in PyTorch these are refered to as CUDA devices. Lets check if we have a CUDA device available.

In [2]:
torch.cuda.is_available()

True

## CPU

Lets set up a big compute task, we are going to do some computations with two large tensors.

In [3]:
a = torch.rand(10,1000,1000)
b = torch.rand(10,1000,1000)

Next to their shape and dtype, tensors have a third important property: their device.

In [4]:
print(a.device)

cpu


If a tensor's `device` is `cpu` that means the data in the tensor is resident in your computer's main memory and that any computations you do with it will be performed by the CPU.

Let us time how long it takes the CPU to add up these two tensors. For that we can use the `%%timeit` magic cell function, starting a cell with `%%timeit` will cause python to benchmark the cell. The options `-r` and `-n` specify the number of runs respectively loops per run.

In [5]:
%%timeit -r5 -n50

c = a + b**2 + a.sin()

47.3 ms ± 617 µs per loop (mean ± std. dev. of 5 runs, 50 loops each)


The code will report the average running time of the cell. The duration will be heavily dependend on your hardware.

## CUDA

Lets see how we can make this faster by doing the computation on the GPU. The first problem we have is that our tensors live in our computer's main memory, memory that our GPU does not have direct access to. A GPU comes with its own memory to work with, so we will need to move the tensors. We can copy a tensor over to the CUDA device by calling `.cuda()` on it.

In [6]:
a_cu = a.cuda()
b_cu = b.cuda()

print(a_cu.device)
print(b_cu.device)

cuda:0
cuda:0


The new tensors `a_cu` and `b_cu` are exact copies of `a` and `b` but live in the GPU's memory. We can check their `.device` property to verify what device they reside on. You can install more than one GPU in your computer, so `cuda:0` indicates the tensors reside on the 1st one. Now we can redo the addition but executed on the GPU.

In [7]:
%%timeit -r5 -n50

c_cu = a_cu + b_cu**2 + a_cu.sin()

The slowest run took 5.91 times longer than the fastest. This could mean that an intermediate result is being cached.
54.7 µs ± 52.9 µs per loop (mean ± std. dev. of 5 runs, 50 loops each)


Compare the running time of this cell with the previous CPU version, how much faster it is depends on your hardware but expect 10x-100x.

The result of an operation on tensors on a certain device will also end up on that device.

In [8]:
c = a + b
c_cu = a_cu + b_cu

print(c.device)
print(c_cu.device)

cpu
cuda:0


Lets compare whether the CPU and GPU yielded the same answer. Asking the CPU and GPU to yield the exact same answer is too much. Particularly for non-trivial functions such as trigonometric functions the answer depends on the choice of algorithm. Instead we will contend ourselves with checking whether all the entries are sufficiently close. The `.allclose()` function will return `True` if all entries are equal within tolerance.

In [9]:
c.allclose(c_cu)

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

This has throw an exception because we cannot compare tensors located on different devices. Copying a tensor on the GPU back to main memory is accomplished by calling `.cpu()`.

In [10]:
c.allclose(c_cu.cpu())

True

This version works since both tensors are in main memory and the CPU can perform the comparison. Fortunately the CPU and GPU have arrived at the same answer.

## Creating Tensors on the GPU

It is not necessary to shuffle tensors in between the CPU and GPU all the time, in fact this takes a lot of time and should be avoided. All the ways of creating tensors we saw in the previous tutorial can be used to create tensors on a CUDA device directly.

In [11]:
cuda = torch.device('cuda') # select the default CUDA device, normally cuda:0

d = torch.tensor([[1.,2.],[3.,4.]], device=cuda) # create this tensor directly on `device`
e = torch.ones((2, 2), device=cuda)
f = torch.rand((2, 3), device=cuda)

d, e, f

(tensor([[1., 2.],
         [3., 4.]], device='cuda:0'),
 tensor([[1., 1.],
         [1., 1.]], device='cuda:0'),
 tensor([[0.0667, 0.1461, 0.0265],
         [0.7871, 0.3133, 0.3860]], device='cuda:0'))

## Optional Exercises

1. Use `%%timeit` to verify that generating a large random matrix on the CPU and then copying it to the GPU is slower than generating is directly on the GPU.