In [1]:
import torch
from torch import nn
from d2l import torch as d2l

## Computing Devices

We can specify devices for running certain computations. E.g. the CPU or GPU. By default, the CPU is used for all calculations. The cpu can be denoted by `torch.device('cpu')` and the GPU by `torch.device('cuda')`. For CPU, torch will try to use all available cores and memory, while for GPU, this only indicates _one_ GPU, which we can index through.

In [5]:
def cpu(): 
    """Get the CPU device."""
    return torch.device('cpu')
    
def gpu(i=0): 
    """Get a GPU device."""
    return torch.device(f'cuda:{i}')

# Our second, totally theoretical, GPU is indexed here. 
cpu(), gpu(), gpu(1)

(device(type='cpu'),
 device(type='cuda', index=0),
 device(type='cuda', index=1))

In [3]:
def num_gpus():
    return torch.cuda.device_count()

num_gpus()

1

In [4]:
torch.cuda.is_available()

True

Here, we define two convenient functions which will allow us to write code even if the GPU does not exist.

In [7]:
def try_gpu(i=0):
    if num_gpus() >= i + 1:
        return gpu(i)
    return cpu()

def try_all_gpus():
    return [gpu(i) for i in range(num_gpus())]

In [8]:
try_gpu(), try_gpu(10), try_all_gpus()

(device(type='cuda', index=0),
 device(type='cpu'),
 [device(type='cuda', index=0)])

## Tensors and GPUs

In [9]:
# By default, tensors are are created on the CPU, we can see where the cpu is..\
x = torch.tensor([1, 2, 3])
x.device

device(type='cpu')

It is important to ensure, when we want to perform operations on tensors, that these are on the same device. Otherwise the framework will not know where to perform the computation.

### Storage on the GPU

In [10]:
# In this example, create the tensor on the GPU. Will only use memory on the GPU. Can use nvidia_smi command to see GPU usage.

X = torch.ones(2, 3, device=try_gpu())
X

tensor([[1., 1., 1.],
        [1., 1., 1.]], device='cuda:0')

### Copying

The code below raises and exception, as the tensors are on the CPU and GPU

In [11]:
x + X

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

In [14]:
# So we copy from one device to another
y = x.cuda(0)
y + X

# Now that both tensors are on the gpu, we can add them (with broadcasting)

tensor([[2., 3., 4.],
        [2., 3., 4.]], device='cuda:0')

Copying data from CPU RAM to GPU VRAM is extremely slow, which is why the framework crashes instead of automatically making a copy of the data you need. Be absolutely sure that you want to copy something before doing so. 

Generally, this also makes paralellization much more challenging. Better to transfer data in large batches, with many transfers at once, than to make many small transfers all over the place. 

When printing or converting to numpy, the framework will copy the data to main memory if it is not already there, with extra overhead. It will then also be subject to the GIL! Blah.

## Neural Networks and GPUs

In much the same way as individual tensors, a network can specify a device, here we make our now-familiar linear model and push it to a GPU.

In [15]:
net = nn.Sequential(nn.LazyLinear(1))

net = net.to(device=try_gpu())



In [17]:
# Result is computed on remote GPU.
net(X)

tensor([[0.9640],
        [0.9640]], device='cuda:0', grad_fn=<AddmmBackward0>)

In [18]:
net[0].weight.data.device

device(type='cuda', index=0)

In [20]:
@d2l.add_to_class(d2l.Trainer) #@save
def __init__(self, max_epochs, num_gpus=0, gradient_clip_val=0):
    self.save_hyperparameters()
    self.gpus = [d2l.gpu(i) for i in range(min(num_gpus, d2l.num_gpus()))]

@d2l.add_to_class(d2l.Trainer) #@save
def prepare_batch(self, batch):
    if self.gpus:
        batch = [a.to(self.gpus[0]) for a in batch]
    return batch

@d2l.add_to_class(d2l.Trainer) #@save
def prepare_model(self, model):
    model.trainer = self
    model.board.xlim = [0, self.max_epochs]
    if self.gpus:
        model.to(self.gpus[0])
        self.model = model

### Summary 

We can specify different devices for storage and calculation. While training operations may be faster on a GPU, transferring data between devices is very slow. It's worth being cautious about this. One example might be computing the loss for each minibatch on the GPU, and then reporting this back to the user on the command line. This will trigger the GIL which locks all GPUs until the operation is completed. 

## Sanity check, lets make sure it really is faster...

In [24]:
import time

In [38]:
dim = 10_000
large_cpu = torch.rand([dim, dim])
large_gpu = torch.rand([dim, dim], device=try_gpu())

(large_cpu, large_gpu)

(tensor([[0.0938, 0.0808, 0.3881,  ..., 0.2899, 0.9301, 0.5865],
         [0.7892, 0.6715, 0.8486,  ..., 0.9083, 0.1142, 0.6990],
         [0.7120, 0.7106, 0.1457,  ..., 0.0742, 0.6360, 0.3333],
         ...,
         [0.3416, 0.3835, 0.0769,  ..., 0.8155, 0.7879, 0.5694],
         [0.4846, 0.0238, 0.9867,  ..., 0.0703, 0.7277, 0.7920],
         [0.2098, 0.2972, 0.5527,  ..., 0.4929, 0.3520, 0.2023]]),
 tensor([[0.8084, 0.0359, 0.1209,  ..., 0.0593, 0.7455, 0.7505],
         [0.5535, 0.6959, 0.2243,  ..., 0.3516, 0.0651, 0.8051],
         [0.3115, 0.1264, 0.8795,  ..., 0.6168, 0.5811, 0.4920],
         ...,
         [0.0067, 0.6687, 0.9430,  ..., 0.2219, 0.6426, 0.0614],
         [0.2145, 0.6876, 0.1812,  ..., 0.2787, 0.3627, 0.6232],
         [0.0613, 0.2376, 0.1867,  ..., 0.9772, 0.2119, 0.0369]],
        device='cuda:0'))

In [39]:
n_mult = 1

t1 = time.time()

for i in range(n_mult):
    large_cpu = large_cpu @ large_cpu

print(f"CPU Time: {time.time() - t1:.3f}s")

CPU Time: 6.007s


In [40]:
n_mult = 10

t3 = time.time()

for i in range(n_mult):
    large_gpu = large_gpu @ large_gpu

print(f"GPU Time: {time.time() - t3:.3f}s")

GPU Time: 0.001s
