# Part 4: Using GPU acceleration with PyTorch

In [0]:
# Execute this code block to install dependencies when running on colab
try:
    import torch
except:
    from os.path import exists
    from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag
    platform = '{}{}-{}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag())
    cuda_output = !ldconfig -p|grep cudart.so|sed -e 's/.*\.\([0-9]*\)\.\([0-9]*\)$/cu\1\2/'
    accelerator = cuda_output[0] if exists('/dev/nvidia0') else 'cpu'

    !pip install -q http://download.pytorch.org/whl/{accelerator}/torch-1.0.0-{platform}-linux_x86_64.whl torchvision

try: 
    import torchbearer
except:
    !pip install torchbearer

Collecting torchbearer
[?25l  Downloading https://files.pythonhosted.org/packages/5a/62/79c45d98e22e87b44c9b354d1b050526de80ac8a4da777126b7c86c2bb3e/torchbearer-0.3.0.tar.gz (84kB)
[K    100% |████████████████████████████████| 92kB 3.5MB/s 
Building wheels for collected packages: torchbearer
  Building wheel for torchbearer (setup.py) ... [?25ldone
[?25h  Stored in directory: /root/.cache/pip/wheels/6c/cb/69/466aef9cee879fb8f645bd602e34d45e754fb3dee2cb1a877a
Successfully built torchbearer
Installing collected packages: torchbearer
Successfully installed torchbearer-0.3.0


## Manual use of `.cuda()`

Now the magic of PyTorch comes in. So far, we've only been using the CPU to do computation. When we want to scale to a bigger problem, that won't be feasible for very long.
|
PyTorch makes it really easy to use the GPU for accelerating computation. Consider the following code that computes the element-wise product of two large matrices:

In [0]:
import torch

t1 = torch.randn(1000, 1000)
t2 = torch.randn(1000, 1000)
t3 = t1*t2
print(t3)

tensor([[-9.0426e-01, -1.7672e+00, -1.7828e-01,  ..., -7.8761e-01,
          2.2038e-02,  5.2148e-01],
        [ 1.2913e+00, -1.2362e+00,  4.7041e-01,  ...,  2.4664e-02,
         -1.1787e-01,  1.3503e-01],
        [-3.1094e-01, -1.3718e-02,  1.6595e-02,  ..., -2.9590e-01,
          1.7844e+00, -2.5597e-01],
        ...,
        [-2.1265e+00,  5.2993e-02, -6.7150e-01,  ...,  8.6739e-01,
          1.7463e+00,  1.8504e+00],
        [ 8.5413e-01, -1.2042e+00, -4.6478e-01,  ..., -2.5265e-01,
          4.4074e-02,  7.1030e-02],
        [ 1.4803e-03, -6.2860e-01,  8.5331e-02,  ..., -1.7224e-01,
         -3.5643e-01, -3.8763e-02]])


By sending all the tensors that we are using to the GPU, all the operations on them will also run on the GPU without having to change anything else. If you're running a non-cuda enabled version of PyTorch the following will throw an error; if you have cuda available the following will create the input matrices, copy them to the GPU and perform the multiplication on the GPU itself:

In [0]:
t1 = torch.randn(1000, 1000).cuda()
t2 = torch.randn(1000, 1000).cuda()
t3 = t1*t2
print(t3)

tensor([[-2.3564e-01, -1.6689e+00, -1.6155e-01,  ...,  7.5903e-01,
          2.1111e+00,  1.8554e-01],
        [ 1.1255e-01,  6.8611e+00,  1.7231e+00,  ...,  3.4754e-01,
          1.5149e+00,  1.0885e+00],
        [-2.0764e-02,  2.0520e-01,  3.0891e-01,  ..., -7.8243e-01,
         -1.2221e-01, -4.7936e-02],
        ...,
        [-7.7671e-01,  2.4333e-02, -4.0971e-01,  ...,  5.1387e-01,
         -3.4297e-01,  3.0031e-01],
        [ 9.9761e-02,  2.5344e-03,  3.7340e+00,  ...,  8.8686e-01,
         -9.6254e-02,  7.9533e-01],
        [ 6.4484e-01, -7.2509e-01,  4.4806e-01,  ..., -3.1643e-01,
          2.4096e-02,  6.4055e-01]], device='cuda:0')


If you're running this workbook in colab, now enable GPU acceleration (`Runtime->Runtime Type` and add a `GPU` in the hardware accelerator pull-down). You'll then need to re-run all cells to this point.

If you were able to run the above with hardware acceleration, the print-out of the result tensor would show that it was an instance of `cuda.FloatTensor` type on the the `(GPU 0)` GPU device. If your wanted to copy the tensor back to the CPU, you would use the `.cpu()` method.

## Writing platform agnostic code

Most of the time you'd like to write code that is device agnostic; that is it will run on a GPU if one is available, and otherwise it would fall back to the CPU. The recommended way to do this is as follows:

In [0]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"
t1 = torch.randn(1000, 1000).to(device)
t2 = torch.randn(1000, 1000).to(device)
t3 = t1*t2
print(t3)

tensor([[ 5.1110e-01,  3.5608e-01, -2.8273e-01,  ...,  1.0561e+00,
          8.3376e-03,  7.5762e-02],
        [ 7.8799e-02,  2.0945e-01,  3.9617e-01,  ..., -1.1587e-01,
         -2.5375e-02,  6.0923e-02],
        [ 2.4760e+00,  4.7810e-01, -4.9006e-01,  ..., -2.9310e+00,
          2.6578e-02,  8.2460e-01],
        ...,
        [ 4.4136e-01,  3.7089e-01, -2.3190e-02,  ..., -3.5786e+00,
          7.1675e-01, -5.2569e-01],
        [ 2.6383e-02, -1.0069e+00, -2.3215e-01,  ...,  4.2130e-01,
          2.9363e-01,  3.0418e-03],
        [-2.2897e+00,  2.5124e+00,  2.0408e-02,  ..., -2.6447e-01,
          2.9882e+00, -3.8330e-01]], device='cuda:0')


## Accelerating neural net training

If you wanted to accelerate the training of a neural net using raw PyTorch, you would have to copy both the model and the training data to the GPU. Unless you were using a really small dataset like MNIST, you would typically _stream_ the batches of training data to the GPU as you used them in the training loop:

```python
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = BaselineModel(784, 784, 10).to(device)

loss_function = ...
optimiser = ...

for epoch in range(10):
    for data in trainloader:
        inputs, labels = data
        inputs, labels = inputs.to(device), labels.to(device)

        optimiser.zero_grad()
        outputs = model(inputs)
        loss = loss_function(outputs, labels)
        loss.backward()
        optimiser.step()
```

Using Torchbearer, this becomes much simpler - you just tell the `Trial` to run on the GPU and that's it!:

```python
model = BetterCNN()

loss_function = ...
optimiser = ...

device = "cuda:0" if torch.cuda.is_available() else "cpu"
trial = Trial(model, optimiser, loss_function, metrics=['loss', 'accuracy']).to(device)
trial.with_generators(trainloader)
trial.run(epochs=10)
```


## Multiple GPUs

Using multiple GPUs is beyond the scope of the lab, but if you have multiple cuda devices, they can be referred to by index: `cuda:0`, `cuda:1`, `cuda:2`, etc. You have to be careful not to mix operations on different devices, and would need how to carefully orchestrate moving of data between the devices (which can really slow down your code to the point at which using the CPU would actually be faster).

## Questions

__Answer the following questions (enter the answer in the box below each one):__

__1.__ What features of GPUs allow them to perform computations faster than a typically CPU?

**Answer: GPUs have a massively parallel processing architecture consisting of thousands of smaller, more efficient cores designed to handle multiple tasks simultaneously. It uses the CUDA(Compute Unified Device Architecture) technology to connect those internal processors together and become a thread processor to solve data-intensive calculations. Each processor can exchange, sync and share the data. GPUs have a parallel stream architecture that focuses on executing a large number of concurrent threads at a slower speed rather than excuting a single thread rapidly. Whereas, CPU just consists of several cores optimized for serial processing, does not have the strong capability in parallel processing.**

__2.__ What is the biggest limiting factor for training large models with current generation GPUs?

**Answer: Taining large models means the data size is huge. The GPU memory capacity is the biggest limiting factor for training large models. The memory capacity limiting factor prevents GPU form handling terabyte scale data.  Due to limited by the bandwidth and latency of the PCIe bus, once the data size is bigger than the capacity of the GPU memory, the performance decreases significantly as the data transfers to the device become the primary bottleneck.**