# GPUs

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lukeconibear/swd6_hpp/blob/main/docs/07_GPUs.ipynb)

In [58]:
import sys
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
    pass

GPUs (Graphics Processing Units) are optimised for numerical operations, while CPUs (central processing units) perform general computation.

GPU hardware is designed for data parallelism, where high throughputs are achieved when the GPU is computing the same operations on many different elements at once.

You could use other types of accelerators too, though we're not going to cover those here.

## [Numba for CUDA GPUs](http://numba.pydata.org/numba-doc/latest/cuda/index.html)

Earlier we covered how Numba works on single CPUs with [`@njit`](https://numba.readthedocs.io/en/stable/glossary.html#term-nopython-mode) and multiple CPUs with `parallel = True`.

As a recap:

In [33]:
import numpy as np
from numba import njit

In [49]:
x = np.arange(1.e7, dtype=np.int64)

So, for a single CPU:

In [50]:
@njit
def my_serial_function_for_cpu(x):
    return np.cos(x) ** 2 + np.sin(x) ** 2

In [51]:
%%timeit
my_serial_function_for_cpu(x)

145 ms ± 1.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


And, for multiple CPUs:

In [52]:
@njit(parallel=True)
def my_parallel_function_for_cpu(x):
    return np.cos(x) ** 2 + np.sin(x) ** 2

In [53]:
%%timeit
my_parallel_function_for_cpu(x)

46.4 ms ± 961 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


*Note, here we used `njit` as this automates the parallelisation process.*

*This is in contrast to `vectorize` where manual effort is required for parallelisation.*

```{note} 
If you're in COLAB or have a local CUDA GPU, you can follow along with this section (uncomment the code).

For those in COLAB, ensure the session is using a GPU by going to: Runtime > Change runtime type > Hardware accelerator = GPU.
```

### `vectorize` for GPUs

Numba also works on [CUDA](https://developer.nvidia.com/how-to-cuda-python) GPUs using [`@vectorize`](https://numba.pydata.org/numba-doc/latest/user/vectorize.html) or [`@cuda.jit`](https://numba.readthedocs.io/en/stable/cuda/kernels.html).

This is suitable for bigger data sizes (> 1 MB) and high compute intensities.

This adds additional overhead due to moving data to and from GPUs ([memory management](https://numba.pydata.org/numba-doc/dev/cuda/memory.html)).

Similar to our examples in the compiler lesson, we need to specify the types and target in the signature (i.e., the decorator arguments).

Here, the types are specificed slightly differently i.e. output types(input types).

*Note, not all NumPy code will work on the GPU ([supported](https://numba.pydata.org/numba-doc/latest/reference/numpysupported.html)). In the following example, we will need to use the `math` library instead.*

For example:

In [101]:
import math
from numba import vectorize, float32

In [66]:
x = np.arange(1.e7, dtype=np.float32)

In [65]:
@vectorize(['float32(float32)'], target='cuda')
def my_serial_function_for_gpu(x):
    return math.cos(x) ** 2 + math.sin(x) ** 2

In [81]:
# %%timeit
# my_serial_function_for_gpu(x)

Numba also supports generalized ufuncs (covered in the compiler lesson) on the GPU using [`guvectorize`](http://numba.pydata.org/numba-doc/latest/cuda/ufunc.html#generalized-cuda-ufuncs).

### Custom CUDA kernels

Kernel functions are GPU functions called from CPU code.

Kernels cannot explicitly return a value. Instead, all result data must be written to an array passed to the function (e.g., called `out`). This array can then be transferred back.

Kernels work over a grid of threads. This grid needs to be defined in terms of the number of blocks in the grid and the number of threads per block. The indices of this grid are used to add values to the `out` array. The indices can be found using `cuda.grid(1)` for a 1D grid.

CUDA kernels are compiled using the `numba.cuda.jit` decorator.

*Note, `numba.cuda.jit` is different to `numba.jit`, which is for CPUs.*

In [82]:
# from numba import cuda

In [83]:
# print(cuda.gpus)

This should return a message similar to: <Managed Device 0>.

You can also run the bash command `nvidia-smi` within the IPython cell:

In [84]:
# !nvidia-smi

This returns something like the table below. This shows we have access to a [NVIDIA Tesla T4 GPU](https://www.nvidia.com/en-gb/data-center/tesla-t4/).

```bash
Tue Feb 22 13:59:03 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   66C    P0    30W /  70W |    144MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
```

So, a simple example to add two numbers together:

In [85]:
# @cuda.jit
# def add_kernel(x, y, out):      
#     index = cuda.grid(1)
#     out[index] = x[index] + y[index]

Let's define some input variables:

In [86]:
# n = 4096
# x = np.arange(n).astype(np.int32) # [0...4095] on the host
# y = np.ones_like(x)               # [1...1] on the host

Now, let's move these input variables from the host (CPU) to the device (GPU) for the work:

In [87]:
# x_on_device = cuda.to_device(x)
# y_on_device = cuda.to_device(y)
# out_on_device = cuda.device_array_like(x_on_device)

Now, we [choose the block size](https://numba.pydata.org/numba-doc/latest/cuda/kernels.html#choosing-the-block-size), by defining how many blocks are in the grid and how many threads are in each of those blocks.

These two numbers multipled together is the size of the grid (for our 1D example).

Some rules of thumb are:
- Blocks per grid should be a multiple of 32.
- Threads per block should be a multiple of 128.

In [88]:
# blocks_per_grid = 32
# threads_per_block = 128

Now, we can call the kernel function.

First, add the grid size arguments.

Then, we pass the input/output variables as arguments to the function.

In [89]:
# add_kernel[blocks_per_grid, threads_per_block](x_on_device, y_on_device, out_on_device)

As these CUDA kernels don't return a value, we can synchronise the device (GPU) back to the host (CPU) to get the result back.

In [None]:
# cuda.synchronize()
# print(out_on_device.copy_to_host())
# # Should be [   1    2    3 ... 4094 4095 4096]

For more information on CUDA, see the training courses:

- [HPC5: Introduction to GPU programming with CUDA](https://arc.leeds.ac.uk/training/courses/hpc5/)
- NVIDIA workshop on [Fundamentals of Accelerated Computing with CUDA Python](https://www.nvidia.com/en-us/training/instructor-led-workshops/fundamentals-of-accelerated-computing-with-cuda-python/)
    - Detailed look at [custom CUDA kernels](https://numba.pydata.org/numba-doc/dev/cuda/kernels.html) and [GPU memory management](https://numba.pydata.org/numba-doc/dev/cuda/memory.html).

## [RAPIDS](https://developer.nvidia.com/rapids)

RAPIDS is a range of accelerated data science libraries from NVIDIA.

There are a wide variety of tools matching up to familiar libraries:

- Arrays and matrices
  - [cuPy](https://cupy.dev/) for NumPy and SciPy
- Tabular data
    - [cuDF](https://docs.rapids.ai/api/cudf/stable/) for Pandas
- Machine learning
    - [cuML](https://docs.rapids.ai/api/cuml/stable/) for scikit-learn
    - [XGBoost](https://rapids.ai/xgboost.html) on GPUs
- Graphs and networks
    - [cuGraph](https://docs.rapids.ai/api/cugraph/stable/) for [NetworkX](https://networkx.org/)
- Multiple GPUs
    - [Dask with CUDA](https://rapids.ai/dask.html), cuDF, cuML, and others.
    - [Dask-MPI with GPUs](http://mpi.dask.org/en/latest/gpu.html)

### [cuPy](https://cupy.dev/)

**NumPy for the CPU**

In [None]:
import numpy as np

In [91]:
x_cpu = np.random.rand(1_000, 1_000)
y_cpu = np.random.rand(1_000, 1_000)
z_cpu = np.dot(x_cpu, y_cpu)

**CuPy for the GPU**

In [97]:
# import cupy as cp

In [98]:
# x_gpu = cp.random.rand(1_000, 1_000)
# y_gpu = cp.random.rand(1_000, 1_000)
# z_gpu = cp.dot(x_gpu, y_gpu)

You can move arrays between the CPU and GPU as follows:

In [100]:
# z_cpu = cp.asnumpy(z_gpu)  # from gpu to cpu

In [99]:
# z_gpu = cp.asarray(z_cpu)  # from cpu to gpu

For more information on RAPIDS, see the training courses:

- NVIDIA workshop on [Fundamentals of Accelerated Data Science (RAPIDS)](https://www.nvidia.com/en-us/training/instructor-led-workshops/fundamentals-of-accelerated-data-science/).

## Diagnostics

Similar to the Dask Dashboard, NVIDIA has a GPU Dashboard called [`NVDashboard`](https://github.com/rapidsai/jupyterlab-nvdashboard).

These real-time diagnostics are provided via a Bokeh server and a Jupyter Lab extension.  

They are a great way to manage your GPU utilisation, resources, throughput, and more.  

More information is [here](https://developer.nvidia.com/blog/gpu-dashboards-in-jupyter-lab/).

![SegmentLocal](images/NVIDIA_GPUDashboard.gif "segment")

*[Image source](https://developer.nvidia.com/blog/gpu-dashboards-in-jupyter-lab/)*

## [JAX](https://jax.readthedocs.io/en/latest/index.html)

...

## Exercise

...

```{admonition} Key Points

- Use Numba to write custom code for CUDA GPUs.
- Use RAPIDS libraries for move your data science code to GPUs.

```

## Further information

### Good practises

- Test out ideas on CPUs first, before moving to expensive GPUs.
- Consider whether the calculation is worth the additional overhead of sending data to and from the GPU.
- Minimise data transfers between the host (CPU) and the device (GPU).

### Other options

- [pycuda](https://documen.tician.de/pycuda/)
    - An alternative to Numba for accessing NVIDIA's CUDA GPUs.
- Many libraries can use GPUs automatically if they can detect one e.g., [`TensorFlow`](https://www.tensorflow.org/install/gpu) and [`PyTorch`](https://pytorch.org/docs/stable/notes/cuda.html).

### Resources

- [CuPy - Sean Farley](https://www.youtube.com/watch?v=_AKDqw6li58), PyBay 2019.  
- [cuDF - Mark Harris](https://www.youtube.com/watch?v=lV7rtDW94do), PyCon AU 2019.  