How to use Awkward Arrays in Numba's CUDA target
================================================

Awkward Array defines extentions to the Numba compiler so that Numba can understand Awkward Array types, and use them to generate efficient compiled code for execution on GPUs or multicore CPUs. The programming effort required is as simple as adding a function decorator to instruct Numba to compile for the GPU.

In [1]:
import numpy as np
import awkward as ak
import numba
from numba import cuda
import cupy as cp

The Numba entry point registration happens too late for the Awkward CUDA extention, that is why we need to register it manually:

In [2]:
ak.numba.register_and_check()

Note, CUDA has an execution model unlike the traditional sequential model used for programming CPUs. In CUDA, the code you write will be executed by multiple threads at once (often hundreds or thousands). Your solution will be modeled by defining a thread hierarchy of grid, blocks, and threads.

In [3]:
blockspergrid = (128, 128)
threadsperblock = (32, 32)

Writing CUDA kernels that understand Awkward Array types 
--------------------------------------------------------

The `@cuda.jit` decorator is used to create a CUDA kernel. A kernel function is a GPU function that is meant to be called from CPU code. To understand Awkward Array types the decorator extensions must include an `ak.numba.cuda` object that prepares the `ak.Array` arguments to be passed into Numba’s default argument marshalling logic.

In [4]:
@cuda.jit(extensions=[ak.numba.cuda])
def average(out, array):
    tid = cuda.grid(1)
    if tid < len(array):
        out[tid] = 0
        for i in range(len(array[tid])):
            out[tid] = out[tid] + array[tid][i]/len(array[tid])

The kernels cannot explicitly return a value. The result data must be written to an `out` array passed to the function (if computing a scalar, you will probably pass a one-element array).

The kernels explicitly declare their thread hierarchy when called: i.e. the number of thread blocks and the number of threads per block (note that while a kernel is compiled once, it can be called multiple times with different block sizes or grid sizes). The `tid` is the absolute position of the current thread in the entire grid of blocks.

Memory management
-----------------

It is a user responsibility to allocate and manage memory, for example, transferring device memory back to the host when a kernel finishes. The `ak.numba.cuda` extention only accepts `ak.Array` with a cuda backend. That way the array data are already on the device and do not need to be copied.

In [5]:
counts = ak.Array(cp.random.poisson(1.5, 1000000))
content = ak.Array(cp.random.normal(0, 45.0, int(ak.sum(counts))))
array = ak.unflatten(content, counts)

For all but the simplest algorithms, it is important that you carefully consider how to use and access memory in order to minimize bandwidth requirements and contention.

Awkward Array can operate on CUDA-device arrays through the `cupy` library. Let's allocate the result array with `cp.empty`:

In [6]:
result = cp.empty(len(array), dtype=np.float32)

Kernel invocation
-----------------

Numba can use the CUDA array protocol (`__cuda_array_interface__`) to obtain a zero-copy reference to the CuPy array. We can launch a Numba kernel that operates upon our source `array` and target `result` as follows:

In [7]:
average[blockspergrid, threadsperblock](result, array)

The result matches our expectations:

In [8]:
cpu_array = ak.to_backend(array, "cpu")

In [9]:
check_result = ak.mean(cpu_array, axis=-1)

In [10]:
ak.operations.isclose(check_result[0:5], ak.Array(result[0:5], backend="cpu"))