# Extending the CUDA target with the High-Level API

Quick preamble: we'll disable low-occupancy and implicit-copy warnings for this notebook, since they'll just generate irrelevant noise:

In [1]:
from numba import config
config.CUDA_LOW_OCCUPANCY_WARNINGS = False
config.CUDA_WARN_ON_IMPLICIT_COPY = False

## What is this?

Using the [High-level Extension API](https://numba.readthedocs.io/en/latest/extending/high-level.html) is the most straightforward way to extend Numba. Compared to the Low-level API:

- Extension code can be written in pure Python in a lot of cases, without a lot of reference to:
  - Numba's type inference mechanism
  - LLVM builders and language
- Extension code is much more "compact" - extension definitions can fit inside a single function, instead of having separate typing / lowering
- There are some limitations:
  - Defining new types and data models, and type inference rules still needs the low-level API

This notebook demonstrates extending the CUDA target using the High-level API through a few examples.

## Example 1: Overloading Functions

Let's implement an overload of the `len()` function, for grid groups. The (moderately useless / absurd) aim is that we can write, for example:

```python
grid = cuda.cg.this_grid()
print("Grid size is", len(grid))
```

in a kernel, and get the size of the grid printed out.

To use the high-level extension API with CUDA, we need the CUDA target and the `overload` method:

In [2]:
from numba import cuda
from numba.extending import overload

Now we'll implement our overload:

In [3]:
@overload(len, target='cuda')
def grid_group_len(seq):
    if isinstance(seq, cuda.types.GridGroup):
        def len_impl(seq):
            n = cuda.gridsize(1)
            return n
        return len_impl

Notes on the implementation:

- The `@overload` decorator defines an overload.
  - We need to specify what function is being overloaded - here it is `len()`
  - The target here is `"cuda"`, but if we set it to `generic` then this overload can be used with CPU and CUDA targets.
  - The default target is the CPU (for annoying historical reasons) so if we forget the `target` kwarg, then our overload won't work on CUDA!
- The function gets called with argument types as its arguments
  - In this case the function accepts one argument, `seq`.
  - Typing is implemented by inspecting the types of the arguments.
  - If we can successfully type this function with these arguments, then an implementation should be returned.
  - If the typing does not succeed, we return `None` so that Numba knows it should try another overload of `len()`.
- Returned implementations (`len_impl` in this case) should be a Python function:
  - This function implements the logic of our overloaded function, and is compiled by Numba.

Now let's use our overload in a kernel:

In [4]:
@cuda.jit
def f():
    if cuda.grid(1) == 0:
        print("Grid size is", len(cuda.cg.this_grid()))
        
f[1, 1]()
f[1, 2]()
f[1, 3]()
cuda.synchronize()

Grid size is 1
Grid size is 2
Grid size is 3


... Success!

## Example 2: Overloading Methods

Overloading methods is similar to overloading functions, except that:

- The `@overload_method()` decorator is used,
- its first argument is the type for which the method is implemented, and
- the second argument is the name of the method.

A couple more imports we need:

In [5]:
from numba.extending import overload_method
from numba import types

The CUDA target presently doesn't support the `sum()` method of NumPy arrays - we'll implement a cut-down version of it to demonstrate method overloading in the CUDA target:

In [6]:
@overload_method(types.Array, 'sum', target='cuda')
def array_sum(arr):
    if arr.ndim != 1:
        # Only implement 1D for this quick example
        return None

    def sum_impl(arr):
        res = 0 
        for i in range(len(arr)):
            res += arr[i]
        return res 
    return sum_impl

The first argument to the overload method (`arr`) is the type of the receiver.

Now we can use the method in a kernel:

In [7]:
@cuda.jit
def f(arr):
    print("Sum is", arr.sum())


import numpy as np    

f[1, 1](np.arange(5))
f[1, 1](np.arange(10))
cuda.synchronize()

Sum is 10
Sum is 45


## Example 3: Overloading attributes

For overloading attributes, we have the `@overload_attribute` decorator - similarly to overloading functions, the decorator takes a type and an attribute name.

In [8]:
from numba.extending import overload_attribute

Let's add an `.nbytes` attribute to arrays in CUDA kernels:

In [9]:
@overload_attribute(types.Array, 'nbytes', target='cuda')
def array_nbytes(arr):
    def nbytes_impl(arr):
        return arr.size * arr.itemsize
    return nbytes_impl

It is immediately available for use:

In [10]:
@cuda.jit
def f(arr):
    print("Nbytes is", arr.nbytes)


f[1, 1](np.arange(5))
f[1, 1](np.arange(10))
cuda.synchronize()

Nbytes is 40
Nbytes is 80


## Example 4: Using intrinsics for lower-level control

The `@overload` family of decorators provide convenient extensions with pure Python implementations - what if one needs to implement an extension using constructs not expressible in pure Python? The `@intrinsic` decorator can be used to build LLVM IR when implementing an extension. Intrinsics can be called from both `@cuda.jit`-decorated functions and overloads.

For intrinsics defined for the CUDA target, we can import:

In [11]:
from numba.cuda.extending import intrinsic

We'll write an intrinsic to implement the [`clock64()` time function](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#time-function), which is not presently implemented in Numba. The prototype of this function in CUDA C is:

```C
long long int clock64();
```

NVCC translates this to a read of a PTX special register, `%clock64`. For example, generated code may look like:

```asm
mov.u64 	%rd1, %clock64;
```

There is no way to express this only in Python, so we need an intrinsic here. Its definition could look like:

In [12]:
from llvmlite import ir

@intrinsic
def cuda_clock64(typingctx):
    sig = types.uint64()

    def codegen(context, builder, sig, args):
        function_type = ir.FunctionType(ir.IntType(64), []) 
        instruction = "mov.u64 $0, %clock64;"
        clock64 = ir.InlineAsm(function_type, instruction, "=l",
                               side_effect=True)
        return builder.call(clock64, []) 

    return sig, codegen

Some remarks on the implementation:

- The decorator itself requires no arguments.
- The arguments to the decorated function are a typing context, and the types of the arguments to the function
  - Here we don't need to use the typing context, but it can occasionally be useful
      - e.g. for checking if two types will unify with `typingctx.unify_types(arg1, arg2)`
  - We also don't have any arguments to this function, but they would appear after `typingctx` if we did.
- The function should return a tuple of the signature and a function to generate code
  - Or nothing, if the intrinsic couldn't be typed for these arguments
- The code generation function is just like a normal lowering function:
  - It gets given `context, builder, sig, args`
  - It should build appropriate LLVM IR and return the instruction holding the result (if there is one).
- The body of the code generation function is just a standard pattern for emitting inline PTX.

Let's try this in a CUDA kernel:

In [13]:
@cuda.jit('void()')
def f():
    print("1. Clock value is", cuda_clock64())
    print("2. Clock value is", cuda_clock64())


f[1, 1]()

1. Clock value is 6670192
2. Clock value is 6723668


It is normal to see clock values differ with every run - the second ought to be greater than the first though.

## Note: mutable structures

The high-level extension API also supports [implementing mutable structures](https://numba.readthedocs.io/en/latest/extending/high-level.html#implementing-mutable-structures) with `StructRef`, but this is not yet supported on the CUDA target. This is because mutable structures are heap-allocated and passed by reference, and allocation within a kernel is not yet supported on the CUDA target.

Support for mutable structures / `StructRef` in CUDA is planned for a future release.

## Further info:

- [High-level extension API documentation](https://numba.readthedocs.io/en/latest/extending/high-level.html)
- [An example using the High-level API and `@overload`](https://numba.readthedocs.io/en/latest/extending/overloading-guide.html) - this example focuses on a use case for the CPU target that presently doesn't map well to CUDA, but the ideas explained are generally applicable.
- [Intrinsics in the CUDA target](https://github.com/numba/numba/blob/main/numba/cuda/intrinsics.py) - some implementations in Numba use the high-level API - see these as an example of implementing intrinsics and overloads. These include `cuda.grid()`, `cuda.gridsize()`, `syncthreads()`, etc.