# CUDA Computing With numba

## Overview

In this section we will introduce numba for CUDA computing. We will do so by using a very simple examples.

## CUDA computing with numba

### Checking the computing environment

Assuming that you have numba installed on you machine, open a terminal and type

```
numba -s
```

This should output various information about your computing environment. CUDA related information is also shown

```
__CUDA Information__
CUDA Device Initialized                       : True
CUDA Driver Version                           : 11.7
CUDA Runtime Version                          : 11.7
CUDA NVIDIA Bindings Available                : False
CUDA NVIDIA Bindings In Use                   : False
CUDA Minor Version Compatibility Available    : False
CUDA Minor Version Compatibility Needed       : False
CUDA Minor Version Compatibility In Use       : False
CUDA Detect Output:
Found 1 CUDA devices
id 0    b'NVIDIA GeForce RTX 3060 Laptop GPU'                              [SUPPORTED]
                      Compute Capability: 8.6
                           PCI Device ID: 0
                              PCI Bus ID: 1
                                    UUID: GPU-9fb8755a-a9d8-27aa-b653-1ea2536e5efe
                                Watchdog: Enabled
             FP32/FP64 Performance Ratio: 32
Summary:
	1/1 devices are supported

CUDA Libraries Test Output:
Finding driver from candidates: libcuda.so, libcuda.so.1, /usr/lib/libcuda.so, /usr/lib/libcuda.so.1, /usr/lib64/libcuda.so, /usr/lib64/libcuda.so.1...
Using loader <class 'ctypes.CDLL'>
	trying to load driver...	ok, loaded from libcuda.so
Finding nvvm from System
	named  libnvvm.so.4.0.0
	trying to open library...	ok
Finding cudart from System
	named  libcudart.so.11.7.99
	trying to open library...	ok
Finding cudadevrt from System
	named  libcudadevrt.a
Finding libdevice from System
	trying to open library...	ok
```

### Basic example

In general, when working with CUDA via numba, we will be using the ```cuda.jit``` in order to instruct
numba to generate code for the GPU. However, it is the programmer's responsibility to instruct numba how to distribute the computation. 
Thus, we will have to divide the computation in thread blocks and each block in grids. Let's see an example below.

In [2]:
from numba import cuda
import numpy as np

In [4]:

@cuda.jit
def double(my_array):
    position = cuda.grid(1)
    my_array[position] *= 2

In [5]:
# specify the number of blocks and
# the number of threads per block
blocks_per_grid = 50
threads_per_block = 20

my_array = np.ones(1000)
double[blocks_per_grid, threads_per_block](my_array)
assert (my_array == 2).all()



The code above, uses 1000 GPU threads. This is the same number as the number of elements in the array. Threads in the same block can share state very fast.
Nevertheless, it is not always possible to equidistibute the number of elements with the number of threads. Try to execute the code below and see what happens.

In [None]:
blocks_per_grid = 17
threads_per_block = 62

my_array = np.ones(1000)
double[blocks_per_grid, threads_per_block](my_array)
assert (my_array == 2).all()

Most likely the code above will crash, since it allocates more threads than actual elements in the array. One way to avoid this is shown 
in the code below

In [5]:
@cuda.jit
def double_array(array):
    
    # get the thread index
    tidx = cuda.grid(1)
    if tidx > array.shape[0]:
        return
    array[tidx] *= 2

In [6]:
blocks_per_grid = 17
threads_per_block = 62

array = np.ones(1000)
double_array[blocks_per_grid, threads_per_block](array)
assert (array == 2).all()



### Workign with threads and blocks

We have already discussed the notion of grids and blocks in the general CUDA presentation. Let's see how we can utilise this
with numba. Recall that each thread in a block has access to the block index it is running, ```cuda.blockIdx```, 
as well as the block dimension, ```cuda.blockDim```. It also has access to the group dimension i.e. ```cuda.threadIdx```. Putting all these together,
means that we are able to calculate the thread index. Let's see an example.

In [9]:
@cuda.jit
def multiply_matrix_elements(matrix, factor: int):
    
    # get the x-position
    x = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
    
    # get the y-position
    y = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y
    
    # make check so that we do not
    # get out of bounds
    if x >= matrix.shape[0]:
        return
    if y >= matrix.shape[1]:
        return
    
    matrix[y, x] *= factor

In [11]:
threads_per_block_2d = 16, 16
blocks_per_grid_2d = 63, 63

mat = np.ones((1000, 1000))
multiply_matrix_elements[blocks_per_grid_2d, threads_per_block_2d](mat, 2)
print((mat == 2).all())

True


## Summary

This section outlined how to use numba in order to utilize GPUs. In particular, we saw that numba supports CUDA via the ```@cuda.jit``` decorator.
Numba simplifies a lot how GPU computing on CUDA enabled devices is done. Still however as programmer you need to specify the number of blocks
and the number of threads per block.

## References

1. _Fast Python for Data Science_, Manning Publications.