# Numba Overview
	

*   Just-in-time (JIT) compiler for Python
*   Specify functions to be compiled using decorators: `@<decorator> def func_name ...`
*   CPU usage: Decorated functions compiled once before first execution and then all executions performed at machine code speed
*   GPU usage: Decorated functions compiled into CUDA kernels and device functions for running on a GPU

# Using Numba to Program GPUs

## Requirements
	

*   Numba package available via `conda install numba` and `pip install numba` (and compiling from source)
*   Requires CUDA-enabled GPU with compute capability 2.0+, up-to-date Nvidia driver, and cudatoolkit package
 *   cudatoolkit available via `conda install cudatoolkit` for Anaconda
 *   Other Numba installations require separate installation of CUDA Toolkit (e.g., via CUDA SDK installer)

## How to use it
*   Write a kernel in Python - one must stick to a subset of supported Python code (e.g., exception handling via `try ... except` is not supported)
*   Mark the function with a Numba CUDA decorator


In [None]:
from numba import cuda
@cuda.jit
def my_kernel(...):
	...

*	Numba compiles it into a CUDA kernel
*	Numba CUDA kernels interface directly with NumPy arrays, which are transferred automatically between the CPU and GPU


## Advantages and disadvantages
* Advantages: Pure Python code; avoids some low-level hassles of CUDA
* Disadvantages: Numba offers only a subset of all CUDA capabilities (e.g., dynamic parallelism and texture memory are not available yet)


## Writing and executing CUDA kernels
*	As with CUDA programming, need to think in terms of grids/blocks/threads and global/shared/local GPU memory
*	Kernels require the number of blocks per grid and the number of threads per block when called

In [None]:
threadsperblock = 32
blockspergrid = math.ceil(my_array.size / threadsperblock)
increment_by_one[blockspergrid, threadsperblock](an_array)

*	CUDA objects used for accessing the dimensions of the thread hierarchy
 *		`numba.cuda.threadIdx` - thread ID
 *		`numba.cuda.blockIdx` - block ID
 *		`numba.cuda.blockDim` - number of threads per block
 *		`numba.cuda.gridDim` - total number of blocks
 *		`numba.cuda.grid(ndim)` - absolute position of the current thread in the entire grid of blocks (ndim - number of dimensions declared when instantiating the kernel)
 *		`numba.cuda.gridsize(ndim)` - absolute size/shape in threads of entire grid of blocks  (ndim - number of dimensions declared when instantiating the kernel)


In [8]:
from numba import cuda
import numpy as np
import math

@cuda.jit
def increment_by_one(an_array):
    pos = cuda.grid(1)
    if pos < an_array.size:
        an_array[pos] += 1

my_array = np.array([3, 5, 9, 13, 7, 2, 4, 1])

threadsperblock = 4
blockspergrid = math.ceil(my_array.size / threadsperblock)
increment_by_one[blockspergrid, threadsperblock](my_array)

print(threadsperblock)
print(blockspergrid)
print(my_array)


4
2
[ 4  6 10 14  8  3  5  2]


## Memory Management in Numba
* Automatically transfers NumPy arrays to the device
* By default, Numba always transfers device memory back to the host when a kernel finishes
* Manual controls for avoiding unnecessary transfer of read-only arrays
 *	`numba.cuda.device_array(shape, dtype=np.float, strides=None, order='C', stream=0)` - allocate empty device ndarray (like numpy.empty())
 *	`numba.cuda.device_array_like(ary, stream=0)` - same as device_array() except it uses information from the array
 *	`numba.cuda.to_device(obj, stream=0, copy=True, to=None)` - allocate and transfer NumPy ndarray to device
   *		For example, copying a NumPy array from host to device:


In [None]:
ary = np.arange(10)
d_ary = cuda.to_device(ary)

* Also support for:
 *	Using streams to transfer arrays
 *	Accessing device arrays
 *	Pinned memory
 *	Mapped memory
 *	Accessing global, shared, and local device memory

* Numba queues up device memory deallocations when no more references exist
 *	Deallocations can be paused temporarily if necessary with `numba.cuda.defer_cleanup()`


## Reductions
* Numba reduction support is simpler than with CUDA, using a `@cuda.reduce` decorator


In [None]:
@cuda.reduce
def sum_reduce(a, b):
    return a + b

# References
* Numba documentation: https://numba.pydata.org/numba-doc/dev/index.html