# Numba

[Numba](http://numba.pydata.org/) is an open source JIT compiler that translates a subset of Python and NumPy code into fast machine code. 

Numba supports CUDA GPU programming by directly compiling a restricted subset of Python code into CUDA kernels and device functions following the CUDA execution model. Kernels written in Numba appear to have direct access to NumPy arrays. NumPy arrays are transferred between the CPU and the GPU automatically.

## What is a kernel?

A kernel is similar to a function, it is a block of code which takes some inputs and is executed by a processor.

The difference between a function and a kernel is:
- A kernel cannot return anything, it must instead modify memory
- A kernel must specify its thread hierarchy (threads and blocks)

## What are grids, threads and blocks (and warps)?

[Threads and blocks](https://en.wikipedia.org/wiki/Thread_block_(CUDA_programming)) are how you instruct you GPU to process some code in parallel. Our GPU is a parallel processor, so we need to specify how many times we want our kernel to be executed.

Threads have the benefit of havign some shared cache memory between them, but there are a limited number of cores on each GPU so we need to break our work down into blocks which will be run after another on the GPU.

<figure>

![CPU GPU Comparison](images/threads-blocks-warps.png)

<figcaption style="text-align: center;"> 
    
Image source <a href="https://docs.nvidia.com/cuda/cuda-c-programming-guide/">https://docs.nvidia.com/cuda/cuda-c-programming-guide/</a>
    
</figcaption>
</figure>


### What??

Don't worry too much about this now. Just take away the idea that **we need to specify the number of times we want our kernel to be called**, and that is given as two numbers which are multiplied together to give your overall grid size.

Rules of thumb for threads per block:
- Should be a round multiple of the warp size (32)
- A good place to start is 128-512 but benchmarking is required to determine the optimal value.


## Hello world

Let's dig in with some code and hopefully things will become more clear.

To start off let's write a simple CPU based Python function which we will call repeatedly within a [list comprehension](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions). From a Python perspective list comprehensions can be a good jumping off point for parallel computing because they feel somewhat parallel already.

In [None]:
data = range(10)

def foo(i):
    return i
    
[foo(i) for i in data]

Here we have our `foo` function return its index value and use a for loop to iterate over our data which is generated by `range`.

Next we will step by step convert this over to a CUDA kernel and run it on our GPU with numba CUDA.

First we need to remember that our kernel cannot return anything. Instead we will use an output list to store the values we would return.

In [None]:
data = range(10)
output = []

def foo(i):
    output.append(i)
    
[foo(i) for i in data]

output

Our next challenge is that our output array on our GPU must have a fixed length. We can't start off with an empty array and keep appending things. So let's use numpy to create a `ndarray` with the same length as our input data. We will also convert our input list to a `numpy` array too as that's what we can move to our GPU.

In [None]:
import numpy as np

In [None]:
data = np.asarray(range(10))
output = np.zeros(len(data))

def foo(i):
    output[i] = i
    
[foo(i) for i in data]

output

Now that our pure Python function behaves like a kernel let's use numba to convert it into one.

In [None]:
from numba import cuda

In [None]:
data = np.asarray(range(10))
output = np.zeros(len(data))

@cuda.jit
def foo(input_array, output_array):
    i = cuda.grid(1)
    output_array[i] = i
    
foo[1, len(data)](data, output)

output

**Woo the above code ran on our GPU!**

Now let's unpack this a bit.

To convert our CPU function into a GPU kernel we need to add the `@cuda.jit` decorator. This tells numba to compile our code down to CUDA compatible byte code at runtime.

Next we changed our kernels inputs to `input_array` and `output_array`. This is because our kernel needs a reference to both arrays in order to interact with them. (More on this later).

But what about `i`? Instead of passing our function the index each time we call it we can rely on a nice CUDA function called `cuda.grid` which allows our kernel to get it's own thread index while it is running.

Lastly we make a funny looking function call `foo[blocks, threads](input, output)`. In order to run our kernel on the GPU in parallel we need to specify how many times we want it to run. Kernel functions are configured using square brackets and passing the block size and thread size. With our array only being `10` elements long we specify a blocksize of `1` and a thread size of `10` which means our kernel will be executed `10` times. Then we pass our arguments as normal.

## Something a little bigger

Now that we've run our first CUDA kernel with Numba let's try something a little bigger.

This time we are going to take a large array and double every number in it. We will do it first in pure Python on the CPU and then in a CUDA kernel on the GPU.

Let's start with a large `30m` long array of random numbers. And an output array of equal length.

In [None]:
random_array = np.random.random((30_000_000))
random_array

In [None]:
output = np.zeros_like(random_array)
output

Then in Python let's iterate over this array and double each item into the output array. We can time the cell to see how long this takes.

In [None]:
%%time

def foo(i):
    output[i] = random_array[i] * 2
    
[foo(i) for i in range(len(random_array))]

output

For me this takes around 10 seconds for the CPU to do this calculation.

Next let's write a CUDA kernel which does exactly the same thing. The only difference to the previous example is that we set out thread size to a fixes value of `128` and then calculate how many blocks we need to cover the whole array.

In [None]:
import math

In [None]:
%%time

output = np.zeros_like(random_array)

threads = 128
blocks = math.ceil(random_array.shape[0] / threads)

@cuda.jit
def foo(input_array, output_array):
    i = cuda.grid(1)
    output_array[i] = input_array[i] * 2
    
foo[blocks, threads](random_array, output)
output

Hooray this is now orders of magnitude faster, only taking a few hundred milliseconds.

The savvy among you may be wondering though, numpy is already a C based optimised library and we are comparing our GPU kernel with some pure Python code. What if we did it in numpy?

Well you would be right, for this example numpy is still faster than our GPU.

In [None]:
%%time 

random_array * 2

But the reason this is the case is because of memory management.

## Memory management

Earlier we discussed that the CPU and GPU are effectively two separate computers. Each of these computers has it's own memory. 

All of the data we've worked with so far has been created with numpy on the CPU. So in order for `numba` to work with this data on the GPU it has been quietly copying data back and forth for us.

This data movement comes at a penalty.

We also have the option of being in control of the data ourselves. We can explicitly move our numpy arrays to the GPU ahead of time with `cuda.to_device`.

In [None]:
gpu_random_array = cuda.to_device(random_array)
gpu_output = cuda.to_device(np.zeros_like(random_array))

In [None]:
gpu_random_array

Now if we run our kernel again and pass it our GPU memory arrays we see it does in fact out perform numpy.

In [None]:
%%timeit -n 100
foo[blocks, threads](gpu_random_array, gpu_output)

However our output result is also still on the GPU. We explicitly copied it there, so we need to explicitly copy it back with `copy_to_host()`.

In [None]:
gpu_output.copy_to_host()

Both of these data movement operations take time. But the calculation we are performing here is trivial. As the calculation within our kernel gets more complex the percentage of time spent copying data becomes smaller and smaller.

Memory management is useful in other places too. We may wish to write code where we write many kernels and chain them together. It would be inefficient to copy that data to the GPU and back again between each kernel call.

```python
# move array to GPU
foo[blocks, threads](data, output)
# move data back to CPU

# move array to GPU
bar[blocks, threads](data, output)
# move data back to CPU

# move array to GPU
baz[blocks, threads](data, output)
# move data back to CPU
```

So by explicitly putting our data there we can cut down on this time and be more in control of our computation.

```python
# move array to GPU
data = cuda.to_device(data)
output = cuda.to_device(output)

foo[blocks, threads](data, output)
bar[blocks, threads](data, output)
baz[blocks, threads](data, output)

# move data back to CPU
data = data.copy_to_host()
output = output.copy_to_host()
```