# Introduction to Numba

Last class you learned about how wonderful parallel computing is. We also learned that your own processor can exploit the full power of local parallelism. However, when getting to the nitty gritty it gets really messy. The multiprocessing module requires you to do manual workload distribution, and it is a complex and error prone task. Enter Numba, a Just In Time (JIT) compiler for Python. Numba takes the code you wrote and compiles it to exploit hardware architecture. Before we go into the specifics you need to know what a JIT compiler is.

## Compilers vs Interpreters

CPython (the standard Python distribution) is an "interpreted language". While you should know the differences between compiled and interpreted computing (and if you don't you really need to take ICS 51), here's a short video to serve as a refresher:

In [1]:
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/_C5AHaS1mOA" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')

Let's focus on the advantages of compiled computation:
- Running time is faster, this is because there's no trips between the language and the processor
- Code can be optimized. If the compiler knows the whole application is translating, it can find better ways to run your code
- Enforces safe, working code. Since your code needs to work in order to run, it's an all or nothing situation

On the other hand, interpreters have advantages to:
- Faster initial deployment, there's no need to look at the whole time is a "pay as you go" mode.
- Easier debugging. Most problems will arise right after running the problematic instruction.
- Allows for dynamic coding, where a programmer can change code without having to recompile the whole thing

But, what if we could exploit the advantages of the two, hence JIT compilers.

## JIT

JIT compilers run as an interpreter but instead of going line by line, it chooses block of code (usually functions) and compiles them, runs them and then comes back to the next block of code. This makes JIT generally faster than interpreters, but it also allows for applying local optimizations to the blocks of code being compiled.

Numba is a JIT compiler for Python (name comes from Numpy +  Mamba which is a very fast snake). It take in Python code, spearates it by compilable blocks and sends it to LLVM which is the backend compiler.

Because of it's nature, Numba works best with optimized distributions of Numpy. 

Numba is not included with Python so you will need to install it:

In [2]:
!pip install numba



Let's see an example of how numba optimizes code:

In [1]:
from numba import jit
import numpy as np
x = np.arange(1000000).reshape(1000, 1000)

In [10]:
%%time
def go_slow(a): # Function is compiled to machine code when called the first time
    trace = 0
    for i in range(a.shape[0]):   # Numba likes loops
        trace += np.tanh(a[i, i]) # Numba likes NumPy functions
    return a + trace              # Numba likes NumPy broadcasting

go_slow(x)

Wall time: 8.99 ms


In [11]:
%%time

@jit(nopython=True) # Set "nopython" mode for best performance, equivalent to @njit
def go_fast(a): # Function is compiled to machine code when called the first time
    trace = 0
    for i in range(a.shape[0]):   # Numba likes loops
        trace += np.tanh(a[i, i]) # Numba likes NumPy functions
    return a + trace              # Numba likes NumPy broadcasting

go_fast(x)

Wall time: 310 ms


Et Voilà!!!.... wait a minute, that is not faster. In fact it is about 50 times slower, what is going on?. Simple, we are timing the function's declaration along with it's instantiation. Therefore, what we are seeing there is the time it took to compile the function. Let's run it again, but now we are only running the instantiation. We'll use timeit to get more accurate results:

In [21]:
%%timeit
go_slow(x)

10.7 ms ± 428 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [22]:
%%timeit
go_fast(x)

6.57 ms ± 250 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


There you have it, about half the running time (I know is not that impressive but this loop uses a numpy function, so even the interpreted version is already optimized. Let's go ahead and look into how to use the numba decorators:

## Nopython mode

Numba operates through the @jit decorator. This can take arguments, one of them is enabling the nopython mode. This means that it will not let the Python interpreter to have any involvement in the compilation. When not set as True (which is the default behavior) then it runs in object-mode, which means that if compiling fails (as in the case where we are using libraries unknown to Numba), it will fall back to Python's default interpreter.

Numba is an incredibly sophisticated tool that includes decorator for the use with stencil computation, GPU programming and more. For this class we are going to focus on the basic functionality to optimize your code and to exploit local parallelism.

## Nogil mode

Setting the nogil mode will allow for running native Python but with the GIL released. This allows for multi-threding


In [27]:
%%time
@jit(nogil=True)
def f(x, y):
    return x + y
f(2,3)

Wall time: 47 ms


## Inlining functions

One of the major optimizations that Numba does is inlining. This means that compatible functions that do similar operations can run in parallel using vector instructions, we'll look into a more detailed example later on but oftentimes inlining will be faster than running the functions separately. In the following example, hypot is calling upon the function square twice. See how it is faster inside the jit compiled function than doing the same operation separately.

In [30]:
import math
@jit
def square(x):
    return x ** 2

@jit
def hypot(x, y):
    return math.sqrt(square(x) + square(y))

In [35]:
%%timeit
hypot(3,4)

267 ns ± 9.19 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [36]:
%%timeit
math.sqrt(square(3) + square(4))

675 ns ± 23 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


## Parallel mode

The parallel mode allows for automatic parallelism of your code. This will work with specific instructions, most ppular ones are:
- unary operators: + - ~
- binary operators: + - * / /? % | >> ^ << & ** //
- comparison operators: == != < <= > >=
- reduce()
- Numpy array functions like zeros, ones, arange, linspace, dot, sum, prod, min, max, argmin, argmax, etc.


In [26]:
%%time
@jit(nopython=True, parallel=True)
def f(x, y):
    return x + y
f(2,3)

Wall time: 58 ms


### Explicit parallelism

Sometimes a loop is "embarassingly" parallel but the parallel argument still won't yield better performance. For those cases we can use loops that are explicitly parallel. This is done through Numba's own prgane. However you need to ensure that there is no cross iteration dependencies (more on that topic later on). Here's an example with a one dimensional array:

In [38]:
from numba import njit, prange

@njit(parallel=True)
def prange_test(A):
    s = 0
    # Without "parallel=True" in the jit-decorator
    # the prange statement is equivalent to range
    for i in prange(A.shape[0]):
        s += A[i]
    return s
def range_test(A):
    s=0
    for i in range(A.shape[0]):
        s += A[i]
    return s

In [49]:
A=np.random.rand(10000)
print(A)

[0.24267202 0.90299416 0.68211224 ... 0.92767258 0.53343004 0.27722563]


In [55]:
%%timeit
range_test(A)

2.38 ms ± 47.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [57]:
%%timeit
prange_test(A)

12.7 µs ± 591 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


However, see what happens if there is cross iteration dependence:

In [86]:
@njit(parallel=True)
def prange_dep_test(A):
    s = 0
    B=np.copy(A)
    # Without "parallel=True" in the jit-decorator
    # the prange statement is equivalent to range
    for i in prange(1,B.shape[0]-1):
        B[i]=B[i]+B[i+1]*B[i-1]
        s += B[i]
    return s
def range_dep_test(A):
    s=0
    B=np.copy(A)
    for i in range(1,B.shape[0]-1):
        B[i]=B[i]+B[i+1]*B[i-1]
        s += B[i]
    return s

In [87]:
B=np.random.rand(1000)

In [88]:

prange_dep_test(B)

740.0015226632324

In [89]:
range_dep_test(B)

974.3861328097464

Now they are not the same, that's because of the cross dependence, so you have to be very careful when you are using prange.

## Final words

There is a lot more you can do with Numba, and as always the official documentation is your first point of reference (https://numba.pydata.org/numba-doc/latest/index.html). We are going to keep playing with this tool in our next session, but in the meantime get familiarized with it, try to think of opportunities to use it in your code