# Numba - Just In Time Compilation for Python

Some technologies are so easy to use, and the effects are so powerful, that they appear to be magic.  Numba is one such technology.  

Let's start with Paul's prime sieve code from earlier.

In [1]:
import  math
def sieve_primes(n):
    a = [True for x in range(n + 1)]
    i = 2
    while i <= math.sqrt(n):
        if a[i]:
            for j in range(i*i, n + 1, i):
                a[j] = False
        i += 1
    return [i for i in range(2, len(a)) if a[i]]

In [2]:
#Check it's working OK
sieve_primes(30)

[2, 3, 5, 7, 11, 13, 17, 19, 23, 29]

In [3]:
#Time it for all primes less than 5 million
N = 5000000
original_speed = %timeit -o sieve_primes(N)

1.27 s ± 48.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


And now with Numba

In [4]:
from numba import jit

@jit
def numba_sieve_primes(n):
    a = [True for x in range(n + 1)]
    i = 2
    while i <= math.sqrt(n):
        if a[i]:
            for j in range(i*i, n + 1, i):
                a[j] = False
        i += 1
    return [i for i in range(2, len(a)) if a[i]]

In [5]:
#Check it's working OK
numba_sieve_primes(30)

[2, 3, 5, 7, 11, 13, 17, 19, 23, 29]

In [6]:
#Time it 
numba_speed = %timeit -o numba_sieve_primes(N)

211 ms ± 5.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [7]:
print('Numba is {0:.1f} times faster than pure Python for the primes function'.format(original_speed.best/numba_speed.best))

Numba is 6.0 times faster than pure Python for the primes function


**A factor of 6 speed up and all we did was type @jit**

### What is Numba doing?
#### Just in time compilation and compilation overhead

By adding the @jit decorator to our Python function, we are asking Numba to try to compile the function to machine code. The compilation happens the first time the function is called, hence the name **jit - just in time** compilation.  The compiler in question is the powerful open source compiler framework, LLVM.

This means that the first time a numba function is called there will be a delay due to the compilation but all subsequent calls will be fast.  Here's an explicit demonstration using `monte_carlo_pi`.

In [8]:
from numba import jit
import random
import time

@jit
def monte_carlo_pi(nsamples):
    acc = 0
    for i in range(nsamples):
        x = random.random()
        y = random.random()
        if (x ** 2 + y ** 2) < 1.0:
            acc += 1
    return 4.0 * acc / nsamples

In [9]:
start = time.time()
pi = monte_carlo_pi(N)
end = time.time()
print("Elapsed (with compilation) = %s" % (end - start))

Elapsed (with compilation) = 0.3076920509338379


In [10]:
start = time.time()
pi = monte_carlo_pi(N)
end = time.time()
print("Elapsed (no compilation) = %s" % (end - start))

Elapsed (no compilation) = 0.05726218223571777


Now we have compilation out of the way, let's time it more robustly using timeit and compare with a version that doesn't make use of numba.

In [11]:
numba_time = %timeit -o monte_carlo_pi(N)

53.4 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [12]:
#Plain Python version that doesn't use numba
def python_monte_carlo_pi(nsamples):
    acc = 0
    for i in range(nsamples):
        x = random.random()
        y = random.random()
        if (x ** 2 + y ** 2) < 1.0:
            acc += 1
    return 4.0 * acc / nsamples

In [13]:
python_time = %timeit -o python_monte_carlo_pi(N)

1.9 s ± 2.99 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [14]:
print('Numba version of monte_carlo_pi is {0:.1f} times faster than plain Python'.format(python_time.best/numba_time.best))

Numba version of monte_carlo_pi is 36.6 times faster than plain Python


## Numba doesn't work on everything

Numba has a lot of functionality but there is still a lot of Python that it does not know about and hence cannot compile to machine code.  The following function is given as an example in the numba documentation as one that Numba can't do anything with

In [15]:
from numba import jit
import pandas as pd

x = {'a': [1, 2, 3], 'b': [20, 30, 40]}

@jit
def use_pandas(a): 
    df = pd.DataFrame.from_dict(a) 
    df += 1                        
    return df.cov()                

In [16]:
use_pandas(x)

Compilation is falling back to object mode WITH looplifting enabled because Function "use_pandas" failed type inference due to: [1m[1mnon-precise type pyobject[0m
[0m[1mDuring: typing of argument at <ipython-input-15-8c1b6d8ea258> (8)[0m
[1m
File "<ipython-input-15-8c1b6d8ea258>", line 8:[0m
[1mdef use_pandas(a): 
[1m    df = pd.DataFrame.from_dict(a) 
[0m    [1m^[0m[0m
[0m
  @jit
[1m
File "<ipython-input-15-8c1b6d8ea258>", line 7:[0m
[1m@jit
[1mdef use_pandas(a): 
[0m[1m^[0m[0m
[0m
Fall-back from the nopython compilation path to the object mode compilation path has been detected, this is deprecated behaviour.

For more information visit https://numba.pydata.org/numba-doc/latest/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit
[1m
File "<ipython-input-15-8c1b6d8ea258>", line 7:[0m
[1m@jit
[1mdef use_pandas(a): 
[0m[1m^[0m[0m
[0m


Unnamed: 0,a,b
a,1.0,10.0
b,10.0,100.0


You might expect an error message when numba is given a function it can't compile but instead everything seemed to have just worked.

To dig deeper, let's create a version of this function without any numba decorations.

In [17]:
def use_pandas_nonumba(a): 
    df = pd.DataFrame.from_dict(a) 
    df += 1                        
    return df.cov()   

In [18]:
%timeit use_pandas_nonumba(x)

841 µs ± 9.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [19]:
%timeit use_pandas(x)

920 µs ± 36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


The numba version is actually slightly slower!  Why?

What is happening behind the scenes is that Numba tries to compile the function, realises it doesn't know anything about pandas and so gives up and just runs the function as a normal Python routine.  All of the checking adds a little overhead and we get the opposite of what we were trying to achieve -- slightly slower code instead of much faster code.

You might want to be warned about Numba's failure. If so, you can use the `nopython` flag which tells Numba that you only want it to produce pure compiled functions -- no interaction with the python interpreter at all.

In [20]:
@jit(nopython=True)
def use_pandas(a): 
    df = pd.DataFrame.from_dict(a) 
    df += 1                        
    return df.cov()  

In [21]:
# This will no result in an error rather than a slightly slower version of the Python funcion
use_pandas(x)

TypingError: Failed in nopython mode pipeline (step: nopython frontend)
[1m[1mnon-precise type pyobject[0m
[0m[1mDuring: typing of argument at <ipython-input-20-a5bacb76db3f> (3)[0m
[1m
File "<ipython-input-20-a5bacb76db3f>", line 3:[0m
[1mdef use_pandas(a): 
[1m    df = pd.DataFrame.from_dict(a) 
[0m    [1m^[0m[0m

This error may have been caused by the following argument(s):
- argument 0: [1mcannot determine Numba type of <class 'dict'>[0m


## Going Parallel

It is also possible to create parallel code using Numba.  The easiest way to proceed is to add `parallel=True` to the Numba decorator for a function that's already working in `nopython` mode and change `range` to its parallel equivalent `prange`.

In [22]:
from numba import prange

@jit(nopython=True,parallel=True)
def parallel_monte_carlo_pi(nsamples):
    acc = 0
    for i in prange(nsamples):
        x = random.random()
        y = random.random()
        if (x ** 2 + y ** 2) < 1.0:
            acc += 1
    return 4.0 * acc / nsamples

In [23]:
numba_parallel_time = %timeit -o parallel_monte_carlo_pi(N)

25.5 ms ± 5.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Recall how long this took in serial mode

In [24]:
numba_serial_time = %timeit -o monte_carlo_pi(N)

53.8 ms ± 2.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


and the original, plain Python version

In [25]:
python_time = %timeit -o python_monte_carlo_pi(N)

1.93 s ± 7.57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [26]:
print('The parallel numba version was {0:.1f} times faster than the serial numba version and'.
      format(numba_serial_time.best/numba_parallel_time.best))
print('it was {0:.1f} times faster than the plain python version'.
     format(python_time.best/numba_parallel_time.best))

The parallel numba version was 2.7 times faster than the serial numba version and
it was 100.0 times faster than the plain python version


Numba automatically ensures that each thread will get a different random number seed but, at the time of writing, I can't figure out how to determine what these seeds are.  

It should also be noted that the underlying Random Number Generator used by **random.random** is [Mersenne Twister](https://en.wikipedia.org/wiki/Mersenne_Twister) which has a very large period.  Although it is likely that multiple random streams created with different seeds won't overlap, it isn't guaranteed 

A discussion on this topic can be found at https://github.com/numba/numba/issues/2486 

## Exercises

1. Produce a parallel Numba version of the `sieve_primes` function.
2. Try to write a parallel Numba version of the Asian option pricing routine below
3. If you have any python code that does something of interest to you, try to write an accelerated version in Numba

In [None]:
import math # This is the standard Pyton math module. Not the numpy one
import random # This is the standard Python random module.  Not the numpy one

def Asian(so,k,r,v,t,m,n):
    """
    I have not identified what the arguments mean since the original MATLAB code didn't either. 
    This doc-string will be updated once I learn what they are!
    """
    dt = t/m
    AsianPayoffSum = 0
    for i in range(1,n+1):
        s = so
        stSum = so
        at = so
        for j in range(1,m+1):
            st = s * math.exp(((r-v**2/2)*dt) + (v*random.normalvariate(0,1)*math.sqrt(dt)))
            stSum = stSum + st 
            at = stSum/(j+1)
            s = st
        AsianPayoff = max(at-k,0);
        AsianPayoffSum = AsianPayoffSum + AsianPayoff;
    AsianCall = math.exp(-r*t)*(AsianPayoffSum/n)
    return(AsianCall)

## Numba - Next steps

This session has provided just an outline of some of the functionality provided by Numba.  Some of the topics that haven't been covered include

* GPU Support - Numba has strong support for [NVIDIA GPUs](http://numba.pydata.org/numba-doc/latest/cuda/index.html) and some [support for AMD](http://numba.pydata.org/numba-doc/latest/roc/index.html).
* Numba has support for different [threading layers](http://numba.pydata.org/numba-doc/latest/user/threading-layer.html) including OpenMP and Intel Threading Building Blocks (TBB)
* Numba can make use of the [Intel Short Vector Math Library (SVML)](http://numba.pydata.org/numba-doc/latest/user/performance-tips.html#intel-svml) for fast evaluation of trigonometric functions, square roots, exponentials and many more.
* The code compiled by Numba makes use of many of your CPU's features including SIMD instructions.  Detailed diagnostics concerning the compilation can be obtained.