# Cython and Numba

[Cython](http://cython.org) is a transparent Python-to-C compiler.  It extends Python with a few type annotations to make the resultant C code very efficient.

In [1]:
# Enable inline Cython in IPython notebook
%load_ext cython

In [2]:
%%cython

# Cython accepts Python code unmodified
a = 0
for i in range(10):
    a += i
print a

45


In [3]:
%%cython --annotate

# --annotate = profile each line, tell you where there's a performance issue

a = 0
for i in range(10):
    a += i
print a

45


Looking at the C code:
* every time you get/set `a`, you modify a Python dictionary
* likewise for `i`
* adding two integers involves a method call (because they might not be integers!)
* the expression `range(10)` materializes a list.  Python isn't sure whether the result is a list (you could have redefined `range` to return something else), so there's code to create and advance an iterator.

Let's fix this step by step

In [4]:
%%cython --annotate

cdef int a = 0       # Cython annotation: a is an integer
for i in range(10):
    a += i
print a

45


In [5]:
%%cython --annotate

# --annotate = profile each line, tell you where there's a performance issue

cdef int a = 0       # Cython annotation: a is an integer
cdef int i           # i is an integer
for i in range(10):
    a += i
print a

45


Now the inner loop looks like what you would write in C

## How much faster is the typed code?

#### Baseline python

In [6]:
def last_three_digits_of_fib(n):
    a, b = 1, 1
    i = 2
    while i < n:
        a, b = b, (a + b) % 1000
        i += 1
    return b

In [7]:
last_three_digits_of_fib(10)

55

In [8]:
last_three_digits_of_fib(1000)

875

In [9]:
%timeit last_three_digits_of_fib(1000)

10000 loops, best of 3: 136 µs per loop


#### Cython, without annotations

In [10]:
%%cython --annotate

def last_three_digits_of_fib_2(n):
    a, b = 1, 1
    i = 2
    while i < n:
        a, b = b, (a + b) % 1000
        i += 1
    return b

In [11]:
%timeit last_three_digits_of_fib_2(1000)

10000 loops, best of 3: 68.9 µs per loop


The difference is mostly due to removing Python bytecode interpreter

#### Cython, full annotations

In [12]:
%%cython --annotate

def last_three_digits_of_fib_3(int n):
    cdef int a, b, i
    a, b = 1, 1
    i = 2
    while i < n:
        a, b = b, (a + b) % 1000
        i += 1
    return b

In [13]:
%timeit last_three_digits_of_fib_3(1000)

100000 loops, best of 3: 6.73 µs per loop


136 us -> 7 us per loop by removing Python crud!

**Beware**: _"Premature optimization is the root of all evil"_ - Donald Knuth

In [14]:
# Numpy code is already pretty optimized
import numpy as np
data = np.random.random(1000)  # 1000 uniform random numbers in [0,1)

In [15]:
%timeit np.mean(data)

The slowest run took 51.73 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 8.7 µs per loop


In [16]:
def my_stupid_mean(vec):
    total = 0.0
    for x in vec:
        total += x
    return total / len(vec)

In [17]:
%timeit my_stupid_mean(data)

10000 loops, best of 3: 134 µs per loop


In [18]:
%%cython --annotate

cimport numpy as np  # Import cython-specific info about NumPy
cimport cython
@cython.boundscheck(False)
def cython_mean(np.ndarray[np.double_t, ndim=1] vec):
    
    cdef double total = 0.0     # Need to burn in data types
    cdef np.double_t x          # And use NumPy-specific data types
    cdef int i                  # And access element via indices (and turn off bounds checking)
    cdef int num = vec.shape[0] # And use this syntax to get array sizes
    
    for i in range(num):
        total += vec[i]
    return total / num

In [19]:
%timeit cython_mean(data)

The slowest run took 18.61 times longer than the fastest. This could mean that an intermediate result is being cached 
1000000 loops, best of 3: 1.55 µs per loop


But sometimes the effort really does pay off:
* Pure Python: 134 us
* NumPy built-in: 9 us
* Cython equivalent: 1.6 us  (took me about 15 minutes to annotate correctly!)

## Numba

[Numba](http://numba.pydata.org/) is a relatively new package that promises to automate a lot of the type annotation.  It brands itself as a JIT engine for Python.  Without thinking, can match/beat Cython most of the time.

Not completely mature (painful to install!).  _Switch to Anaconda IPython Notebook here_: `/anaconda/bin/ipython notebook`

In [20]:
import numba

In [21]:
my_smart_mean = numba.jit("f8(f8[:])")(my_stupid_mean)

In [22]:
%timeit my_smart_mean(data)

The slowest run took 145.12 times longer than the fastest. This could mean that an intermediate result is being cached 
1000000 loops, best of 3: 1.44 µs per loop


_Not bad!!_

In [23]:
# Without a type annotation, delay compilation until
# function is called with explicit args 
speedy_last_three_digits_of_fib = numba.jit()(last_three_digits_of_fib)

In [24]:
%timeit speedy_last_three_digits_of_fib(1000)

The slowest run took 10079.58 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 5.75 µs per loop


_Slightly faster than Cython annotated version!_

In [25]:
# Idiomatic Numba
@numba.jit    # Function decorator
def my_super_smart_mean(data):
    total = 0.0
    for x in data:
        total += x
    return total / len(data)

In [26]:
%timeit my_super_smart_mean(data)

The slowest run took 46533.71 times longer than the fastest. This could mean that an intermediate result is being cached 
1000000 loops, best of 3: 1.45 µs per loop


[Note: first call to function incurs JIT overhead]

In [27]:
%timeit my_super_smart_mean(data)

The slowest run took 4.12 times longer than the fastest. This could mean that an intermediate result is being cached 
1000000 loops, best of 3: 1.68 µs per loop
