Numba: fast 'for' loops in python
Author: Pierre Ablin, Mathurin Massias

Numba is a Python package that does Just In Time compilation. It can greatly accelerate Python for loops. It implements most Python/Numpy operations.

To install it, simply do conda install numba or pip install numba. Be sure to have an up-to-date version:pip install --upgrade numba.

First example
Say you want to compute $\sum_{i=1}^n \frac{1}{i^2}$. The following code does it in pure Python:


In [11]:
def sum_python(n):
    output = 0.
    for i in range(1, n + 1):
        output += 1. / i ** 2
    return output

In [12]:
%timeit sum_python(10000)


5.15 ms ± 690 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Numpy
To accelerate this loop, you can vectorize it using Numpy:

In [18]:
import numpy as np

def sum_numpy(n):
    return np.sum(1. / np.arange(1, n + 1) ** 2)

In [19]:
%timeit sum_numpy(10000)


54.9 µs ± 1.26 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


Numba
You can also use the @njit decorator from Numba. Simply put it on top of the python function:

In [20]:
from numba import njit

@njit
def sum_numba(n):
    output = 0.
    for i in range(1, n + 1):
        output += 1. / i ** 2
    return output

In [21]:
%timeit sum_numba(10000)

16.7 µs ± 1.41 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


Orders of magnitude faster than pure Python code, and also (for this example) faster than Numpy !

Second example: stochastic gradients
Numba can be very handy when coding a stochastic algorithm. Indeed, computing a stochastic gradient can be a very fast operation, hence coding a for loop over it in pure Python can slow the code down.

Take the ridge regression $ \min f(x) = \frac 1n \sum_{i=1}^n f_i(x)$ where:

$$f_i(x) = \frac{1}{2}(a_i^\top x- b_i)^2 + \frac \lambda 2 \|x\|_2^2$$
We have the stochastic gradients: $\nabla f_i(x) = (a_i^\top x - b_i) a_i + \lambda x$, and the full batch gradient: $\nabla f(x) = \frac1n A^{\top}(A x - b) + \lambda x$

In [22]:
n, p = 100, 100

A = np.random.randn(n, p)
b = np.random.randn(n)

lam = 0.1
x = np.zeros(p)

Numpy¶

In [25]:
def grad_i(x, i, A, b, lam):
    ai = A[i]
    return (np.dot(ai, x) - b[i]) * ai + lam * x


def sgd(x, max_iter, step, A, b, lam):
    n, _ = A.shape
    for i in range(max_iter):
        x -= step * grad_i(x, i % n, A, b, lam)
    return x

In [27]:
%timeit grad_i(x, 0, A, b, lam)


4.6 µs ± 137 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [28]:
%timeit sgd(x, 1000, 0.0001, A, b, lam)


10.3 ms ± 3.85 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


Numba¶


In [29]:
@njit
def grad_i(x, i, A, b, lam):
    ai = A[i]
    return (np.dot(ai, x) - b[i]) * ai + lam * x


@njit
def sgd(x, max_iter, step, A, b, lam):
    n, _ = A.shape
    for i in range(max_iter):
        x -= step * grad_i(x, i % n, A, b, lam)
    return x

In [30]:
%timeit grad_i(x, 0, A, b, lam)


The slowest run took 9.00 times longer than the fastest. This could mean that an intermediate result is being cached.
3.18 µs ± 3.44 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [31]:
%timeit sgd(x, 1000, 0.0001, A, b, lam)


453 µs ± 11.4 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
