## Coming soon in `numba` 0.34

You can install the release candidate as of 07/09/2017 from the `numba` conda channel

```
conda install -c numba numba
```

In [1]:
import numpy
from numba import njit

Define some reasonably expensive operation in a function.

In [2]:
def do_trig(x, y):
    z = numpy.sin(x**2) + numpy.cos(y)
    return z

We can start with 1000 x 1000 arrays

In [3]:
x = numpy.random.random((1000, 1000))
y = numpy.random.random((1000, 1000))

In [4]:
%timeit do_trig(x, y)

14.6 ms ± 1.29 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


Now let's `jit` this function.  What do we expect to get out of this?  Probably nothing, honestly.  As we've seen, `numpy` is pretty good at what it does.

In [5]:
do_trig_jit = njit()(do_trig)

In [6]:
%timeit do_trig_jit(x, y)

36.1 ms ± 470 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


Maybe a _hair_ slower than the bare `numpy` version. So yeah, no improvement. 

### BUT

Starting in version 0.34, with help from the Intel Parallel Accelerator team, you can now pass a `parallel` keyword argument to `jit` and `njit`. 

Like this:

In [7]:
do_trig_jit_par = njit(parallel=True)(do_trig)

How do we think this will run?

In [8]:
%timeit do_trig_jit_par(x, y)

19 ms ± 2.12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Not bad -- around a 3x speedup for a single line?

And what if we unroll the array operations like we've seen before?  Does that help us out?

In [9]:
@njit
def do_trig(x, y):
    z = numpy.empty_like(x)
    for i in range(x.shape[0]):
        for j in range(x.shape[1]):
            z[i, j] = numpy.sin(x[i, j]**2) + numpy.cos(y[i, j])
    return z

In [10]:
%timeit do_trig(x, y)

35.5 ms ± 1.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Hmm, that's actually a hair faster than before.  Cool!

Now let's parallelize it!

In [11]:
@njit(parallel=True)
def do_trig(x, y):
    z = numpy.empty_like(x)
    for i in range(x.shape[0]):
        for j in range(x.shape[1]):
            z[i, j] = numpy.sin(x[i, j]**2) + numpy.cos(y[i, j])
    return z

In [12]:
%timeit do_trig(x, y)

36.1 ms ± 1.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


What happened?

Well, automatic parallelization is a _pretty hard_ problem.  (This is a massive understatement).

Basically, parallel `jit` is "limited" to working on array operations, so in this case, unrolling loops will hurt you. 
Blarg.

### FAQ that I just made up

- Why didn't you tell us about this before?

It is brand new. The numba team is great, but have a really bad habit of releasing new features 5-10 days before I run a tutorial.

- Is regular `jit` just dead now?

It honestly might be. I've only started playing around with it but I haven't seen any speed _hits_ for using it when there are no array operations to operate on.

- Is all of that stuff about `vectorize` just useless now?

Short answer: no.  Long answer: Let's check it out!

In [13]:
from numba import vectorize
import math

Recall that we define the function as if it operates on scalars, then apply the vectorize decorator.

In [14]:
@vectorize
def do_trig_vec(x, y):
    z = math.sin(x**2) + math.cos(y)
    return z

In [15]:
%timeit do_trig_vec(x, y)

39.6 ms ± 753 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


A little faster, but roughly equivalent to the base `numpy` and `jit` versions. Now let's type our inputs and run it in `parallel`

In [16]:
@vectorize('float64(float64, float64)', target='parallel')
def do_trig_vec_par(x, y):
    z = math.sin(x**2) + math.cos(y)
    return z

In [17]:
%timeit do_trig_vec_par(x, y)

20.3 ms ± 974 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


Yowza! So yeah, `vectorize` is still the best performer when you have element-wise operations, but if you have a big mess of stuff that you just want to speed up, then parallel `jit` is an awesome and easy way to try to boost performance.

In [18]:
a = x
b = y
c = numpy.random.random((a.shape))

In [19]:
%%timeit
b**2 - 4 * a * c

12.3 ms ± 752 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [20]:
def discrim(a, b, c):
    return b**2 - 4 * a * c

In [21]:
discrim_vec = vectorize()(discrim)

In [22]:
%timeit discrim_vec(a, b, c)

5.24 ms ± 189 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [23]:
discrim_vec_par = vectorize('float64(float64, float64, float64)', target='parallel')(discrim)

In [24]:
%timeit discrim_vec_par(a, b, c)

5.33 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [25]:
discrim_jit = njit()(discrim)

In [26]:
%timeit discrim_jit(a, b, c)

4.87 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [27]:
discrim_jit_par = njit(parallel=True)(discrim)

In [28]:
%timeit discrim_jit_par(a, b, c)

5.34 ms ± 696 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
