<center>
    
# 1 - Introduction to `Numba`
    
<img src="imgs/numba_logo.png" alt="dask" width="300"/>

</center>

# Two approaches to write scientific or numeric software:

- Traditionally, scripting everything on **C** or **C++**, then writting **Python** wrappers as user-friendly interface with code (*bottom-up*).
- Nowadays, scripting everything on **Python** and, only when needed for performance, we speed up code with **Cython** or **Numba** (*top-down*). 

<center>
<img src="imgs/two_approaches.png" alt="numba" width="400"/>
   
##### "For day-to-day scientific data exploration, speed-of-development is primary, and speed-of-execution is often secondary."
    
Jake Vanderplas.

</center>

# Just-in-time Compilation with Numba

<center>
<img src="imgs/compiled_vs_interpreted.png" alt="numba" width="500"/>
</center>


- Numba compiles Python functions *on the fly* to machine code using LLVM
- Easy to use: just *decorate* your Python function with `@numba.jit`.  
- Compatibility with Numpy arrays
- Enables parallelization (use all the CPU cores in your machine)

In [None]:
import os
import time
import numpy as np
import urllib.request
import numba
from utils import show_images
from skimage.io import imread
from skimage.color import rgb2gray
from skimage.util import img_as_float

# 0) Let's load some data! 

For this introduction we are going to use some image that fits in memory. We are going to load an RGB [picture of Manhattan](https://unsplash.com/photos/5ULk8EgE8tg) taken by Miltiadis Fragkidis. 

In [None]:
# Download image --  (approx. 2Mb)
url = "https://unsplash.com/photos/5ULk8EgE8tg/download?ixid=MnwxMjA3fDB8MXxhbGx8fHx8fHx8fHwxNjc2OTAxMjYx&force=true"
manhattan_image = os.path.join("data", "manhattan.jpg")
urllib.request.urlretrieve(url, manhattan_image);

In [None]:
# Load image and convert to float

img = img_as_float(imread(manhattan_image))

print(f"The image has shape {img.shape}")
print(f"The full image has {np.prod(img.shape[:2]) / 1e6 : .0f} MPix, and occupies {img.nbytes / 1e6:.0f} Mb in RAM")

# RGB -> Grayscale
img_gray = rgb2gray(img)

# Plot
show_images(images=[img, img], zoom=[False, True], titles="Original ")

# Uniform filter 

As part of one pipeline, we are interested in performing smoothing with a 2-d uniform filter, of the form:

$$
\frac{1}{9} \cdot \begin{pmatrix} 
1 & 1 & 1 \\
1 & 1 & 1 \\
1 & 1 & 1 \\
\end{pmatrix} 
$$

This filter is applied at each pixel of the image, via a convolution.  

## 1) Let's first attempt this with pure Python code and time the performance. 

In [None]:
def uniform_filter(image):
    out = np.zeros_like(image)
    for i in range(1, image.shape[0] - 1):
        for j in range(1, image.shape[1] - 1):
            out[i, j] = (
                  image[i + -1, j + -1]
                + image[i + -1, j +  0]
                + image[i + -1, j +  1]
                + image[i +  0, j + -1]
                + image[i +  0, j +  0]
                + image[i +  0, j +  1]
                + image[i +  1, j + -1]
                + image[i +  1, j +  0]
                + image[i +  1, j +  1]
            ) / 9
    return out

In [None]:
%time smooth_img_gray = uniform_filter(img_gray)

show_images(
    images=[img_gray, smooth_img_gray, img_gray - smooth_img_gray],
    titles=["Original", "Smoothed", "Difference"],
    zoom=True,
    cmap=[None, None, "Accent"],
)

On my machine, his took around 26 seconds to perform. 

Python is very slow at for-loops because it uses [dynamic typing](https://stackoverflow.com/a/1517670), and at each iteration, the types must be checked. 



`Numba` can look the code ahead at run-time and optimize for repeated or unneeded operations, which can result in speed-ups. This is known as Just-in-time (JIT) compilation. `Numba` can compile Python functions with a the `@numba.jit` decorator. 

In [None]:
import numba

@numba.jit 
def numba_uniform_filter(image):
    out = np.zeros_like(image)
    for i in range(1, image.shape[0] - 1):
        for j in range(1, image.shape[1] - 1):
            out[i, j] = (
                  image[i + -1, j + -1]
                + image[i + -1, j +  0]
                + image[i + -1, j +  1]
                + image[i +  0, j + -1]
                + image[i +  0, j +  0]
                + image[i +  0, j +  1]
                + image[i +  1, j + -1]
                + image[i +  1, j +  0]
                + image[i +  1, j +  1]
            ) / 9
    return out

# We could also reuse the previous function and compile it. 
# Feel free to try if this would yield the same functionality:

# @numba.jit 
# def numba_uniform_filter(image):
#     return uniform_filter(image)


In [None]:
# The first time that we run the decorated Python function, Numba will compile it
numba_uniform_filter(img_gray)
# Time without compilation
%timeit smooth_img_gray = numba_uniform_filter(img_gray)

On my machine, his took around 82 miliseconds to perform. 

That means that only adding one line of code (`@numba.jit`) enables us to speed up our pipeline **317 times faster**.

### Providing arguments to @JIT
Numba allows us to provide some arguments that can accellerate the performance even more. 

In [None]:
import numba
@numba.jit(parallel=True, nogil=True, fastmath=True)
def numba_uniform_filter(x):
    out = np.zeros_like(x)
    # rule of thumb -- parallalize outermost loop
    for i in numba.prange(1, x.shape[0] - 1):  
        for j in range(1, x.shape[1] - 1):
            out[i, j] = (
                  x[i + -1, j + -1]
                + x[i + -1, j +  0]
                + x[i + -1, j +  1]
                + x[i +  0, j + -1]
                + x[i +  0, j +  0]
                + x[i +  0, j +  1]
                + x[i +  1, j + -1]
                + x[i +  1, j +  0]
                + x[i +  1, j +  1]
            ) / 9
    return out

In [None]:
# Time execution

# 1) The first time that we run the decorated Python function, Numba will compile it
numba_uniform_filter(img_gray)

# 2) Time without compilation
%timeit smooth_img_gray = numba_uniform_filter(img_gray)

Let's compare that with a Numpy implementation and with the Scipy's optimized `scipy.ndimage.uniform_filter`.

In [None]:
def numpy_uniform_filter(image):
    out = np.zeros_like(image)
    
    out[1:-1, 1:-1] += image[2:, 2:]
    out[1:-1, 1:-1] += image[2:, 1:-1]
    out[1:-1, 1:-1] += image[2:,:-2]
    out[1:-1, 1:-1] += image[1:-1, 2:]
    out[1:-1, 1:-1] += image[1:-1,1:-1]
    out[1:-1, 1:-1] += image[1:-1,:-2]
    out[1:-1, 1:-1] += image[:-2, 2:]
    out[1:-1, 1:-1] += image[:-2,1:-1]
    out[1:-1, 1:-1] += image[:-2,:-2]
    out /= 9
    return out

%timeit smooth_img_gray = numpy_uniform_filter(img_gray)

In [None]:
import scipy.ndimage

%timeit smooth_img_gray = scipy.ndimage.uniform_filter(img_gray, size=3, mode="constant", cval=0)

# Stencils

Stencils are a class of linear operators where output $x[i]$ is given by a weighted linear combination of its neighborhod: 

$$y[i] = \sum_{k\in\mathcal{N_{i}}}\alpha_{i} x[i-k]$$

Notable examples include multi-dimensional convolution and correlation.

**Numba** provides the [`@stencil` decorator](https://numba.pydata.org/numba-doc/latest/user/stencil.html) so that users may easily specify a stencil kernel and Numba then generates the looping code necessary to apply that kernel to some input array. 

Thus, the stencil decorator allows clearer, more concise code and in conjunction with the parallel jit option enables higher performance through parallelization of the stencil execution.

#### Let's create a Numba stencil, and JIT compile it.  

In [None]:
@numba.stencil
def _smooth_stencil(x):
    return (x[-1, -1] + x[-1, 0] + x[-1, 1] +
            x[ 0, -1] + x[ 0, 0] + x[ 0, 1] +
            x[ 1, -1] + x[ 1, 0] + x[ 1, 1]) / 9


@numba.jit(parallel=True)
def smooth_stencil(x):
    return _smooth_stencil(x)

# Time execution

# 1) The first time that we run the decorated Python function, Numba will compile it
smooth_stencil(img_gray)

# 2) Time without compilation
%timeit smooth_img_gray = smooth_stencil(img_gray)

# Numba benchmarking

| Architecture                  | Time      |   Ratio   |
|-------------------------------|-----------|-----------|
| Single CPU Core (Python)      |  26 s     |  x1       |
| Single CPU Core (Numpy)       | 260 ms    |  x100     |
| Single CPU Core (Scipy)       | 178 ms    |  x150     |
| Single CPU Core (Numba)       |  82 ms    |  x300     |
| 16 CPU Cores (parallel=True)  |  50 ms    |  x500     |
| 16 CPU Cores (stencil)        |  25 ms    |  x1000    |


### Some notes:
The argument `nopython=True` in Numba's `@jit` compiler defines a compilation mode that generates code that does not access the Python C API. This compilation mode produces the highest performance code, but requires that the native types of all values in the function can be inferred. Unless otherwise instructed, the `@jit` decorator will automatically fall back to object mode if `nopython` mode cannot be used.

As a side note, if compilation time is an issue, Numba JIT supports on-disk caching of compiled functions and also has an [Ahead-Of-Time](https://numba.readthedocs.io/en/stable/user/pycc.html) compilation mode.

# Bonus:
Other JIT [options](https://numba.readthedocs.io/en/stable/user/jit.html#compilation-options) to play with: 
- fastmath
- nopython
- cache
- nogil

# Bonus:
[Numba's `@guvectorize`](https://numba.pydata.org/numba-doc/dev/user/vectorize.html)