# Numba Code Demo

In [161]:
import warnings
warnings.filterwarnings("ignore")
%pylab inline

Populating the interactive namespace from numpy and matplotlib


## Demo 1: Calucating Pi using Monte Carlo simulation

Basic usage of Numba JIT compiler

In [162]:
from numba import jit
import random
import timeit

### Without Numba JIT compiler

In [163]:
def monte_carlo_pi_no_jit(nsamples):
    acc = 0
    for i in range(nsamples):
        x = random.random()
        y = random.random()
        if (x ** 2 + y ** 2) < 1.0:
            acc += 1
    return 4.0 * acc / nsamples


Profile the code with %timeit, %time, and %prun IPython magics

In [165]:
%time monte_carlo_pi_no_jit(int(1e4))  # only calls the function once

%timeit monte_carlo_pi_no_jit(int(1e4))

Wall time: 6.99 ms
5.58 ms ± 1.09 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


%prun gives a report of the time spend on each function in the call stack

In [166]:
%prun monte_carlo_pi_no_jit(int(1e4))

 

### With Numba JIT compiler

**IMPORTANT**: always use `nonpython=True` or `@njit` first 

In [168]:
@jit(nopython=True)
def monte_carlo_pi(nsamples):
    acc = 0
    for i in range(nsamples):
        x = random.random()
        y = random.random()
        if (x ** 2 + y ** 2) < 1.0:
            acc += 1
    return 4.0 * acc / nsamples

You can also do use the decorator `@jit` as a function

The `jitted` is equivalent to `monte_carlo_pi`

In [169]:
jitted = jit(nopython=True)(monte_carlo_pi_no_jit)

First time is slow because Numba is compiling the code

In [171]:
%time monte_carlo_pi_no_jit(int(1e4)) 
%time jitted(int(1e4))
%time monte_carlo_pi(int(1e4))

Wall time: 4.97 ms
Wall time: 0 ns
Wall time: 0 ns


3.1748

Second time and onwards will be very fast

In [172]:
%time monte_carlo_pi_no_jit(int(1e4)) 
%time jitted(int(1e4))
%time monte_carlo_pi(int(1e4))

Wall time: 5.01 ms
Wall time: 959 µs
Wall time: 0 ns


3.106

### Achieve ~40x speedup

In [173]:
n_points = int(1e4)
repeat_times = 1000

time_jit = timeit.timeit(lambda: monte_carlo_pi(n_points), number=repeat_times)
print(f"Time with JIT: {time_jit} sec")

time_nonjit = timeit.timeit(lambda: monte_carlo_pi_no_jit(n_points), number=repeat_times)
print(f"Time without JIT: {time_nonjit} sec\n"
      f"Speed up: {time_nonjit/time_jit: .1f}x")

Time with JIT: 0.12212749999889638 sec
Time without JIT: 5.021202199997788 sec
Speed up:  41.1x


## Demo 2: Numba Object mode

Comparison of `nonpython` mode vs. `object` mode

In [None]:
import numpy as np
from numba import njit, vectorize, prange

In [None]:
# @njit
def object_function(input_list):
    output_list = []
    for item in input_list:
        if item % 2 == 0:
            output_list.append(2)
        else:
            output_list.append(1)
    return output_list

### `njit` or `jit(nopython=True)` give the best performance

In [None]:
njitted_function = jit(nopython=True)(object_function)
# njitted_function = njit()(object_function)

### `jit` or `jit(nopython=False)` give the best compatibility (slow)

In [None]:
jitted_function = jit()(object_function)

### Comparison

In [None]:
test_list = [*range(100000)]
test_ndarray = np.arange(100000)

#### Original for-loop

In [None]:
%time object_function(test_list)[0:5]

#### Compiled with `nopython` mode

In [None]:
# %time njitted_function(test_list)[0:5]
%time njitted_function(test_ndarray)[0:5]

#### Compiled with `object` mode

In [None]:
%time jitted_function(test_list)[0:5]

## Demo 3: Vectorization with Numba

In [None]:
import numpy as np
from numba import vectorize

In [None]:
@vectorize(nopython=True)
def non_list_function(item):
    if item % 2 == 0:
        return 2
    else:
        return 1

test_ndarray = np.arange(100000)

#### First time is slow due to compilation

In [None]:
%time non_list_function(test_ndarray)

#### Second time is a lot faster

In [None]:
%time non_list_function(test_ndarray)

#### Profile the function more accurately with `%timeit`

In [None]:
%timeit non_list_function(test_ndarray)

## Demo 4: Parallelization with Numba


In [180]:
import numpy as np
from numba import njit, prange

In [181]:
@njit(parallel=True)
def prange_forloop(A):
    s = 0
    # Without "parallel=True" in the jit-decorator
    # the prange statement is equivalent to range
    for i in prange(A.shape[0]):
        s += A[i]
    return s

def regular_forloop(A):
    s = 0
    for i in range(A.shape[0]):
        s += A[i]
    return s

test_ndarray = np.arange(100000)

#### First time is slow due to compilation

In [183]:
%time np.sum(test_ndarray)
%time prange_test(test_ndarray)

Wall time: 1 ms
Wall time: 0 ns


4999950000

#### Profile the function more accurately with `%timeit`

In [184]:
%timeit np.sum(test_ndarray)

37 µs ± 658 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


#### Second time is a lot faster

In [185]:
%timeit prange_test(test_ndarray)

14.3 µs ± 134 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [186]:
%timeit regular_forloop(test_ndarray)

19.4 ms ± 529 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Demo 5: Program CUDA enabled GPU with Numba

You will need to install `cudatoolkit` for this to work

Due to the firewall, we cannot run `conda install cudatoolkit` directly

Download the package from [here](https://anaconda.org/anaconda/cudatoolkit/11.0.221/download/win-64/cudatoolkit-11.0.221-h74a9793_0.tar.bz2), then run `conda intall [package name]` in terminal

In [174]:
from numba import cuda

@cuda.jit
def matmul(A, B, C):
    """Perform square matrix multiplication of C = A * B
    """
    i, j = cuda.grid(2)
    if i < C.shape[0] and j < C.shape[1]:
        tmp = 0.
        for k in range(A.shape[1]):
            tmp += A[i, k] * B[k, j]
        C[i, j] = tmp

In [175]:
a = np.random.random((10000, 10000))
b = np.random.random((10000, 10000))
c = np.empty_like(a)  # for storing the results

a,b,c >= 800 MB each

### `Numpy` native matrix multiplication

In [177]:
%time _ = np.matmul(a, b)

Wall time: 12.8 s


### `CUDA` accelerated matrix multiplication

In [178]:
%time matmul(a, b, c)

Wall time: 1.52 s


In [179]:
%time matmul(a, b, c)

Wall time: 943 ms
