<h1><center>Python Computation Stack</center></h1>

* __cupy:__ Numpy like APIs which uses CUDA Kernel
* __modin:__ Pandas like API but multi-threaded and uses multiple cores
* __numba (JIT):__ JIT compiler to translate python code to LLVM code.
* __numpy:__ numpy :D 
* __dask:__ parallel computing library. Can use local and cluseter using same API with lazy eval & DAG.
* __cython (AOT):__ For running C++ code in python.
* __pycuda:__ Python's interface of Nvidia's CUDA parallel computation API.
* __pypy:__ Implementation of the Python programming language to CPython
* __pythran (AOT):__ AOT compiler for a subset of the Python language.

<h1><center>Python GPU Computation Stack</center></h1>

![GPU Stack](./gpu_stack.png)

<h1><center>Setup GPU computation stack</center></h1>


```python
conda create --name gpu_stack python=3.8 -y
conda activate gpu_stack
conda install ipykernel jupyter nb_conda_kernels pandas numba cudatoolkit tbb
conda install -c conda-forge cupy cudnn cutensor nccl
conda install -c numba icc_rt
numba -s
```

If everything is installed correctly `numba -s` should return the hardware, driver and library list.

<!-- #TODO: Write what all is gpu libs. -->

<h1><center>Numba official exmaple</center></h1>

In [89]:
from numba import njit, prange, jit
import numba as nb
import random
import numpy as np

@njit(nogil=True, parallel=True, fastmath=True, target_backend = 'cuda')
def monte_carlo_pi(nsamples):
    acc = 0
    for i in prange(nsamples):
        x = random.random()
        y = random.random()
        if (x ** 2 + y ** 2) < 1.0:
            acc += 1
    return 4.0 * acc / nsamples

In [91]:
%time monte_carlo_pi(1000_000)
%time monte_carlo_pi.py_func(1000_000)

CPU times: user 18.7 ms, sys: 585 µs, total: 19.3 ms
Wall time: 3.12 ms
CPU times: user 345 ms, sys: 0 ns, total: 345 ms
Wall time: 345 ms


3.14218

In [105]:
def pi_np(n):
    x = np.random.uniform(size = n)
    y = np.random.uniform(size = n)
    return 4 * np.sum(x**2 + y**2 < 1)/n

@njit(nogil=True, parallel=True, fastmath=True)
def monte_carlo_pi_cpu(nsamples):
    acc = 0
    for i in prange(nsamples):
        x = random.random()
        y = random.random()
        if (x ** 2 + y ** 2) < 1.0:
            acc += 1
    return 4.0 * acc / nsamples

%timeit pi_np(1000_000)
%timeit monte_carlo_pi_cpu(1000_000)

23.8 ms ± 952 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.69 ms ± 93.3 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


* __Note:__ Since numba is a JIT compiler, this will take same or even more amount of time to execute the code for 1st time. But, if you run this multiple times using `timeit` magic the performance you will be able to get the performance improvement.
* __Note:__ If `monte_carlo_pi(1000_000)` is a numba function, we need use `monte_carlo_pi.py_func(1000_000)` call the pure python version of the JIT complied function.
* __Note:__ If your runtime is in `ms` then it is better to use `from time import pref_counter`. It is more accurate than `time` or `timeit`.

<h1><center>Numba pitfalls</center></h1>

In [111]:
def spam_py(n):
    return n * [1]

@jit
def spam_jit(n):
    return n * [1]

@njit
def spam_njit(n):
    return n * [1]

@njit
def spam_njit_context(n):
    with nb.objmode(res='int64[:]'):
        res = np.asarray(n * [1])
    return res
        

# spam_py(3)
# spam_jit(3)
## NOTE: The following function will fail. We will slove this issue in spam_njit_context
# spam_njit(3) 
# spam_njit_context(3)

* __Note:__ `spam_njit` will fail. if we mention `njit` python will throw error if it fallbacks to object mode during unboxing. In many cases this is better if we want to ensure speed.

* __Note:__ If it is absolutely necessary use pure python in numba function you can use `nb.objmode` context manager to do this. Check `spam_njit_context` for example.

<h1><center>Functions in python</center></h1>

In [114]:
import dis

def cond():
    x = 3
    if x <5:
        return 'yes'
    else:
        return 'no'

# Abstract syntax tree (AST) --> raw bytecode --> numeric bytecode
# dis.dis(cond)
# cond.__code__.co_code
# list(cond.__code__.co_code)

<h1><center>Numba functions</center></h1>

In [115]:
@nb.jit
def compute(n):
    return n * 2

In [116]:
compute

CPUDispatcher(<function compute at 0x7f408c9dd820>)

In [117]:
compute.overloads

OrderedDict()

In [122]:
compute(3.1)

6.2

In [123]:
compute.overloads

OrderedDict([((int64,),
              CompileResult(typing_context=<numba.core.typing.context.Context object at 0x7f40c8678be0>, target_context=<numba.core.cpu.CPUContext object at 0x7f404ad7fbe0>, entry_point=<built-in method compute of _dynfunc._Closure object at 0x7f404a8d8b80>, typing_error=None, type_annotation=<numba.core.annotations.type_annotations.TypeAnnotation object at 0x7f404a639550>, signature=(int64,) -> int64, objectmode=False, lifted=(), fndesc=<function descriptor 'compute$139'>, library=<Library 'compute' at 0x7f404a6396d0>, call_helper=None, environment=<Environment '_ZN08NumbaEnv8__main__13compute_24139B46c8tJTIeFCjyCbUFRqqOAFv_2fYRdE1AT0EZmkCAA_3d_3dEx' >, metadata={'parfor_diagnostics': ParforDiagnostics, 'parfors': {}, 'pipeline_times': {'nopython': OrderedDict([('0_translate_bytecode', pass_timings(init=2.4320033844560385e-06, run=0.0004631690026144497, finalize=2.1909945644438267e-06)), ('1_fixup_args', pass_timings(init=1.5139958122745156e-06, run=2.175002009

* __Note:__ When we call `compute.overloads` for 2nd time a `int64` data type gets attached to it. As and when we pass any other data type this information will be updated and the newer version of the function will be compiled and stored in cached with the new data type.

<h1><center>Rule of thumb for choosing a library</center></h1>

![Rule of thumb](./tree_1.png)

<h1><center>Take away</center></h1>

Three Takeaways,

* If you have GPU(s), try CuPy first! 
* If you only have CPU, use Numba first
  * Numba supports more NumPy functions
  * If it works, try Pythran to get more performance 
* Each solution supports different number of NumPy functions.
  * You can easily find out which function doesn't work (program stops :P )
  * Check its document to see which functions are provided
  * If A doesn't work, B might work! 

<h1><center>Numba to LLVM 1</center></h1>

![Numba to LLVM 1](./numba_to_llvm_1.png)

<h1><center>Numba to LLVM 2</center></h1>

![Numba to LLVM 2](./pycode_to_llvm.png)

<h1><center>Glimps into llvm IR code as string</center></h1>

In [88]:
# TODO: Glimps into llvm IR code as string

<h1><center>Numba to Sales Folks</center></h1>

![Numba to Sales Folks](./meme.png)

<h1><center>numba ufuncs + vectorize</center></h1>

In [130]:
# Numpy ufuncs (universal functions: add, sin from numpy)
# Supports boradcasting, data type handelling, accumlate & reduce.
# Those which operate on scalars, these are “universal functions” or ufuncs (see @vectorize below).

np.add(1,2)
np.add(1, [2,3])
np.add([[1,2]],[[3],[4]])
np.add.accumulate([2,3,4,5])

# Numba ufuncs + vectorize
# Write function for one elements, add static typing,
@nb.vectorize("(int64, int64)")
def add(x, y):
    return x + y
 
add(1,2)
add(1, [2,3,4])
add.accumulate([2,3,4,5])

# np testing: all_close

array([ 2,  5,  9, 14])

<h1><center>numba gufuncs + vectorize</center></h1>

In [131]:
# Those which operate on higher dimensional arrays and scalars, these are 
# “generalized universal functions” or gufuncs (@guvectorize below).

from math import sqrt
from numba import njit, jit, guvectorize
import timeit
import numpy as np

@njit
def square_sum(arr):
    a = 0.
    for i in range(arr.size):
        a = sqrt(a**2 + arr[i]**2)  # sqrt and square are cpu-intensive!
    return a

@guvectorize(["void(float64[:], float64[:])"], "(n) -> ()", target="parallel", nopython=True)
def row_sum_gu(input, output) :
    output[0] = square_sum(input)

@jit(nopython=True)
def row_sum_jit(input_array, output_array) :
    m, n = input_array.shape
    for i in range(m) :
        output_array[i] = square_sum(input_array[i,:])
    return output_array

In [132]:
rows = int(64)
columns = int(1e6)

input_array = np.random.random((rows, columns))
output_array = np.zeros((rows))
output_array2 = np.zeros((rows))

np.testing.assert_equal(row_sum_jit(input_array, output_array), row_sum_gu(input_array, output_array2))
%timeit row_sum_jit(input_array, output_array.copy())
%timeit row_sum_gu(input_array, output_array.copy())

484 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
50.5 ms ± 643 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


<h1><center>numba stencil</center></h1>

In [77]:
import numpy as np
from numba import stencil

def conv_op(a, b):
    for i in range(a.shape[0]):
        if i-1 < 0 or i+1 >= a.shape[0]:
            b[i] = 0
        else:
            b[i] = a[i-1] + a[i] + a[i+1]

In [78]:
input_arr = np.arange(1_000_000)
output_arr = np.empty_like(input_arr)

%time conv_op(input_arr,output_arr)

CPU times: user 565 ms, sys: 3.95 ms, total: 569 ms
Wall time: 568 ms


In [79]:
@stencil
def conv_op(a):
    return a[-1] + a[0] + a[1]

In [82]:
%time output_arr = conv_op(input_arr)
%time output_arr = conv_op(input_arr)
%time output_arr = conv_op(input_arr)

# Reference:
# https://coderzcolumn.com/tutorials/python/numba-stencil-decorator
# TODO:
# http://jakevdp.github.io/blog/2013/08/07/conways-game-of-life/
# https://towardsdatascience.com/math-neural-network-from-scratch-in-python-d6da9f29ce65

CPU times: user 117 ms, sys: 43 µs, total: 117 ms
Wall time: 115 ms
CPU times: user 91.5 ms, sys: 0 ns, total: 91.5 ms
Wall time: 91.5 ms
CPU times: user 91.4 ms, sys: 0 ns, total: 91.4 ms
Wall time: 91.4 ms


<h1><center>numba cfunc</center></h1>

In [81]:
from numba import cfunc

@cfunc("float64(float64, float64)")
def c_add(x, y):
    return x + y

print(c_add.ctypes(4.0, 5.0))

9.0


<h1><center>CUDA as target backend</center></h1>

In [68]:
from numba import cuda

@cuda.jit
def multiply(a, b, c): 
    i1, i2 = cuda.grid(2)
    the_sum = 0
    for k in range(b.shape[0]):
        the_sum += a[i1][k]*b[k][i2]
    c[i1, i2] = the_sum

In [67]:
a = np.arange(6).reshape(2,3)
b = np.arange(12).reshape(3,4)

d_a = cuda.to_device(a) # Sending stuff to GPU
d_b = cuda.to_device(b) # Sending stuff to GPU
c = np.zeros((a.shape[0], b.shape[1]))
d_c = cuda.to_device(c) # Sending stuff to GPU
multiply[(1,), (2,4)](d_a, d_b, d_c)
print(d_c.copy_to_host()) # Getting things back in host

[[20. 23. 26. 29.]
 [56. 68. 80. 92.]]




<h1><center>How to check single & packed instructions</center></h1>

In [84]:
# Check NB 2

<h1><center>Using Numba with Pandas</center></h1>

In [83]:
import pandas as pd
data = pd.Series(range(1_000_000))
roll = data.rolling(10)
def f(x):
    return np.sum(x) + 5
%timeit -r 1 -n 1 roll.apply(f, engine='numba', raw=True)

387 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


<h1><center>My two cents regarding using numba</center></h1>

* Numba functions takes more time to compile function 1st time. Do not be surprised. 
* If your function fallsback to object mode during unboxing, it may lead to slower execution. If you want to ensure that this does not happen, either use `njit` or `jit(nopython=True)`. If you are using `jit` take a close look at the warnings. By the way, `@jit(nopython=True)` and `@njit` are the same thing.
* Do not use python list. Use numpy arrays for scientific computing.
* Whenever possible use `@vectorize`. Write a function as scaler. Use `@vectorize`. It will work for scaler & vector both. `ufuncs` & `vectorize` has to be used togather.
* When writing developing optimized code. Note that things will not be pythonic always. It will be more C-ish and Fortran like code. But thats alright. For example, `for index in indexes` will change to `for index in range(len(indexes))`.
* `float32`'s are great. Use it wherever possible.
* Remember numba supports limited functions. A good starting point is to check numba support for numpy & python.
* To use all threads using `@njit(nogil=True)`. If you don't want use it, you will not be benefited from using `ThreadPoolExecutor`.
* Using thread pool executor is same as using `parallel=True` flag in `@njit` decorator. Make sure the problem is `embarrassingly parallel`. In case if you are using this use `prange` in place of `range` and install TBB.

* If you don't care much about floating point precision `fastmath=True` is your friend.
* LLVM takes care of the backend and different architecture. Hence if you want to leverage GPU or CPU, it is less stressful.
* To check all the dependencies use `numba -s`.
* MKL, BLAS, SVML, TBB is great explore them. If you have intel CPU MKL + Intel Python is great to generate synthetic data.
* If have CUDA enabled GPU used, `target='cuda'`.
* `numba.stencil` is great for convolution or sliding window or any other neighborhood computation. For C callbacks use `numba.cfunc`
* There is a test suite present in numpy with all sort of numerical comparison. Wherever possible use that.
* Measure, measure & measure. Snakeviz is a good profiler if you prefer web UIs. Prioritize what needs to be optimized.
* deepcopy does not work with numba.
* Talk about default profiler which comes with numba. `from time import pref_counter`. `foo.py_func()` we can use to call python function.
* Talk about single and multiple instructions in numba.
* Example that how we are using this to calculate DTW distance using numba in multiplier.
* NOTE: `pdb` does not work out of the box with numba.