# Python accelerators

## General principle for perf with Python (not fully valid for PyPy): 

Don't use too often the Python interpreter (and small Python objects) for computationally demanding tasks.

Pure Python 

&emsp;&emsp;&emsp;&emsp; → Numpy 

&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; → Numpy without too many loops (vectorized) 

&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;→ C extensions

But ⚠️ ⚠️ ⚠️ writting a C extension by hand is **not a good idea** ! ⚠️ ⚠️ ⚠️

### *No need to quit the Python language to avoid using too much the Python interpreter* !

# Tools to 

- compile Python 

- write C extensions without writing C

Cython, Numba, Pythran, Transonic, ...


- Langage: superset of Python

- A great mix of Python / C / CPython C API! 

  Very powerfull but a tool for experts!

- Easy to study where the interpreter is used (`cython --annotate`).

- Very mature

- Now able to use Pythran internally...

My experience: large Cython extensions difficult to maintain

## Numba: (per-method) JIT for Python-Numpy code

- Very simple to use (just add few decorators) 🙂

In [6]:
from numba import jit

@jit
def myfunc(x):
    return x**2

- "nopython" mode (fast and no GIL) 🙂

- Also a "python" mode 🙂

- GPU and Cupy 😀

- Methods (of classes) 🙂

## Python decorators

In [7]:
def mydecorator(func):
    # do something with the function
    print(func)
    # return a(nother) function
    return func

In [8]:
@mydecorator
def myfunc(x):
    return x**2

<function myfunc at 0x7fc5bd76f378>


This mysterious syntax with `@` is just syntaxic sugar for:

In [9]:
def myfunc(x):
    return x**2

myfunc = mydecorator(myfunc)

<function myfunc at 0x7fc5bd76f598>


## Numba: (per-method) JIT for Python-Numpy code

- Sometimes not as much efficient as it could be 🙁

  (usually slower than Pythran / Julia / C++)

<p class="small"><br></p>

- Only JIT 🙁

<p class="small"><br></p>

- Not good to optimize high-level NumPy code 🙁

## Pythran: AOT compiler for module using Python-Numpy

Transpiles Python to efficient C++

- Good to optimize *high-level NumPy code* 😎

- Extensions never use the Python interpreter (pure C++ ⇒ no GIL) 🙂

- Can produce C++ that can be used without Python

- Usually **very efficient** (sometimes faster than Julia)

    - High and low level optimizations
    
      (Python optimizations and C++ compilation)

    - SIMD 🤩 (with [xsimd](https://github.com/QuantStack/xsimd)) 

    - Understand OpenMP instructions 🤗 !

- Can [use and make PyCapsules](https://serge-sans-paille.github.io/pythran-stories/the-capsule-corporation.html) (functions operating in the native word) 🙂

### High level transformations

In [10]:
from black import format_str, FileMode
from pythran.toolchain import generate_py
import gast as ast
import astunparse

def print_optimized(src):    
    optimized_py = generate_py("bar", src)
    tree = ast.parse(optimized_py)
    for node in tree.body:
        if isinstance(node, ast.FunctionDef):
            fdef = node
            fdef.body = [node for node in fdef.body[:-1] if not isinstance(node, ast.Pass)] + [fdef.body[-1]]
    optimized_code = astunparse.unparse(tree)
    print(format_str(optimized_code, mode=FileMode(line_length=82)))


In [11]:
# calcul of range
print_optimized("""
def f(x):
    y = 1 if x else 2
    return y == 3
""")

def f(x):
    return 0



In [12]:
# inlining
print_optimized("""
def foo(a):
    return  a + 1
def bar(b, c):
    return foo(b), foo(2 * c)
""")

def foo(a):
    return a + 1


def bar(b, c):
    return ((b + 1), ((2 * c) + 1))



In [13]:
# unroll loops
print_optimized("""
def foo():
    ret = 0
    for i in range(1, 3):
        for j in range(1, 4):
            ret += i * j
    return ret
""")

def foo():
    ret = 0
    ret += 1
    ret += 2
    ret += 3
    ret += 2
    ret += 4
    ret += 6
    return ret



In [14]:
# constant propagation
print_optimized("""
def fib(n):
    return n if n< 2 else fib(n-1) + fib(n-2)
    
def bar(): 
    return [fib(i) for i in [1, 2, 8, 20]]
""")

import functools as __pythran_import_functools


def fib(n):
    return n if (n < 2) else (fib((n - 1)) + fib((n - 2)))


def bar():
    return [1, 1, 21, 6765]


def bar_lambda0(i):
    return fib(i)



In [15]:
# advanced transformations
print_optimized("""
import numpy as np
def wsum(v, w, x, y, z):
    return sum(np.array([v, w, x, y, z]) * (.1, .2, .3, .2, .1))
""")

import numpy as __pythran_import_numpy


def wsum(v, w, x, y, z):
    return __builtin__.sum(
        ((v * 0.1), (w * 0.2), (x * 0.3), (y * 0.2), (z * 0.1))
    )



## Pythran: AOT compiler for module using Python-Numpy

- Compile only full modules (⇒ refactoring needed 🙁)

- Only "nopython" mode

    * limited to a subset of Python
    
        - only homogeneous list / dict 🤷‍♀️
        - no methods (of classes) 😢 and user-defined class
    
    * limited to few extension packages (Numpy + bits of Scipy)
    
    * pythranized functions can't call Python functions

- No JIT: need types (written manually in comments)

- Lengthy ⌛️ and memory intensive compilations (especially with gcc, less with clang)

- Debugging 🐜 Pythran requires C++ skills!

- No GPU (maybe with [OpenMP 4](https://www.openmp.org/updates/openmp-accelerator-support-gpus/)?)

- <img src="./fig/logo_intel.png" alt="Intel" align="left" style="width: 7%; margin-bottom: 2px; margin-right: 5px;"> compilers unable to compile Pythran C++11 👎

# First conclusions

- Python great language & ecosystem for sciences & data

- Performance issues, especially for crunching numbers 🔢

  *⇒ need to accelerate the "numerical kernels"*

- Many good accelerators and compilers for Python-Numpy code

  - All have pros and cons!

  **⇒ We shouldn't have to write specialized code for one accelerator!**

# Make your numerical Python code fly at transonic speed 🚀 !

## Transonic is landing 🛬 !

*Pure Python package (>= 3.6) to easily accelerate modern Python-Numpy code with different accelerators*

**Work in progress!** Current state: one backend based on Pythran!

<div align="middle">
    <a href="https://transonic.readthedocs.io">https://transonic.readthedocs.io</a>    
</div>

- Keep your Python-Numpy code clean and "natural" 🧘

- Clean type annotations (🐍 3)

- Easily mix Python code and compiled functions

- JIT based on AOT compilers

- Methods (of classes) and blocks of code


## Transonic: examples from real-life packages

- JIT (`@jit`)

  [fluidsim/solvers/plate2d/output/correlations_freq.py](https://bitbucket.org/fluiddyn/fluidsim/src/default/fluidsim/solvers/plate2d/output/correlations_freq.py)

- AOT compilation for functions and methods (`@boost`)

  [fluidfft/fft3d/operators.py](https://bitbucket.org/fluiddyn/fluidfft/src/default/fluidfft/fft3d/operators.py)

- Blocks of code (with `if ts.is_transpiled:`)

  [fluidsim/base/time_stepping/pseudo_spect.py](https://bitbucket.org/fluiddyn/fluidsim/src/default/fluidsim/base/time_stepping/pseudo_spect.py)
  
- Parallelism with a class (adapted from Olivier Borderies)

  [omp/tsp.py](https://gitlab.com/paugier/tsp-pythran/blob/fluid-omp/tsp.py) (OpenMP) and
  [tsp_concurrent.py](https://gitlab.com/paugier/tsp-pythran/blob/fluid/tsp_concurrent.py) (concurrent - threads)
  
  Also compatible with MPI!
 
Works also well in *simple scripts* and *IPython / Jupyter*.

## Transonic: how does it work?

- AST analyses (using [Beniget](https://github.com/serge-sans-paille/beniget), no import at compilation time)

In [24]:
# abstract syntax tree
import ast
tree = ast.parse("great_tool = 'Beniget'")
assign = tree.body[0]
print(f"{assign.value.s} is a {assign.targets[0].id}")

Beniget is a great_tool


- Write the (Pythran) files when needed

- Compile the (Pythran) files when needed

- Use the fast solutions when available

# Cupy

https://cupy.chainer.org/

Numpy API executed on GPU (Cuda)

# PyTorch

https://pytorch.org/