<h1>Table of Contents<span class="tocSkip"></span></h1>


# Introduction
<hr style = "border:2px solid black" ></hr>


**What?** NumPy vs. Numba vs. Cython



# Imports
<hr style = "border:2px solid black" ></hr>

In [1]:
import random
import numpy as np
import numba

# Options to speed-up the computation
<hr style = "border:2px solid black" ></hr>


- Vectorisation via `NumPy` which makes use of Python’s vectorization capabilities to improve excecution time.
- Dynamic compiling via `Numba` which allows to dynamically compile pure Python code using LLVM technology.
- Static compiling via `Cython` which is a hybrid language that combines Python and C; it allows one, for instance, to use static type declarations and to statically compile such adjusted code.
- Multiprocessing and hypertrading is not considered here because this is just a different way to run the code optimised in one of the three ways just descibed.



# Python
<hr style = "border:2px solid black" ></hr>


- Let's write a simple algortihm that computs the average of some random numbers.
- This is done in pure python, which will then set the standard.



In [3]:
def average_py(n):
    s = 0  
    for i in range(n):
        s += random.random()  
    return s / n  

In [4]:
n = 10000000  

In [5]:
# Time it once
%time average_py(n)  

CPU times: user 1.37 s, sys: 5.5 ms, total: 1.38 s
Wall time: 1.38 s


0.5000439315770325

In [6]:
# Time it several times for a more realibale estimate
%timeit average_py(n)  

1.38 s ± 2.73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [7]:
# Uses a list comprehension instead of the function.
%time sum([random.random() for _ in range(n)]) / n  

CPU times: user 1.43 s, sys: 161 ms, total: 1.59 s
Wall time: 1.59 s


0.5000041921827179

# NumPy 
<hr style = "border:2px solid black" ></hr>


- The strength of NumPy lies in its vectorization capabilities.
- The looping takes place one level deeper based on optimized and compiled routines provided by `NumPy`.



In [8]:
def average_np(n):
    s = np.random.random(n)
    return s.mean()

In [9]:
%time average_np(n)

CPU times: user 108 ms, sys: 27.3 ms, total: 135 ms
Wall time: 134 ms


0.4998682171843504

In [10]:
%timeit average_np(n)

95.4 ms ± 224 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [11]:
# How much data did we actually used?
s = np.random.random(n)
s.nbytes  

80000000


- **Where is the catch??** Keep an eye on the memory (RAM)!
- The price you paid (which does not mean it is bad!) for the speedup is a significantly higher memory usage.
- This is due to the fact that NumPy attains speed by preallocating data that can be processed in the compiled layer. 
- As a consquence, there is no way, given this approach, to work with **streamed data**. This increased memory usage might even be prohibitively large depending on the algorithm or problem at hand.



# Numba
<hr style = "border:2px solid black" ></hr>


- Numba is a package that allows the dynamic compiling of pure Python code by the use of LLVM. 
- The combination of pure Python with Numba beats the NumPy version **and preserves** the memory efficiency of the original loop-based implementation. 
- It is also obvious that the application of Numba in such simple cases comes with hardly any program‐ ming overhead.



In [12]:
average_nb = numba.jit(average_py)  

In [13]:
%time average_nb(n)  

CPU times: user 809 ms, sys: 188 ms, total: 997 ms
Wall time: 617 ms


0.5000909046501579

In [14]:
# The compiling happens during runtime, leading to some overhead.
%time average_nb(n)  

CPU times: user 67.5 ms, sys: 591 µs, total: 68.1 ms
Wall time: 68.1 ms


0.5000941871987323

In [15]:
# From the second execution (with the same input data types), the execution is faster.
%timeit average_nb(n)  

67.3 ms ± 189 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)



- **Where is the catch?** No Free Lunch
- The application of Numba sometimes seems like magic when one compares the performance of the Python code to the compiled ver‐ sion, especially given its ease of use. 
- However, there are many use cases for which Numba is not suited and for which performance gains are hardly observed or even impossible to achieve.



# Cython
<hr style = "border:2px solid black" ></hr>


- `Cython` allows one to statically compile Python code. 
- However, the application is not as simple as with Numba since the code generally needs to be changed to see significant speed improvements.
- The code needs to be done several times as it will be shown down below.
    


In [16]:
%load_ext Cython

In [17]:
%%cython -a
import random  
def average_cy1(int n):  
    """
    Adds static type declarations for the variables n, i, and s.
    """
    cdef int i  
    cdef float s = 0  
    for i in range(n):
        s += random.random()
    return s / n

In [18]:
%time average_cy1(n)

CPU times: user 475 ms, sys: 2.46 ms, total: 477 ms
Wall time: 475 ms


0.49994465708732605

In [19]:
%timeit average_cy1(n)

477 ms ± 12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)



- Some speedup is observed, but not even close to that achieved by, for example, the NumPy version. 
- A bit more Cython optimization is necessary to beat even the Numba version:
- This further optimized Cython version, `average_cy2`, is now a bit faster than the Numba version. However, the effort has also been a bit larger. Compared to the NumPy version, Cython also preserves the memory efficiency of the original loop-based implementation.



In [20]:
%%cython

#Imports a random number generator from C.
from libc.stdlib cimport rand  

# Imports a constant value for the scaling of the random numbers.
cdef extern from 'limits.h':  
    int INT_MAX  
cdef int i
cdef float rn
for i in range(5):
    # Adds uniformly distributed random numbers from the interval (0, 1), after scal‐ ing
    rn = rand() / INT_MAX  
    print(rn)

      PyErr_SetString(PyExc_ZeroDivisionError, "float division");
      ^~~~~~~~~~~~~~~


0.3835020661354065
0.5194163918495178
0.8309653401374817
0.03457210958003998
0.05346163362264633


In [21]:
%%cython -a
# Imports a random number generator from C.
from libc.stdlib cimport rand  

# Imports a constant value for the scaling of the random numbers.
cdef extern from 'limits.h':  
    int INT_MAX  
def average_cy2(int n):
    cdef int i
    cdef float s = 0
    for i in range(n):
        # Adds uniformly distributed random numbers from the interval (0, 1), after scal‐ ing
        s += rand() / INT_MAX  
    return s / n

      PyErr_SetString(PyExc_ZeroDivisionError, "float division");
      ^~~~~~~~~~~~~~~


In [22]:
%time average_cy2(n)

CPU times: user 67.3 ms, sys: 257 µs, total: 67.6 ms
Wall time: 67.4 ms


0.500017523765564

In [23]:
%timeit average_cy2(n)

52.8 ms ± 676 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)



- **If there is much to add, why would developer use it?
- Cython allows developers to tweak code for performance as much as possible or as little as sensible—starting with a pure Python version, for instance, and adding more and more elements from C to the code. 
- Esentially the improvement **at will** is the the main selling point for `cython`.
- The compilation step itself can also be parameterized to further optimize the compiled version.  



# References
<hr style = "border:2px solid black" ></hr>


- https://github.com/yhilpisch/py4fi2nd/blob/master/code/ch10/10_performance_python.ipynb
- https://llvm.org/
- Hilpisch, Yves. Python for finance: mastering data-driven finance. O'Reilly Media, 2018.

