# Simple Test between NumPy, Numexpr, Numba, and Cython

## for complex square root of big matrices

The calculation is
$$\sqrt{A\ L^2 + B}\ ,$$
where
- $A$ and $B$ are complex matrices (complex128) with dimension (#frequencies, #layers), and
- L is a real matrix (floats64) with dimension (#offsets, #wavenumbers).

Their dimensions are broadcasted (`A[:, None, :, None]`; `B[:, None, :, None]`; and `L[None, :, None, :]`) so that the result is a complex matrix of dimension (#frequencies, #offsets, #layers, #wavenumbers), which can become quite big.

Here I compare my go at `Numpy`, `Numexpr`, `Numba`, and `Cython` implementations. The `Numexpr` was very fast; however, it used 4 threads by default, whereas all others only used 1 thread. So I restricted `Numexpr` to 1 thread to, which then turns out to be the slowest of the four.

In this comparison `Numba` and `Cython` are about the same, give or take (run the notebook a few times), where the `Numba` implementation is slightly easier.

**Please let me know if you know how to improve any of these!**

In [1]:
import numba
import cython
import numexpr
import numpy as np

%load_ext cython

## NumPy

In [2]:
def test_numpy(a, b, l):
    return np.sqrt(a*l**2 + b)

## Numexpr

In [3]:
numexpr.set_num_threads(1)  # Set nr of threads to 1 for a fair comparison
def test_numexpr(a, b, l):
    return numexpr.evaluate("sqrt(a*l**2 + b)")

## Numba

In [4]:
@numba.jit(nopython=True, nogil=True)
def test_numba(a, b, l, out):
    for nf in range(out.shape[0]):
        for no in range(out.shape[1]):
            for nl in range(out.shape[2]):
                for ni in range(out.shape[3]):
                    out[nf, no, nl, ni] = np.sqrt(a[nf, 0, nl, 0] * l[0, no, 0, ni] ** 2 + b[nf, 0, nl, 0])
    return out

## Cython version

In [5]:
%%cython -a
import cython
from libc.math cimport sqrt

cdef extern from "complex.h":
    double complex csqrt(double complex z)

@cython.boundscheck(False)
@cython.wraparound(False)
def test_cython(complex [:,:,:,:] a, complex [:,:,:,:] b, double [:,:,:,:] l, double complex [:,:,:,:] out):
    cdef size_t nf, no, nl, nw
    for nf in xrange(out.shape[0]):
        for no in xrange(out.shape[1]):
            for nl in xrange(out.shape[2]):
                for nw in xrange(out.shape[3]):
                    out[nf, no, nl, nw] = csqrt(a[nf, 0, nl, 0] * l[0, no, 0, nw] ** 2 + b[nf, 0, nl, 0])

## Run comparison for a small and a big matrix

In [6]:
lay = [1, 10]
fre = [10, 101]
off = [11, 201]
wav = [51, 301] 

for i in range(2):
    nlay, nfre, noff, nwav = lay[i], fre[i], off[i], wav[i]
    a = np.random.rand(nfre, nlay) + 1j*np.random.rand(nfre, nlay)
    b = np.random.rand(nfre, nlay) + 1j*np.random.rand(nfre, nlay)
    l = np.random.rand(noff, nwav)
    
    # Broadcast shapes
    a_bc = a[:, None, :, None]
    b_bc = b[:, None, :, None]
    l_bc = l[None, :, None, :]
    
    # Output shape
    out_shape = (nfre, noff, nlay, nwav)
    
    print('= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =')
    print('  Shape Test Matrix ::', out_shape, '; total # elements:: '+str(nfre*noff*nlay*nwav))
    print('-----------------------------------------------------------------------------')
  
    print('NumPy   ::  ', end='')
    %timeit test_numpy(a_bc, b_bc, l_bc)
    # Get NumPy result for comparison
    numpy_result = test_numpy(a_bc, b_bc, l_bc)
    
    print('Numexpr ::  ', end='')
    %timeit test_numexpr(a_bc, b_bc, l_bc)
    # Ensure it agrees with NumPy
    numexpr_result = test_numexpr(a_bc, b_bc, l_bc)
    if not np.allclose(numpy_result, numexpr_result, atol=0, rtol=1e-10):
        print('* FAIL, DOES NOT AGREE WITH NumPy RESULT!')
    
    print('Numba   ::  ', end='')
    %timeit test_numba(a_bc, b_bc, l_bc, np.empty(out_shape, dtype=complex))
    # Ensure it agrees with NumPy
    numba_result = np.empty(out_shape, dtype=complex)
    test_numba(a_bc, b_bc, l_bc, numba_result)
    if not np.allclose(numpy_result, numba_result, atol=0, rtol=1e-10):
        print('* FAIL, DOES NOT AGREE WITH NumPy RESULT!')
    
    print('Cython  ::  ', end='')
    %timeit test_cython(a_bc, b_bc, l_bc, np.empty(out_shape, dtype=complex))
    # Ensure it agrees with NumPy
    cython_result = np.empty(out_shape, dtype=complex)
    test_cython(a_bc, b_bc, l_bc, cython_result)
    if not np.allclose(numpy_result, cython_result, atol=0, rtol=1e-10):
        print('* FAIL, DOES NOT AGREE WITH NumPy RESULT!')
        
    print()

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
  Shape Test Matrix :: (10, 11, 1, 51) ; total # elements:: 5610
-----------------------------------------------------------------------------
NumPy   ::  211 µs ± 2.77 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Numexpr ::  450 µs ± 11.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Numba   ::  184 µs ± 17.2 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Cython  ::  210 µs ± 23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
  Shape Test Matrix :: (101, 201, 10, 301) ; total # elements:: 61106010
-----------------------------------------------------------------------------
NumPy   ::  3.4 s ± 186 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Numexpr ::  4.94 s ± 48.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Numba   ::  2.17 s ± 29.7 ms per loop (mean ± std. dev. of 7

In [7]:
from empymod import versions
versions('HTML', add_pckg=[cython, numba], ncol=5)

0,1,2,3,4,5,6,7,8,9
Fri Jun 29 07:34:01 2018 CDT,Fri Jun 29 07:34:01 2018 CDT,Fri Jun 29 07:34:01 2018 CDT,Fri Jun 29 07:34:01 2018 CDT,Fri Jun 29 07:34:01 2018 CDT,Fri Jun 29 07:34:01 2018 CDT,Fri Jun 29 07:34:01 2018 CDT,Fri Jun 29 07:34:01 2018 CDT,Fri Jun 29 07:34:01 2018 CDT,Fri Jun 29 07:34:01 2018 CDT
Linux,OS,4,CPU(s),1.13.3,numpy,1.1.0,scipy,1.7.1,empymod
6.4.0,IPython,2.6.5,numexpr,2.2.2,matplotlib,0.28.3,cython,0.38.1+1.gc42707d0f.dirty,numba
"3.6.5 |Anaconda custom (64-bit)| (default, Apr 29 2018, 16:14:56) [GCC 7.2.0]","3.6.5 |Anaconda custom (64-bit)| (default, Apr 29 2018, 16:14:56) [GCC 7.2.0]","3.6.5 |Anaconda custom (64-bit)| (default, Apr 29 2018, 16:14:56) [GCC 7.2.0]","3.6.5 |Anaconda custom (64-bit)| (default, Apr 29 2018, 16:14:56) [GCC 7.2.0]","3.6.5 |Anaconda custom (64-bit)| (default, Apr 29 2018, 16:14:56) [GCC 7.2.0]","3.6.5 |Anaconda custom (64-bit)| (default, Apr 29 2018, 16:14:56) [GCC 7.2.0]","3.6.5 |Anaconda custom (64-bit)| (default, Apr 29 2018, 16:14:56) [GCC 7.2.0]","3.6.5 |Anaconda custom (64-bit)| (default, Apr 29 2018, 16:14:56) [GCC 7.2.0]","3.6.5 |Anaconda custom (64-bit)| (default, Apr 29 2018, 16:14:56) [GCC 7.2.0]","3.6.5 |Anaconda custom (64-bit)| (default, Apr 29 2018, 16:14:56) [GCC 7.2.0]"
Intel(R) Math Kernel Library Version 2018.0.3 Product Build 20180406 for Intel(R) 64 architecture applications,Intel(R) Math Kernel Library Version 2018.0.3 Product Build 20180406 for Intel(R) 64 architecture applications,Intel(R) Math Kernel Library Version 2018.0.3 Product Build 20180406 for Intel(R) 64 architecture applications,Intel(R) Math Kernel Library Version 2018.0.3 Product Build 20180406 for Intel(R) 64 architecture applications,Intel(R) Math Kernel Library Version 2018.0.3 Product Build 20180406 for Intel(R) 64 architecture applications,Intel(R) Math Kernel Library Version 2018.0.3 Product Build 20180406 for Intel(R) 64 architecture applications,Intel(R) Math Kernel Library Version 2018.0.3 Product Build 20180406 for Intel(R) 64 architecture applications,Intel(R) Math Kernel Library Version 2018.0.3 Product Build 20180406 for Intel(R) 64 architecture applications,Intel(R) Math Kernel Library Version 2018.0.3 Product Build 20180406 for Intel(R) 64 architecture applications,Intel(R) Math Kernel Library Version 2018.0.3 Product Build 20180406 for Intel(R) 64 architecture applications
