# Expression Evaluation (User Defined Functions)

So far we have seen that ironArray has support for evaluating expressions that are passed as strings or as simple Python statements.  There is another, more flexible way for evaluating expressions called User Defined Functions, or UDFs for short.

UDFs are small functions that can be expressed in a simple subset of Python.  These functions are then passed to the internal LLVM compiler in ironArray and a binary specific and optimized for the local machine is generated.  This binary is optimized for the CPU and in addition, it will make use of the [Intel SVML library](https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/intrinsics/intrinsics-for-short-vector-math-library-operations/overview-intrinsics-for-short-vector-math-library-svml-functions.html) for accelerating the evaluation of transcendental functions.

Let's see how this works.

In [1]:
%load_ext memprofiler
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import iarray as ia

In [2]:
%%time
precip1 = ia.load("precip1.iarr")
precip2 = ia.load("precip2.iarr")
precip3 = ia.load("precip3.iarr")

CPU times: user 26.8 s, sys: 4.82 s, total: 31.6 s
Wall time: 6.24 s


Now, let's define a simple function that computes the mean for this data:

In [3]:
from iarray.udf import jit, Array, float32

@jit()
def mean(out: Array(float32, 3),
         p1: Array(float32, 3),
         p2: Array(float32, 3),
         p3: Array(float32, 3)) -> int:

    l = p1.shape[0]
    m = p1.shape[1]
    n = p1.shape[2]

    for i in range(l):
        for j in range(m):
            for k in range(n):
                value = p1[i,j,k] + p2[i,j,k] + p3[i,j,k]
                out[i,j,k] = value / 3

    return 0

and create the ironArray expression from this User Defined Function with:

In [4]:
%%time
precip_expr = ia.expr_from_udf(mean, [precip1, precip2, precip3])

CPU times: user 22.2 ms, sys: 9.27 ms, total: 31.5 ms
Wall time: 43.2 ms


As can be seen, converting the user defined function into a native ironArray expression is pretty fast.  And as always, in order to do the actual evaluation, we have to call `.eval()` on the expression:

In [5]:
%%mprof_run iarray-mean
precip_mean = precip_expr.eval()
precip_mean

<IArray (720, 721, 1440) np.float32>

memprofiler: used 435.21 MiB RAM (peak of 491.55 MiB) in 1.0935 s, total RAM usage 1549.15 MiB


Let's compare this time with the evaluation via a regular lazy expression:

In [6]:
precip_expr2 = (precip1 + precip2 + precip3) / 3

In [7]:
%%mprof_run iarray-mean2
precip_mean2 = precip_expr2.eval()
precip_mean2

<IArray (720, 721, 1440) np.float32>

memprofiler: used 382.66 MiB RAM (peak of 453.57 MiB) in 1.1170 s, total RAM usage 1931.82 MiB


Ok, so the times are very close.  It turns out that UDFs compile and execute in ironArray using the very same LLVM machinery, which explains times being similar.  It is up to the user to use one or the other depending on the needs.

Now, let's use expressions with some transcendental functions.  This does not make sense for this case (precipitation data), but we are doing this just as an indication of the efficiency of the computational engine inside ironArray:

In [8]:
import math

@jit()
def trans(out: Array(float32, 3),
          p1: Array(float32, 3),
          p2: Array(float32, 3),
          p3: Array(float32, 3)) -> int:

    l = p1.shape[0]
    m = p1.shape[1]
    n = p1.shape[2]

    for i in range(l):
        for j in range(m):
            for k in range(n):
                value = math.sin(p1[i,j,k]) * math.sin(p2[i,j,k]) + math.cos(p2[i,j,k])
                value *= math.tan(p1[i,j,k])
                value += math.sqrt(p3[i,j,k]) * 2
                out[i,j,k] = value

    return 0

In [9]:
%%time
precip_expr = ia.expr_from_udf(trans, [precip1, precip2, precip3])

CPU times: user 21.8 ms, sys: 5.34 ms, total: 27.2 ms
Wall time: 39 ms


In [10]:
%%mprof_run iarray-trans
precip_mean = precip_expr.eval()
precip_mean

<IArray (720, 721, 1440) np.float32>

memprofiler: used 645.20 MiB RAM (peak of 718.22 MiB) in 1.2722 s, total RAM usage 2579.38 MiB


In this case we see that the overhead of using transcendental functions is pretty the same than plain arithmetic operations (sum, rest, mult, division...).  This is a very significant fact because traditionally transcendental functions took really long time compared with plain arithmetic; not anymore thanks to SVML.  Let's compare these times against NumPy:

In [11]:
%%time
p1_ = precip1.data
p2_ = precip2.data
p3_ = precip3.data

CPU times: user 11.2 s, sys: 2.12 s, total: 13.3 s
Wall time: 4.5 s


In [None]:
%%mprof_run np_trans
np_result = (np.tan(p1_) * (np.sin(p1_) * np.sin(p2_) + np.cos(p2_)) + np.sqrt(p3_) * 2)

This is really slow, but this is kind of expected because NumPy does not have support for SVML (at this time at least), and we all know that transcendental functions always took quite a lot to execute on a regular CPU.  The secret behind SVML is a good mix between compiler optimization (via LLVM) and SIMD usage.

TODO:
* Compare with numba