# Defining `ufuncs` using `vectorize`

You have been able to define your own NumPy [`ufuncs`](http://docs.scipy.org/doc/numpy/reference/ufuncs.html) for quite some time, but it's a little involved.  

You can read through the [documentation](http://docs.scipy.org/doc/numpy/user/c-info.ufunc-tutorial.html), the example they post there is a ufunc to perform 

$$f(a) = \log \left(\frac{a}{1-a}\right)$$

It looks like this:

```c
static void double_logit(char **args, npy_intp *dimensions,
                            npy_intp* steps, void* data)
{
    npy_intp i;
    npy_intp n = dimensions[0];
    char *in = args[0], *out = args[1];
    npy_intp in_step = steps[0], out_step = steps[1];

    double tmp;

    for (i = 0; i < n; i++) {
        /*BEGIN main ufunc computation*/
        tmp = *(double *)in;
        tmp /= 1-tmp;
        *((double *)out) = log(tmp);
        /*END main ufunc computation*/

        in += in_step;
        out += out_step;
    }
}
```

And **note**, that's just for a `double`.  If you want `floats`, `long doubles`, etc... you have to write all of those, too.  And then create a `setup.py` file to install it.  And I left out a bunch of boilerplate stuff to set up the import hooks, etc...

# Say "thank you" to the NumPy devs

We can use Numba to define ufuncs without all of the pain.

In [1]:
import numpy
import math

Let's define a function that operates on two inputs

In [2]:
def trig(a, b):
    return math.sin(a**2) * math.exp(b)

In [3]:
trig(1, 1)

2.2873552871788423

Seems reasonable.  However, the `math` library only works on scalars.  If we try to pass in arrays, we'll get an error.

In [4]:
a = numpy.ones((5,5))
b = numpy.ones((5,5))

In [5]:
trig(a, b)

TypeError: only size-1 arrays can be converted to Python scalars

In [6]:
from numba import vectorize

In [7]:
vec_trig = vectorize()(trig)

In [8]:
vec_trig(a, b)

array([[2.28735529, 2.28735529, 2.28735529, 2.28735529, 2.28735529],
       [2.28735529, 2.28735529, 2.28735529, 2.28735529, 2.28735529],
       [2.28735529, 2.28735529, 2.28735529, 2.28735529, 2.28735529],
       [2.28735529, 2.28735529, 2.28735529, 2.28735529, 2.28735529],
       [2.28735529, 2.28735529, 2.28735529, 2.28735529, 2.28735529]])

And just like that, the scalar function `trig` is now a NumPy `ufunc` called `vec_trig`

Note that this is a "Dynamic UFunc" with no signature given.  

How does it compare to just using NumPy?  Let's check

In [9]:
def numpy_trig(a, b):
    return numpy.sin(a**2) * numpy.exp(b)

In [10]:
a = numpy.random.random((1000, 1000))
b = numpy.random.random((1000, 1000))

In [11]:
%timeit vec_trig(a, b)

46.5 ms ± 1.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [12]:
%timeit numpy_trig(a, b)

17.2 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


What happens if we do specify a signature?  Is there a speed boost?

In [13]:
vec_trig = vectorize('float64(float64, float64)')(trig)

In [14]:
%timeit vec_trig(a, b)

44.4 ms ± 1.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


No, not really.  But(!), if we have a signature, then we can add the target `kwarg`.

In [15]:
vec_trig = vectorize('float64(float64, float64)', target='parallel')(trig)

In [16]:
%timeit vec_trig(a, b)

25.2 ms ± 5.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


Automatic multicore operations!

**Note**: `target='parallel'` is not always the best option.  There is overhead in setting up the threading, so if the individual scalar operations that make up a `ufunc` are simple you'll probably get better performance in serial.  If the individual operations are more expensive (like trig!) then parallel is (usually) a good option.

In [17]:
vec_trig_cuda = vectorize('float64(float64, float64)', target='cuda')(trig)

In [18]:
# %timeit vec_trig_cuda(a, b)

### Passing multiple signatures

If you use multiple signatures, they have to be listed in order of most specific -> least specific

In [19]:
@vectorize(['int32(int32, int32)',
            'int64(int64, int64)',
            'float32(float32, float32)',
            'float64(float64, float64)'])
def trig(a, b):
    return math.sin(a**2) * math.exp(b)

In [20]:
trig(1, 1)

2

In [21]:
trig(1., 1.)

2.2873552871788423

In [22]:
trig.ntypes

4

## [Exercise: Clipping an array](./exercises/07.Vectorize.Exercises.ipynb#Exercise:-Clipping-an-array)

Yes, NumPy has a `clip` ufunc already, but let's pretend it doesn't.  

Create a Numba vectorized ufunc that takes a vector `a`, a lower limit `amin` and an upper limit `amax`.  It should return the vector `a` with all values clipped such that $a_{min} < a < a_{max}$:

In [23]:
# %load snippets/clip.py
def my_clip(a, amin, amax):
    result = a
    if a < amin:
        result = amin
    elif a > amax:
        result = amax
    return result

vec_truncate_serial = vectorize('float64(float64, float64, float64)')(my_clip)
vec_truncate_par = vectorize('float64(float64, float64, float64)', target='parallel')(my_clip)

In [24]:
a = numpy.random.random((5000))

In [25]:
amin = .2
amax = .6

In [26]:
%timeit vec_truncate_serial(a, amin, amax)

27.5 µs ± 2.62 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [27]:
%timeit vec_truncate_par(a, amin, amax)

48.3 µs ± 5.53 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [28]:
%timeit numpy.clip(a, amin, amax)

11 µs ± 493 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [29]:
a = numpy.random.random((100000))

In [30]:
%timeit vec_truncate_serial(a, amin, amax)

968 µs ± 48.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [31]:
%timeit vec_truncate_par(a, amin, amax)

762 µs ± 90.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [32]:
%timeit numpy.clip(a, amin, amax)

416 µs ± 33.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


## [Exercise: Create `logit` ufunc](./exercises/07.Vectorize.Exercises.ipynb#Exercise:-Create-logit-ufunc)

Recall from above that this is a ufunc which performs this operation:

$$f(a) = \log \left(\frac{a}{1-a}\right)$$

In [33]:
# %load snippets/logit.py
@vectorize('float64(float64)')
def logit(a):
    return math.log(a / (1 - a))

In [34]:
a = numpy.random.random((5000))

In [35]:
logit(a)

array([-0.6989712 , -1.98974674,  0.43825071, ..., -1.28457118,
       -0.46629495, -0.92129208])

In [36]:
%timeit logit(a)

164 µs ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [37]:
%timeit numpy.log(a / (1 - a))

174 µs ± 3.55 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


## Performance of `vectorize` vs. regular array-wide operations

In [38]:
@vectorize
def discriminant(a, b, c):
    return b**2 - 4 * a * c

In [39]:
a = numpy.arange(10000)
b = numpy.arange(10000)
c = numpy.arange(10000)

In [40]:
%timeit discriminant(a, b, c)

17.8 µs ± 1.15 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [41]:
%timeit b**2 - 4 * a * c

39.7 µs ± 976 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


What's going on?

* Each array operation creates a temporary copy
* Each of these arrays are loaded into and out of cache a whole bunch

In [42]:
del a, b, c