# Rewrite of bottleneck internals #92

Closed
opened this Issue Oct 19, 2014 · 26 comments

## Comments

Projects
None yet
3 participants
Owner

### kwgoodman commented Oct 19, 2014

 Here's a proposal for a rewrite of the bottlleneck internals. The main goals: Easier to add functions to bottleneck Easier to maintain Less C code The easiest way to explain what I have in mind is with an example written in pure python: ``````import numpy as np def nanmean_2d_float64_axis0(a): algo = Nanmean() return reduce_2d_float64_axis0(a, algo) def reduce_2d_float64_axis0(a, algo): n0, n1 = a.shape y = np.empty(n1) for j in range(n1): algo.clear() for i in range(n0): ai = a[i, j] algo.append(ai) y[j] = algo.calc() return y class Nanmean(object): def __init__(self): self.count = 0 self.asum = 0 def append(self, ai): if ai == ai: self.asum += ai self.count += 1 def calc(self): if self.count > 0: return self.asum / self.count else: return np.nan def clear(self): self.asum = 0 self.count = 0 `````` Note that: `nanmean_2d_float64_axis0` keeps the same name and signature as in the current code. `reduce_2d_float64_axis0` can be (re)used with any reduce function such as nansum or median. The algo, here `Nanmean`, only has to be written for the 1d case. To add a new function, the user only has to think about the 1d case and then write the algo class (I've never written a cython class but I assume it can be done). Bottleneck should take care of the rest. A similar setup would be used for moving window functions. I can't picture what all this would look like in practice, but just wanted to record the thought.

Closed

Contributor

### shoyer commented Oct 19, 2014

 Coincidentally, this week I have also been thinking about how to rewrite bottleneck internals. I'll lay out my thoughts, and then respond to yours. But let me preface this by saying that I'm in 100% agreement with your goals here. First, I think bottleneck should get out of the business of writing separate functions for each different number of dimensions. Instead, we should use numpy's generalized ufuncs. Second, I think we should consider eventually moving the core of bottleneck from Cython to Numba. For the simple cases I tried, Numba is not quite as fast as the Cython found in bottleneck, but it is usually pretty close (within 20%). However, the functions written for Numba are much more readable than Cython (they run unmodified as Python code), and Numba already handles writing generalized ufuncs. I have a working prototype (see this gist and numba/numba#825), but there are still a few barriers to making this work elegantly. Most notably, generalized ufuncs really should have an axis argument that allows for specifying core dimensions (numpy/numpy#5197). It's also almost certainly possible to write gufuncs from Cython instead of Numba, though my googling could not turn up any complete examples. This tutorial gets pretty close. I think this is probably a better option than writing functions like `reduce_2d_float64_axis0` ourselves. Doing this right with gufuncs and/or numba will require some upstream work, but it seems possibly achievable in the intermediate term future (perhaps even with numpy 1.10). If we switch to using numba instead of cython, that would be a significant enough change that we should probably create a new project with a new import name (perhaps `bottleneck2`?). As for your proposal, my main concerns are: The `Nanmean` class is an improvement over the multi-dimensional template language, but it's still much less obvious than the 1d function. I'm pretty sure we won't be able to fit every existing bottleneck function into this model. Some of them (e.g., `nanvar`) use two passes over the data.
Owner

### kwgoodman commented Oct 19, 2014

 Interesting. Your proposal would certainly make it easier to add new functions to bottleneck. And would lower the barrier to user contributions. Is your proposal to do this in steps? Numpy gufuncs first, then numba later?
Contributor

### shoyer commented Oct 19, 2014

 Is your proposal to do this in steps? Numpy gufuncs first, then numba later? That's certainly an option, but my C/Cython is pretty bad, and that would end up being wasted work later, if/when we switch to numba. My inclination is try out both at once (in a separate/parallel project with as much copied code/tests from bottleneck as possible), and count on the numba devs to smooth out any rough edges we identify once we have some performance benchmarks. Numba is not quite fast enough yet (especially pending the fix for numba/numba#815), but it is being very actively developed these days. We do still need a little bit of boilerplate to make things work right now, but eventually that should go away. My hope is that the axis selection bit could end up in numpy pretty quickly. Personally, I am more excited about trying this out with numba than hooking up gufuncs to 1d versions of the existing Cython code. But that's also a perfectly reasonable option, and there's no reason it couldn't be tried out in parallel. Switching to numba at this point would indeed mean a slight performance hit, and there are a few bottleneck modules written in C that would be harder to convert over. In my fantasy world, writing a performant `finitesum` with bottleneck (e.g., as requested on the mailing list recently) is as easy as the following: ```from bottleneck2 import ndreduce @ndreduce def finitesum(a): asum = 0.0 for ai in a.flat: if np.isfinite(ai): asum += ai return asum``` I can see this as feasible with numba, not so much with Cython :).

### Nodd commented Oct 20, 2014

 Just keep in mind that using numba adds a dependency to bottleneck, while it is possible to install it without cython. Also, numba is a pain to install if you're not using anaconda. (That was a few months ago, it may have changed since).
Contributor

### shoyer commented Oct 20, 2014

 @Nodd Agreed, numba is a pretty bad dependency to add. That's why I suggested making it technically a separate project (e.g., called `bottleneck2` or something else entirely).
Owner

### kwgoodman commented Oct 20, 2014

 I converted the python code in my proposal to cython: ``````import numpy as np cimport numpy as np import cython from numpy cimport NPY_FLOAT64 as NPY_float64 from numpy cimport PyArray_EMPTY, PyArray_DIMS, import_array import_array() def nanmean_2d_float64_axis0(np.ndarray[np.float64_t, ndim=2] a): algo = Nanmean() return reduce_2d_float64_axis0(a, algo) @cython.boundscheck(False) @cython.wraparound(False) cpdef np.ndarray[np.float64_t, ndim=1] reduce_2d_float64_axis0( np.ndarray[np.float64_t, ndim=2] a, Nanmean algo): cdef Py_ssize_t i0, i1 cdef np.npy_intp *dim dim = PyArray_DIMS(a) cdef Py_ssize_t n0 = dim[0] cdef Py_ssize_t n1 = dim[1] cdef np.npy_intp *dims = [n1] cdef np.ndarray[np.float64_t, ndim=1] y = PyArray_EMPTY(1, dims, NPY_float64, 0) for i1 in range(n1): algo.clear() for i0 in range(n0): algo.append(a[i0, i1]) y[i1] = algo.calc() return y cdef class Nanmean: cdef int count cdef double asum, ai def __init__(self): self.asum = 0.0 self.count = 0 cdef void append(self, double ai): if ai == ai: self.asum += ai self.count += 1 cdef double calc(self): cdef double y if self.count > 0: y = self.asum / self.count else: y = np.nan return y cdef void clear(self): self.asum = 0.0 self.count = 0 `````` But it is only half as fast as bn.nanmean(). nanmean is such a simple thing to calculate that most all of the time is probably in looping over the array elements. Calling Nanmean.append at each array element adds too much overhead.
Owner

### kwgoodman commented Oct 20, 2014

 I have a new proposal for a rewrite of the bottleneck internals. This one seems to be speed competitive with the current bottleneck. ``````import numpy as np cimport numpy as np import cython from numpy cimport NPY_FLOAT64 as NPY_float64 from numpy cimport PyArray_EMPTY, PyArray_DIMS, import_array import_array() # --------------------------------------------------------------------------- # user or bottleneck developer would write this function: @cython.boundscheck(False) @cython.wraparound(False) cdef double nanmean_1d_double(double* x, int step, int n0): cdef int count = 0 cdef double asum = 0, ai cdef int i for i in range(n0): ai = x[i * step] if ai == ai: asum += ai count += 1 if count > 0: ai = asum / count else: ai = np.nan return ai # --------------------------------------------------------------------------- # bottleneck's templating system would auto-generate these functions: @cython.boundscheck(False) @cython.wraparound(False) def nanmean_2d_float64_axis0(np.ndarray[np.float64_t, ndim=2] a): cdef Py_ssize_t i cdef np.npy_intp *dim dim = PyArray_DIMS(a) cdef int n0 = dim[0] cdef int n1 = dim[1] cdef int step cdef np.npy_intp *dims = [n1] cdef np.ndarray[np.float64_t, ndim=1] y y = PyArray_EMPTY(1, dims, NPY_float64, 0) cdef double* aptr = &a[0, 0] step = n1 for i in range(n1): y[i] = nanmean_1d_double(aptr + i, step, n0) return y @cython.boundscheck(False) @cython.wraparound(False) def nanmean_2d_float64_axis1(np.ndarray[np.float64_t, ndim=2] a): cdef Py_ssize_t i cdef np.npy_intp *dim dim = PyArray_DIMS(a) cdef int n0 = dim[0] cdef int n1 = dim[1] cdef int step cdef np.npy_intp *dims = [n0] cdef np.ndarray[np.float64_t, ndim=1] y y = PyArray_EMPTY(1, dims, NPY_float64, 0) cdef double* aptr = &a[0, 0] step = 1 for i in range(n1): y[i] = nanmean_1d_double(aptr + i * n0, step, n0) return y @cython.boundscheck(False) @cython.wraparound(False) def nanmean_2d_float64_axisNone(np.ndarray[np.float64_t, ndim=2] a): cdef Py_ssize_t i cdef np.npy_intp *dim dim = PyArray_DIMS(a) cdef int n0 = dim[0] cdef int n1 = dim[1] cdef int step cdef np.npy_intp *dims = [n0] cdef double y = 0 cdef double* aptr = &a[0, 0] step = 1 for i in range(n1): y += nanmean_1d_double(aptr + i * n0, step, n0) return y / n1 # hack ``````
Contributor

### shoyer commented Oct 20, 2014

 @kwgoodman can that pointer math really handle any numpy.ndarray? Even arrays that are non-contiguous or don't have their own data (e.g., from slicing)?
Owner

### kwgoodman commented Oct 20, 2014

 Ugh, taking a transpose breaks the current code: ``````In [11]: a = np.random.rand(4,2) In [12]: bn.nanmean(a,0) Out[12]: array([ 0.37224901, 0.30965753]) In [13]: nanmean_2d_float64_axis0(a) Out[13]: array([ 0.37224901, 0.30965753]) In [14]: a = a.T In [15]: bn.nanmean(a,0) Out[15]: array([ 0.32749017, 0.6195802 , 0.22710758, 0.18963513]) In [16]: nanmean_2d_float64_axis0(a) Out[16]: array([ 0.46717604, 0.08742171, 0.27732198, 0.53189335]) ``````
Contributor

### shoyer commented Oct 21, 2014

 OK, so I went ahead and started another project for Numba accelerated functions: https://github.com/shoyer/numbagg Still happy to roll this back into bottleneck at some point if/when the time is right.
Owner

### kwgoodman commented Oct 21, 2014

 numbagg is very impressive! What are timings currently like for small input arrays? Something like: ``````In [1]: from bottleneck.func import nanmean_2d_float64_axis1 In [2]: a = np.random.RandomState(42).randn(5, 5) In [3]: timeit nanmean_2d_float64_axis1(a) 1000000 loops, best of 3: 778 ns per loop In [4]: a = np.random.RandomState(42).randn(100000, 5) In [5]: timeit nanmean_2d_float64_axis1(a) 1000 loops, best of 3: 515 µs per loop ``````
Contributor

### shoyer commented Oct 21, 2014

 Here are a few more benchmarks: ``````for size in [(5, 5), (100000, 5), (1000, 1000)]: for axis in [0, 1, None]: a = np.random.RandomState(42).randn(*size) print('size=%s, axis=%s:' % (size, axis)) %timeit bottleneck.nanmean(a, axis=axis) %timeit numbagg.nanmean(a, axis=axis) `````` ``````size=(5, 5), axis=0: 100000 loops, best of 3: 2.46 µs per loop 100000 loops, best of 3: 5.13 µs per loop size=(5, 5), axis=1: 100000 loops, best of 3: 2.34 µs per loop 100000 loops, best of 3: 5.24 µs per loop size=(5, 5), axis=None: 1000000 loops, best of 3: 1.58 µs per loop 1000000 loops, best of 3: 1.04 µs per loop size=(100000, 5), axis=0: 1000 loops, best of 3: 719 µs per loop 1000 loops, best of 3: 771 µs per loop size=(100000, 5), axis=1: 1000 loops, best of 3: 579 µs per loop 1000 loops, best of 3: 838 µs per loop size=(100000, 5), axis=None: 1000 loops, best of 3: 480 µs per loop 1000 loops, best of 3: 505 µs per loop size=(1000, 1000), axis=0: 100 loops, best of 3: 7.86 ms per loop 100 loops, best of 3: 7.9 ms per loop size=(1000, 1000), axis=1: 1000 loops, best of 3: 1.01 ms per loop 1000 loops, best of 3: 1.08 ms per loop size=(1000, 1000), axis=None: 1000 loops, best of 3: 1.01 ms per loop 1000 loops, best of 3: 1.1 ms per loop `````` For comparison, here are my times for that first test case using `bottleneck.func.nanmean_2d_float64_axis*` directly for `axis=0,1,None`: ``````100000 loops, best of 3: 1.71 µs per loop 1000000 loops, best of 3: 1.73 µs per loop 1000000 loops, best of 3: 1.09 µs per loop `````` So numbagg is actual faster than even the direct bottleneck function for `axis=None`, but there is some overhead if the axis is specified. Probably could get that down with some profiling.
Owner

### kwgoodman commented Oct 21, 2014

 That is fast! For axis=None bottleneck returns np.float(answer) while numbagg returns answer. Could that explain the overhead difference for axis=None?
Contributor

### shoyer commented Oct 21, 2014

 Yep, looks like `np.float(answer)` adds a few hundred ns of overhead. ``````>>> %timeit np.float(1.0) 10000000 loops, best of 3: 152 ns per loop ``````
Contributor

### shoyer commented Oct 21, 2014

 Just pushed a new commit with an simpler logic for parsing `axis` when it's an integer, which speeds up most of the above benchmarks (I edited my comment).
Owner

### kwgoodman commented Oct 22, 2014

 Here's a nansum along axis=None for arbitrary ndim: ``````def nansum_float64_axisNone(np.ndarray a): cdef int axis=-1 cdef np.flatiter ita ita = np.PyArray_IterAllButAxis(a, &axis) cdef Py_ssize_t stride = a.strides[axis], length = a.shape[axis], i cdef np.float64_t asum = 0, ai while np.PyArray_ITER_NOTDONE(ita): for i in range(length): ai = (((np.PyArray_ITER_DATA(ita)) + i * stride))[0] if ai == ai: asum += ai np.PyArray_ITER_NEXT(ita) return asum `````` Same speed as bottleneck, except for being much, much faster when the array is fortran ordered.
Contributor

### shoyer commented Oct 22, 2014

 @kwgoodman very nice! Now I just need the numba devs to expose something like that :).

Closed

Owner

### kwgoodman commented Oct 23, 2014

 And here's a prototype for moving window functions. Note that it works for arbitrary ndim and axis, which leaves us dtype to deal with. The speed is the same as the current bottleneck. ``````def move_nanmean_float64(a, int window, int axis): cdef int n = a.shape[axis] if (window < 1) or (window > n): raise ValueError(MOVE_WINDOW_ERR_MSG % (window, n)) y = np.empty(a.shape, np.float64) cdef Py_ssize_t count, i cdef double asum, aold, ai, yi cdef np.flatiter ita = np.PyArray_IterAllButAxis(a, &axis) cdef np.flatiter ity = np.PyArray_IterAllButAxis(y, &axis) cdef int astride = a.strides[axis] cdef int ystride = y.strides[axis] while np.PyArray_ITER_NOTDONE(ita): with nogil: asum = 0 count = 0 for i in range(window - 1): ai = (((np.PyArray_ITER_DATA(ita)) + i*astride))[0] if ai == ai: asum += ai count += 1 (((np.PyArray_ITER_DATA(ity)) + i*ystride))[0] = NAN i = window - 1 ai = (((np.PyArray_ITER_DATA(ita)) + i*astride))[0] if ai == ai: asum += ai count += 1 if count > 0: yi = asum / count else: yi = NAN (((np.PyArray_ITER_DATA(ity)) + i*ystride))[0] = yi for i in range(window, n): ai = (((np.PyArray_ITER_DATA(ita)) + i*astride))[0] if ai == ai: asum += ai count += 1 aold = (((np.PyArray_ITER_DATA(ita)) + (i-window)*astride))[0] if aold == aold: asum -= aold count -= 1 if count > 0: yi = asum / count else: yi = NAN (((np.PyArray_ITER_DATA(ity)) + i*ystride))[0] = yi np.PyArray_ITER_NEXT(ita) np.PyArray_ITER_NEXT(ity) return y `````` OK, so we have a prototype for reduction along axis=None (shown in a previous comment in this thread) and now a prototype for moving window functions. What's left? The big one is reduction along a given axis. I don't know how to write that one. Anyone know?
Owner

### kwgoodman commented Oct 24, 2014

 So far I've proposed prototypes for functions that reduce over all axes (axis=None) and functions that do not reduce at all (eg moving window functions). What's missing is a prototype of a function that reduces over a single axis. Well, here it is: ``````def nansum_float64_axisint(np.ndarray a, int axis): cdef int ndim = a.ndim if ndim <= 1: raise ValueError("`a.ndim` must be greater than 1") if axis < 0: axis += ndim if (axis >= ndim) or (axis < 0): raise ValueError("axis(=%d) out of bounds" % axis) cdef np.flatiter ita ita = np.PyArray_IterAllButAxis(a, &axis) cdef Py_ssize_t length = a.shape[axis], i cdef int stride = a.strides[axis] # temp python hack cdef list shape = [] for i in range(ndim): if i != axis: shape.append(a.shape[i]) y = np.empty(shape, np.float64) cdef np.flatiter ity = np.PyArray_IterNew(y) cdef np.float64_t asum, ai while np.PyArray_ITER_NOTDONE(ita): asum = 0 for i in range(length): ai = (((np.PyArray_ITER_DATA(ita)) + i*stride))[0] if ai == ai: asum += ai (((np.PyArray_ITER_DATA(ity))))[0] = asum np.PyArray_ITER_NEXT(ita) np.PyArray_ITER_NEXT(ity) return y `````` This function works for arbitratry ndim > 1, and arbitratry axis, leaving only the different dtypes to deal with.
Contributor

### shoyer commented Oct 24, 2014

 @kwgoodman This is very impressive! This code is definitely a step up from the template system.
Contributor

### shoyer commented Jan 3, 2015

 @kwgoodman It looks like you've been making steady progress on the rewrite branch, but I stumbled across an (ancient/closed) numpy pull request that implemented `nansum` and other aggregation functions as numpy ufuncs that I thought you might find interesting. It's written in C: https://github.com/numpy/numpy/pull/30/files
Owner

### kwgoodman commented Jan 5, 2015

 I'm curious about how performance compares (bottleneck rewrite versus numpy pull request) if anyone is interested in testing.
Contributor

### shoyer commented Jan 29, 2015

 Have you seen Cython's fused types? http://docs.cython.org/src/userguide/fusedtypes.html Using fused types like integer and float could be a nice way to remove a lot of redundancy in this rewrite.
Owner

### kwgoodman commented Jan 29, 2015

 Yep, fused types are in the rewrite branch. But I ended up removing them. I forget the exact issue. That was at an early stage of the rewrite. So it might be a good idea to revisit and rediscover what, if any, issues there are.
Owner

### kwgoodman commented Jan 30, 2015

 I need to finish up, ugh, the docstrings. The addition of `min_count` means a rewrite of the moving window docstrings. I'm kinda burned out on the rewrite at this point, so any and all suggestions welcomed on this moving window docstring template: ``````""" Moving window sum along the specified axis, optionally ignoring NaNs. Parameters ---------- arr : ndarray Input array. If `arr` is not an array, a conversion is attempted. window : int The number of elements in the moving window. min_count: {int, None}, optional If the number of non-NaN values in a window is less than `min_count`, then a value of NaN is assigned to the window. By default `min_count` is None, which is equivalent to setting `min_count` equal to `window`. axis : int, optional The axis over which the window is moved. By default the last axis (axis=-1) is used. An axis of None is not allowed. Returns ------- y : ndarray The moving sum of the input array along the specified axis. The output has the same shape as the input. Examples -------- >>> arr = np.array([1.0, 2.0, 3.0, np.nan, 5.0]) >>> bn.move_nansum(arr, window=2) array([ nan, 3., 5., nan, nan]) >>> bn.move_sum(arr, window=2, min_count=1) array([ 1., 3., 5., 3., 5.]) """ ``````
Owner

### kwgoodman commented Feb 4, 2015

 I merged the rewrite branch into master.

### kwgoodman closed this Feb 4, 2015

to join this conversation on GitHub. Already have an account? Sign in to comment