Rewrite of bottleneck internals #92

Closed
kwgoodman opened this Issue Oct 19, 2014 · 26 comments

Comments

Projects
None yet
3 participants
@kwgoodman
Owner

kwgoodman commented Oct 19, 2014

Here's a proposal for a rewrite of the bottlleneck internals. The main goals:

  • Easier to add functions to bottleneck
  • Easier to maintain
  • Less C code

The easiest way to explain what I have in mind is with an example written in pure python:

import numpy as np


def nanmean_2d_float64_axis0(a):
    algo = Nanmean()
    return reduce_2d_float64_axis0(a, algo)


def reduce_2d_float64_axis0(a, algo):
    n0, n1 = a.shape
    y = np.empty(n1)
    for j in range(n1):
        algo.clear()
        for i in range(n0):
            ai = a[i, j]
            algo.append(ai)
        y[j] = algo.calc()
    return y


class Nanmean(object):

    def __init__(self):
        self.count = 0
        self.asum = 0

    def append(self, ai):
        if ai == ai:
            self.asum += ai
            self.count += 1

    def calc(self):
        if self.count > 0:
            return self.asum / self.count
        else:
            return np.nan

    def clear(self):
        self.asum = 0
        self.count = 0

Note that:

  • nanmean_2d_float64_axis0 keeps the same name and signature as in the current code.
  • reduce_2d_float64_axis0 can be (re)used with any reduce function such as
    nansum or median.
  • The algo, here Nanmean, only has to be written for the 1d case.

To add a new function, the user only has to think about the 1d case and then write the algo class (I've never written a cython class but I assume it can be done). Bottleneck should take care of the rest.

A similar setup would be used for moving window functions.

I can't picture what all this would look like in practice, but just wanted to record the thought.

@shoyer

This comment has been minimized.

Show comment
Hide comment
@shoyer

shoyer Oct 19, 2014

Contributor

Coincidentally, this week I have also been thinking about how to rewrite bottleneck internals. I'll lay out my thoughts, and then respond to yours. But let me preface this by saying that I'm in 100% agreement with your goals here.

First, I think bottleneck should get out of the business of writing separate functions for each different number of dimensions. Instead, we should use numpy's generalized ufuncs.

Second, I think we should consider eventually moving the core of bottleneck from Cython to Numba. For the simple cases I tried, Numba is not quite as fast as the Cython found in bottleneck, but it is usually pretty close (within 20%). However, the functions written for Numba are much more readable than Cython (they run unmodified as Python code), and Numba already handles writing generalized ufuncs.

I have a working prototype (see this gist and numba/numba#825), but there are still a few barriers to making this work elegantly. Most notably, generalized ufuncs really should have an axis argument that allows for specifying core dimensions (numpy/numpy#5197).

It's also almost certainly possible to write gufuncs from Cython instead of Numba, though my googling could not turn up any complete examples. This tutorial gets pretty close. I think this is probably a better option than writing functions like reduce_2d_float64_axis0 ourselves.

Doing this right with gufuncs and/or numba will require some upstream work, but it seems possibly achievable in the intermediate term future (perhaps even with numpy 1.10). If we switch to using numba instead of cython, that would be a significant enough change that we should probably create a new project with a new import name (perhaps bottleneck2?).

As for your proposal, my main concerns are:

  1. The Nanmean class is an improvement over the multi-dimensional template language, but it's still much less obvious than the 1d function.
  2. I'm pretty sure we won't be able to fit every existing bottleneck function into this model. Some of them (e.g., nanvar) use two passes over the data.
Contributor

shoyer commented Oct 19, 2014

Coincidentally, this week I have also been thinking about how to rewrite bottleneck internals. I'll lay out my thoughts, and then respond to yours. But let me preface this by saying that I'm in 100% agreement with your goals here.

First, I think bottleneck should get out of the business of writing separate functions for each different number of dimensions. Instead, we should use numpy's generalized ufuncs.

Second, I think we should consider eventually moving the core of bottleneck from Cython to Numba. For the simple cases I tried, Numba is not quite as fast as the Cython found in bottleneck, but it is usually pretty close (within 20%). However, the functions written for Numba are much more readable than Cython (they run unmodified as Python code), and Numba already handles writing generalized ufuncs.

I have a working prototype (see this gist and numba/numba#825), but there are still a few barriers to making this work elegantly. Most notably, generalized ufuncs really should have an axis argument that allows for specifying core dimensions (numpy/numpy#5197).

It's also almost certainly possible to write gufuncs from Cython instead of Numba, though my googling could not turn up any complete examples. This tutorial gets pretty close. I think this is probably a better option than writing functions like reduce_2d_float64_axis0 ourselves.

Doing this right with gufuncs and/or numba will require some upstream work, but it seems possibly achievable in the intermediate term future (perhaps even with numpy 1.10). If we switch to using numba instead of cython, that would be a significant enough change that we should probably create a new project with a new import name (perhaps bottleneck2?).

As for your proposal, my main concerns are:

  1. The Nanmean class is an improvement over the multi-dimensional template language, but it's still much less obvious than the 1d function.
  2. I'm pretty sure we won't be able to fit every existing bottleneck function into this model. Some of them (e.g., nanvar) use two passes over the data.
@kwgoodman

This comment has been minimized.

Show comment
Hide comment
@kwgoodman

kwgoodman Oct 19, 2014

Owner

Interesting.

Your proposal would certainly make it easier to add new functions to bottleneck. And would lower the barrier to user contributions.

Is your proposal to do this in steps? Numpy gufuncs first, then numba later?

Owner

kwgoodman commented Oct 19, 2014

Interesting.

Your proposal would certainly make it easier to add new functions to bottleneck. And would lower the barrier to user contributions.

Is your proposal to do this in steps? Numpy gufuncs first, then numba later?

@shoyer

This comment has been minimized.

Show comment
Hide comment
@shoyer

shoyer Oct 19, 2014

Contributor

Is your proposal to do this in steps? Numpy gufuncs first, then numba later?

That's certainly an option, but my C/Cython is pretty bad, and that would end up being wasted work later, if/when we switch to numba.

My inclination is try out both at once (in a separate/parallel project with as much copied code/tests from bottleneck as possible), and count on the numba devs to smooth out any rough edges we identify once we have some performance benchmarks. Numba is not quite fast enough yet (especially pending the fix for numba/numba#815), but it is being very actively developed these days.

We do still need a little bit of boilerplate to make things work right now, but eventually that should go away. My hope is that the axis selection bit could end up in numpy pretty quickly.

Personally, I am more excited about trying this out with numba than hooking up gufuncs to 1d versions of the existing Cython code. But that's also a perfectly reasonable option, and there's no reason it couldn't be tried out in parallel. Switching to numba at this point would indeed mean a slight performance hit, and there are a few bottleneck modules written in C that would be harder to convert over.

In my fantasy world, writing a performant finitesum with bottleneck (e.g., as requested on the mailing list recently) is as easy as the following:

from bottleneck2 import ndreduce

@ndreduce
def finitesum(a):
    asum = 0.0
    for ai in a.flat:
        if np.isfinite(ai):
            asum += ai
    return asum

I can see this as feasible with numba, not so much with Cython :).

Contributor

shoyer commented Oct 19, 2014

Is your proposal to do this in steps? Numpy gufuncs first, then numba later?

That's certainly an option, but my C/Cython is pretty bad, and that would end up being wasted work later, if/when we switch to numba.

My inclination is try out both at once (in a separate/parallel project with as much copied code/tests from bottleneck as possible), and count on the numba devs to smooth out any rough edges we identify once we have some performance benchmarks. Numba is not quite fast enough yet (especially pending the fix for numba/numba#815), but it is being very actively developed these days.

We do still need a little bit of boilerplate to make things work right now, but eventually that should go away. My hope is that the axis selection bit could end up in numpy pretty quickly.

Personally, I am more excited about trying this out with numba than hooking up gufuncs to 1d versions of the existing Cython code. But that's also a perfectly reasonable option, and there's no reason it couldn't be tried out in parallel. Switching to numba at this point would indeed mean a slight performance hit, and there are a few bottleneck modules written in C that would be harder to convert over.

In my fantasy world, writing a performant finitesum with bottleneck (e.g., as requested on the mailing list recently) is as easy as the following:

from bottleneck2 import ndreduce

@ndreduce
def finitesum(a):
    asum = 0.0
    for ai in a.flat:
        if np.isfinite(ai):
            asum += ai
    return asum

I can see this as feasible with numba, not so much with Cython :).

@Nodd

This comment has been minimized.

Show comment
Hide comment
@Nodd

Nodd Oct 20, 2014

Just keep in mind that using numba adds a dependency to bottleneck, while it is possible to install it without cython.
Also, numba is a pain to install if you're not using anaconda. (That was a few months ago, it may have changed since).

Nodd commented Oct 20, 2014

Just keep in mind that using numba adds a dependency to bottleneck, while it is possible to install it without cython.
Also, numba is a pain to install if you're not using anaconda. (That was a few months ago, it may have changed since).

@shoyer

This comment has been minimized.

Show comment
Hide comment
@shoyer

shoyer Oct 20, 2014

Contributor

@Nodd Agreed, numba is a pretty bad dependency to add. That's why I suggested making it technically a separate project (e.g., called bottleneck2 or something else entirely).

Contributor

shoyer commented Oct 20, 2014

@Nodd Agreed, numba is a pretty bad dependency to add. That's why I suggested making it technically a separate project (e.g., called bottleneck2 or something else entirely).

@kwgoodman

This comment has been minimized.

Show comment
Hide comment
@kwgoodman

kwgoodman Oct 20, 2014

Owner

I converted the python code in my proposal to cython:

import numpy as np
cimport numpy as np
import cython
from numpy cimport NPY_FLOAT64 as NPY_float64
from numpy cimport PyArray_EMPTY, PyArray_DIMS, import_array
import_array()


def nanmean_2d_float64_axis0(np.ndarray[np.float64_t, ndim=2] a):
    algo = Nanmean()
    return reduce_2d_float64_axis0(a, algo)


@cython.boundscheck(False)
@cython.wraparound(False)
cpdef np.ndarray[np.float64_t, ndim=1] reduce_2d_float64_axis0(
        np.ndarray[np.float64_t, ndim=2] a, Nanmean algo):

    cdef Py_ssize_t i0, i1
    cdef np.npy_intp *dim
    dim = PyArray_DIMS(a)
    cdef Py_ssize_t n0 = dim[0]
    cdef Py_ssize_t n1 = dim[1]
    cdef np.npy_intp *dims = [n1]
    cdef np.ndarray[np.float64_t, ndim=1] y = PyArray_EMPTY(1, dims,
        NPY_float64, 0)

    for i1 in range(n1):
        algo.clear()
        for i0 in range(n0):
            algo.append(a[i0, i1])
        y[i1] = algo.calc()

    return y


cdef class Nanmean:

    cdef int count
    cdef double asum, ai

    def __init__(self):
        self.asum = 0.0
        self.count = 0

    cdef void append(self, double ai):
        if ai == ai:
            self.asum += ai
            self.count += 1

    cdef double calc(self):
        cdef double y
        if self.count > 0:
            y = self.asum / self.count
        else:
            y = np.nan
        return y

    cdef void clear(self):
        self.asum = 0.0
        self.count = 0

But it is only half as fast as bn.nanmean().

nanmean is such a simple thing to calculate that most all of the time is probably in looping over the array elements. Calling Nanmean.append at each array element adds too much overhead.

Owner

kwgoodman commented Oct 20, 2014

I converted the python code in my proposal to cython:

import numpy as np
cimport numpy as np
import cython
from numpy cimport NPY_FLOAT64 as NPY_float64
from numpy cimport PyArray_EMPTY, PyArray_DIMS, import_array
import_array()


def nanmean_2d_float64_axis0(np.ndarray[np.float64_t, ndim=2] a):
    algo = Nanmean()
    return reduce_2d_float64_axis0(a, algo)


@cython.boundscheck(False)
@cython.wraparound(False)
cpdef np.ndarray[np.float64_t, ndim=1] reduce_2d_float64_axis0(
        np.ndarray[np.float64_t, ndim=2] a, Nanmean algo):

    cdef Py_ssize_t i0, i1
    cdef np.npy_intp *dim
    dim = PyArray_DIMS(a)
    cdef Py_ssize_t n0 = dim[0]
    cdef Py_ssize_t n1 = dim[1]
    cdef np.npy_intp *dims = [n1]
    cdef np.ndarray[np.float64_t, ndim=1] y = PyArray_EMPTY(1, dims,
        NPY_float64, 0)

    for i1 in range(n1):
        algo.clear()
        for i0 in range(n0):
            algo.append(a[i0, i1])
        y[i1] = algo.calc()

    return y


cdef class Nanmean:

    cdef int count
    cdef double asum, ai

    def __init__(self):
        self.asum = 0.0
        self.count = 0

    cdef void append(self, double ai):
        if ai == ai:
            self.asum += ai
            self.count += 1

    cdef double calc(self):
        cdef double y
        if self.count > 0:
            y = self.asum / self.count
        else:
            y = np.nan
        return y

    cdef void clear(self):
        self.asum = 0.0
        self.count = 0

But it is only half as fast as bn.nanmean().

nanmean is such a simple thing to calculate that most all of the time is probably in looping over the array elements. Calling Nanmean.append at each array element adds too much overhead.

@kwgoodman

This comment has been minimized.

Show comment
Hide comment
@kwgoodman

kwgoodman Oct 20, 2014

Owner

I have a new proposal for a rewrite of the bottleneck internals. This one seems to be speed competitive with the current bottleneck.

import numpy as np
cimport numpy as np
import cython
from numpy cimport NPY_FLOAT64 as NPY_float64
from numpy cimport PyArray_EMPTY, PyArray_DIMS, import_array
import_array()


# ---------------------------------------------------------------------------
# user or bottleneck developer would write this function:

@cython.boundscheck(False)
@cython.wraparound(False)
cdef double nanmean_1d_double(double* x, int step, int n0):

    cdef int count = 0
    cdef double asum = 0, ai
    cdef int i

    for i in range(n0):
        ai = x[i * step]
        if ai == ai:
            asum += ai
            count += 1

    if count > 0:
        ai = asum / count
    else:
        ai = np.nan

    return ai


# ---------------------------------------------------------------------------
# bottleneck's templating system would auto-generate these functions:

@cython.boundscheck(False)
@cython.wraparound(False)
def nanmean_2d_float64_axis0(np.ndarray[np.float64_t, ndim=2] a):

    cdef Py_ssize_t i
    cdef np.npy_intp *dim
    dim = PyArray_DIMS(a)
    cdef int n0 = dim[0]
    cdef int n1 = dim[1]
    cdef int step
    cdef np.npy_intp *dims = [n1]
    cdef np.ndarray[np.float64_t, ndim=1] y

    y = PyArray_EMPTY(1, dims, NPY_float64, 0)

    cdef double* aptr = &a[0, 0]
    step = n1

    for i in range(n1):
        y[i] = nanmean_1d_double(aptr + i, step, n0)

    return y


@cython.boundscheck(False)
@cython.wraparound(False)
def nanmean_2d_float64_axis1(np.ndarray[np.float64_t, ndim=2] a):

    cdef Py_ssize_t i
    cdef np.npy_intp *dim
    dim = PyArray_DIMS(a)
    cdef int n0 = dim[0]
    cdef int n1 = dim[1]
    cdef int step
    cdef np.npy_intp *dims = [n0]
    cdef np.ndarray[np.float64_t, ndim=1] y

    y = PyArray_EMPTY(1, dims, NPY_float64, 0)

    cdef double* aptr = &a[0, 0]
    step = 1

    for i in range(n1):
        y[i] = nanmean_1d_double(aptr + i * n0, step, n0)

    return y


@cython.boundscheck(False)
@cython.wraparound(False)
def nanmean_2d_float64_axisNone(np.ndarray[np.float64_t, ndim=2] a):

    cdef Py_ssize_t i
    cdef np.npy_intp *dim
    dim = PyArray_DIMS(a)
    cdef int n0 = dim[0]
    cdef int n1 = dim[1]
    cdef int step
    cdef np.npy_intp *dims = [n0]
    cdef double y = 0

    cdef double* aptr = &a[0, 0]
    step = 1

    for i in range(n1):
        y += nanmean_1d_double(aptr + i * n0, step, n0)

    return y / n1  # hack
Owner

kwgoodman commented Oct 20, 2014

I have a new proposal for a rewrite of the bottleneck internals. This one seems to be speed competitive with the current bottleneck.

import numpy as np
cimport numpy as np
import cython
from numpy cimport NPY_FLOAT64 as NPY_float64
from numpy cimport PyArray_EMPTY, PyArray_DIMS, import_array
import_array()


# ---------------------------------------------------------------------------
# user or bottleneck developer would write this function:

@cython.boundscheck(False)
@cython.wraparound(False)
cdef double nanmean_1d_double(double* x, int step, int n0):

    cdef int count = 0
    cdef double asum = 0, ai
    cdef int i

    for i in range(n0):
        ai = x[i * step]
        if ai == ai:
            asum += ai
            count += 1

    if count > 0:
        ai = asum / count
    else:
        ai = np.nan

    return ai


# ---------------------------------------------------------------------------
# bottleneck's templating system would auto-generate these functions:

@cython.boundscheck(False)
@cython.wraparound(False)
def nanmean_2d_float64_axis0(np.ndarray[np.float64_t, ndim=2] a):

    cdef Py_ssize_t i
    cdef np.npy_intp *dim
    dim = PyArray_DIMS(a)
    cdef int n0 = dim[0]
    cdef int n1 = dim[1]
    cdef int step
    cdef np.npy_intp *dims = [n1]
    cdef np.ndarray[np.float64_t, ndim=1] y

    y = PyArray_EMPTY(1, dims, NPY_float64, 0)

    cdef double* aptr = &a[0, 0]
    step = n1

    for i in range(n1):
        y[i] = nanmean_1d_double(aptr + i, step, n0)

    return y


@cython.boundscheck(False)
@cython.wraparound(False)
def nanmean_2d_float64_axis1(np.ndarray[np.float64_t, ndim=2] a):

    cdef Py_ssize_t i
    cdef np.npy_intp *dim
    dim = PyArray_DIMS(a)
    cdef int n0 = dim[0]
    cdef int n1 = dim[1]
    cdef int step
    cdef np.npy_intp *dims = [n0]
    cdef np.ndarray[np.float64_t, ndim=1] y

    y = PyArray_EMPTY(1, dims, NPY_float64, 0)

    cdef double* aptr = &a[0, 0]
    step = 1

    for i in range(n1):
        y[i] = nanmean_1d_double(aptr + i * n0, step, n0)

    return y


@cython.boundscheck(False)
@cython.wraparound(False)
def nanmean_2d_float64_axisNone(np.ndarray[np.float64_t, ndim=2] a):

    cdef Py_ssize_t i
    cdef np.npy_intp *dim
    dim = PyArray_DIMS(a)
    cdef int n0 = dim[0]
    cdef int n1 = dim[1]
    cdef int step
    cdef np.npy_intp *dims = [n0]
    cdef double y = 0

    cdef double* aptr = &a[0, 0]
    step = 1

    for i in range(n1):
        y += nanmean_1d_double(aptr + i * n0, step, n0)

    return y / n1  # hack
@shoyer

This comment has been minimized.

Show comment
Hide comment
@shoyer

shoyer Oct 20, 2014

Contributor

@kwgoodman can that pointer math really handle any numpy.ndarray? Even arrays that are non-contiguous or don't have their own data (e.g., from slicing)?

Contributor

shoyer commented Oct 20, 2014

@kwgoodman can that pointer math really handle any numpy.ndarray? Even arrays that are non-contiguous or don't have their own data (e.g., from slicing)?

@kwgoodman

This comment has been minimized.

Show comment
Hide comment
@kwgoodman

kwgoodman Oct 20, 2014

Owner

Ugh, taking a transpose breaks the current code:

In [11]: a = np.random.rand(4,2)
In [12]: bn.nanmean(a,0)
Out[12]: array([ 0.37224901,  0.30965753])
In [13]: nanmean_2d_float64_axis0(a)
Out[13]: array([ 0.37224901,  0.30965753])

In [14]: a = a.T
In [15]: bn.nanmean(a,0)
Out[15]: array([ 0.32749017,  0.6195802 ,  0.22710758,  0.18963513])
In [16]: nanmean_2d_float64_axis0(a)
Out[16]: array([ 0.46717604,  0.08742171,  0.27732198,  0.53189335])
Owner

kwgoodman commented Oct 20, 2014

Ugh, taking a transpose breaks the current code:

In [11]: a = np.random.rand(4,2)
In [12]: bn.nanmean(a,0)
Out[12]: array([ 0.37224901,  0.30965753])
In [13]: nanmean_2d_float64_axis0(a)
Out[13]: array([ 0.37224901,  0.30965753])

In [14]: a = a.T
In [15]: bn.nanmean(a,0)
Out[15]: array([ 0.32749017,  0.6195802 ,  0.22710758,  0.18963513])
In [16]: nanmean_2d_float64_axis0(a)
Out[16]: array([ 0.46717604,  0.08742171,  0.27732198,  0.53189335])
@shoyer

This comment has been minimized.

Show comment
Hide comment
@shoyer

shoyer Oct 21, 2014

Contributor

OK, so I went ahead and started another project for Numba accelerated functions: https://github.com/shoyer/numbagg

Still happy to roll this back into bottleneck at some point if/when the time is right.

Contributor

shoyer commented Oct 21, 2014

OK, so I went ahead and started another project for Numba accelerated functions: https://github.com/shoyer/numbagg

Still happy to roll this back into bottleneck at some point if/when the time is right.

@kwgoodman

This comment has been minimized.

Show comment
Hide comment
@kwgoodman

kwgoodman Oct 21, 2014

Owner

numbagg is very impressive! What are timings currently like for small input arrays? Something like:

In [1]: from bottleneck.func import nanmean_2d_float64_axis1
In [2]: a = np.random.RandomState(42).randn(5, 5)
In [3]: timeit nanmean_2d_float64_axis1(a)
1000000 loops, best of 3: 778 ns per loop
In [4]: a = np.random.RandomState(42).randn(100000, 5)
In [5]: timeit nanmean_2d_float64_axis1(a)
1000 loops, best of 3: 515 µs per loop
Owner

kwgoodman commented Oct 21, 2014

numbagg is very impressive! What are timings currently like for small input arrays? Something like:

In [1]: from bottleneck.func import nanmean_2d_float64_axis1
In [2]: a = np.random.RandomState(42).randn(5, 5)
In [3]: timeit nanmean_2d_float64_axis1(a)
1000000 loops, best of 3: 778 ns per loop
In [4]: a = np.random.RandomState(42).randn(100000, 5)
In [5]: timeit nanmean_2d_float64_axis1(a)
1000 loops, best of 3: 515 µs per loop
@shoyer

This comment has been minimized.

Show comment
Hide comment
@shoyer

shoyer Oct 21, 2014

Contributor

Here are a few more benchmarks:

for size in [(5, 5), (100000, 5), (1000, 1000)]:
    for axis in [0, 1, None]:
        a = np.random.RandomState(42).randn(*size) 
        print('size=%s, axis=%s:' % (size, axis))
        %timeit bottleneck.nanmean(a, axis=axis)
        %timeit numbagg.nanmean(a, axis=axis)
size=(5, 5), axis=0:
100000 loops, best of 3: 2.46 µs per loop
100000 loops, best of 3: 5.13 µs per loop
size=(5, 5), axis=1:
100000 loops, best of 3: 2.34 µs per loop
100000 loops, best of 3: 5.24 µs per loop
size=(5, 5), axis=None:
1000000 loops, best of 3: 1.58 µs per loop
1000000 loops, best of 3: 1.04 µs per loop
size=(100000, 5), axis=0:
1000 loops, best of 3: 719 µs per loop
1000 loops, best of 3: 771 µs per loop
size=(100000, 5), axis=1:
1000 loops, best of 3: 579 µs per loop
1000 loops, best of 3: 838 µs per loop
size=(100000, 5), axis=None:
1000 loops, best of 3: 480 µs per loop
1000 loops, best of 3: 505 µs per loop
size=(1000, 1000), axis=0:
100 loops, best of 3: 7.86 ms per loop
100 loops, best of 3: 7.9 ms per loop
size=(1000, 1000), axis=1:
1000 loops, best of 3: 1.01 ms per loop
1000 loops, best of 3: 1.08 ms per loop
size=(1000, 1000), axis=None:
1000 loops, best of 3: 1.01 ms per loop
1000 loops, best of 3: 1.1 ms per loop

For comparison, here are my times for that first test case using bottleneck.func.nanmean_2d_float64_axis* directly for axis=0,1,None:

100000 loops, best of 3: 1.71 µs per loop
1000000 loops, best of 3: 1.73 µs per loop
1000000 loops, best of 3: 1.09 µs per loop

So numbagg is actual faster than even the direct bottleneck function for axis=None, but there is some overhead if the axis is specified. Probably could get that down with some profiling.

Contributor

shoyer commented Oct 21, 2014

Here are a few more benchmarks:

for size in [(5, 5), (100000, 5), (1000, 1000)]:
    for axis in [0, 1, None]:
        a = np.random.RandomState(42).randn(*size) 
        print('size=%s, axis=%s:' % (size, axis))
        %timeit bottleneck.nanmean(a, axis=axis)
        %timeit numbagg.nanmean(a, axis=axis)
size=(5, 5), axis=0:
100000 loops, best of 3: 2.46 µs per loop
100000 loops, best of 3: 5.13 µs per loop
size=(5, 5), axis=1:
100000 loops, best of 3: 2.34 µs per loop
100000 loops, best of 3: 5.24 µs per loop
size=(5, 5), axis=None:
1000000 loops, best of 3: 1.58 µs per loop
1000000 loops, best of 3: 1.04 µs per loop
size=(100000, 5), axis=0:
1000 loops, best of 3: 719 µs per loop
1000 loops, best of 3: 771 µs per loop
size=(100000, 5), axis=1:
1000 loops, best of 3: 579 µs per loop
1000 loops, best of 3: 838 µs per loop
size=(100000, 5), axis=None:
1000 loops, best of 3: 480 µs per loop
1000 loops, best of 3: 505 µs per loop
size=(1000, 1000), axis=0:
100 loops, best of 3: 7.86 ms per loop
100 loops, best of 3: 7.9 ms per loop
size=(1000, 1000), axis=1:
1000 loops, best of 3: 1.01 ms per loop
1000 loops, best of 3: 1.08 ms per loop
size=(1000, 1000), axis=None:
1000 loops, best of 3: 1.01 ms per loop
1000 loops, best of 3: 1.1 ms per loop

For comparison, here are my times for that first test case using bottleneck.func.nanmean_2d_float64_axis* directly for axis=0,1,None:

100000 loops, best of 3: 1.71 µs per loop
1000000 loops, best of 3: 1.73 µs per loop
1000000 loops, best of 3: 1.09 µs per loop

So numbagg is actual faster than even the direct bottleneck function for axis=None, but there is some overhead if the axis is specified. Probably could get that down with some profiling.

@kwgoodman

This comment has been minimized.

Show comment
Hide comment
@kwgoodman

kwgoodman Oct 21, 2014

Owner

That is fast!

For axis=None bottleneck returns np.float(answer) while numbagg returns answer. Could that explain the overhead difference for axis=None?

Owner

kwgoodman commented Oct 21, 2014

That is fast!

For axis=None bottleneck returns np.float(answer) while numbagg returns answer. Could that explain the overhead difference for axis=None?

@shoyer

This comment has been minimized.

Show comment
Hide comment
@shoyer

shoyer Oct 21, 2014

Contributor

Yep, looks like np.float(answer) adds a few hundred ns of overhead.

>>> %timeit np.float(1.0)
10000000 loops, best of 3: 152 ns per loop
Contributor

shoyer commented Oct 21, 2014

Yep, looks like np.float(answer) adds a few hundred ns of overhead.

>>> %timeit np.float(1.0)
10000000 loops, best of 3: 152 ns per loop
@shoyer

This comment has been minimized.

Show comment
Hide comment
@shoyer

shoyer Oct 21, 2014

Contributor

Just pushed a new commit with an simpler logic for parsing axis when it's an integer, which speeds up most of the above benchmarks (I edited my comment).

Contributor

shoyer commented Oct 21, 2014

Just pushed a new commit with an simpler logic for parsing axis when it's an integer, which speeds up most of the above benchmarks (I edited my comment).

@kwgoodman

This comment has been minimized.

Show comment
Hide comment
@kwgoodman

kwgoodman Oct 22, 2014

Owner

Here's a nansum along axis=None for arbitrary ndim:

def nansum_float64_axisNone(np.ndarray a):
    cdef int axis=-1
    cdef np.flatiter ita
    ita = np.PyArray_IterAllButAxis(a, &axis)
    cdef Py_ssize_t stride = a.strides[axis], length = a.shape[axis], i
    cdef np.float64_t asum = 0, ai
    while np.PyArray_ITER_NOTDONE(ita):
        for i in range(length):
            ai = (<np.float64_t*>((<char*>np.PyArray_ITER_DATA(ita)) + i * stride))[0]
            if ai == ai:
                asum += ai
        np.PyArray_ITER_NEXT(ita)
    return asum

Same speed as bottleneck, except for being much, much faster when the array is fortran ordered.

Owner

kwgoodman commented Oct 22, 2014

Here's a nansum along axis=None for arbitrary ndim:

def nansum_float64_axisNone(np.ndarray a):
    cdef int axis=-1
    cdef np.flatiter ita
    ita = np.PyArray_IterAllButAxis(a, &axis)
    cdef Py_ssize_t stride = a.strides[axis], length = a.shape[axis], i
    cdef np.float64_t asum = 0, ai
    while np.PyArray_ITER_NOTDONE(ita):
        for i in range(length):
            ai = (<np.float64_t*>((<char*>np.PyArray_ITER_DATA(ita)) + i * stride))[0]
            if ai == ai:
                asum += ai
        np.PyArray_ITER_NEXT(ita)
    return asum

Same speed as bottleneck, except for being much, much faster when the array is fortran ordered.

@shoyer

This comment has been minimized.

Show comment
Hide comment
@shoyer

shoyer Oct 22, 2014

Contributor

@kwgoodman very nice! Now I just need the numba devs to expose something like that :).

Contributor

shoyer commented Oct 22, 2014

@kwgoodman very nice! Now I just need the numba devs to expose something like that :).

@kwgoodman

This comment has been minimized.

Show comment
Hide comment
@kwgoodman

kwgoodman Oct 23, 2014

Owner

And here's a prototype for moving window functions. Note that it works for arbitrary ndim and axis, which leaves us dtype to deal with. The speed is the same as the current bottleneck.

def move_nanmean_float64(a, int window, int axis):

    cdef int n = a.shape[axis]
    if (window < 1) or (window > n):
        raise ValueError(MOVE_WINDOW_ERR_MSG % (window, n))
    y = np.empty(a.shape, np.float64)

    cdef Py_ssize_t count, i
    cdef double asum, aold, ai, yi

    cdef np.flatiter ita = np.PyArray_IterAllButAxis(a, &axis)
    cdef np.flatiter ity = np.PyArray_IterAllButAxis(y, &axis)

    cdef int astride = a.strides[axis]
    cdef int ystride = y.strides[axis]

    while np.PyArray_ITER_NOTDONE(ita):

        with nogil:

            asum = 0
            count = 0

            for i in range(window - 1):
                ai = (<double*>((<char*>np.PyArray_ITER_DATA(ita)) + i*astride))[0]
                if ai == ai:
                    asum += ai
                    count += 1
                (<double*>((<char*>np.PyArray_ITER_DATA(ity)) + i*ystride))[0] = NAN

            i = window - 1
            ai = (<double*>((<char*>np.PyArray_ITER_DATA(ita)) + i*astride))[0]
            if ai == ai:
                asum += ai
                count += 1
            if count > 0:
                yi = asum / count
            else:
                yi = NAN
            (<double*>((<char*>np.PyArray_ITER_DATA(ity)) + i*ystride))[0] = yi

            for i in range(window, n):
                ai = (<double*>((<char*>np.PyArray_ITER_DATA(ita)) + i*astride))[0]
                if ai == ai:
                    asum += ai
                    count += 1
                aold = (<double*>((<char*>np.PyArray_ITER_DATA(ita)) + (i-window)*astride))[0]
                if aold == aold:
                    asum -= aold
                    count -= 1
                if count > 0:
                    yi = asum / count
                else:
                    yi = NAN
                (<double*>((<char*>np.PyArray_ITER_DATA(ity)) + i*ystride))[0] = yi

            np.PyArray_ITER_NEXT(ita)
            np.PyArray_ITER_NEXT(ity)

    return y

OK, so we have a prototype for reduction along axis=None (shown in a previous comment in this thread) and now a prototype for moving window functions. What's left? The big one is reduction along a given axis. I don't know how to write that one. Anyone know?

Owner

kwgoodman commented Oct 23, 2014

And here's a prototype for moving window functions. Note that it works for arbitrary ndim and axis, which leaves us dtype to deal with. The speed is the same as the current bottleneck.

def move_nanmean_float64(a, int window, int axis):

    cdef int n = a.shape[axis]
    if (window < 1) or (window > n):
        raise ValueError(MOVE_WINDOW_ERR_MSG % (window, n))
    y = np.empty(a.shape, np.float64)

    cdef Py_ssize_t count, i
    cdef double asum, aold, ai, yi

    cdef np.flatiter ita = np.PyArray_IterAllButAxis(a, &axis)
    cdef np.flatiter ity = np.PyArray_IterAllButAxis(y, &axis)

    cdef int astride = a.strides[axis]
    cdef int ystride = y.strides[axis]

    while np.PyArray_ITER_NOTDONE(ita):

        with nogil:

            asum = 0
            count = 0

            for i in range(window - 1):
                ai = (<double*>((<char*>np.PyArray_ITER_DATA(ita)) + i*astride))[0]
                if ai == ai:
                    asum += ai
                    count += 1
                (<double*>((<char*>np.PyArray_ITER_DATA(ity)) + i*ystride))[0] = NAN

            i = window - 1
            ai = (<double*>((<char*>np.PyArray_ITER_DATA(ita)) + i*astride))[0]
            if ai == ai:
                asum += ai
                count += 1
            if count > 0:
                yi = asum / count
            else:
                yi = NAN
            (<double*>((<char*>np.PyArray_ITER_DATA(ity)) + i*ystride))[0] = yi

            for i in range(window, n):
                ai = (<double*>((<char*>np.PyArray_ITER_DATA(ita)) + i*astride))[0]
                if ai == ai:
                    asum += ai
                    count += 1
                aold = (<double*>((<char*>np.PyArray_ITER_DATA(ita)) + (i-window)*astride))[0]
                if aold == aold:
                    asum -= aold
                    count -= 1
                if count > 0:
                    yi = asum / count
                else:
                    yi = NAN
                (<double*>((<char*>np.PyArray_ITER_DATA(ity)) + i*ystride))[0] = yi

            np.PyArray_ITER_NEXT(ita)
            np.PyArray_ITER_NEXT(ity)

    return y

OK, so we have a prototype for reduction along axis=None (shown in a previous comment in this thread) and now a prototype for moving window functions. What's left? The big one is reduction along a given axis. I don't know how to write that one. Anyone know?

@kwgoodman

This comment has been minimized.

Show comment
Hide comment
@kwgoodman

kwgoodman Oct 24, 2014

Owner

So far I've proposed prototypes for functions that reduce over all axes (axis=None) and functions that do not reduce at all (eg moving window functions).

What's missing is a prototype of a function that reduces over a single axis. Well, here it is:

def nansum_float64_axisint(np.ndarray a, int axis):

    cdef int ndim = a.ndim
    if ndim <= 1:
        raise ValueError("`a.ndim` must be greater than 1")
    if axis < 0:
        axis += ndim
    if (axis >= ndim) or (axis < 0):
        raise ValueError("axis(=%d) out of bounds" % axis)

    cdef np.flatiter ita
    ita = np.PyArray_IterAllButAxis(a, &axis)
    cdef Py_ssize_t length = a.shape[axis], i
    cdef int stride = a.strides[axis]

    # temp python hack
    cdef list shape = []
    for i in range(ndim):
        if i != axis:
            shape.append(a.shape[i])

    y = np.empty(shape, np.float64)
    cdef np.flatiter ity = np.PyArray_IterNew(y)

    cdef np.float64_t asum, ai

    while np.PyArray_ITER_NOTDONE(ita):
        asum = 0
        for i in range(length):
            ai = (<np.float64_t*>((<char*>np.PyArray_ITER_DATA(ita)) + i*stride))[0]
            if ai == ai:
                asum += ai
        (<double*>((<char*>np.PyArray_ITER_DATA(ity))))[0] = asum
        np.PyArray_ITER_NEXT(ita)
        np.PyArray_ITER_NEXT(ity)

    return y

This function works for arbitratry ndim > 1, and arbitratry axis, leaving only the different dtypes to deal with.

Owner

kwgoodman commented Oct 24, 2014

So far I've proposed prototypes for functions that reduce over all axes (axis=None) and functions that do not reduce at all (eg moving window functions).

What's missing is a prototype of a function that reduces over a single axis. Well, here it is:

def nansum_float64_axisint(np.ndarray a, int axis):

    cdef int ndim = a.ndim
    if ndim <= 1:
        raise ValueError("`a.ndim` must be greater than 1")
    if axis < 0:
        axis += ndim
    if (axis >= ndim) or (axis < 0):
        raise ValueError("axis(=%d) out of bounds" % axis)

    cdef np.flatiter ita
    ita = np.PyArray_IterAllButAxis(a, &axis)
    cdef Py_ssize_t length = a.shape[axis], i
    cdef int stride = a.strides[axis]

    # temp python hack
    cdef list shape = []
    for i in range(ndim):
        if i != axis:
            shape.append(a.shape[i])

    y = np.empty(shape, np.float64)
    cdef np.flatiter ity = np.PyArray_IterNew(y)

    cdef np.float64_t asum, ai

    while np.PyArray_ITER_NOTDONE(ita):
        asum = 0
        for i in range(length):
            ai = (<np.float64_t*>((<char*>np.PyArray_ITER_DATA(ita)) + i*stride))[0]
            if ai == ai:
                asum += ai
        (<double*>((<char*>np.PyArray_ITER_DATA(ity))))[0] = asum
        np.PyArray_ITER_NEXT(ita)
        np.PyArray_ITER_NEXT(ity)

    return y

This function works for arbitratry ndim > 1, and arbitratry axis, leaving only the different dtypes to deal with.

@shoyer

This comment has been minimized.

Show comment
Hide comment
@shoyer

shoyer Oct 24, 2014

Contributor

@kwgoodman This is very impressive! This code is definitely a step up from the template system.

Contributor

shoyer commented Oct 24, 2014

@kwgoodman This is very impressive! This code is definitely a step up from the template system.

@shoyer

This comment has been minimized.

Show comment
Hide comment
@shoyer

shoyer Jan 3, 2015

Contributor

@kwgoodman It looks like you've been making steady progress on the rewrite branch, but I stumbled across an (ancient/closed) numpy pull request that implemented nansum and other aggregation functions as numpy ufuncs that I thought you might find interesting. It's written in C: https://github.com/numpy/numpy/pull/30/files

Contributor

shoyer commented Jan 3, 2015

@kwgoodman It looks like you've been making steady progress on the rewrite branch, but I stumbled across an (ancient/closed) numpy pull request that implemented nansum and other aggregation functions as numpy ufuncs that I thought you might find interesting. It's written in C: https://github.com/numpy/numpy/pull/30/files

@kwgoodman

This comment has been minimized.

Show comment
Hide comment
@kwgoodman

kwgoodman Jan 5, 2015

Owner

I'm curious about how performance compares (bottleneck rewrite versus numpy pull request) if anyone is interested in testing.

Owner

kwgoodman commented Jan 5, 2015

I'm curious about how performance compares (bottleneck rewrite versus numpy pull request) if anyone is interested in testing.

@shoyer

This comment has been minimized.

Show comment
Hide comment
@shoyer

shoyer Jan 29, 2015

Contributor

Have you seen Cython's fused types? http://docs.cython.org/src/userguide/fusedtypes.html

Using fused types like integer and float could be a nice way to remove a lot of redundancy in this rewrite.

Contributor

shoyer commented Jan 29, 2015

Have you seen Cython's fused types? http://docs.cython.org/src/userguide/fusedtypes.html

Using fused types like integer and float could be a nice way to remove a lot of redundancy in this rewrite.

@kwgoodman

This comment has been minimized.

Show comment
Hide comment
@kwgoodman

kwgoodman Jan 29, 2015

Owner

Yep, fused types are in the rewrite branch. But I ended up removing them. I forget the exact issue. That was at an early stage of the rewrite. So it might be a good idea to revisit and rediscover what, if any, issues there are.

Owner

kwgoodman commented Jan 29, 2015

Yep, fused types are in the rewrite branch. But I ended up removing them. I forget the exact issue. That was at an early stage of the rewrite. So it might be a good idea to revisit and rediscover what, if any, issues there are.

@kwgoodman

This comment has been minimized.

Show comment
Hide comment
@kwgoodman

kwgoodman Jan 30, 2015

Owner

I need to finish up, ugh, the docstrings. The addition of min_count means a rewrite of the moving window docstrings. I'm kinda burned out on the rewrite at this point, so any and all suggestions welcomed on this moving window docstring template:

"""
Moving window sum along the specified axis, optionally ignoring NaNs.

Parameters
----------
arr : ndarray
    Input array. If `arr` is not an array, a conversion is attempted.
window : int
    The number of elements in the moving window.
min_count: {int, None}, optional
    If the number of non-NaN values in a window is less than `min_count`,
    then a value of NaN is assigned to the window. By default `min_count`
    is None, which is equivalent to setting `min_count` equal to `window`.
axis : int, optional
    The axis over which the window is moved. By default the last axis
    (axis=-1) is used. An axis of None is not allowed.

Returns
-------
y : ndarray
    The moving sum of the input array along the specified axis. The output
    has the same shape as the input.

Examples
--------
>>> arr = np.array([1.0, 2.0, 3.0, np.nan, 5.0])
>>> bn.move_nansum(arr, window=2)
array([ nan,   3.,   5.,  nan,  nan])
>>> bn.move_sum(arr, window=2, min_count=1)
array([ 1.,  3.,  5.,  3.,  5.])

"""
Owner

kwgoodman commented Jan 30, 2015

I need to finish up, ugh, the docstrings. The addition of min_count means a rewrite of the moving window docstrings. I'm kinda burned out on the rewrite at this point, so any and all suggestions welcomed on this moving window docstring template:

"""
Moving window sum along the specified axis, optionally ignoring NaNs.

Parameters
----------
arr : ndarray
    Input array. If `arr` is not an array, a conversion is attempted.
window : int
    The number of elements in the moving window.
min_count: {int, None}, optional
    If the number of non-NaN values in a window is less than `min_count`,
    then a value of NaN is assigned to the window. By default `min_count`
    is None, which is equivalent to setting `min_count` equal to `window`.
axis : int, optional
    The axis over which the window is moved. By default the last axis
    (axis=-1) is used. An axis of None is not allowed.

Returns
-------
y : ndarray
    The moving sum of the input array along the specified axis. The output
    has the same shape as the input.

Examples
--------
>>> arr = np.array([1.0, 2.0, 3.0, np.nan, 5.0])
>>> bn.move_nansum(arr, window=2)
array([ nan,   3.,   5.,  nan,  nan])
>>> bn.move_sum(arr, window=2, min_count=1)
array([ 1.,  3.,  5.,  3.,  5.])

"""
@kwgoodman

This comment has been minimized.

Show comment
Hide comment
@kwgoodman

kwgoodman Feb 4, 2015

Owner

I merged the rewrite branch into master.

Owner

kwgoodman commented Feb 4, 2015

I merged the rewrite branch into master.

@kwgoodman kwgoodman closed this Feb 4, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment