Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: implement digitize with PyArray_SearchSorted #5101

Merged
merged 4 commits into from
Sep 30, 2014

Conversation

jaimefrio
Copy link
Member

Closes #2656 and replaces #4184.

@jaimefrio
Copy link
Member Author

Some timings follow... Using binary search starts being faster for haystack sizes of ~30, although the exact crossover depends also on the size of the needle, and is different for increasing and decreasing haystacks. Haystacks of 1000 items are now close to 10x faster, but very small haystacks of only 2 items may be up to 3x slower.

For each needle and haystack sizes, there are four benchmarks: with increasing and decreasing haystack, and with the two possible values of right. First timing is current master, second is this PR:

haystack size: 2, needle size: 10                       master    this PR
    np.digitize(needle, haystack_inc, right=False) --> 8.210e-07 1.232e-06
    np.digitize(needle, haystack_dec, right=False) --> 1.232e-06 1.642e-06
    np.digitize(needle, haystack_inc, right=True)  --> 1.232e-06 1.232e-06
    np.digitize(needle, haystack_dec, right=True)  --> 1.232e-06 1.642e-06
haystack size: 2, needle size: 100
    np.digitize(needle, haystack_inc, right=False) --> 1.232e-06 1.642e-06
    np.digitize(needle, haystack_dec, right=False) --> 1.232e-06 2.463e-06
    np.digitize(needle, haystack_inc, right=True)  --> 1.232e-06 2.053e-06
    np.digitize(needle, haystack_dec, right=True)  --> 1.232e-06 2.463e-06
haystack size: 2, needle size: 1000
    np.digitize(needle, haystack_inc, right=False) --> 4.105e-06 6.158e-06
    np.digitize(needle, haystack_dec, right=False) --> 4.516e-06 8.210e-06
    np.digitize(needle, haystack_inc, right=True)  --> 3.695e-06 9.442e-06
    np.digitize(needle, haystack_dec, right=True)  --> 4.105e-06 1.067e-05
haystack size: 5, needle size: 10
    np.digitize(needle, haystack_inc, right=False) --> 1.232e-06 1.232e-06
    np.digitize(needle, haystack_dec, right=False) --> 1.232e-06 1.642e-06
    np.digitize(needle, haystack_inc, right=True)  --> 1.232e-06 1.232e-06
    np.digitize(needle, haystack_dec, right=True)  --> 1.232e-06 1.642e-06
haystack size: 5, needle size: 100
    np.digitize(needle, haystack_inc, right=False) --> 1.642e-06 2.053e-06
    np.digitize(needle, haystack_dec, right=False) --> 1.642e-06 2.463e-06
    np.digitize(needle, haystack_inc, right=True)  --> 1.232e-06 2.463e-06
    np.digitize(needle, haystack_dec, right=True)  --> 1.232e-06 2.874e-06
haystack size: 5, needle size: 1000
    np.digitize(needle, haystack_inc, right=False) --> 5.337e-06 8.210e-06
    np.digitize(needle, haystack_dec, right=False) --> 6.158e-06 9.442e-06
    np.digitize(needle, haystack_inc, right=True)  --> 5.337e-06 1.149e-05
    np.digitize(needle, haystack_dec, right=True)  --> 5.747e-06 1.273e-05
haystack size: 10, needle size: 10
    np.digitize(needle, haystack_inc, right=False) --> 1.232e-06 1.232e-06
    np.digitize(needle, haystack_dec, right=False) --> 1.232e-06 1.232e-06
    np.digitize(needle, haystack_inc, right=True)  --> 1.232e-06 1.232e-06
    np.digitize(needle, haystack_dec, right=True)  --> 1.232e-06 1.232e-06
haystack size: 10, needle size: 100
    np.digitize(needle, haystack_inc, right=False) --> 1.642e-06 2.053e-06
    np.digitize(needle, haystack_dec, right=False) --> 1.642e-06 2.463e-06
    np.digitize(needle, haystack_inc, right=True)  --> 1.642e-06 2.463e-06
    np.digitize(needle, haystack_dec, right=True)  --> 1.642e-06 3.284e-06
haystack size: 10, needle size: 1000
    np.digitize(needle, haystack_inc, right=False) --> 9.031e-06 1.108e-05
    np.digitize(needle, haystack_dec, right=False) --> 1.067e-05 1.232e-05
    np.digitize(needle, haystack_inc, right=True)  --> 9.031e-06 1.396e-05
    np.digitize(needle, haystack_dec, right=True)  --> 9.853e-06 1.519e-05
haystack size: 15, needle size: 10
    np.digitize(needle, haystack_inc, right=False) --> 1.232e-06 1.232e-06
    np.digitize(needle, haystack_dec, right=False) --> 1.232e-06 1.232e-06
    np.digitize(needle, haystack_inc, right=True)  --> 1.232e-06 1.232e-06
    np.digitize(needle, haystack_dec, right=True)  --> 1.232e-06 1.642e-06
haystack size: 15, needle size: 100
    np.digitize(needle, haystack_inc, right=False) --> 1.642e-06 2.053e-06
    np.digitize(needle, haystack_dec, right=False) --> 2.053e-06 2.874e-06
    np.digitize(needle, haystack_inc, right=True)  --> 2.053e-06 2.463e-06
    np.digitize(needle, haystack_dec, right=True)  --> 2.053e-06 3.284e-06
haystack size: 15, needle size: 1000
    np.digitize(needle, haystack_inc, right=False) --> 1.273e-05 1.396e-05
    np.digitize(needle, haystack_dec, right=False) --> 1.519e-05 1.519e-05
    np.digitize(needle, haystack_inc, right=True)  --> 1.314e-05 1.560e-05
    np.digitize(needle, haystack_dec, right=True)  --> 1.396e-05 1.724e-05
haystack size: 20, needle size: 10
    np.digitize(needle, haystack_inc, right=False) --> 1.232e-06 1.232e-06
    np.digitize(needle, haystack_dec, right=False) --> 1.232e-06 1.232e-06
    np.digitize(needle, haystack_inc, right=True)  --> 1.232e-06 1.232e-06
    np.digitize(needle, haystack_dec, right=True)  --> 1.232e-06 1.642e-06
haystack size: 20, needle size: 100
    np.digitize(needle, haystack_inc, right=False) --> 2.053e-06 2.053e-06
    np.digitize(needle, haystack_dec, right=False) --> 2.053e-06 2.874e-06
    np.digitize(needle, haystack_inc, right=True)  --> 2.053e-06 2.874e-06
    np.digitize(needle, haystack_dec, right=True)  --> 2.053e-06 3.284e-06
haystack size: 20, needle size: 1000
    np.digitize(needle, haystack_inc, right=False) --> 1.642e-05 1.765e-05
    np.digitize(needle, haystack_dec, right=False) --> 1.765e-05 1.806e-05
    np.digitize(needle, haystack_inc, right=True)  --> 1.642e-05 1.806e-05
    np.digitize(needle, haystack_dec, right=True)  --> 1.724e-05 2.012e-05
haystack size: 25, needle size: 10
    np.digitize(needle, haystack_inc, right=False) --> 1.232e-06 1.232e-06
    np.digitize(needle, haystack_dec, right=False) --> 1.232e-06 1.232e-06
    np.digitize(needle, haystack_inc, right=True)  --> 1.232e-06 1.232e-06
    np.digitize(needle, haystack_dec, right=True)  --> 1.232e-06 1.642e-06
haystack size: 25, needle size: 100
    np.digitize(needle, haystack_inc, right=False) --> 2.463e-06 2.463e-06
    np.digitize(needle, haystack_dec, right=False) --> 2.463e-06 2.874e-06
    np.digitize(needle, haystack_inc, right=True)  --> 2.463e-06 2.874e-06
    np.digitize(needle, haystack_dec, right=True)  --> 2.463e-06 3.284e-06
haystack size: 25, needle size: 1000
    np.digitize(needle, haystack_inc, right=False) --> 1.971e-05 1.929e-05
    np.digitize(needle, haystack_dec, right=False) --> 2.094e-05 2.094e-05
    np.digitize(needle, haystack_inc, right=True)  --> 1.929e-05 2.012e-05
    np.digitize(needle, haystack_dec, right=True)  --> 2.012e-05 2.176e-05
haystack size: 30, needle size: 10
    np.digitize(needle, haystack_inc, right=False) --> 1.232e-06 1.232e-06
    np.digitize(needle, haystack_dec, right=False) --> 1.232e-06 1.232e-06
    np.digitize(needle, haystack_inc, right=True)  --> 1.232e-06 1.232e-06
    np.digitize(needle, haystack_dec, right=True)  --> 1.232e-06 1.642e-06
haystack size: 30, needle size: 100
    np.digitize(needle, haystack_inc, right=False) --> 2.463e-06 2.463e-06
    np.digitize(needle, haystack_dec, right=False) --> 2.874e-06 2.874e-06
    np.digitize(needle, haystack_inc, right=True)  --> 2.874e-06 2.874e-06
    np.digitize(needle, haystack_dec, right=True)  --> 2.874e-06 3.284e-06
haystack size: 30, needle size: 1000
    np.digitize(needle, haystack_inc, right=False) --> 2.258e-05 2.094e-05
    np.digitize(needle, haystack_dec, right=False) --> 2.381e-05 2.176e-05
    np.digitize(needle, haystack_inc, right=True)  --> 2.217e-05 2.094e-05
    np.digitize(needle, haystack_dec, right=True)  --> 2.299e-05 2.217e-05
haystack size: 35, needle size: 10
    np.digitize(needle, haystack_inc, right=False) --> 1.232e-06 1.232e-06
    np.digitize(needle, haystack_dec, right=False) --> 1.232e-06 1.642e-06
    np.digitize(needle, haystack_inc, right=True)  --> 1.232e-06 1.232e-06
    np.digitize(needle, haystack_dec, right=True)  --> 1.232e-06 1.642e-06
haystack size: 35, needle size: 100
    np.digitize(needle, haystack_inc, right=False) --> 2.874e-06 2.463e-06
    np.digitize(needle, haystack_dec, right=False) --> 3.284e-06 2.874e-06
    np.digitize(needle, haystack_inc, right=True)  --> 2.874e-06 2.874e-06
    np.digitize(needle, haystack_dec, right=True)  --> 3.284e-06 3.695e-06
haystack size: 35, needle size: 1000
    np.digitize(needle, haystack_inc, right=False) --> 2.545e-05 2.217e-05
    np.digitize(needle, haystack_dec, right=False) --> 2.627e-05 2.340e-05
    np.digitize(needle, haystack_inc, right=True)  --> 2.545e-05 2.258e-05
    np.digitize(needle, haystack_dec, right=True)  --> 2.586e-05 2.381e-05
haystack size: 40, needle size: 10
    np.digitize(needle, haystack_inc, right=False) --> 1.232e-06 1.232e-06
    np.digitize(needle, haystack_dec, right=False) --> 1.232e-06 1.642e-06
    np.digitize(needle, haystack_inc, right=True)  --> 1.232e-06 1.232e-06
    np.digitize(needle, haystack_dec, right=True)  --> 1.232e-06 1.642e-06
haystack size: 40, needle size: 100
    np.digitize(needle, haystack_inc, right=False) --> 3.284e-06 2.463e-06
    np.digitize(needle, haystack_dec, right=False) --> 3.284e-06 3.284e-06
    np.digitize(needle, haystack_inc, right=True)  --> 3.284e-06 2.874e-06
    np.digitize(needle, haystack_dec, right=True)  --> 3.284e-06 3.695e-06
haystack size: 40, needle size: 1000
    np.digitize(needle, haystack_inc, right=False) --> 2.792e-05 2.381e-05
    np.digitize(needle, haystack_dec, right=False) --> 2.874e-05 2.545e-05
    np.digitize(needle, haystack_inc, right=True)  --> 2.792e-05 2.422e-05
    np.digitize(needle, haystack_dec, right=True)  --> 2.874e-05 2.586e-05
haystack size: 45, needle size: 10
    np.digitize(needle, haystack_inc, right=False) --> 1.232e-06 1.232e-06
    np.digitize(needle, haystack_dec, right=False) --> 1.232e-06 1.642e-06
    np.digitize(needle, haystack_inc, right=True)  --> 1.232e-06 1.232e-06
    np.digitize(needle, haystack_dec, right=True)  --> 1.232e-06 1.642e-06
haystack size: 45, needle size: 100
    np.digitize(needle, haystack_inc, right=False) --> 3.284e-06 2.463e-06
    np.digitize(needle, haystack_dec, right=False) --> 3.695e-06 3.284e-06
    np.digitize(needle, haystack_inc, right=True)  --> 3.695e-06 2.874e-06
    np.digitize(needle, haystack_dec, right=True)  --> 3.695e-06 3.695e-06
haystack size: 45, needle size: 1000
    np.digitize(needle, haystack_inc, right=False) --> 2.997e-05 2.463e-05
    np.digitize(needle, haystack_dec, right=False) --> 3.079e-05 2.586e-05
    np.digitize(needle, haystack_inc, right=True)  --> 2.997e-05 2.504e-05
    np.digitize(needle, haystack_dec, right=True)  --> 3.120e-05 2.627e-05
haystack size: 50, needle size: 10
    np.digitize(needle, haystack_inc, right=False) --> 1.232e-06 1.232e-06
    np.digitize(needle, haystack_dec, right=False) --> 1.232e-06 1.642e-06
    np.digitize(needle, haystack_inc, right=True)  --> 1.232e-06 1.232e-06
    np.digitize(needle, haystack_dec, right=True)  --> 1.232e-06 1.642e-06
haystack size: 50, needle size: 100
    np.digitize(needle, haystack_inc, right=False) --> 3.695e-06 2.463e-06
    np.digitize(needle, haystack_dec, right=False) --> 4.105e-06 3.284e-06
    np.digitize(needle, haystack_inc, right=True)  --> 3.695e-06 2.874e-06
    np.digitize(needle, haystack_dec, right=True)  --> 4.105e-06 3.695e-06
haystack size: 50, needle size: 1000
    np.digitize(needle, haystack_inc, right=False) --> 3.284e-05 2.586e-05
    np.digitize(needle, haystack_dec, right=False) --> 3.325e-05 2.709e-05
    np.digitize(needle, haystack_inc, right=True)  --> 3.284e-05 2.586e-05
    np.digitize(needle, haystack_dec, right=True)  --> 3.366e-05 2.709e-05
haystack size: 100, needle size: 10
    np.digitize(needle, haystack_inc, right=False) --> 1.642e-06 1.642e-06
    np.digitize(needle, haystack_dec, right=False) --> 1.642e-06 1.642e-06
    np.digitize(needle, haystack_inc, right=True)  --> 1.642e-06 1.642e-06
    np.digitize(needle, haystack_dec, right=True)  --> 1.642e-06 1.642e-06
haystack size: 100, needle size: 100
    np.digitize(needle, haystack_inc, right=False) --> 6.158e-06 2.874e-06
    np.digitize(needle, haystack_dec, right=False) --> 6.568e-06 2.874e-06
    np.digitize(needle, haystack_inc, right=True)  --> 6.158e-06 3.284e-06
    np.digitize(needle, haystack_dec, right=True)  --> 6.568e-06 3.695e-06
haystack size: 100, needle size: 1000
    np.digitize(needle, haystack_inc, right=False) --> 5.583e-05 3.366e-05
    np.digitize(needle, haystack_dec, right=False) --> 5.624e-05 3.489e-05
    np.digitize(needle, haystack_inc, right=True)  --> 5.994e-05 3.654e-05
    np.digitize(needle, haystack_dec, right=True)  --> 5.665e-05 3.654e-05
haystack size: 1000, needle size: 10
    np.digitize(needle, haystack_inc, right=False) --> 7.800e-06 2.463e-06
    np.digitize(needle, haystack_dec, right=False) --> 8.210e-06 2.874e-06
    np.digitize(needle, haystack_inc, right=True)  --> 7.800e-06 2.463e-06
    np.digitize(needle, haystack_dec, right=True)  --> 8.210e-06 2.874e-06
haystack size: 1000, needle size: 100
    np.digitize(needle, haystack_inc, right=False) --> 4.721e-05 4.516e-06
    np.digitize(needle, haystack_dec, right=False) --> 4.762e-05 4.926e-06
    np.digitize(needle, haystack_inc, right=True)  --> 4.721e-05 4.926e-06
    np.digitize(needle, haystack_dec, right=True)  --> 4.762e-05 5.337e-06
haystack size: 1000, needle size: 1000
    np.digitize(needle, haystack_inc, right=False) --> 4.762e-04 5.870e-05
    np.digitize(needle, haystack_dec, right=False) --> 4.766e-04 5.994e-05
    np.digitize(needle, haystack_inc, right=True)  --> 4.762e-04 7.307e-05
    np.digitize(needle, haystack_dec, right=True)  --> 4.766e-04 6.979e-05

@jaimefrio
Copy link
Member Author

If anyone wants to replicate the benchmarks on a different system, this is the code I have used:

from __future__ import division
import numpy as np
from timeit import repeat

haystack_sizes = [2, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 100, 1000]
needle_sizes = [10, 100, 1000]

codes = ['np.digitize(needle, haystack_inc, right=False)',
         'np.digitize(needle, haystack_dec, right=False)',
         'np.digitize(needle, haystack_inc, right=True) ',
         'np.digitize(needle, haystack_dec, right=True) ',
         ]
setup = 'from __main__ import np, needle, haystack_inc, haystack_dec'

for h_size in haystack_sizes:
    haystack_inc = np.linspace(0, 1, h_size+2)[1:-1]
    haystack_dec = np.linspace(1, 0, h_size+2)[1:-1]
    for n_size in needle_sizes:
        np.random.seed(0)
        needle = np.random.rand(n_size)
        print 'haystack size: {}, needle size: {}'.format(h_size, n_size)
        t = min(repeat(codes[0], setup, repeat=10, number=1))
        reps = max(10, int(.1/t))
        for code in codes:
            t = min(repeat(code, setup, repeat=reps, number=1))
            print '    {} --> {:.3e}'.format(code, t)

return NULL;
}
/* If bins is decreasing, ret has bins from end, not start */
if (monotonic == -1) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyArray_SearchSorted will release the GIL for the heavy lifting of this function. The other potentially lengthy operation is this subtraction from the array length for decreasing bins arrays. Is it worth releasing the GIL here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably not really significant but also does no harm, there is a macro that only releases it if a counter is higher than 500 to avoid unnecessary overheads, that can be used here.

@skuschel
Copy link

Am I right, that this PR basically removes the algorithm, that was used by digitize up to now? Why would we remove a working algorithm? I think we shouldnt touch digitize, nor searchsorted. People might want to choose one or the other algorithm for a reason (Anyways documentation should cross link between both of course).
But as soon as the user tries to accomplish a task (requesting a histogram) he is no longer asking for a special algorithm. In this case the user is asking for a histogram and doesnt want to think about how its done. Thus in this case the numpy code should just make a (parameter dependent) decision which algorithm to use in order to create the histogram fastest. Surely choosing an algorithm doesnt make sure that this will always be the optimal choice. But having a good guess is fair enough. histogramdd is currently basically useless for huge data sets (and many bins, but everyone would use a couple hundred bins if you have 100e6 data points). The actual problem is histogramdd's runtime on huge data sets.

@njsmith
Copy link
Member

njsmith commented Sep 23, 2014

On 23 Sep 2014 18:15, "Stephan Kuschel" notifications@github.com wrote:

Am I right, that this PR basically removes the algorithm, that was used
by digitize up to now? Why would we remove a working algorithm?

We should always remove every algorithm (or other code) that we can, unless
there is some good reason to keep it :-). The best case is to have one
piece of code that Just Works in all cases; if that's not possible then we
should clearly document the different algorithms used and how they are
selected. The worst case is what we have now, with two functions that are
documented to do exactly the same thing using two different implementations.

@@ -4845,14 +4845,14 @@ def luf(lamdaexpr, *args, **kwargs):
Parameters
----------
x : array_like
Input array to be binned. It has to be 1-dimensional.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if it isn't 1-D?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return is of the same shape as x, which is the standard searchsorted behavior. We can restrict this to 1D arrays if there is a convincing reason for this, but it seemed unnecessary to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. The change should be documented with a .. versionadded:: and mentioned in the release notes.

@skuschel
Copy link

Sorry for the confusion! I totally agree: If there is an algorithm performing better in every situation, it should replace the old one completely. But in this case the removed algorithm scales (a little) better for small data sets. Thats why nothing changed in #2656. Its not desirable to have 2 functions with identical output, but different scalings for different problem sizes -- still -- it can happen. You are right, the documentation doesnt state this clearly.

Anyways, I came to this because I tried to find out what slows down the histogram2d function on large arrays with a couple 100 bins (see third post in #2656 from 3 months ago). My personal use is around 700x700 bins and 100e6 data points. So this is actually a different discussion, but this PR should speed up my use case as well.

Thanks to everyone for your fast response!

@jaimefrio
Copy link
Member Author

I have added a note to the docstring and the release notes, and the GIL is now released with the thresholded macro for the monotonicity check and the subtraction from the number of bins for decreasing bins.

@@ -4877,6 +4877,13 @@ def luf(lamdaexpr, *args, **kwargs):
attempting to index `bins` with the indices that `digitize` returns
will result in an IndexError.

.. versionadded:: 1.10.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look like the right place. Hmm, I'll make a comment up above.

@charris
Copy link
Member

charris commented Sep 25, 2014

Needs rebase, probably because of the notes.

@charris
Copy link
Member

charris commented Sep 30, 2014

LGTM. I assume this gives the same result as the old digitize for decreasing bins?

@jaimefrio
Copy link
Member Author

Yes, there is a test checking that for all 4 cases (increasing/decreasing x right/left). And the whole block starting at L283 is about getting the value right for decreasing bins.

@charris
Copy link
Member

charris commented Sep 30, 2014

OK, thanks. In it goes.

charris added a commit that referenced this pull request Sep 30, 2014
ENH: implement `digitize` with `PyArray_SearchSorted`
@charris charris merged commit 8cdc5f5 into numpy:master Sep 30, 2014
@jaimefrio jaimefrio deleted the digitize_as_searchsorted branch September 30, 2014 04:56
jaimefrio added a commit to jaimefrio/numpy that referenced this pull request Dec 8, 2014
The new searchsorted-based digitize introduced in numpy#5101 segfaults
when it should raise a TypeError.
@jrmlhermitte
Copy link
Contributor

Could you please update the help on digitize to discuss this issue? This will greatly help scientific users transitioning to python. Thanks.

@jaimefrio
Copy link
Member Author

There already is a note on the docstring stating that a binary search is used, see the development version of the docs.Is that what you were missing?

@jrmlhermitte
Copy link
Contributor

Ah okay thanks, I am using numpy version 1.8.2 on python 3.4.0 (taken from an Ubuntu repository), so this may have been the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

numpy.digitize uses a linear search when it should be using binary search (Trac #2064)
6 participants