-
-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: implement digitize
with PyArray_SearchSorted
#5101
Conversation
Some timings follow... Using binary search starts being faster for haystack sizes of ~30, although the exact crossover depends also on the size of the needle, and is different for increasing and decreasing haystacks. Haystacks of 1000 items are now close to 10x faster, but very small haystacks of only 2 items may be up to 3x slower. For each needle and haystack sizes, there are four benchmarks: with increasing and decreasing haystack, and with the two possible values of
|
If anyone wants to replicate the benchmarks on a different system, this is the code I have used:
|
return NULL; | ||
} | ||
/* If bins is decreasing, ret has bins from end, not start */ | ||
if (monotonic == -1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PyArray_SearchSorted
will release the GIL for the heavy lifting of this function. The other potentially lengthy operation is this subtraction from the array length for decreasing bins
arrays. Is it worth releasing the GIL here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably not really significant but also does no harm, there is a macro that only releases it if a counter is higher than 500 to avoid unnecessary overheads, that can be used here.
Am I right, that this PR basically removes the algorithm, that was used by digitize up to now? Why would we remove a working algorithm? I think we shouldnt touch digitize, nor searchsorted. People might want to choose one or the other algorithm for a reason (Anyways documentation should cross link between both of course). |
On 23 Sep 2014 18:15, "Stephan Kuschel" notifications@github.com wrote:
We should always remove every algorithm (or other code) that we can, unless |
@@ -4845,14 +4845,14 @@ def luf(lamdaexpr, *args, **kwargs): | |||
Parameters | |||
---------- | |||
x : array_like | |||
Input array to be binned. It has to be 1-dimensional. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if it isn't 1-D?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return is of the same shape as x
, which is the standard searchsorted
behavior. We can restrict this to 1D arrays if there is a convincing reason for this, but it seemed unnecessary to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. The change should be documented with a .. versionadded::
and mentioned in the release notes.
Sorry for the confusion! I totally agree: If there is an algorithm performing better in every situation, it should replace the old one completely. But in this case the removed algorithm scales (a little) better for small data sets. Thats why nothing changed in #2656. Its not desirable to have 2 functions with identical output, but different scalings for different problem sizes -- still -- it can happen. You are right, the documentation doesnt state this clearly. Anyways, I came to this because I tried to find out what slows down the histogram2d function on large arrays with a couple 100 bins (see third post in #2656 from 3 months ago). My personal use is around 700x700 bins and 100e6 data points. So this is actually a different discussion, but this PR should speed up my use case as well. Thanks to everyone for your fast response! |
I have added a note to the docstring and the release notes, and the GIL is now released with the thresholded macro for the monotonicity check and the subtraction from the number of bins for decreasing bins. |
@@ -4877,6 +4877,13 @@ def luf(lamdaexpr, *args, **kwargs): | |||
attempting to index `bins` with the indices that `digitize` returns | |||
will result in an IndexError. | |||
|
|||
.. versionadded:: 1.10.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look like the right place. Hmm, I'll make a comment up above.
Needs rebase, probably because of the notes. |
6dcf19c
to
0730abe
Compare
LGTM. I assume this gives the same result as the old digitize for decreasing bins? |
Yes, there is a test checking that for all 4 cases (increasing/decreasing x right/left). And the whole block starting at L283 is about getting the value right for decreasing bins. |
OK, thanks. In it goes. |
ENH: implement `digitize` with `PyArray_SearchSorted`
The new searchsorted-based digitize introduced in numpy#5101 segfaults when it should raise a TypeError.
Could you please update the help on digitize to discuss this issue? This will greatly help scientific users transitioning to python. Thanks. |
There already is a note on the docstring stating that a binary search is used, see the development version of the docs.Is that what you were missing? |
Ah okay thanks, I am using numpy version 1.8.2 on python 3.4.0 (taken from an Ubuntu repository), so this may have been the issue. |
Closes #2656 and replaces #4184.