New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
interp(): add storage of previous value for next index lookup #5594
Conversation
you can generalize this a bit by not only searching the next element, but instead checking offset 1, then offset 2, offset 4 etc. And when you overshoot you bisect the remainder. I think thats called the hunt method. Its log2 n faster when the searched value is close to the the previously searched one. |
In Numerical Recipes is where they call that the hunt method. You can browse the book online here, it's section "3.1.1. Searching with Correlated Values." That is also how Timsort finds its merge points when it enters "galloping mode." On the CPython source code for the list object there is an implementation in the There is a nice discussion on performance in the CPyhton source starting here. We would basically make the search 2x slower than bisection in the worst case, but get results comparable to a linear search for almost ordered keys. I remember Julian suggesting this would be a good addition to the |
If we're thinking about speeding up binary search, then it might make sense
|
That's a nice one too. Although having a division in every iteration could negate the benefits except for very large arrays. I guess there is no way of telling other than implementing it... |
I don't think interpolation is going to be a win, when using the assumption of uniformity you can do a hunt by storing the last offset with less cpu usage (divisions are super expensive). interpolation is more useful on one shot searches into uniform sets, here we have most likely highly correlated searches. |
Using the hunt method may be useful. OTOH, once you have to start looking farther away than what is in the current memory cache it may not matter so much what method is used. What's clear is that always doing a binary search of the full array is much too slow, especially in the (in my experience) very common case of an ordered target array with spacing finer than or similar to that of the original array. In those cases, looking +/- 1 is almost always right, and the speed-up of this alone is substantial. When +/-1 isn't right, checking if the bisection can use +/- 8 from the current index might give a noticeable improvement over using the full array, but I guess that is a marginal gain compared to the benefit of using +/- 1 from the previous index. For me the essential point is that |
You can do a lot of divisions in the time saved by avoiding even one cache On Mon, Feb 23, 2015 at 11:33 AM, Jaime notifications@github.com wrote:
Nathaniel J. Smith -- http://vorpus.org |
not that many actually, main memory bandwidth is sufficient to feed 1-4 dividing cores without starvation in sequential access, of course for this case latency is more relevant and probably more significant. Might be worth trying out. @newville can you provide the benchmark that speeds up by 3? I guess its just interpolating onto a finer grid, but it would still be useful to have the original to compare. |
@juliantaylor Yes, I can work on a benchmark script and report results. Is just using timeit.timeit() sufficient or is there another benchmark framework to use? It might take me several days to find the time to do this. |
it doesn't need to be anything complex, I was actually just interested in whatever script you have been using to test the results of this branch. |
@juliantaylor OK, I couldn't resist. A Benchmark script and results are at https://gist.github.com/newville/b36b54eb47637979e256 The test isn't exhaustive, but clearly shows an important performance gain. For what I would consider the most common use cases, speedups range from 3 to 7. As the array to be interpolated onto gets noisy (that is, unordered relative to the original |
@juliantaylor, @jaimefrio: Any thoughts on this PR or suggestions on how to make case better? It seems like a relatively small change to the code, no change in API, a clear performance improvement, and all tests pass. Is there more that should be done to make this PR acceptable? |
I messed around with using hunt, and its 40% slower for your good cases and not significantly faster for others so this is probably good |
{ | ||
npy_intp imin = 0; | ||
npy_intp imax = len; | ||
|
||
if (key > arr[len - 1]) { | ||
return len; | ||
} else if (key < arr[0]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
}
else if ... {
Lots of coding style slipups. Take a look at |
@charris I believe the style-guide is now followed. |
guess = len - 2; | ||
} | ||
|
||
/* check most likely values: guess, guess + 1, guess - 1 */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, might be better to move this section up before line:513 as the more likely case. Some redundant computations also. Might be possible to make this a loop of some sort. Just sayin' might be worth a second look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@charris I don't understand the comment about "before line:513" --
The code between line 513:518 ensures guess
is in range.
Re: redundancies (I think only access of arr[guess]
, arr[guess + 1]
and arr[guess - 1]
). I thought this did not warrant making local variables, and to let the compiler optimize this.
The code between line 532:538 could be removed -- it's probably too machine-dependent, and seems to help only for a limited kind of "not well-ordered" data. Otherwise, it's not likely to do any harm, but that might be faint praise.
LGTM. I wonder if it is worth worrying about the decreasing case? I have the feeling that the logic could be made more compact, but that might just be wishful thinking.. |
@charris I probably would have agreed that the "decreasing case" is rare. But, as it turns out, the problem I was having that led me to looking into this, and to the PR, was a usage in scikit-image (https://github.com/scikit-image/scikit-image/blob/master/skimage/transform/radon_transform.py#L244) that does, in fact, use a "decreasing case" much of the time. My initial thought was that |
What is bothering me about this is the number of checks that are made, together with a fair amount of index arithmetic. For searches into a big array, that probably doesn't matter, but for smaller searches it might make a difference. I wonder if it might not be faster to just check a smaller range, or do the method that starts with a guess and increments by an offset in the right direction, the offset doubling on each falure, and do that maybe two, three times before falling back on binary search. Some timing numbers would be useful here. |
@charris I put some timing numbers at https://gist.github.com/newville/b36b54eb47637979e256 two weeks ago. The tests could be expanded, though there are no doubt many ways that could be done. Do you have concrete suggestions for changes? Numpy's interp() is typically described in the scipy community as the fastest way to do interpolation . A first time contributor makes a pull request that makes a relatively small change to internal C code that goes back many years (at least since numpy 1.0.4) to make a top-level numpy routine 4x faster. How often does that happen? The response was not (ever) "thanks!" but "did you try another method?" and "index arithmetic bothers me". Well, Index arithmetic and looking locally (+1, +0, -1) gives a huge performance benefit for very many cases, and most likely the (by far) most common cases. What you suggest is the "hunt" method, already discussed. I do hope the numpy team is not always like this. |
@newville I think the idea is fine and the implementation is correct, but I think the implementation can be improved. For instance, the same index is computed several times. I'm asking that you take a second look and see it you can improve it and make it beautiful. |
Hi Matt, you're right that speedups of this kind are very welcome and don't occur everyday. So thank you for working on this! Most of us are a bit overloaded and tend to focus on code / getting things mergeable, and may sometimes forget to say "thank you" or "good job". |
For instance, if
The if statements compile to
|
I don't think thats going to be very significant. |
Maybe, but (IMHO) it is easier to read, and note that each floating value is loaded just once. |
@charris -- isn't that the job of the compiler? I find your version with pointer offsets much uglier than mine with index arithmetic. |
Compilers aren't quite that brilliant. I always thought that a proper optimization would invent the FFT, but it isn't so ;) Taking out the offset checks for comparison
Note that three floating registars are used, as well the longer instructions like That said, times are probably dominated by the floating comparisons, so it might be good to use the fact that |
I don't think that code will run any different on current hardware, even if its going to be really minimal. |
@juliantaylor Depends on the data, of course. For dense points jumping around past the first Thanks @newville . |
interp(): add storage of previous value for next index lookup
guess = 0; | ||
} | ||
else if (guess >= len - 1) { | ||
guess = len - 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With len == 1
or len == 2
you can go into this branch with guess == 0
or guess == 1
, which will leave guess
pointing outside the array, and cause a segmentation fault in the next few lines.
Not sure what the best solution is. I think the ideal situation would be to special-case anything with len < 4
(I would suggest a linear search ignoring guess
), and for the larger arrays make sure in this normalization that guess >= 1
and guess <= len - 3
, so that the testing of the three most likely intervals can be done with no extra checking with a couple of nested branches, e.g.:
assert(guess >=1 && guess <= len-3);
if (key > arr[guess]) {
if (key <= arr[guess + 1]) {
return guess;
}
else if (key <= arr[guess + 2]) {
return guess + 1;
}
imin = guess + 2;
}
else if (key > arr[guess - 1]) {
return guess - 1;
}
else {
imax = guess - 1;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch, swapping the if's and making the else if an if should work too, I'll do some testing with avoiding range checks in an followup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops.
I like your solution though, it is more in line with what I was thinking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry that I hadn't consider very short arrays. I would just put a if (len > 16){ }
test around the whole set of tests involving the value of guess, and not bother trying to optimize for very small arrays.
This PR addresses #5593 by changing
binary_search()
tobinary_search_with_guess()
with an addedguess
parameter that is used as a first guess (including +/- 1 indices) as the correct index. The callsbinary_search_with_guess()
inarr_interp()
store the index found for the array element i as the guess for element i+1.For very many cases, this guess will be close enough (+/-1) to completely avoid the binary search of the array. Though binary search is fast, using the previous index and testing values in its immediate vicinity (and likely to still be in active memory) can give a substantial speedup in many common cases, and nearly no penalty in pathologically cases (say, randomly ordered values of
x
).In a test case using scikits-image's iradon transform, linear interpolation of ~85M well- (but not perfectly-) ordered values was sped up by more than a factor of 3 with this change.