interp(): add storage of previous value for next index lookup #5594

newville · 2015-02-23T03:36:13Z

This PR addresses #5593 by changing binary_search() to binary_search_with_guess() with an added guess parameter that is used as a first guess (including +/- 1 indices) as the correct index. The calls binary_search_with_guess() in arr_interp() store the index found for the array element i as the guess for element i+1.

For very many cases, this guess will be close enough (+/-1) to completely avoid the binary search of the array. Though binary search is fast, using the previous index and testing values in its immediate vicinity (and likely to still be in active memory) can give a substantial speedup in many common cases, and nearly no penalty in pathologically cases (say, randomly ordered values of x).

In a test case using scikits-image's iradon transform, linear interpolation of ~85M well- (but not perfectly-) ordered values was sped up by more than a factor of 3 with this change.

juliantaylor · 2015-02-23T18:17:31Z

you can generalize this a bit by not only searching the next element, but instead checking offset 1, then offset 2, offset 4 etc. And when you overshoot you bisect the remainder. I think thats called the hunt method. Its log2 n faster when the searched value is close to the the previously searched one.

jaimefrio · 2015-02-23T19:01:40Z

In Numerical Recipes is where they call that the hunt method. You can browse the book online here, it's section "3.1.1. Searching with Correlated Values." That is also how Timsort finds its merge points when it enters "galloping mode." On the CPython source code for the list object there is an implementation in the gallop_left and gallop_right functions, starting here.

There is a nice discussion on performance in the CPyhton source starting here. We would basically make the search 2x slower than bisection in the worst case, but get results comparable to a linear search for almost ordered keys.

I remember Julian suggesting this would be a good addition to the searchsorted function, may look into it again. At some point I looked at making interp use searchsorted, but because searchsorted handles NaNs properly, the comparisons are about 2x slower than the ad hoc implementation here. Maybe if we made it faster for typical imputs by making it smarter, we could consolidate all binary searches in one place?

njsmith · 2015-02-23T19:04:11Z

If we're thinking about speeding up binary search, then it might make sense
to also think about using interpolation search, which gives O(log log n)
cache misses for a trivial increase in CPU. The idea is very simple: if
you're looking for a name starting with Y, then you open the phone book
towards the end, not the middle.
On Feb 23, 2015 10:17 AM, "Julian Taylor" notifications@github.com wrote:

you can generalize this a bit by not only searching the next element, but
instead checking offset 1, then offset 2, offset 4 etc. And when you
overshoot you bisect the remainder. I think thats called the hunt method.
Its log2 n faster when the searched value is close to the the previously
searched one.

—
Reply to this email directly or view it on GitHub
#5594 (comment).

jaimefrio · 2015-02-23T19:33:27Z

That's a nice one too. Although having a division in every iteration could negate the benefits except for very large arrays. I guess there is no way of telling other than implementing it...

juliantaylor · 2015-02-23T19:35:17Z

I don't think interpolation is going to be a win, when using the assumption of uniformity you can do a hunt by storing the last offset with less cpu usage (divisions are super expensive). interpolation is more useful on one shot searches into uniform sets, here we have most likely highly correlated searches.

newville · 2015-02-23T20:39:12Z

Using the hunt method may be useful. OTOH, once you have to start looking farther away than what is in the current memory cache it may not matter so much what method is used. What's clear is that always doing a binary search of the full array is much too slow, especially in the (in my experience) very common case of an ordered target array with spacing finer than or similar to that of the original array. In those cases, looking +/- 1 is almost always right, and the speed-up of this alone is substantial. When +/-1 isn't right, checking if the bisection can use +/- 8 from the current index might give a noticeable improvement over using the full array, but I guess that is a marginal gain compared to the benefit of using +/- 1 from the previous index.

For me the essential point is that interp() is almost always used with an array x that is highly correlated with and of similar or finer in spacing to the original xp.

njsmith · 2015-02-23T20:52:23Z

You can do a lot of divisions in the time saved by avoiding even one cache
miss on a modern CPU. But yeah, interpolation makes more sense for
searchsorted than interp.

On Mon, Feb 23, 2015 at 11:33 AM, Jaime notifications@github.com wrote:

That's a nice one too. Although having a division in every iteration could
negate the benefits except for very large arrays. I guess there is no way
of telling other than implementing it...

—
Reply to this email directly or view it on GitHub
#5594 (comment).

Nathaniel J. Smith -- http://vorpus.org

juliantaylor · 2015-02-24T09:42:16Z

not that many actually, main memory bandwidth is sufficient to feed 1-4 dividing cores without starvation in sequential access, of course for this case latency is more relevant and probably more significant. Might be worth trying out.

@newville can you provide the benchmark that speeds up by 3? I guess its just interpolating onto a finer grid, but it would still be useful to have the original to compare.

newville · 2015-02-24T14:08:28Z

@juliantaylor Yes, I can work on a benchmark script and report results. Is just using timeit.timeit() sufficient or is there another benchmark framework to use? It might take me several days to find the time to do this.

juliantaylor · 2015-02-24T17:16:49Z

it doesn't need to be anything complex, I was actually just interested in whatever script you have been using to test the results of this branch.

newville · 2015-02-25T05:06:10Z

@juliantaylor OK, I couldn't resist. A Benchmark script and results are at https://gist.github.com/newville/b36b54eb47637979e256

The test isn't exhaustive, but clearly shows an important performance gain. For what I would consider the most common use cases, speedups range from 3 to 7. As the array to be interpolated onto gets noisy (that is, unordered relative to the original xp), the speedup diminishes. For interpolating onto truly randomly ordered arrays, the penalty is 10%.

newville · 2015-03-03T21:43:24Z

@juliantaylor, @jaimefrio: Any thoughts on this PR or suggestions on how to make case better? It seems like a relatively small change to the code, no change in API, a clear performance improvement, and all tests pass. Is there more that should be done to make this PR acceptable?

juliantaylor · 2015-03-03T21:54:47Z

I messed around with using hunt, and its 40% slower for your good cases and not significantly faster for others so this is probably good

charris · 2015-03-08T01:53:09Z

numpy/core/src/multiarray/compiled_base.c

 {
    npy_intp imin = 0;
    npy_intp imax = len;

    if (key > arr[len - 1]) {
        return len;
+    } else if (key < arr[0]) {


} else if ... {

charris · 2015-03-08T02:02:16Z

Lots of coding style slipups. Take a look at doc/C_STYLE_GUIDE.rst.txt.

newville · 2015-03-08T03:31:31Z

@charris I believe the style-guide is now followed.

charris · 2015-03-09T03:41:00Z

numpy/core/src/multiarray/compiled_base.c

+        guess = len - 2;
+    }
+
+    /* check most likely values: guess, guess + 1, guess - 1 */


Hmm, might be better to move this section up before line:513 as the more likely case. Some redundant computations also. Might be possible to make this a loop of some sort. Just sayin' might be worth a second look.

@charris I don't understand the comment about "before line:513" --
The code between line 513:518 ensures guess is in range.

Re: redundancies (I think only access of arr[guess], arr[guess + 1] and arr[guess - 1]). I thought this did not warrant making local variables, and to let the compiler optimize this.

The code between line 532:538 could be removed -- it's probably too machine-dependent, and seems to help only for a limited kind of "not well-ordered" data. Otherwise, it's not likely to do any harm, but that might be faint praise.

charris · 2015-03-09T03:44:57Z

LGTM. I wonder if it is worth worrying about the decreasing case? I have the feeling that the logic could be made more compact, but that might just be wishful thinking..

newville · 2015-03-09T04:17:47Z

@charris I probably would have agreed that the "decreasing case" is rare. But, as it turns out, the problem I was having that led me to looking into this, and to the PR, was a usage in scikit-image (https://github.com/scikit-image/scikit-image/blob/master/skimage/transform/radon_transform.py#L244) that does, in fact, use a "decreasing case" much of the time. My initial thought was that np.interp() was just way too slow for the decreasing case... but it turned out to be not any better for the increasing case -- it's just too slow in general. Anyway, I'd suggest keeping the "decreasing case" option.

charris · 2015-03-09T14:40:22Z

What is bothering me about this is the number of checks that are made, together with a fair amount of index arithmetic. For searches into a big array, that probably doesn't matter, but for smaller searches it might make a difference. I wonder if it might not be faster to just check a smaller range, or do the method that starts with a guess and increments by an offset in the right direction, the offset doubling on each falure, and do that maybe two, three times before falling back on binary search.

Some timing numbers would be useful here.

newville · 2015-03-09T17:24:28Z

@charris I put some timing numbers at https://gist.github.com/newville/b36b54eb47637979e256 two weeks ago. The tests could be expanded, though there are no doubt many ways that could be done. Do you have concrete suggestions for changes?

Numpy's interp() is typically described in the scipy community as the fastest way to do interpolation . A first time contributor makes a pull request that makes a relatively small change to internal C code that goes back many years (at least since numpy 1.0.4) to make a top-level numpy routine 4x faster. How often does that happen? The response was not (ever) "thanks!" but "did you try another method?" and "index arithmetic bothers me". Well, Index arithmetic and looking locally (+1, +0, -1) gives a huge performance benefit for very many cases, and most likely the (by far) most common cases. What you suggest is the "hunt" method, already discussed. I do hope the numpy team is not always like this.

charris · 2015-03-09T18:35:22Z

@newville I think the idea is fine and the implementation is correct, but I think the implementation can be improved. For instance, the same index is computed several times. I'm asking that you take a second look and see it you can improve it and make it beautiful.

rgommers · 2015-03-09T19:35:34Z

Hi Matt, you're right that speedups of this kind are very welcome and don't occur everyday. So thank you for working on this! Most of us are a bit overloaded and tend to focus on code / getting things mergeable, and may sometimes forget to say "thank you" or "good job".

charris · 2015-03-09T19:38:09Z

For instance, if pguess is a pointer, than pguess[0], pguess[1], etc. Will have the offset compiled into the instruction. In the toy function

int test(double *a, double key, int guess)
{
    double *pguess = &a[guess - 2];

    if (key > pguess[0] && key <= pguess[1]) {
        return guess - 1;
    }
    if (key > pguess[1] && key <= pguess[2]) {
        return guess;
    }
    if (key > pguess[2] && key <= pguess[3]) {
        return guess + 1;
    }
}

The if statements compile to

    ucomisd (%rax), %xmm0
    movsd   8(%rax), %xmm1
    jbe .L4
    ucomisd %xmm0, %xmm1
    jnb .L19
.L4:
    ucomisd %xmm1, %xmm0
    movsd   16(%rax), %xmm1
    jbe .L8
    ucomisd %xmm0, %xmm1
    jnb .L12
.L8:
    ucomisd %xmm1, %xmm0
    jbe .L9
    movsd   24(%rax), %xmm1
    ucomisd %xmm0, %xmm1
    jb  .L9

juliantaylor · 2015-03-09T20:13:48Z

I don't think thats going to be very significant.
It might be worthwhile peeling the first and last loop iteration to skip the boundary checks in the binary search, kind of like quicksort avoids the boundary checks.

charris · 2015-03-09T20:21:42Z

Maybe, but (IMHO) it is easier to read, and note that each floating value is loaded just once.

newville · 2015-03-09T21:06:30Z

@charris -- isn't that the job of the compiler? I find your version with pointer offsets much uglier than mine with index arithmetic.

charris · 2015-03-09T23:18:21Z

Compilers aren't quite that brilliant. I always thought that a proper optimization would invent the FFT, but it isn't so ;) Taking out the offset checks for comparison

    leaq    0(,%rdx,8), %rax
    ucomisd %xmm1, %xmm0
    movsd   8(%rdi,%rax), %xmm2
    jbe .L3
    ucomisd %xmm0, %xmm2
    jnb .L9
.L3:
    ucomisd %xmm2, %xmm0
    jbe .L5
    movsd   16(%rdi,%rax), %xmm2
    ucomisd %xmm0, %xmm2
    jnb .L19
.L5:
    ucomisd -8(%rdi,%rax), %xmm0
    jbe .L8
    ucomisd %xmm0, %xmm1
    jb  .L8
    leal    -1(%rsi), %eax
    ret

Note that three floating registars are used, as well the longer instructions like -8(%rdi,%rax), %xmm0 where %rdi holds guess. Modern hardware does that faster than it used to, but it still isn't as simple. Folks can diagree about readability, but from my point a view the logical expression is shorter and the range 0..3 is easy to comprehend.

That said, times are probably dominated by the floating comparisons, so it might be good to use the fact that key <= arr[guess] -> key > arr[guess] and avoid some comparisons.

juliantaylor · 2015-03-09T23:33:27Z

I don't think that code will run any different on current hardware, even if its going to be really minimal.
I think there are other areas that offer larger gains like skipping some range checks or vectorizing the slope computation.
This is basically a twice unrolled hunt and for our common use case it seems better than a generic hunt, so I'm happy with merging it as is, I'll maybe follow up with some improvements later.

charris · 2015-03-09T23:41:49Z

@juliantaylor Depends on the data, of course. For dense points jumping around past the first if, I think you might see 10% to 15% gains, which would be right up your alley ;) But if you want to work on this I'll put it in.

Thanks @newville .

interp(): add storage of previous value for next index lookup

jaimefrio · 2015-03-09T23:44:58Z

numpy/core/src/multiarray/compiled_base.c

+        guess = 0;
+    }
+    else if (guess >= len - 1) {
+        guess = len - 2;


With len == 1 or len == 2 you can go into this branch with guess == 0 or guess == 1, which will leave guess pointing outside the array, and cause a segmentation fault in the next few lines.

Not sure what the best solution is. I think the ideal situation would be to special-case anything with len < 4 (I would suggest a linear search ignoring guess), and for the larger arrays make sure in this normalization that guess >= 1 and guess <= len - 3, so that the testing of the three most likely intervals can be done with no extra checking with a couple of nested branches, e.g.:

assert(guess >=1 && guess <= len-3); if (key > arr[guess]) { if (key <= arr[guess + 1]) { return guess; } else if (key <= arr[guess + 2]) { return guess + 1; } imin = guess + 2; } else if (key > arr[guess - 1]) { return guess - 1; } else { imax = guess - 1; }

nice catch, swapping the if's and making the else if an if should work too, I'll do some testing with avoiding range checks in an followup

Oops.

I like your solution though, it is more in line with what I was thinking.

Sorry that I hadn't consider very short arrays. I would just put a if (len > 16){ } test around the whole set of tests involving the value of guess, and not bother trying to optimize for very small arrays.

interp(): add storage of previous value for next index lookup

85e4ac7

charris reviewed Mar 8, 2015
View reviewed changes

newville added 3 commits March 7, 2015 20:10

Merge branch 'master' of https://github.com/numpy/numpy

564be6e

stylistic improvements

ebde190

whitespace fixes

eb21e7d

rgommers added component: numpy._core 03 - Maintenance labels Mar 8, 2015

charris reviewed Mar 9, 2015
View reviewed changes

charris added a commit that referenced this pull request Mar 9, 2015

Merge pull request #5594 from newville/master

0650e63

interp(): add storage of previous value for next index lookup

charris merged commit 0650e63 into numpy:master Mar 9, 2015

jaimefrio reviewed Mar 9, 2015
View reviewed changes

charris mentioned this pull request Mar 21, 2015

interpolate needs fixing for short arrays of sample points. #5705

Closed

This was referenced May 5, 2015

numpy.interp always uses a binary search of entire array -- may be slower than necessary. #5593

Closed

BUG: Fix interp issues with small arrays #5843

Merged

interp(): add storage of previous value for next index lookup #5594

interp(): add storage of previous value for next index lookup #5594

Conversation

newville commented Feb 23, 2015

juliantaylor commented Feb 23, 2015

jaimefrio commented Feb 23, 2015

njsmith commented Feb 23, 2015

jaimefrio commented Feb 23, 2015

juliantaylor commented Feb 23, 2015

newville commented Feb 23, 2015

njsmith commented Feb 23, 2015

juliantaylor commented Feb 24, 2015

newville commented Feb 24, 2015

juliantaylor commented Feb 24, 2015

newville commented Feb 25, 2015

newville commented Mar 3, 2015

juliantaylor commented Mar 3, 2015

charris Mar 8, 2015

Choose a reason for hiding this comment

charris commented Mar 8, 2015

newville commented Mar 8, 2015

charris Mar 9, 2015

Choose a reason for hiding this comment

newville Mar 9, 2015

Choose a reason for hiding this comment

charris commented Mar 9, 2015

newville commented Mar 9, 2015

charris commented Mar 9, 2015

newville commented Mar 9, 2015

charris commented Mar 9, 2015

rgommers commented Mar 9, 2015

charris commented Mar 9, 2015

juliantaylor commented Mar 9, 2015

charris commented Mar 9, 2015

newville commented Mar 9, 2015

charris commented Mar 9, 2015

juliantaylor commented Mar 9, 2015

charris commented Mar 9, 2015

jaimefrio Mar 9, 2015

Choose a reason for hiding this comment

juliantaylor Mar 9, 2015

Choose a reason for hiding this comment

charris Mar 9, 2015

Choose a reason for hiding this comment

newville Mar 10, 2015

Choose a reason for hiding this comment