Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

interp(): add storage of previous value for next index lookup #5594

Merged
merged 4 commits into from Mar 9, 2015
Merged

interp(): add storage of previous value for next index lookup #5594

merged 4 commits into from Mar 9, 2015

Conversation

newville
Copy link
Contributor

This PR addresses #5593 by changing binary_search() to binary_search_with_guess() with an added guess parameter that is used as a first guess (including +/- 1 indices) as the correct index. The calls binary_search_with_guess() in arr_interp() store the index found for the array element i as the guess for element i+1.

For very many cases, this guess will be close enough (+/-1) to completely avoid the binary search of the array. Though binary search is fast, using the previous index and testing values in its immediate vicinity (and likely to still be in active memory) can give a substantial speedup in many common cases, and nearly no penalty in pathologically cases (say, randomly ordered values of x).

In a test case using scikits-image's iradon transform, linear interpolation of ~85M well- (but not perfectly-) ordered values was sped up by more than a factor of 3 with this change.

@juliantaylor
Copy link
Contributor

you can generalize this a bit by not only searching the next element, but instead checking offset 1, then offset 2, offset 4 etc. And when you overshoot you bisect the remainder. I think thats called the hunt method. Its log2 n faster when the searched value is close to the the previously searched one.

@jaimefrio
Copy link
Member

In Numerical Recipes is where they call that the hunt method. You can browse the book online here, it's section "3.1.1. Searching with Correlated Values." That is also how Timsort finds its merge points when it enters "galloping mode." On the CPython source code for the list object there is an implementation in the gallop_left and gallop_right functions, starting here.

There is a nice discussion on performance in the CPyhton source starting here. We would basically make the search 2x slower than bisection in the worst case, but get results comparable to a linear search for almost ordered keys.

I remember Julian suggesting this would be a good addition to the searchsorted function, may look into it again. At some point I looked at making interp use searchsorted, but because searchsorted handles NaNs properly, the comparisons are about 2x slower than the ad hoc implementation here. Maybe if we made it faster for typical imputs by making it smarter, we could consolidate all binary searches in one place?

@njsmith
Copy link
Member

njsmith commented Feb 23, 2015

If we're thinking about speeding up binary search, then it might make sense
to also think about using interpolation search, which gives O(log log n)
cache misses for a trivial increase in CPU. The idea is very simple: if
you're looking for a name starting with Y, then you open the phone book
towards the end, not the middle.
On Feb 23, 2015 10:17 AM, "Julian Taylor" notifications@github.com wrote:

you can generalize this a bit by not only searching the next element, but
instead checking offset 1, then offset 2, offset 4 etc. And when you
overshoot you bisect the remainder. I think thats called the hunt method.
Its log2 n faster when the searched value is close to the the previously
searched one.


Reply to this email directly or view it on GitHub
#5594 (comment).

@jaimefrio
Copy link
Member

That's a nice one too. Although having a division in every iteration could negate the benefits except for very large arrays. I guess there is no way of telling other than implementing it...

@juliantaylor
Copy link
Contributor

I don't think interpolation is going to be a win, when using the assumption of uniformity you can do a hunt by storing the last offset with less cpu usage (divisions are super expensive). interpolation is more useful on one shot searches into uniform sets, here we have most likely highly correlated searches.

@newville
Copy link
Contributor Author

Using the hunt method may be useful. OTOH, once you have to start looking farther away than what is in the current memory cache it may not matter so much what method is used. What's clear is that always doing a binary search of the full array is much too slow, especially in the (in my experience) very common case of an ordered target array with spacing finer than or similar to that of the original array. In those cases, looking +/- 1 is almost always right, and the speed-up of this alone is substantial. When +/-1 isn't right, checking if the bisection can use +/- 8 from the current index might give a noticeable improvement over using the full array, but I guess that is a marginal gain compared to the benefit of using +/- 1 from the previous index.

For me the essential point is that interp() is almost always used with an array x that is highly correlated with and of similar or finer in spacing to the original xp.

@njsmith
Copy link
Member

njsmith commented Feb 23, 2015

You can do a lot of divisions in the time saved by avoiding even one cache
miss on a modern CPU. But yeah, interpolation makes more sense for
searchsorted than interp.

On Mon, Feb 23, 2015 at 11:33 AM, Jaime notifications@github.com wrote:

That's a nice one too. Although having a division in every iteration could
negate the benefits except for very large arrays. I guess there is no way
of telling other than implementing it...


Reply to this email directly or view it on GitHub
#5594 (comment).

Nathaniel J. Smith -- http://vorpus.org

@juliantaylor
Copy link
Contributor

not that many actually, main memory bandwidth is sufficient to feed 1-4 dividing cores without starvation in sequential access, of course for this case latency is more relevant and probably more significant. Might be worth trying out.

@newville can you provide the benchmark that speeds up by 3? I guess its just interpolating onto a finer grid, but it would still be useful to have the original to compare.

@newville
Copy link
Contributor Author

@juliantaylor Yes, I can work on a benchmark script and report results. Is just using timeit.timeit() sufficient or is there another benchmark framework to use? It might take me several days to find the time to do this.

@juliantaylor
Copy link
Contributor

it doesn't need to be anything complex, I was actually just interested in whatever script you have been using to test the results of this branch.

@newville
Copy link
Contributor Author

@juliantaylor OK, I couldn't resist. A Benchmark script and results are at https://gist.github.com/newville/b36b54eb47637979e256

The test isn't exhaustive, but clearly shows an important performance gain. For what I would consider the most common use cases, speedups range from 3 to 7. As the array to be interpolated onto gets noisy (that is, unordered relative to the original xp), the speedup diminishes. For interpolating onto truly randomly ordered arrays, the penalty is 10%.

@newville
Copy link
Contributor Author

newville commented Mar 3, 2015

@juliantaylor, @jaimefrio: Any thoughts on this PR or suggestions on how to make case better? It seems like a relatively small change to the code, no change in API, a clear performance improvement, and all tests pass. Is there more that should be done to make this PR acceptable?

@juliantaylor
Copy link
Contributor

I messed around with using hunt, and its 40% slower for your good cases and not significantly faster for others so this is probably good

{
npy_intp imin = 0;
npy_intp imax = len;

if (key > arr[len - 1]) {
return len;
} else if (key < arr[0]) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
else if ... {

@charris
Copy link
Member

charris commented Mar 8, 2015

Lots of coding style slipups. Take a look at doc/C_STYLE_GUIDE.rst.txt.

@newville
Copy link
Contributor Author

newville commented Mar 8, 2015

@charris I believe the style-guide is now followed.

guess = len - 2;
}

/* check most likely values: guess, guess + 1, guess - 1 */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, might be better to move this section up before line:513 as the more likely case. Some redundant computations also. Might be possible to make this a loop of some sort. Just sayin' might be worth a second look.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@charris I don't understand the comment about "before line:513" --
The code between line 513:518 ensures guess is in range.

Re: redundancies (I think only access of arr[guess], arr[guess + 1] and arr[guess - 1]). I thought this did not warrant making local variables, and to let the compiler optimize this.

The code between line 532:538 could be removed -- it's probably too machine-dependent, and seems to help only for a limited kind of "not well-ordered" data. Otherwise, it's not likely to do any harm, but that might be faint praise.

@charris
Copy link
Member

charris commented Mar 9, 2015

LGTM. I wonder if it is worth worrying about the decreasing case? I have the feeling that the logic could be made more compact, but that might just be wishful thinking..

@newville
Copy link
Contributor Author

newville commented Mar 9, 2015

@charris I probably would have agreed that the "decreasing case" is rare. But, as it turns out, the problem I was having that led me to looking into this, and to the PR, was a usage in scikit-image (https://github.com/scikit-image/scikit-image/blob/master/skimage/transform/radon_transform.py#L244) that does, in fact, use a "decreasing case" much of the time. My initial thought was that np.interp() was just way too slow for the decreasing case... but it turned out to be not any better for the increasing case -- it's just too slow in general. Anyway, I'd suggest keeping the "decreasing case" option.

@charris
Copy link
Member

charris commented Mar 9, 2015

What is bothering me about this is the number of checks that are made, together with a fair amount of index arithmetic. For searches into a big array, that probably doesn't matter, but for smaller searches it might make a difference. I wonder if it might not be faster to just check a smaller range, or do the method that starts with a guess and increments by an offset in the right direction, the offset doubling on each falure, and do that maybe two, three times before falling back on binary search.

Some timing numbers would be useful here.

@newville
Copy link
Contributor Author

newville commented Mar 9, 2015

@charris I put some timing numbers at https://gist.github.com/newville/b36b54eb47637979e256 two weeks ago. The tests could be expanded, though there are no doubt many ways that could be done. Do you have concrete suggestions for changes?

Numpy's interp() is typically described in the scipy community as the fastest way to do interpolation . A first time contributor makes a pull request that makes a relatively small change to internal C code that goes back many years (at least since numpy 1.0.4) to make a top-level numpy routine 4x faster. How often does that happen? The response was not (ever) "thanks!" but "did you try another method?" and "index arithmetic bothers me". Well, Index arithmetic and looking locally (+1, +0, -1) gives a huge performance benefit for very many cases, and most likely the (by far) most common cases. What you suggest is the "hunt" method, already discussed. I do hope the numpy team is not always like this.

@charris
Copy link
Member

charris commented Mar 9, 2015

@newville I think the idea is fine and the implementation is correct, but I think the implementation can be improved. For instance, the same index is computed several times. I'm asking that you take a second look and see it you can improve it and make it beautiful.

@rgommers
Copy link
Member

rgommers commented Mar 9, 2015

Hi Matt, you're right that speedups of this kind are very welcome and don't occur everyday. So thank you for working on this! Most of us are a bit overloaded and tend to focus on code / getting things mergeable, and may sometimes forget to say "thank you" or "good job".

@charris
Copy link
Member

charris commented Mar 9, 2015

For instance, if pguess is a pointer, than pguess[0], pguess[1], etc. Will have the offset compiled into the instruction. In the toy function

int test(double *a, double key, int guess)
{
    double *pguess = &a[guess - 2];

    if (key > pguess[0] && key <= pguess[1]) {
        return guess - 1;
    }
    if (key > pguess[1] && key <= pguess[2]) {
        return guess;
    }
    if (key > pguess[2] && key <= pguess[3]) {
        return guess + 1;
    }
}

The if statements compile to

    ucomisd (%rax), %xmm0
    movsd   8(%rax), %xmm1
    jbe .L4
    ucomisd %xmm0, %xmm1
    jnb .L19
.L4:
    ucomisd %xmm1, %xmm0
    movsd   16(%rax), %xmm1
    jbe .L8
    ucomisd %xmm0, %xmm1
    jnb .L12
.L8:
    ucomisd %xmm1, %xmm0
    jbe .L9
    movsd   24(%rax), %xmm1
    ucomisd %xmm0, %xmm1
    jb  .L9

@juliantaylor
Copy link
Contributor

I don't think thats going to be very significant.
It might be worthwhile peeling the first and last loop iteration to skip the boundary checks in the binary search, kind of like quicksort avoids the boundary checks.

@charris
Copy link
Member

charris commented Mar 9, 2015

Maybe, but (IMHO) it is easier to read, and note that each floating value is loaded just once.

@newville
Copy link
Contributor Author

newville commented Mar 9, 2015

@charris -- isn't that the job of the compiler? I find your version with pointer offsets much uglier than mine with index arithmetic.

@charris
Copy link
Member

charris commented Mar 9, 2015

Compilers aren't quite that brilliant. I always thought that a proper optimization would invent the FFT, but it isn't so ;) Taking out the offset checks for comparison

    leaq    0(,%rdx,8), %rax
    ucomisd %xmm1, %xmm0
    movsd   8(%rdi,%rax), %xmm2
    jbe .L3
    ucomisd %xmm0, %xmm2
    jnb .L9
.L3:
    ucomisd %xmm2, %xmm0
    jbe .L5
    movsd   16(%rdi,%rax), %xmm2
    ucomisd %xmm0, %xmm2
    jnb .L19
.L5:
    ucomisd -8(%rdi,%rax), %xmm0
    jbe .L8
    ucomisd %xmm0, %xmm1
    jb  .L8
    leal    -1(%rsi), %eax
    ret

Note that three floating registars are used, as well the longer instructions like -8(%rdi,%rax), %xmm0 where %rdi holds guess. Modern hardware does that faster than it used to, but it still isn't as simple. Folks can diagree about readability, but from my point a view the logical expression is shorter and the range 0..3 is easy to comprehend.

That said, times are probably dominated by the floating comparisons, so it might be good to use the fact that key <= arr[guess] -> key > arr[guess] and avoid some comparisons.

@juliantaylor
Copy link
Contributor

I don't think that code will run any different on current hardware, even if its going to be really minimal.
I think there are other areas that offer larger gains like skipping some range checks or vectorizing the slope computation.
This is basically a twice unrolled hunt and for our common use case it seems better than a generic hunt, so I'm happy with merging it as is, I'll maybe follow up with some improvements later.

@charris
Copy link
Member

charris commented Mar 9, 2015

@juliantaylor Depends on the data, of course. For dense points jumping around past the first if, I think you might see 10% to 15% gains, which would be right up your alley ;) But if you want to work on this I'll put it in.

Thanks @newville .

charris added a commit that referenced this pull request Mar 9, 2015
interp(): add storage of previous value for next index lookup
@charris charris merged commit 0650e63 into numpy:master Mar 9, 2015
guess = 0;
}
else if (guess >= len - 1) {
guess = len - 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With len == 1 or len == 2 you can go into this branch with guess == 0 or guess == 1, which will leave guess pointing outside the array, and cause a segmentation fault in the next few lines.

Not sure what the best solution is. I think the ideal situation would be to special-case anything with len < 4 (I would suggest a linear search ignoring guess), and for the larger arrays make sure in this normalization that guess >= 1 and guess <= len - 3, so that the testing of the three most likely intervals can be done with no extra checking with a couple of nested branches, e.g.:

assert(guess >=1 && guess <= len-3);

if (key > arr[guess]) {
    if (key <= arr[guess + 1]) {
        return guess;
    }
    else if (key <= arr[guess + 2]) {
        return guess + 1;
    }
    imin = guess + 2;
}
else if (key > arr[guess - 1]) {
    return guess - 1;
}
else {
    imax = guess - 1;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch, swapping the if's and making the else if an if should work too, I'll do some testing with avoiding range checks in an followup

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops.

I like your solution though, it is more in line with what I was thinking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that I hadn't consider very short arrays. I would just put a if (len > 16){ } test around the whole set of tests involving the value of guess, and not bother trying to optimize for very small arrays.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants