Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

np.any and np.all short-circuiting (Trac #1237) #1835

Closed
thouis opened this issue Oct 19, 2012 · 17 comments
Closed

np.any and np.all short-circuiting (Trac #1237) #1835

thouis opened this issue Oct 19, 2012 · 17 comments
Labels
01 - Enhancement component: numpy._core Priority: high High priority, also add milestones for urgent issues

Comments

@thouis
Copy link
Contributor

thouis commented Oct 19, 2012

Original ticket http://projects.scipy.org/numpy/ticket/1237 on 2009-09-24 by trac user lucaciti, assigned to unknown.

I noticed that python's "any" can be faster than numpy's "any" (and the similarly for "all").
Then I wondered why.
I realized that numpy implements "any" as logical_or.reduce (and "all" as logical_and.reduce).
This means that numpy cannot take advantage of short-circuiting.
Looking at the timings confirmed my suspects.
I think python fetches one element at the time from the array and as soon as any of them is true it returns true.
Instead, numpy goes on until the end of the array even if the very first element is already true.
Looking at the code I think I found a way to fix it.

In the following table any(x) is python's version, np.any(x) is numpy's, while np.any(x) is mine.

'1.4.0.dev7417'

x = np.zeros(100000, dtype=bool)
x[i] = True
%timeit any(x)
%timeit np.any(x)

x = np.ones(100000, dtype=bool)
x[i] = False
%timeit all(x)
%timeit np.all(x)

ANY

i          any(x)        np.any(x)    *np.any(x)*
//         6.84 ms       831 µs        189  µs
50000      3.41 ms       832 µs        98   µs
10000      683  µs       831 µs        24.7 µs
1000       68.9 µs       859 µs        8.41 µs
100        7.92 µs       888 µs        6.9  µs
10         1.42 µs       832 µs        6.68 µs
0          712  ns       831 µs        6.65 µs


ALL

i          all(x)        np.all(x)    *np.all(x)*
//         6.65 ms       676 µs        300 µs
50000      3.32 ms       677 µs        154 µs
10000      666  µs       676 µs        36.4 µs
1000       67.9 µs       686 µs        9.86 µs
100        7.53 µs       677 µs        7.26 µs
10         1.39 µs       676 µs        7.06 µs
0          716  ns       678 µs        6.96 µs
@thouis
Copy link
Contributor Author

thouis commented Oct 19, 2012

Attachment added by trac user lucaciti on 2009-09-24: fix1237.patch

@thouis
Copy link
Contributor Author

thouis commented Oct 19, 2012

trac user lucaciti wrote on 2009-09-24

Please note that in the table the rows with "//" mean that the condition is never satisfied. This means that the improvement we see between the starred and the non-starred versions cannot be due to the short-circuiting.

I suspect it is due to the fact that

  • the loop is performed incrementing 1 pointer instead of 3
  • the value that is carried on from one cycle of the loop to the next is stored in a variable (possibly a register), instead of dereferenced from a pointer twice for each cycle

Therefore, I think, the same trick I used for logical_and and logical_or could be extended to all functions and all reduction operations would benefit from it (e.g. the ubiquitous x.sum(), x.mean(), ...).

One thing I do not get is why the timings for "any" and "all" are different and the order changes between the starred and non-starred version. The only difference is in the operation and/or and the check for the short-circuiting ==0/!=0.

@thouis
Copy link
Contributor Author

thouis commented Oct 19, 2012

@charris wrote on 2009-09-25

What happens if you do the same trick for add?

@thouis
Copy link
Contributor Author

thouis commented Oct 19, 2012

trac user lucaciti wrote on 2009-09-25

I specialized the loops for the common operations so that they use a different loop for inplace operations and for reductions.

It looks like having a specialized loop for inplace operations only gives a slight boost (at least with the vector length and data types I have tried).

Instead, the specialized reduce loop gives a significant boost around: 60% faster. As many operations rely on reductions (e.g. mean, std, ...), this can have a widespread positive effect on the speed.

Below the results of the test I have done so far. For "any" and "all" only the most favourable (a short-circuit at the first element) and the least favourable ("zeros" for "any" and "ones" for "all") cases are shown.

In [1]: import sys
In [2]: sys.path.insert(0, '/numpy') # VS '/numpy-fastred'
In [3]: import numpy as np
In [4]: x = np.arange(10000)
In [5]: y = np.ones_like(x)
In [6]: _ip.magic("timeit np.add(x, y, x)")
100000 loops, best of 3: 14.1 µs per loop # VS 13.5 µs
In [7]: _ip.magic("timeit x.sum()")
10000 loops, best of 3: 27.4 µs per loop # VS 10.9 µs
In [8]: x = np.arange(10000.)
In [9]: y = np.ones_like(x)
In [10]: _ip.magic("timeit np.add(x, y, x)")
100000 loops, best of 3: 14.4 µs per loop # VS 14.4 µs
In [11]: _ip.magic("timeit x.sum()")
10000 loops, best of 3: 39.4 µs per loop # VS 15 µs
In [12]: _ip.magic("timeit x.std()")
10000 loops, best of 3: 120 µs per loop # VS 70.8 µs
In [13]: x = np.ones(10000, dtype=bool)
In [14]: _ip.magic("timeit x.all()")
10000 loops, best of 3: 40.3 µs per loop # VS 19.4 µs
In [15]: _ip.magic("timeit x.any()")
10000 loops, best of 3: 48.6 µs per loop # VS 3.32 µs
In [16]: x = np.zeros(10000, dtype=bool)
In [17]: _ip.magic("timeit x.all()")
10000 loops, best of 3: 40.2 µs per loop # VS 3.33 µs
In [18]: _ip.magic("timeit x.any()")
10000 loops, best of 3: 49.8 µs per loop # VS 11.3 µs

@thouis
Copy link
Contributor Author

thouis commented Oct 19, 2012

Attachment added by trac user lucaciti on 2009-09-25: fix-1237_rev2.patch

@thouis
Copy link
Contributor Author

thouis commented Oct 19, 2012

@charris wrote on 2009-09-25

Looks the reduce part is worth doing. Can you think up a pair of macros for this case? I'm thinking of something that looks like

IF_REDUCE_BINARY-LOOP {
    blah;
}
ELSE_BINARY_LOOP {
    blah-de-blah;
}

I kind of detest burying control in macros, but sometimes it is the only way to make all the code look alike, which is important in massively repetitive stuff like the loops.

If you come up with something, make it a separate patch.

@thouis
Copy link
Contributor Author

thouis commented Oct 19, 2012

trac user lucaciti wrote on 2009-09-25

Hello!

You mean the reduce and the short-circuiting?
I'll submit a patch implementing both.

I agree with you, I like the "if" outside the macro better. I think the solution I found isn't that bad. Let me know if you meant something different.

@thouis
Copy link
Contributor Author

thouis commented Oct 19, 2012

Attachment added by trac user lucaciti on 2009-09-25: fix-1237_rev3.patch

@thouis
Copy link
Contributor Author

thouis commented Oct 19, 2012

@charris wrote on 2009-09-25

Looks better, but the use of a macro argument bothers me, the idea is to hide most of the repetitive stuff that doesn't serve to clarify the code. I don't think the inplace case is worth the hassle. Style wise, the rest of the code has been using

}
else {

I'm not adamant about that and seem to be in the minority in that preference, but until it is discussed on the list and someone (else) steps forward to change all the current code I would prefer it to conform to current usage.

Make two patches, one for the macros, a second for the loop changes.

@thouis
Copy link
Contributor Author

thouis commented Oct 19, 2012

trac user lucaciti wrote on 2009-09-25

I am fine with the style thing and the two patches.

But concerning to the "use of a macro argument", I do not see how to remove it. The "repeat" thing happens before the preprocessor, right?
As the initialization parameter, io1, needs to know the type of the data it cannot be in a macro, am I wrong? Using *iop1 instead of io1 prevents the compiler to use a register for it and need to put it back and forth in memory at each cycle.

If you have a better solution you can implement it differently, I do not mind, honestly.

Thanks!
Luca

@thouis
Copy link
Contributor Author

thouis commented Oct 19, 2012

@charris wrote on 2009-09-25

I was hinting that you could bring the style thing up on the list, where I suspect you will find plenty of support. I was instigating a revolt against myself ;)

On the macro argument thing, the Boolean section is an anomaly because the output type differs from the input type. There are some other functions like that, but they are all unary. Hmm, I think the template argument should take at most the type and that the variable name io1 should be hardwired. That would make the function of the template a bit clearer. Let me think about it a bit more.

@thouis
Copy link
Contributor Author

thouis commented Oct 19, 2012

Attachment added by trac user lucaciti on 2009-09-26: fix-1237_rev4.patch

@thouis
Copy link
Contributor Author

thouis commented Oct 19, 2012

trac user lucaciti wrote on 2009-09-26

The macro thing looks better now.

I tend to use "else" between the two braces but it is a matter of taste. I think I understand the reason why numpy programmers like the other way better. I think they mentally associate the open brace with a semicolon, just ignore the closed brace... et voilà, they can pretend they are programming in python :-D

@thouis
Copy link
Contributor Author

thouis commented Oct 19, 2012

@charris wrote on 2009-09-26

Looks good, thanks. Maybe use BINARY_REDUCE_LOOP, but if you don't do that I'll commit it as is tomorrow.

I prefer the not-two-braces thing because my aging eyes find it a bit easier to catch the "else" than the "}" when scanning down the edge of the code, especially if 'else' is colorized. Most do prefer the two-braces, however, and that style takes up less space.

@thouis
Copy link
Contributor Author

thouis commented Oct 19, 2012

Attachment added by trac user lucaciti on 2009-09-27: fix-1237_rev5.patch

@thouis
Copy link
Contributor Author

thouis commented Oct 19, 2012

@charris wrote on 2009-09-27

Applied in r7420, thanks.

I removed some trailing whitespace and changed

if (isblah)
    break;

to

if (isblah) {
    break;
}

You can probably set up your editor to highlight trailing whitespace. There are also some scripts that will do that.

@thouis
Copy link
Contributor Author

thouis commented Oct 19, 2012

trac user lucaciti wrote on 2009-09-27

Thanks! Perfect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement component: numpy._core Priority: high High priority, also add milestones for urgent issues
Projects
None yet
Development

No branches or pull requests

1 participant