-
-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAINT: Speed up np.maximum/np.minimum by up to 15x when operating on two inputs #13207
Conversation
I adjusted the title - this specifically targets |
Using NPY_GCC_OPT_3 seems like a win. There is a remark about not using it too much since it can lead to code bloat. Any opinions on the cost of it here? |
I can do an actual comparison sometime tomorrow, but a similar change to |
Unfortunately, |
On my machine (Ubuntu 18.04 x64) the so size grows by 2.4% to 16MB. |
numpy/core/src/umath/loops.c.src
Outdated
} | ||
} | ||
BINARY_LOOP_FAST(@type@, @type@, | ||
if (in1 == NPY_DATETIME_NAT) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need a fast loop for datetime
does the compiler vectorize this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, with gcc 7.3 and -ftree-vectorize -fopt-info-vec-missed
you get:
numpy/core/src/umath/loops.c.src:1225:5: note: not vectorized: control flow in loop.
numpy/core/src/umath/loops.c.src:1225:5: note: bad loop form.
numpy/core/src/umath/loops.c.src:1225:5: note: not consecutive access n_21 = *dimensions_20(D);
numpy/core/src/umath/loops.c.src:1225:5: note: not vectorized: no grouped stores in basic block.
numpy/core/src/umath/loops.c.src:1225:5: note: not vectorized: not enough data-refs in basic block.
numpy/core/src/umath/loops.c.src:1228:12: note: not consecutive access in1_22 = MEM[(npy_datetime *)ip1_32];
numpy/core/src/umath/loops.c.src:1228:12: note: not consecutive access in2_23 = MEM[(npy_datetime *)ip2_31];
numpy/core/src/umath/loops.c.src:1228:12: note: not vectorized: no grouped stores in basic block.
numpy/core/src/umath/loops.c.src:1229:36: note: not vectorized: not enough data-refs in basic block.
Also, the 25% speedup in the asv benchmarks requires both -O3
and the fast loop.
numpy/core/src/umath/loops.c.src
Outdated
const npy_half in2 = *(npy_half *)ip2; | ||
*((npy_half *)op1) = (@OP@(in1, in2) || npy_half_isnan(in1)) ? in1 : in2; | ||
} | ||
BINARY_LOOP_FAST(npy_half, npy_half, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar to datetime, performance of float is probably not a big concern it is a input/output format, computations should be done on casted data to float32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, removed
the fast loop was not applied to many of these because it seemed like unnecessary bloat for little used functions or types Though I have no issue with bloating a bit for faster integer and float operations, in the end it is mostly just size on disk and somewhat more download bandwidth for distribution. |
So sounds like this PR is almost ready to go. @qwhelan could you address the indent and exclude datetime/float16? |
34a8d2f
to
8b5196d
Compare
@mattip Done |
LGTM. I guess this should get a mention in the Improvement section of the release notes, we have already noted |
Can we conclude on the bloat discussion first? Can one of you summarize the current state? |
On my machine using gcc-7 the change from the commit just before this one to 8b5196d is |
Thanks @mattip. My take on this is:
tl;dr let's merge this, and let's try not to make these kinds of changes a habit EDIT: will send this to the list as well |
Perhaps we should enable this only for int8. Looking at the benchmark results, the win for 32bit float and int is 3x, where the win for int8 is 10-15x. 16 bit int does not show up - was it unchanged or did I miss it? In any case, let's hold off on merging this till we see how enabling the fast loop for np.maximum.reduce affects benchmarks and code bloat. |
|
||
class MinMax(UFuncBenchmark): | ||
def time_fmax(self, dtype): | ||
np.fmax(self.arr, self.arr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having spent too much understanding benchmarks, I would suggest preallocating, and prefaulting your output arrays.
param_names = ['dtype'] | ||
|
||
def setup(self, dtype): | ||
self.arr = np.full(100000, 1, dtype=dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.arr = np.full(100000, 1, dtype=dtype) | |
self.arr = np.full(100000, 1, dtype=dtype) | |
self.arr_2 = np.full(100000, 1, dtype=dtype) | |
self.out = np.full(100000, 1, dtype=dtype) |
np.fmax(self.arr, self.arr) | ||
|
||
def time_fmin(self, dtype): | ||
np.fmin(self.arr, self.arr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.fmin(self.arr, self.arr) | |
np.fmin(self.arr, self.arr_2, out=self.out) |
using arr
as both inputs is something of an edge case. Using the out parameter will ensure you aren't allocating a new array (which can be slow or fast depending on the previous state of the RAM).
I won't have bandwidth to do that, so anyone interested should feel free to
do so.
…On Thu, Jun 9, 2022, 5:26 PM Charles Harris ***@***.***> wrote:
@qwhelan <https://github.com/qwhelan> Needs a rebase. I'm wondering if
these functions have been vectorized (SIMD). @seberg
<https://github.com/seberg> Do you recall?
—
Reply to this email directly, view it on GitHub
<#13207 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADLOH77VZ52YNATMJFG76DVOKDTJANCNFSM4HCGFX5A>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Yeah, we have a whole file for it here: https://github.com/numpy/numpy/blob/49c560c22f137907ea6a240591e49b004f28444b/numpy/core/src/umath/loops_minmax.dispatch.c.src That said, I guess there are a couple of places here that could still use the attribute, for example the datetime (and maybe half) loops. There may be a general question whether we should be adding such optimization attributes basically by default to most things in |
I will close this. Thanks @qwhelan for showing the potential of SIMD compilation. |
For calls of the form
np.min(arr_1, arr_2)
, the runtime does not scale withsizeof(type)
as we would generally expect:Notably, the integer types all take about the same amount of time to run irrespective of the size of the data, which indicates we're not constrained on memory bandwidth. Using the existing
BINARY_LOOP_FAST
macro and enabling-O3
is sufficient to get some broad speedups:Please note that this does not impact the more common
np.min(arr)
reduction call - I will have a separate PR addressing those functions in the near future.