MAINT: Speed up np.maximum/np.minimum by up to 15x when operating on two inputs #13207

qwhelan · 2019-03-29T03:18:09Z

For calls of the form np.min(arr_1, arr_2), the runtime does not scale with sizeof(type) as we would generally expect:

[100.00%] ··· bench_ufunc.MinMax.time_minimum                                                                                                                             ok
[100.00%] ··· ================= ============
                    dtype
              ----------------- ------------
                     bool        6.54±0.2μs    <----|
                    uint8         88.9±2μs     <----|--- these are the same size
                     int8         85.1±1μs     <----|
                    int32         75.9±2μs
                    uint64        82.3±1μs
                    int64         82.6±1μs
                   float16        795±20μs
                   float32        102±3μs
                   float64        86.7±1μs
                  complex128      178±4μs
                datetime64[ns]    101±2μs
               timedelta64[ns]    101±2μs
              ================= ============

Notably, the integer types all take about the same amount of time to run irrespective of the size of the data, which indicates we're not constrained on memory bandwidth. Using the existing BINARY_LOOP_FAST macro and enabling -O3 is sufficient to get some broad speedups:

asv compare HEAD^ HEAD -s --sort ratio

Benchmarks that have improved:

       before           after         ratio
     [db5fcc8e]       [34a8d2f9]
     <maximum_speedup~1>       <maximum_speedup>
-        99.4±2μs         78.3±3μs     0.79  bench_ufunc.MinMax.time_fmin('datetime64[ns]')
-        97.3±1μs         75.7±3μs     0.78  bench_ufunc.MinMax.time_minimum('datetime64[ns]')
-        99.1±2μs         76.3±1μs     0.77  bench_ufunc.MinMax.time_maximum('timedelta64[ns]')
-        99.2±1μs         75.3±3μs     0.76  bench_ufunc.MinMax.time_minimum('timedelta64[ns]')
-        83.1±3μs         62.3±1μs     0.75  bench_ufunc.MinMax.time_maximum('float64')
-        105±10μs         78.3±3μs     0.75  bench_ufunc.MinMax.time_fmax('timedelta64[ns]')
-        82.1±2μs         61.1±3μs     0.74  bench_ufunc.MinMax.time_minimum('float64')
-         102±2μs        71.3±10μs     0.70  bench_ufunc.MinMax.time_fmax('float64')
-        83.5±3μs         31.4±2μs     0.38  bench_ufunc.MinMax.time_maximum('float32')
-        73.2±3μs       25.4±0.5μs     0.35  bench_ufunc.MinMax.time_maximum('int32')
-      76.8±0.7μs       26.5±0.7μs     0.35  bench_ufunc.MinMax.time_fmax('int32')
-        76.6±4μs       25.7±0.5μs     0.34  bench_ufunc.MinMax.time_minimum('int32')
-        77.9±4μs       25.2±0.7μs     0.32  bench_ufunc.MinMax.time_fmin('int32')
-        98.2±2μs         31.7±1μs     0.32  bench_ufunc.MinMax.time_minimum('float32')
-        98.7±3μs       31.4±0.8μs     0.32  bench_ufunc.MinMax.time_fmax('float32')
-        98.8±2μs       30.3±0.8μs     0.31  bench_ufunc.MinMax.time_fmin('float32')
-        78.5±2μs       7.08±0.2μs     0.09  bench_ufunc.MinMax.time_minimum('int8')
-       95.7±20μs         8.59±1μs     0.09  bench_ufunc.MinMax.time_fmin('int8')
-      88.5±0.6μs      7.18±0.09μs     0.08  bench_ufunc.MinMax.time_maximum('int8')
-        89.7±2μs       7.11±0.3μs     0.08  bench_ufunc.MinMax.time_fmax('int8')
-        81.6±1μs       5.37±0.2μs     0.07  bench_ufunc.MinMax.time_fmax('uint8')
-      80.5±0.5μs       5.23±0.2μs     0.06  bench_ufunc.MinMax.time_maximum('uint8')
-        87.1±1μs       5.38±0.1μs     0.06  bench_ufunc.MinMax.time_minimum('uint8')
-       92.1±20μs       5.37±0.4μs     0.06  bench_ufunc.MinMax.time_fmin('uint8')

Benchmarks that have stayed the same:

       before           after         ratio
     [db5fcc8e]       [34a8d2f9]
     <maximum_speedup~1>       <maximum_speedup>
         907±20μs         975±20μs     1.07  bench_ufunc.MinMax.time_maximum('float16')
          171±7μs         183±10μs     1.07  bench_ufunc.MinMax.time_minimum('complex128')
          817±9μs         872±70μs     1.07  bench_ufunc.MinMax.time_minimum('float16')
       6.01±0.1μs       6.31±0.2μs     1.05  bench_ufunc.MinMax.time_minimum('bool')
         78.9±2μs         82.2±2μs     1.04  bench_ufunc.MinMax.time_minimum('uint64')
        861±200μs         895±20μs     1.04  bench_ufunc.MinMax.time_fmin('float16')
       11.1±0.3μs       11.5±0.8μs     1.03  bench_reduce.MinMax.time_min(<class 'numpy.float64'>)
       11.0±0.4μs       11.3±0.1μs     1.03  bench_reduce.MinMax.time_max(<class 'numpy.float64'>)
       7.55±0.2μs       7.75±0.1μs     1.03  bench_reduce.MinMax.time_min(<class 'numpy.float32'>)
         953±10μs         969±10μs     1.02  bench_ufunc.MinMax.time_fmax('float16')
          206±5μs          207±7μs     1.01  bench_ufunc.MinMax.time_fmax('complex128')
       5.83±0.3μs       5.86±0.1μs     1.00  bench_ufunc.MinMax.time_fmax('bool')
       17.0±0.3μs       17.1±0.6μs     1.00  bench_reduce.MinMax.time_min(<class 'numpy.int64'>)
       7.80±0.4μs       7.83±0.3μs     1.00  bench_reduce.MinMax.time_max(<class 'numpy.float32'>)
       17.2±0.3μs       17.2±0.4μs     1.00  bench_reduce.MinMax.time_max(<class 'numpy.int64'>)
      5.93±0.06μs       5.90±0.2μs     1.00  bench_ufunc.MinMax.time_maximum('bool')
          176±3μs          175±8μs     0.99  bench_ufunc.MinMax.time_maximum('complex128')
       6.63±0.8μs       6.44±0.1μs     0.97  bench_ufunc.MinMax.time_fmin('bool')
          206±4μs          198±3μs     0.96  bench_ufunc.MinMax.time_fmin('complex128')
         83.7±6μs         80.0±2μs     0.96  bench_ufunc.MinMax.time_fmin('uint64')
       72.4±0.8μs         68.3±2μs     0.94  bench_ufunc.MinMax.time_minimum('int64')
         73.9±2μs         69.2±2μs     0.94  bench_ufunc.MinMax.time_fmax('int64')
         76.4±1μs         71.2±2μs     0.93  bench_ufunc.MinMax.time_maximum('int64')
       76.4±0.8μs         70.0±5μs     0.92  bench_ufunc.MinMax.time_fmin('int64')
         73.5±2μs         66.9±2μs     0.91  bench_ufunc.MinMax.time_fmax('uint64')
       77.8±0.6μs         69.9±2μs    ~0.90  bench_ufunc.MinMax.time_maximum('uint64')
         94.1±5μs         82.4±8μs    ~0.88  bench_ufunc.MinMax.time_fmin('timedelta64[ns]')
         98.1±3μs         82.8±5μs    ~0.84  bench_ufunc.MinMax.time_fmax('datetime64[ns]')
         84.9±3μs         66.2±6μs    ~0.78  bench_ufunc.MinMax.time_fmin('float64')
          100±2μs         74.7±4μs    ~0.74  bench_ufunc.MinMax.time_maximum('datetime64[ns]')

Please note that this does not impact the more common np.min(arr) reduction call - I will have a separate PR addressing those functions in the near future.

eric-wieser · 2019-03-29T03:19:29Z

I adjusted the title - this specifically targets np.maximum not np.max == np.maximum.reduce

mattip · 2019-04-18T12:12:01Z

Using NPY_GCC_OPT_3 seems like a win. There is a remark about not using it too much since it can lead to code bloat. Any opinions on the cost of it here?

qwhelan · 2019-04-18T13:18:44Z

I can do an actual comparison sometime tomorrow, but a similar change to np.abs() increased loops.o by 0.16% (#13271). This will likely be a bit more due to more types and logic.

qwhelan · 2019-04-22T03:22:33Z

Unfortunately, BINARY_LOOP_FAST causes quite a bit more bloat (~7%) than the example I gave. Most of the macro is dealing with potential aliasing and scalar comparison cases; the bloat could be cut in half or more if those cases are dropped for these functions. That being said, this increase seems a bit much for np.maximum given that the reduction version is far more common.

mattip · 2019-04-22T06:13:53Z

On my machine (Ubuntu 18.04 x64) the so size grows by 2.4% to 16MB.

numpy/core/src/umath/loops.c.src

juliantaylor · 2019-04-24T17:50:08Z

numpy/core/src/umath/loops.c.src

-        }
-    }
+    BINARY_LOOP_FAST(@type@, @type@,
+                     if (in1 == NPY_DATETIME_NAT) {


I don't think we need a fast loop for datetime
does the compiler vectorize this?

Nope, with gcc 7.3 and -ftree-vectorize -fopt-info-vec-missed you get:

numpy/core/src/umath/loops.c.src:1225:5: note: not vectorized: control flow in loop. numpy/core/src/umath/loops.c.src:1225:5: note: bad loop form. numpy/core/src/umath/loops.c.src:1225:5: note: not consecutive access n_21 = *dimensions_20(D); numpy/core/src/umath/loops.c.src:1225:5: note: not vectorized: no grouped stores in basic block. numpy/core/src/umath/loops.c.src:1225:5: note: not vectorized: not enough data-refs in basic block. numpy/core/src/umath/loops.c.src:1228:12: note: not consecutive access in1_22 = MEM[(npy_datetime *)ip1_32]; numpy/core/src/umath/loops.c.src:1228:12: note: not consecutive access in2_23 = MEM[(npy_datetime *)ip2_31]; numpy/core/src/umath/loops.c.src:1228:12: note: not vectorized: no grouped stores in basic block. numpy/core/src/umath/loops.c.src:1229:36: note: not vectorized: not enough data-refs in basic block.

Also, the 25% speedup in the asv benchmarks requires both -O3 and the fast loop.

juliantaylor · 2019-04-24T17:51:54Z

numpy/core/src/umath/loops.c.src

-        const npy_half in2 = *(npy_half *)ip2;
-        *((npy_half *)op1) = (@OP@(in1, in2) || npy_half_isnan(in1)) ? in1 : in2;
-    }
+    BINARY_LOOP_FAST(npy_half, npy_half,


similar to datetime, performance of float is probably not a big concern it is a input/output format, computations should be done on casted data to float32

Thanks, removed

juliantaylor · 2019-04-24T17:55:40Z

the fast loop was not applied to many of these because it seemed like unnecessary bloat for little used functions or types

Though I have no issue with bloating a bit for faster integer and float operations, in the end it is mostly just size on disk and somewhat more download bandwidth for distribution.
datetime and half floats imo do not need the fast loop but then again those are only three types compared to the many different integers so it probably makes no difference in bloat size and avoids having to look at them again in similar changes in the future.

mattip · 2019-04-25T21:25:31Z

So sounds like this PR is almost ready to go. @qwhelan could you address the indent and exclude datetime/float16?

…two inputs

qwhelan · 2019-04-29T21:10:47Z

@mattip Done

mattip · 2019-04-29T21:30:23Z

LGTM. I guess this should get a mention in the Improvement section of the release notes, we have already noted numpy.exp and numpy.log so putting it next to that seems appropriate.

rgommers · 2019-04-29T21:37:51Z

Can we conclude on the bloat discussion first? Can one of you summarize the current state?

mattip · 2019-04-29T22:21:08Z

On my machine using gcc-7 the change from the commit just before this one to 8b5196d is
17192648 bytes -> 17611176 bytes, a change of 408kB out of 16789kB for around 2.4%. The mailing list discussion did not provide a hard criteria for a decision, although the general direction was that disk size does not matter as much as download size.

rgommers · 2019-04-30T09:04:31Z

Thanks @mattip. My take on this is:

This case is borderline. If we would write down a hard criterion, this likely would not meet it.
Rationale: if we get 100 PRs like this, the average performance of numpy for a user would not change much, however we have by then blown up the size NumPy takes up (disk/RAM/download/etc) by a factor 2.4
However, we won't get 100 PRs like this. So judging this based on such a criterion isn't quite right.
We have this PR now, and it's good to go. Presumably it helps @qwhelan significantly. So I'm +0.5 for merging it.
Also note that Cython has the same problem: taking one function and putting it in a .pyx file gives a huge amount of bloat (example: scipy.ndimage.label). We had the same discussion there, but it never became a practical issue because there were not many other PRs like that.

tl;dr let's merge this, and let's try not to make these kinds of changes a habit

EDIT: will send this to the list as well

mattip · 2019-04-30T11:15:57Z

Perhaps we should enable this only for int8.

Looking at the benchmark results, the win for 32bit float and int is 3x, where the win for int8 is 10-15x. 16 bit int does not show up - was it unchanged or did I miss it?

In any case, let's hold off on merging this till we see how enabling the fast loop for np.maximum.reduce affects benchmarks and code bloat.

hmaarrfk · 2022-04-05T19:00:48Z

benchmarks/benchmarks/bench_ufunc.py

+
+class MinMax(UFuncBenchmark):
+    def time_fmax(self, dtype):
+        np.fmax(self.arr, self.arr)


Having spent too much understanding benchmarks, I would suggest preallocating, and prefaulting your output arrays.

hmaarrfk · 2022-04-05T19:01:18Z

benchmarks/benchmarks/bench_ufunc.py

+    param_names = ['dtype']
+
+    def setup(self, dtype):
+        self.arr = np.full(100000, 1, dtype=dtype)


Suggested change

self.arr = np.full(100000, 1, dtype=dtype)

self.arr = np.full(100000, 1, dtype=dtype)

self.arr_2 = np.full(100000, 1, dtype=dtype)

self.out = np.full(100000, 1, dtype=dtype)

hmaarrfk · 2022-04-05T19:02:06Z

benchmarks/benchmarks/bench_ufunc.py

+        np.fmax(self.arr, self.arr)
+
+    def time_fmin(self, dtype):
+        np.fmin(self.arr, self.arr)


Suggested change

np.fmin(self.arr, self.arr)

np.fmin(self.arr, self.arr_2, out=self.out)

using arr as both inputs is something of an edge case. Using the out parameter will ensure you aren't allocating a new array (which can be slow or fast depending on the previous state of the RAM).

charris · 2022-06-10T00:26:16Z

@qwhelan Needs a rebase. I'm wondering if these functions have been vectorized (SIMD). @seberg Do you recall?

qwhelan · 2022-06-10T00:28:35Z

I won't have bandwidth to do that, so anyone interested should feel free to do so.

…

On Thu, Jun 9, 2022, 5:26 PM Charles Harris ***@***.***> wrote: @qwhelan <https://github.com/qwhelan> Needs a rebase. I'm wondering if these functions have been vectorized (SIMD). @seberg <https://github.com/seberg> Do you recall? — Reply to this email directly, view it on GitHub <#13207 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADLOH77VZ52YNATMJFG76DVOKDTJANCNFSM4HCGFX5A> . You are receiving this because you were mentioned.Message ID: ***@***.***>

seberg · 2022-06-10T00:36:57Z

Yeah, we have a whole file for it here: https://github.com/numpy/numpy/blob/49c560c22f137907ea6a240591e49b004f28444b/numpy/core/src/umath/loops_minmax.dispatch.c.src

That said, I guess there are a couple of places here that could still use the attribute, for example the datetime (and maybe half) loops.

There may be a general question whether we should be adding such optimization attributes basically by default to most things in loops.c.src?

mattip · 2022-06-15T16:31:47Z

I will close this. Thanks @qwhelan for showing the potential of SIMD compilation.

eric-wieser changed the title ~~PERF: speed up np.max/np.min by up to 15x when operating on two inputs~~ PERF: speed up np.maximum/np.minimum by up to 15x when operating on two inputs Mar 29, 2019

eric-wieser added the 01 - Enhancement label Mar 29, 2019

eric-wieser requested a review from juliantaylor March 29, 2019 03:22

mattip added the component: numpy.ufunc label Mar 31, 2019

charris changed the title ~~PERF: speed up np.maximum/np.minimum by up to 15x when operating on two inputs~~ MAINT: Speed up np.maximum/np.minimum by up to 15x when operating on two inputs Apr 3, 2019

juliantaylor reviewed Apr 24, 2019

View reviewed changes

numpy/core/src/umath/loops.c.src Outdated Show resolved Hide resolved

juliantaylor reviewed Apr 24, 2019

View reviewed changes

MAINT: speed up np.maximum/np.minimum by up to 15x when operating on …

8b5196d

…two inputs

qwhelan force-pushed the maximum_speedup branch from 34a8d2f to 8b5196d Compare April 26, 2019 06:54

Base automatically changed from master to main March 4, 2021 02:04

github-actions bot added the 03 - Maintenance label Mar 4, 2021

hmaarrfk reviewed Apr 5, 2022

View reviewed changes

charris added the triage review Issue/PR to be discussed at the next triage meeting label Jun 10, 2022

mattip removed the triage review Issue/PR to be discussed at the next triage meeting label Jun 15, 2022

mattip closed this Jun 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAINT: Speed up np.maximum/np.minimum by up to 15x when operating on two inputs #13207

MAINT: Speed up np.maximum/np.minimum by up to 15x when operating on two inputs #13207

qwhelan commented Mar 29, 2019

eric-wieser commented Mar 29, 2019

mattip commented Apr 18, 2019

qwhelan commented Apr 18, 2019

qwhelan commented Apr 22, 2019

mattip commented Apr 22, 2019

juliantaylor Apr 24, 2019

qwhelan Apr 26, 2019

juliantaylor Apr 24, 2019

qwhelan Apr 29, 2019

juliantaylor commented Apr 24, 2019 •

edited

Loading

mattip commented Apr 25, 2019

qwhelan commented Apr 29, 2019

mattip commented Apr 29, 2019

rgommers commented Apr 29, 2019

mattip commented Apr 29, 2019

rgommers commented Apr 30, 2019 •

edited

Loading

mattip commented Apr 30, 2019

hmaarrfk Apr 5, 2022

hmaarrfk Apr 5, 2022

hmaarrfk Apr 5, 2022

charris commented Jun 10, 2022

qwhelan commented Jun 10, 2022 via email

seberg commented Jun 10, 2022

mattip commented Jun 15, 2022

	np.fmin(self.arr, self.arr)
	np.fmin(self.arr, self.arr_2, out=self.out)

MAINT: Speed up np.maximum/np.minimum by up to 15x when operating on two inputs #13207

MAINT: Speed up np.maximum/np.minimum by up to 15x when operating on two inputs #13207

Conversation

qwhelan commented Mar 29, 2019

eric-wieser commented Mar 29, 2019

mattip commented Apr 18, 2019

qwhelan commented Apr 18, 2019

qwhelan commented Apr 22, 2019

mattip commented Apr 22, 2019

juliantaylor Apr 24, 2019

Choose a reason for hiding this comment

qwhelan Apr 26, 2019

Choose a reason for hiding this comment

juliantaylor Apr 24, 2019

Choose a reason for hiding this comment

qwhelan Apr 29, 2019

Choose a reason for hiding this comment

juliantaylor commented Apr 24, 2019 • edited Loading

mattip commented Apr 25, 2019

qwhelan commented Apr 29, 2019

mattip commented Apr 29, 2019

rgommers commented Apr 29, 2019

mattip commented Apr 29, 2019

rgommers commented Apr 30, 2019 • edited Loading

mattip commented Apr 30, 2019

hmaarrfk Apr 5, 2022

Choose a reason for hiding this comment

hmaarrfk Apr 5, 2022

Choose a reason for hiding this comment

hmaarrfk Apr 5, 2022

Choose a reason for hiding this comment

charris commented Jun 10, 2022

qwhelan commented Jun 10, 2022 via email

seberg commented Jun 10, 2022

mattip commented Jun 15, 2022

juliantaylor commented Apr 24, 2019 •

edited

Loading

rgommers commented Apr 30, 2019 •

edited

Loading