Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAINT: Speed up np.maximum/np.minimum by up to 15x when operating on two inputs #13207

Closed
wants to merge 1 commit into from

Conversation

qwhelan
Copy link
Contributor

@qwhelan qwhelan commented Mar 29, 2019

For calls of the form np.min(arr_1, arr_2), the runtime does not scale with sizeof(type) as we would generally expect:

[100.00%] ··· bench_ufunc.MinMax.time_minimum                                                                                                                             ok
[100.00%] ··· ================= ============
                    dtype
              ----------------- ------------
                     bool        6.54±0.2μs    <----|
                    uint8         88.9±2μs     <----|--- these are the same size
                     int8         85.1±1μs     <----|
                    int32         75.9±2μs
                    uint64        82.3±1μs
                    int64         82.6±1μs
                   float16        795±20μs
                   float32        102±3μs
                   float64        86.7±1μs
                  complex128      178±4μs
                datetime64[ns]    101±2μs
               timedelta64[ns]    101±2μs
              ================= ============

Notably, the integer types all take about the same amount of time to run irrespective of the size of the data, which indicates we're not constrained on memory bandwidth. Using the existing BINARY_LOOP_FAST macro and enabling -O3 is sufficient to get some broad speedups:

asv compare HEAD^ HEAD -s --sort ratio

Benchmarks that have improved:

       before           after         ratio
     [db5fcc8e]       [34a8d2f9]
     <maximum_speedup~1>       <maximum_speedup>
-        99.4±2μs         78.3±3μs     0.79  bench_ufunc.MinMax.time_fmin('datetime64[ns]')
-        97.3±1μs         75.7±3μs     0.78  bench_ufunc.MinMax.time_minimum('datetime64[ns]')
-        99.1±2μs         76.3±1μs     0.77  bench_ufunc.MinMax.time_maximum('timedelta64[ns]')
-        99.2±1μs         75.3±3μs     0.76  bench_ufunc.MinMax.time_minimum('timedelta64[ns]')
-        83.1±3μs         62.3±1μs     0.75  bench_ufunc.MinMax.time_maximum('float64')
-        105±10μs         78.3±3μs     0.75  bench_ufunc.MinMax.time_fmax('timedelta64[ns]')
-        82.1±2μs         61.1±3μs     0.74  bench_ufunc.MinMax.time_minimum('float64')
-         102±2μs        71.3±10μs     0.70  bench_ufunc.MinMax.time_fmax('float64')
-        83.5±3μs         31.4±2μs     0.38  bench_ufunc.MinMax.time_maximum('float32')
-        73.2±3μs       25.4±0.5μs     0.35  bench_ufunc.MinMax.time_maximum('int32')
-      76.8±0.7μs       26.5±0.7μs     0.35  bench_ufunc.MinMax.time_fmax('int32')
-        76.6±4μs       25.7±0.5μs     0.34  bench_ufunc.MinMax.time_minimum('int32')
-        77.9±4μs       25.2±0.7μs     0.32  bench_ufunc.MinMax.time_fmin('int32')
-        98.2±2μs         31.7±1μs     0.32  bench_ufunc.MinMax.time_minimum('float32')
-        98.7±3μs       31.4±0.8μs     0.32  bench_ufunc.MinMax.time_fmax('float32')
-        98.8±2μs       30.3±0.8μs     0.31  bench_ufunc.MinMax.time_fmin('float32')
-        78.5±2μs       7.08±0.2μs     0.09  bench_ufunc.MinMax.time_minimum('int8')
-       95.7±20μs         8.59±1μs     0.09  bench_ufunc.MinMax.time_fmin('int8')
-      88.5±0.6μs      7.18±0.09μs     0.08  bench_ufunc.MinMax.time_maximum('int8')
-        89.7±2μs       7.11±0.3μs     0.08  bench_ufunc.MinMax.time_fmax('int8')
-        81.6±1μs       5.37±0.2μs     0.07  bench_ufunc.MinMax.time_fmax('uint8')
-      80.5±0.5μs       5.23±0.2μs     0.06  bench_ufunc.MinMax.time_maximum('uint8')
-        87.1±1μs       5.38±0.1μs     0.06  bench_ufunc.MinMax.time_minimum('uint8')
-       92.1±20μs       5.37±0.4μs     0.06  bench_ufunc.MinMax.time_fmin('uint8')

Benchmarks that have stayed the same:

       before           after         ratio
     [db5fcc8e]       [34a8d2f9]
     <maximum_speedup~1>       <maximum_speedup>
         907±20μs         975±20μs     1.07  bench_ufunc.MinMax.time_maximum('float16')
          171±7μs         183±10μs     1.07  bench_ufunc.MinMax.time_minimum('complex128')
          817±9μs         872±70μs     1.07  bench_ufunc.MinMax.time_minimum('float16')
       6.01±0.1μs       6.31±0.2μs     1.05  bench_ufunc.MinMax.time_minimum('bool')
         78.9±2μs         82.2±2μs     1.04  bench_ufunc.MinMax.time_minimum('uint64')
        861±200μs         895±20μs     1.04  bench_ufunc.MinMax.time_fmin('float16')
       11.1±0.3μs       11.5±0.8μs     1.03  bench_reduce.MinMax.time_min(<class 'numpy.float64'>)
       11.0±0.4μs       11.3±0.1μs     1.03  bench_reduce.MinMax.time_max(<class 'numpy.float64'>)
       7.55±0.2μs       7.75±0.1μs     1.03  bench_reduce.MinMax.time_min(<class 'numpy.float32'>)
         953±10μs         969±10μs     1.02  bench_ufunc.MinMax.time_fmax('float16')
          206±5μs          207±7μs     1.01  bench_ufunc.MinMax.time_fmax('complex128')
       5.83±0.3μs       5.86±0.1μs     1.00  bench_ufunc.MinMax.time_fmax('bool')
       17.0±0.3μs       17.1±0.6μs     1.00  bench_reduce.MinMax.time_min(<class 'numpy.int64'>)
       7.80±0.4μs       7.83±0.3μs     1.00  bench_reduce.MinMax.time_max(<class 'numpy.float32'>)
       17.2±0.3μs       17.2±0.4μs     1.00  bench_reduce.MinMax.time_max(<class 'numpy.int64'>)
      5.93±0.06μs       5.90±0.2μs     1.00  bench_ufunc.MinMax.time_maximum('bool')
          176±3μs          175±8μs     0.99  bench_ufunc.MinMax.time_maximum('complex128')
       6.63±0.8μs       6.44±0.1μs     0.97  bench_ufunc.MinMax.time_fmin('bool')
          206±4μs          198±3μs     0.96  bench_ufunc.MinMax.time_fmin('complex128')
         83.7±6μs         80.0±2μs     0.96  bench_ufunc.MinMax.time_fmin('uint64')
       72.4±0.8μs         68.3±2μs     0.94  bench_ufunc.MinMax.time_minimum('int64')
         73.9±2μs         69.2±2μs     0.94  bench_ufunc.MinMax.time_fmax('int64')
         76.4±1μs         71.2±2μs     0.93  bench_ufunc.MinMax.time_maximum('int64')
       76.4±0.8μs         70.0±5μs     0.92  bench_ufunc.MinMax.time_fmin('int64')
         73.5±2μs         66.9±2μs     0.91  bench_ufunc.MinMax.time_fmax('uint64')
       77.8±0.6μs         69.9±2μs    ~0.90  bench_ufunc.MinMax.time_maximum('uint64')
         94.1±5μs         82.4±8μs    ~0.88  bench_ufunc.MinMax.time_fmin('timedelta64[ns]')
         98.1±3μs         82.8±5μs    ~0.84  bench_ufunc.MinMax.time_fmax('datetime64[ns]')
         84.9±3μs         66.2±6μs    ~0.78  bench_ufunc.MinMax.time_fmin('float64')
          100±2μs         74.7±4μs    ~0.74  bench_ufunc.MinMax.time_maximum('datetime64[ns]')

Please note that this does not impact the more common np.min(arr) reduction call - I will have a separate PR addressing those functions in the near future.

@eric-wieser eric-wieser changed the title PERF: speed up np.max/np.min by up to 15x when operating on two inputs PERF: speed up np.maximum/np.minimum by up to 15x when operating on two inputs Mar 29, 2019
@eric-wieser
Copy link
Member

I adjusted the title - this specifically targets np.maximum not np.max == np.maximum.reduce

@charris charris changed the title PERF: speed up np.maximum/np.minimum by up to 15x when operating on two inputs MAINT: Speed up np.maximum/np.minimum by up to 15x when operating on two inputs Apr 3, 2019
@mattip
Copy link
Member

mattip commented Apr 18, 2019

Using NPY_GCC_OPT_3 seems like a win. There is a remark about not using it too much since it can lead to code bloat. Any opinions on the cost of it here?

@qwhelan
Copy link
Contributor Author

qwhelan commented Apr 18, 2019

I can do an actual comparison sometime tomorrow, but a similar change to np.abs() increased loops.o by 0.16% (#13271). This will likely be a bit more due to more types and logic.

@qwhelan
Copy link
Contributor Author

qwhelan commented Apr 22, 2019

Unfortunately, BINARY_LOOP_FAST causes quite a bit more bloat (~7%) than the example I gave. Most of the macro is dealing with potential aliasing and scalar comparison cases; the bloat could be cut in half or more if those cases are dropped for these functions. That being said, this increase seems a bit much for np.maximum given that the reduction version is far more common.

@mattip
Copy link
Member

mattip commented Apr 22, 2019

On my machine (Ubuntu 18.04 x64) the so size grows by 2.4% to 16MB.

}
}
BINARY_LOOP_FAST(@type@, @type@,
if (in1 == NPY_DATETIME_NAT) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need a fast loop for datetime
does the compiler vectorize this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, with gcc 7.3 and -ftree-vectorize -fopt-info-vec-missed you get:

numpy/core/src/umath/loops.c.src:1225:5: note: not vectorized: control flow in loop.
numpy/core/src/umath/loops.c.src:1225:5: note: bad loop form.
numpy/core/src/umath/loops.c.src:1225:5: note: not consecutive access n_21 = *dimensions_20(D);
numpy/core/src/umath/loops.c.src:1225:5: note: not vectorized: no grouped stores in basic block.
numpy/core/src/umath/loops.c.src:1225:5: note: not vectorized: not enough data-refs in basic block.
numpy/core/src/umath/loops.c.src:1228:12: note: not consecutive access in1_22 = MEM[(npy_datetime *)ip1_32];
numpy/core/src/umath/loops.c.src:1228:12: note: not consecutive access in2_23 = MEM[(npy_datetime *)ip2_31];
numpy/core/src/umath/loops.c.src:1228:12: note: not vectorized: no grouped stores in basic block.
numpy/core/src/umath/loops.c.src:1229:36: note: not vectorized: not enough data-refs in basic block.

Also, the 25% speedup in the asv benchmarks requires both -O3 and the fast loop.

const npy_half in2 = *(npy_half *)ip2;
*((npy_half *)op1) = (@OP@(in1, in2) || npy_half_isnan(in1)) ? in1 : in2;
}
BINARY_LOOP_FAST(npy_half, npy_half,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to datetime, performance of float is probably not a big concern it is a input/output format, computations should be done on casted data to float32

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, removed

@juliantaylor
Copy link
Contributor

juliantaylor commented Apr 24, 2019

the fast loop was not applied to many of these because it seemed like unnecessary bloat for little used functions or types

Though I have no issue with bloating a bit for faster integer and float operations, in the end it is mostly just size on disk and somewhat more download bandwidth for distribution.
datetime and half floats imo do not need the fast loop but then again those are only three types compared to the many different integers so it probably makes no difference in bloat size and avoids having to look at them again in similar changes in the future.

@mattip
Copy link
Member

mattip commented Apr 25, 2019

So sounds like this PR is almost ready to go. @qwhelan could you address the indent and exclude datetime/float16?

@qwhelan
Copy link
Contributor Author

qwhelan commented Apr 29, 2019

@mattip Done

@mattip
Copy link
Member

mattip commented Apr 29, 2019

LGTM. I guess this should get a mention in the Improvement section of the release notes, we have already noted numpy.exp and numpy.log so putting it next to that seems appropriate.

@rgommers
Copy link
Member

Can we conclude on the bloat discussion first? Can one of you summarize the current state?

@mattip
Copy link
Member

mattip commented Apr 29, 2019

On my machine using gcc-7 the change from the commit just before this one to 8b5196d is
17192648 bytes -> 17611176 bytes, a change of 408kB out of 16789kB for around 2.4%. The mailing list discussion did not provide a hard criteria for a decision, although the general direction was that disk size does not matter as much as download size.

@rgommers
Copy link
Member

rgommers commented Apr 30, 2019

Thanks @mattip. My take on this is:

  • This case is borderline. If we would write down a hard criterion, this likely would not meet it.
  • Rationale: if we get 100 PRs like this, the average performance of numpy for a user would not change much, however we have by then blown up the size NumPy takes up (disk/RAM/download/etc) by a factor 2.4
  • However, we won't get 100 PRs like this. So judging this based on such a criterion isn't quite right.
  • We have this PR now, and it's good to go. Presumably it helps @qwhelan significantly. So I'm +0.5 for merging it.
  • Also note that Cython has the same problem: taking one function and putting it in a .pyx file gives a huge amount of bloat (example: scipy.ndimage.label). We had the same discussion there, but it never became a practical issue because there were not many other PRs like that.

tl;dr let's merge this, and let's try not to make these kinds of changes a habit

EDIT: will send this to the list as well

@mattip
Copy link
Member

mattip commented Apr 30, 2019

Perhaps we should enable this only for int8.

Looking at the benchmark results, the win for 32bit float and int is 3x, where the win for int8 is 10-15x. 16 bit int does not show up - was it unchanged or did I miss it?

In any case, let's hold off on merging this till we see how enabling the fast loop for np.maximum.reduce affects benchmarks and code bloat.

Base automatically changed from master to main March 4, 2021 02:04

class MinMax(UFuncBenchmark):
def time_fmax(self, dtype):
np.fmax(self.arr, self.arr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having spent too much understanding benchmarks, I would suggest preallocating, and prefaulting your output arrays.

param_names = ['dtype']

def setup(self, dtype):
self.arr = np.full(100000, 1, dtype=dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.arr = np.full(100000, 1, dtype=dtype)
self.arr = np.full(100000, 1, dtype=dtype)
self.arr_2 = np.full(100000, 1, dtype=dtype)
self.out = np.full(100000, 1, dtype=dtype)

np.fmax(self.arr, self.arr)

def time_fmin(self, dtype):
np.fmin(self.arr, self.arr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
np.fmin(self.arr, self.arr)
np.fmin(self.arr, self.arr_2, out=self.out)

using arr as both inputs is something of an edge case. Using the out parameter will ensure you aren't allocating a new array (which can be slow or fast depending on the previous state of the RAM).

@charris
Copy link
Member

charris commented Jun 10, 2022

@qwhelan Needs a rebase. I'm wondering if these functions have been vectorized (SIMD). @seberg Do you recall?

@charris charris added the triage review Issue/PR to be discussed at the next triage meeting label Jun 10, 2022
@qwhelan
Copy link
Contributor Author

qwhelan commented Jun 10, 2022 via email

@seberg
Copy link
Member

seberg commented Jun 10, 2022

Yeah, we have a whole file for it here: https://github.com/numpy/numpy/blob/49c560c22f137907ea6a240591e49b004f28444b/numpy/core/src/umath/loops_minmax.dispatch.c.src

That said, I guess there are a couple of places here that could still use the attribute, for example the datetime (and maybe half) loops.

There may be a general question whether we should be adding such optimization attributes basically by default to most things in loops.c.src?

@mattip mattip removed the triage review Issue/PR to be discussed at the next triage meeting label Jun 15, 2022
@mattip
Copy link
Member

mattip commented Jun 15, 2022

I will close this. Thanks @qwhelan for showing the potential of SIMD compilation.

@mattip mattip closed this Jun 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants