Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: min/max is slow, re-implement using NEON (#17989) #20131

Merged
merged 10 commits into from
Jan 11, 2022

Conversation

Developer-Ecosystem-Engineering
Copy link
Contributor

@Developer-Ecosystem-Engineering Developer-Ecosystem-Engineering commented Oct 18, 2021

This fixes #17989 by adding ARM NEON implementations for min/max and fmin/max.

Before: Rosetta faster than native arm64 by 1.2x - 8.6x.
After: Native arm64 faster than Rosetta by 1.6x - 6.7x. (2.8x - 15.5x improvement)

Benchmarks

       before           after         ratio
     [b0e1a445]       [8301ffd7]
     <main>           <gh-issue-17989/improve-neon-min-max>
+     32.6±0.04μs      37.5±0.08μs     1.15  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 2, 1, 'd')
+     32.6±0.06μs      37.5±0.04μs     1.15  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 2, 1, 'd')
+     37.8±0.09μs      43.2±0.09μs     1.14  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 4, 4, 'f')
+     37.7±0.09μs       42.9±0.1μs     1.14  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 2, 2, 'd')
+      37.9±0.2μs      43.0±0.02μs     1.14  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 2, 2, 'd')
+     37.7±0.01μs         42.3±1μs     1.12  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'>, 2, 2, 'd')
+     34.2±0.07μs      38.1±0.05μs     1.12  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 4, 2, 'f')
+     32.6±0.03μs      35.8±0.04μs     1.10  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 4, 1, 'f')
+      37.1±0.1μs       40.3±0.1μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 1, 2, 'd')
+      37.2±0.1μs      40.3±0.04μs     1.08  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 2, 4, 'f')
+     37.1±0.09μs      40.3±0.07μs     1.08  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 1, 2, 'd')
+      68.6±0.5μs       74.2±0.3μs     1.08  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 4, 4, 'd')
+      37.1±0.2μs       40.0±0.1μs     1.08  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'>, 1, 2, 'd')
+        2.42±0μs      2.61±0.05μs     1.08  bench_core.CountNonzero.time_count_nonzero_axis(3, 100, <class 'numpy.int16'>)
+      69.1±0.7μs       73.5±0.7μs     1.06  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'>, 4, 4, 'd')
+      54.7±0.3μs       58.0±0.2μs     1.06  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 2, 4, 'd')
+      54.5±0.2μs       57.8±0.2μs     1.06  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'>, 2, 4, 'd')
+     3.78±0.04μs      4.00±0.02μs     1.06  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 100, <class 'str'>)
+      54.8±0.2μs       57.9±0.3μs     1.06  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 2, 4, 'd')
+     3.68±0.01μs      3.87±0.02μs     1.05  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 100, <class 'object'>)
+      69.6±0.2μs       73.1±0.2μs     1.05  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 4, 4, 'd')
+         229±2μs        241±0.2μs     1.05  bench_random.Bounded.time_bounded('PCG64', [<class 'numpy.uint64'>, 1535])
-      73.0±0.8μs       69.5±0.2μs     0.95  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'trunc'>, 4, 4, 'd')
-      37.6±0.1μs       35.7±0.3μs     0.95  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'trunc'>, 1, 4, 'f')
-     88.7±0.04μs       84.2±0.7μs     0.95  bench_lib.Pad.time_pad((256, 128, 1), 1, 'wrap')
-      57.9±0.2μs       54.8±0.2μs     0.95  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'trunc'>, 2, 4, 'd')
-      39.9±0.2μs      37.2±0.04μs     0.93  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'positive'>, 1, 2, 'd')
-     2.66±0.01μs      2.47±0.01μs     0.93  bench_lib.Nan.time_nanmin(200, 0)
-     2.65±0.02μs      2.46±0.04μs     0.93  bench_lib.Nan.time_nanmin(200, 50.0)
-     2.64±0.01μs      2.45±0.01μs     0.93  bench_lib.Nan.time_nanmax(200, 90.0)
-        2.64±0μs      2.44±0.02μs     0.92  bench_lib.Nan.time_nanmax(200, 0)
-     2.68±0.02μs         2.48±0μs     0.92  bench_lib.Nan.time_nanmax(200, 2.0)
-     40.2±0.01μs       37.1±0.1μs     0.92  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 2, 4, 'f')
-        2.69±0μs         2.47±0μs     0.92  bench_lib.Nan.time_nanmin(200, 2.0)
-     2.70±0.02μs      2.48±0.02μs     0.92  bench_lib.Nan.time_nanmax(200, 0.1)
-        2.70±0μs         2.47±0μs     0.91  bench_lib.Nan.time_nanmin(200, 90.0)
-        2.70±0μs         2.46±0μs     0.91  bench_lib.Nan.time_nanmin(200, 0.1)
-        2.70±0μs      2.42±0.01μs     0.90  bench_lib.Nan.time_nanmax(200, 50.0)
-      11.8±0.6ms       10.6±0.6ms     0.89  bench_core.CountNonzero.time_count_nonzero_axis(2, 1000000, <class 'str'>)
-      42.7±0.1μs      37.8±0.02μs     0.88  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'positive'>, 2, 2, 'd')
-     42.8±0.03μs       37.8±0.2μs     0.88  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rint'>, 2, 2, 'd')
-      43.1±0.2μs      37.7±0.09μs     0.87  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 4, 4, 'f')
-     37.5±0.07μs      32.6±0.06μs     0.87  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rint'>, 2, 1, 'd')
-     41.7±0.03μs      36.3±0.07μs     0.87  bench_ufunc_strides.Unary.time_ufunc(<ufunc '_ones_like'>, 1, 4, 'd')
-       166±0.8μs          144±1μs     0.87  bench_ufunc.UFunc.time_ufunc_types('fmin')
-      11.6±0.8ms      10.0±0.01ms     0.87  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 1000000, <class 'str'>)
-       167±0.9μs          144±2μs     0.86  bench_ufunc.UFunc.time_ufunc_types('minimum')
-         168±4μs        143±0.5μs     0.85  bench_ufunc.UFunc.time_ufunc_types('fmax')
-         167±1μs        142±0.8μs     0.85  bench_ufunc.UFunc.time_ufunc_types('maximum')
-        7.10±0μs      4.97±0.01μs     0.70  bench_ufunc_strides.AVX_BFunc.time_ufunc('minimum', 'd', 2)
-     7.11±0.07μs      4.96±0.01μs     0.70  bench_ufunc_strides.AVX_BFunc.time_ufunc('maximum', 'd', 2)
-     7.05±0.07μs         4.68±0μs     0.66  bench_ufunc_strides.AVX_BFunc.time_ufunc('minimum', 'f', 4)
-        7.13±0μs      4.68±0.01μs     0.66  bench_ufunc_strides.AVX_BFunc.time_ufunc('maximum', 'f', 4)
-       461±0.2μs          297±7μs     0.64  bench_app.MaxesOfDots.time_it
-     7.04±0.07μs         3.95±0μs     0.56  bench_ufunc_strides.AVX_BFunc.time_ufunc('maximum', 'f', 2)
-     7.06±0.06μs      3.95±0.01μs     0.56  bench_ufunc_strides.AVX_BFunc.time_ufunc('minimum', 'f', 2)
-     7.09±0.06μs         3.24±0μs     0.46  bench_ufunc_strides.AVX_BFunc.time_ufunc('minimum', 'd', 1)
-     7.12±0.07μs      3.25±0.02μs     0.46  bench_ufunc_strides.AVX_BFunc.time_ufunc('maximum', 'd', 1)
-     14.5±0.02μs         3.98±0μs     0.27  bench_reduce.MinMax.time_max(<class 'numpy.int64'>)
-      14.6±0.1μs      4.00±0.01μs     0.27  bench_reduce.MinMax.time_min(<class 'numpy.int64'>)
-     6.88±0.06μs         1.34±0μs     0.19  bench_ufunc_strides.AVX_BFunc.time_ufunc('maximum', 'f', 1)
-        7.00±0μs         1.33±0μs     0.19  bench_ufunc_strides.AVX_BFunc.time_ufunc('minimum', 'f', 1)
-     39.4±0.01μs      3.95±0.01μs     0.10  bench_reduce.MinMax.time_min(<class 'numpy.float64'>)
-     39.4±0.01μs      3.95±0.02μs     0.10  bench_reduce.MinMax.time_max(<class 'numpy.float64'>)
-      254±0.02μs       22.8±0.2μs     0.09  bench_lib.Nan.time_nanmax(200000, 50.0)
-       253±0.1μs       22.7±0.1μs     0.09  bench_lib.Nan.time_nanmin(200000, 0)
-      254±0.06μs      22.7±0.09μs     0.09  bench_lib.Nan.time_nanmin(200000, 2.0)
-      254±0.01μs      22.7±0.03μs     0.09  bench_lib.Nan.time_nanmin(200000, 0.1)
-      254±0.04μs      22.7±0.02μs     0.09  bench_lib.Nan.time_nanmin(200000, 50.0)
-       253±0.1μs      22.7±0.04μs     0.09  bench_lib.Nan.time_nanmax(200000, 0.1)
-      253±0.03μs      22.7±0.04μs     0.09  bench_lib.Nan.time_nanmin(200000, 90.0)
-      253±0.02μs      22.7±0.07μs     0.09  bench_lib.Nan.time_nanmax(200000, 0)
-      254±0.03μs      22.7±0.02μs     0.09  bench_lib.Nan.time_nanmax(200000, 90.0)
-      254±0.09μs      22.7±0.04μs     0.09  bench_lib.Nan.time_nanmax(200000, 2.0)
-     39.2±0.01μs      2.51±0.01μs     0.06  bench_reduce.MinMax.time_max(<class 'numpy.float32'>)
-     39.2±0.01μs      2.50±0.01μs     0.06  bench_reduce.MinMax.time_min(<class 'numpy.float32'>)

Size change of _multiarray_umath.cpython-39-darwin.so:

Before: 3,890,723
After: 3,924,035
Change: +33,312 (~ +0.856 %)

This fixes numpy#17989 by adding ARM NEON implementations for min/max and fmin/max.

Before: Rosetta faster than native arm64 by `1.2x - 8.6x`.
After: Native arm64 faster than Rosetta by `1.6x - 6.7x`.  (2.8x - 15.5x improvement)

**Benchmarks**
```
       before           after         ratio
     [b0e1a44]       [8301ffd7]
     <main>           <gh-issue-17989/improve-neon-min-max>
+     32.6±0.04μs      37.5±0.08μs     1.15  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 2, 1, 'd')
+     32.6±0.06μs      37.5±0.04μs     1.15  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 2, 1, 'd')
+     37.8±0.09μs      43.2±0.09μs     1.14  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 4, 4, 'f')
+     37.7±0.09μs       42.9±0.1μs     1.14  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 2, 2, 'd')
+      37.9±0.2μs      43.0±0.02μs     1.14  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 2, 2, 'd')
+     37.7±0.01μs         42.3±1μs     1.12  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'>, 2, 2, 'd')
+     34.2±0.07μs      38.1±0.05μs     1.12  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 4, 2, 'f')
+     32.6±0.03μs      35.8±0.04μs     1.10  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 4, 1, 'f')
+      37.1±0.1μs       40.3±0.1μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 1, 2, 'd')
+      37.2±0.1μs      40.3±0.04μs     1.08  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 2, 4, 'f')
+     37.1±0.09μs      40.3±0.07μs     1.08  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 1, 2, 'd')
+      68.6±0.5μs       74.2±0.3μs     1.08  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 4, 4, 'd')
+      37.1±0.2μs       40.0±0.1μs     1.08  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'>, 1, 2, 'd')
+        2.42±0μs      2.61±0.05μs     1.08  bench_core.CountNonzero.time_count_nonzero_axis(3, 100, <class 'numpy.int16'>)
+      69.1±0.7μs       73.5±0.7μs     1.06  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'>, 4, 4, 'd')
+      54.7±0.3μs       58.0±0.2μs     1.06  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 2, 4, 'd')
+      54.5±0.2μs       57.8±0.2μs     1.06  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'>, 2, 4, 'd')
+     3.78±0.04μs      4.00±0.02μs     1.06  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 100, <class 'str'>)
+      54.8±0.2μs       57.9±0.3μs     1.06  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 2, 4, 'd')
+     3.68±0.01μs      3.87±0.02μs     1.05  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 100, <class 'object'>)
+      69.6±0.2μs       73.1±0.2μs     1.05  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 4, 4, 'd')
+         229±2μs        241±0.2μs     1.05  bench_random.Bounded.time_bounded('PCG64', [<class 'numpy.uint64'>, 1535])
-      73.0±0.8μs       69.5±0.2μs     0.95  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'trunc'>, 4, 4, 'd')
-      37.6±0.1μs       35.7±0.3μs     0.95  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'trunc'>, 1, 4, 'f')
-     88.7±0.04μs       84.2±0.7μs     0.95  bench_lib.Pad.time_pad((256, 128, 1), 1, 'wrap')
-      57.9±0.2μs       54.8±0.2μs     0.95  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'trunc'>, 2, 4, 'd')
-      39.9±0.2μs      37.2±0.04μs     0.93  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'positive'>, 1, 2, 'd')
-     2.66±0.01μs      2.47±0.01μs     0.93  bench_lib.Nan.time_nanmin(200, 0)
-     2.65±0.02μs      2.46±0.04μs     0.93  bench_lib.Nan.time_nanmin(200, 50.0)
-     2.64±0.01μs      2.45±0.01μs     0.93  bench_lib.Nan.time_nanmax(200, 90.0)
-        2.64±0μs      2.44±0.02μs     0.92  bench_lib.Nan.time_nanmax(200, 0)
-     2.68±0.02μs         2.48±0μs     0.92  bench_lib.Nan.time_nanmax(200, 2.0)
-     40.2±0.01μs       37.1±0.1μs     0.92  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 2, 4, 'f')
-        2.69±0μs         2.47±0μs     0.92  bench_lib.Nan.time_nanmin(200, 2.0)
-     2.70±0.02μs      2.48±0.02μs     0.92  bench_lib.Nan.time_nanmax(200, 0.1)
-        2.70±0μs         2.47±0μs     0.91  bench_lib.Nan.time_nanmin(200, 90.0)
-        2.70±0μs         2.46±0μs     0.91  bench_lib.Nan.time_nanmin(200, 0.1)
-        2.70±0μs      2.42±0.01μs     0.90  bench_lib.Nan.time_nanmax(200, 50.0)
-      11.8±0.6ms       10.6±0.6ms     0.89  bench_core.CountNonzero.time_count_nonzero_axis(2, 1000000, <class 'str'>)
-      42.7±0.1μs      37.8±0.02μs     0.88  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'positive'>, 2, 2, 'd')
-     42.8±0.03μs       37.8±0.2μs     0.88  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rint'>, 2, 2, 'd')
-      43.1±0.2μs      37.7±0.09μs     0.87  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 4, 4, 'f')
-     37.5±0.07μs      32.6±0.06μs     0.87  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rint'>, 2, 1, 'd')
-     41.7±0.03μs      36.3±0.07μs     0.87  bench_ufunc_strides.Unary.time_ufunc(<ufunc '_ones_like'>, 1, 4, 'd')
-       166±0.8μs          144±1μs     0.87  bench_ufunc.UFunc.time_ufunc_types('fmin')
-      11.6±0.8ms      10.0±0.01ms     0.87  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 1000000, <class 'str'>)
-       167±0.9μs          144±2μs     0.86  bench_ufunc.UFunc.time_ufunc_types('minimum')
-         168±4μs        143±0.5μs     0.85  bench_ufunc.UFunc.time_ufunc_types('fmax')
-         167±1μs        142±0.8μs     0.85  bench_ufunc.UFunc.time_ufunc_types('maximum')
-        7.10±0μs      4.97±0.01μs     0.70  bench_ufunc_strides.AVX_BFunc.time_ufunc('minimum', 'd', 2)
-     7.11±0.07μs      4.96±0.01μs     0.70  bench_ufunc_strides.AVX_BFunc.time_ufunc('maximum', 'd', 2)
-     7.05±0.07μs         4.68±0μs     0.66  bench_ufunc_strides.AVX_BFunc.time_ufunc('minimum', 'f', 4)
-        7.13±0μs      4.68±0.01μs     0.66  bench_ufunc_strides.AVX_BFunc.time_ufunc('maximum', 'f', 4)
-       461±0.2μs          297±7μs     0.64  bench_app.MaxesOfDots.time_it
-     7.04±0.07μs         3.95±0μs     0.56  bench_ufunc_strides.AVX_BFunc.time_ufunc('maximum', 'f', 2)
-     7.06±0.06μs      3.95±0.01μs     0.56  bench_ufunc_strides.AVX_BFunc.time_ufunc('minimum', 'f', 2)
-     7.09±0.06μs         3.24±0μs     0.46  bench_ufunc_strides.AVX_BFunc.time_ufunc('minimum', 'd', 1)
-     7.12±0.07μs      3.25±0.02μs     0.46  bench_ufunc_strides.AVX_BFunc.time_ufunc('maximum', 'd', 1)
-     14.5±0.02μs         3.98±0μs     0.27  bench_reduce.MinMax.time_max(<class 'numpy.int64'>)
-      14.6±0.1μs      4.00±0.01μs     0.27  bench_reduce.MinMax.time_min(<class 'numpy.int64'>)
-     6.88±0.06μs         1.34±0μs     0.19  bench_ufunc_strides.AVX_BFunc.time_ufunc('maximum', 'f', 1)
-        7.00±0μs         1.33±0μs     0.19  bench_ufunc_strides.AVX_BFunc.time_ufunc('minimum', 'f', 1)
-     39.4±0.01μs      3.95±0.01μs     0.10  bench_reduce.MinMax.time_min(<class 'numpy.float64'>)
-     39.4±0.01μs      3.95±0.02μs     0.10  bench_reduce.MinMax.time_max(<class 'numpy.float64'>)
-      254±0.02μs       22.8±0.2μs     0.09  bench_lib.Nan.time_nanmax(200000, 50.0)
-       253±0.1μs       22.7±0.1μs     0.09  bench_lib.Nan.time_nanmin(200000, 0)
-      254±0.06μs      22.7±0.09μs     0.09  bench_lib.Nan.time_nanmin(200000, 2.0)
-      254±0.01μs      22.7±0.03μs     0.09  bench_lib.Nan.time_nanmin(200000, 0.1)
-      254±0.04μs      22.7±0.02μs     0.09  bench_lib.Nan.time_nanmin(200000, 50.0)
-       253±0.1μs      22.7±0.04μs     0.09  bench_lib.Nan.time_nanmax(200000, 0.1)
-      253±0.03μs      22.7±0.04μs     0.09  bench_lib.Nan.time_nanmin(200000, 90.0)
-      253±0.02μs      22.7±0.07μs     0.09  bench_lib.Nan.time_nanmax(200000, 0)
-      254±0.03μs      22.7±0.02μs     0.09  bench_lib.Nan.time_nanmax(200000, 90.0)
-      254±0.09μs      22.7±0.04μs     0.09  bench_lib.Nan.time_nanmax(200000, 2.0)
-     39.2±0.01μs      2.51±0.01μs     0.06  bench_reduce.MinMax.time_max(<class 'numpy.float32'>)
-     39.2±0.01μs      2.50±0.01μs     0.06  bench_reduce.MinMax.time_min(<class 'numpy.float32'>)
```

Size change of _multiarray_umath.cpython-39-darwin.so:
Before: 3,890,723
After: 3,924,035
Change: +33,312 (~ +0.856 %)
@Developer-Ecosystem-Engineering Developer-Ecosystem-Engineering changed the title BUG: NEON min/max is slow (#17989) BUG: min/max is slow, re-implement using NEON (#17989) Oct 18, 2021
@charris
Copy link
Member

charris commented Oct 19, 2021

OT: @Developer-Ecosystem-Engineering I notice that your previous commit messages lack line breaks, which does not work well with text meant to be read in a terminal. Going forward it would be good to use hard line breaks. Depending on the your workflow, it should be possible for a commit to bring up an editor specific for commit messages that should take care of that.

@seberg seberg added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Oct 20, 2021
Copy link
Member

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Developer-Ecosystem-Engineering! This looks pretty clean and the benchmark and size change numbers are convincing.

There's a lot of code here and this is not my area of expertise, so I hope @seiko2plus, @Qiyu8, @ganesh-k13 or someone else can review in detail and see if this is the right approach.

*/

// Implementation below assumes longlong and ulonglong are 64-bit.
#if @HAVE_NEON_IMPL@
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation uses the universal intrinsics framework, but it has to be specific to Neon anyway because of the "longlong is 64-bit" assumption? It's true for some other platforms as well, at least MSVC comes to mind.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it has to be specific to Neon anyway because of the "longlong is 64-bit" assumption?

Yes, thats correct.

@mattip
Copy link
Member

mattip commented Oct 27, 2021

@seiko2plus, @Qiyu8, @ganesh-k13 could you take a look?

It seems a future PR could extend this to other architectures. Do the npyv_reduce_max_u8 and friends' naming scheme match with what we would want?

@ganesh-k13
Copy link
Member

Still a rookie here in SIMD :). One thing I noticed is the new file loops_minmax.dispatch.c.src. Since it seems to use only integers, can we use loops_arithmetic.dispatch.c.src itself?

Copy link
Member

@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch requires the modifications mentioned in the following suggestions to be compatible with all architectures. The proposed modifications do not require you to add or support any special intrinsics for these architectures, just redesign the implementation to be more friendly.
However, once you get done with these changes, I will follow it with tweaks/cleanup/benchmark since we already have raw SIMD implementation for max and min located at simd.inc.

numpy/core/src/common/simd/neon/math.h Outdated Show resolved Hide resolved
numpy/core/src/umath/loops.c.src Outdated Show resolved Hide resolved
numpy/core/code_generators/generate_umath.py Outdated Show resolved Hide resolved
numpy/core/src/common/simd/neon/math.h Outdated Show resolved Hide resolved
numpy/core/src/umath/loops.c.src Outdated Show resolved Hide resolved
numpy/core/src/umath/loops.h.src Outdated Show resolved Hide resolved
numpy/core/src/umath/loops.h.src Outdated Show resolved Hide resolved
numpy/core/src/umath/loops_minmax.dispatch.c.src Outdated Show resolved Hide resolved

static inline npy_intp
simd_reduce_@TYPE@_@kind@(char **args, npy_intp const *dimensions, npy_intp const *steps, npy_intp i)
{
Copy link
Member

@seiko2plus seiko2plus Nov 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a fact, any ufunc's reduction has an identity, which is the value of the first element of the second input array. @seberg please correct me if I'm wrong. that's mean the input and output array can't be empty at least the length greater than zero.

In the light of the above, your SIMD kernel should use this identity value as a default value for the final accumulated vector, so we loop vector by vector to deal with small arrays or when we deal with large SIMD width > 128-bit. see the following example:

// any ufunc reduce always has an identity
// which is the value of the first element of the second input array
assert(len > 0);
const int nlanes = npyv_nlanes_@sfx@;
npyv_@sfx@ acc = npyv_setall_@sfx@(op1[0]); // final accumulator
for (; len >= nlanes*8; len -= nlanes*8, ip += nlanes*8) {
    // unroll goes here
}
for (; len >= nlanes; len -= nlanes, ip += nlanes) {
    acc = npyv_@vop@_@sfx@(npyv_load_@sfx@(ip), acc);
}
npyv_lanetype_@sfx@ r = npyv_reduce_@vop@_@sfx@(acc);
for (; len > 0; --len, ++ip) {
    const npyv_lanetype_@sfx@ a = *ip;
    r = SCALAR_OP(r, a);
}
*op1 = r;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually... no I do not think you can, reading the 0th element is not actually well defined currently, the macro checking for a "reduction" doesn't ensure that it actually is a reduction (which indeed would have a guaranteed 1-element), but only that it effectively is one.

This is super obscure though, and pretty much implausible to hit (if you are in the "reduce" branch). So probably the fix may be that NumPy should guarantee to never call an inner-loop with N == 0 (*dimensions == 0)?
(This might be one of the reason NumPy historically "over-allocates" empty arrays, but that habit is one of those 80% band-aid fixes, only...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the clarification, then we have to test empty arrays.

@seiko2plus
Copy link
Member

@ganesh-k13, separating the SIMD kernels into multiple files can increase the readability, speed up the build and reduce the binary size, so I don't see any issue with having a new dispatch-able source.

@seiko2plus
Copy link
Member

seiko2plus commented Nov 1, 2021

@mattip,

Do the npyv_reduce_max_u8 and friends' naming scheme match with what we would want?

yes, but the contributor should emulate any missing intrinsics inside the dispatch-able source using universal intrinsics itself, so later we can tweak them and move to the main interface.

@seiko2plus
Copy link
Member

seiko2plus commented Nov 1, 2021

Our docs don't explain the behavior of the sign of zero for max/fmax/min/fmin, any ideas? should we unify certain behavior across all architectures or just leave it as is or depending on the behavior of native instructions?
EDIT: forget to type "or" :).

@charris
Copy link
Member

charris commented Nov 1, 2021

Our docs don't explain the behavior of the sign of zero for max/fmax/min/fmin, any ideas?

IIRC, both -0.0 and 0.0 are treated as 0.0 in comparisons, so I guess the question is what should be returned if either is a max or min.

@seberg
Copy link
Member

seberg commented Nov 1, 2021

I don't think we currently have any guarantees and C99 seems to say it that it is undefined for fmin (implementations may sort -0.0 first, but are not required to). So to me, sorting -0.0 seems +nice but nothing we currently guarantee.

@seiko2plus
Copy link
Member

We are not using C99 fmax/fmin. According to our code in loops.c.src, we already have a unified behavior which is simply treating the positive and negative zero as equivalent, and the priority goes to the first operand(except when AVX512 is enabled, the priority goes to the second operand).

If you execute the following code:

See the code
import numpy as np
from numpy.core import _simd as simd
from numpy.core._multiarray_umath import __cpu_baseline__ as cpu_baseline

zp = np.array([ 0.], dtype=np.float32).repeat(8)
zn = np.array([-0.], dtype=np.float32).repeat(8)
a = np.ravel(np.column_stack((zp, zn)))
b = np.ravel(np.column_stack((zn, zp)))
reduce = np.concatenate((a, b), axis=None).repeat(2)

print("Operands:")
print("\tfirst:", a)
print("\tsecond:", b)
print("\treduce:", reduce)


print("\nNumPy behaviour:")
print(" CPU Features:", np.lib.utils._opt_info())
print("\tnp.minimum:", np.minimum(a, b))
print("\tnp.maximum:", np.maximum(a, b))
print("\tnp.fmin:", np.fmin(a, b))
print("\tnp.fmax:", np.fmax(a, b))
print(f"\tnp.minimum.reduce:", np.minimum.reduce(reduce))
print(f"\tnp.maximum.reduce:", np.maximum.reduce(reduce))
print(f"\tnp.fmin.reduce:", np.fmin.reduce(reduce))
print(f"\tnp.fmax.reduce:", np.fmax.reduce(reduce))

print(f"\nHW behaviour:")
for k, v in simd.targets.items():
    if k == "baseline":
        fname = f"baseline({', '.join(cpu_baseline)})"
    else:
        fname = k.split('__') # multi-target
        fname = ', '.join(fname)
    if not v:
        print(f"\tescape target {fname}, not supported by current CPU")
        continue
    print(f"\tWith {fname} enabled:")
    print("\t\tmin:", v.min_f32(v.load_f32(a), v.load_f32(b)))
    print("\t\tmax:", v.max_f32(v.load_f32(a), v.load_f32(b)))

The output

On x86
Operands:
	first: [ 0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.]
	second: [-0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0.]
	reduce: [ 0.  0. -0. -0.  0.  0. -0. -0.  0.  0. -0. -0.  0.  0. -0. -0.  0.  0.
 -0. -0.  0.  0. -0. -0.  0.  0. -0. -0.  0.  0. -0. -0. -0. -0.  0.  0.
 -0. -0.  0.  0. -0. -0.  0.  0. -0. -0.  0.  0. -0. -0.  0.  0. -0. -0.
  0.  0. -0. -0.  0.  0. -0. -0.  0.  0.]

NumPy behavior:
 CPU Features: SSE SSE2 SSE3 SSSE3* SSE41* POPCNT* SSE42* AVX* F16C* FMA3* AVX2* AVX512F? AVX512CD? AVX512_KNL? AVX512_KNM? AVX512_SKX? AVX512_CLX? AVX512_CNL? AVX512_ICL?
	np.minimum: [ 0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.]
	np.maximum: [ 0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.]
	np.fmin: [ 0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.]
	np.fmax: [ 0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.]
	np.minimum.reduce: 0.0
	np.maximum.reduce: 0.0
	np.fmin.reduce: 0.0
	np.fmax.reduce: 0.0

HW behavior:
	escape target AVX512_SKX, not supported by current CPU
	escape target AVX512F, not supported by current CPU
	With FMA3, AVX2 enabled:
		min: <npyv_f32 of [-0.0, 0.0, -0.0, 0.0, -0.0, 0.0, -0.0, 0.0]>
		max: <npyv_f32 of [-0.0, 0.0, -0.0, 0.0, -0.0, 0.0, -0.0, 0.0]>
	With SSE42 enabled:
		min: <npyv_f32 of [-0.0, 0.0, -0.0, 0.0]>
		max: <npyv_f32 of [-0.0, 0.0, -0.0, 0.0]>
	With baseline(SSE, SSE2, SSE3) enabled:
		min: <npyv_f32 of [-0.0, 0.0, -0.0, 0.0]>
		max: <npyv_f32 of [-0.0, 0.0, -0.0, 0.0]>
On x86(AVX512)
Operands:
	first: [ 0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.]
	second: [-0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0.]
	reduce: [ 0.  0. -0. -0.  0.  0. -0. -0.  0.  0. -0. -0.  0.  0. -0. -0.  0.  0.
 -0. -0.  0.  0. -0. -0.  0.  0. -0. -0.  0.  0. -0. -0. -0. -0.  0.  0.
 -0. -0.  0.  0. -0. -0.  0.  0. -0. -0.  0.  0. -0. -0.  0.  0. -0. -0.
  0.  0. -0. -0.  0.  0. -0. -0.  0.  0.]

NumPy behavior:
 CPU Features: SSE SSE2 SSE3 SSSE3* SSE41* POPCNT* SSE42* AVX* F16C* FMA3* AVX2* AVX512F* AVX512CD* AVX512_KNL? AVX512_KNM? AVX512_SKX* AVX512_CLX* AVX512_CNL* AVX512_ICL*
	np.minimum: [-0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0.]
	np.maximum: [-0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0.]
	np.fmin: [ 0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.]
	np.fmax: [ 0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.]
	np.minimum.reduce: 0.0
	np.maximum.reduce: 0.0
	np.fmin.reduce: 0.0
	np.fmax.reduce: 0.0

HW behavior:
	With AVX512_SKX enabled:
		min: <npyv_f32 of [-0.0, 0.0, -0.0, 0.0, -0.0, 0.0, -0.0, 0.0, -0.0, 0.0, -0.0, 0.0, -0.0, 0.0, -0.0, 0.0]>
		max: <npyv_f32 of [-0.0, 0.0, -0.0, 0.0, -0.0, 0.0, -0.0, 0.0, -0.0, 0.0, -0.0, 0.0, -0.0, 0.0, -0.0, 0.0]>
	With AVX512F enabled:
		min: <npyv_f32 of [-0.0, 0.0, -0.0, 0.0, -0.0, 0.0, -0.0, 0.0, -0.0, 0.0, -0.0, 0.0, -0.0, 0.0, -0.0, 0.0]>
		max: <npyv_f32 of [-0.0, 0.0, -0.0, 0.0, -0.0, 0.0, -0.0, 0.0, -0.0, 0.0, -0.0, 0.0, -0.0, 0.0, -0.0, 0.0]>
	With FMA3, AVX2 enabled:
		min: <npyv_f32 of [-0.0, 0.0, -0.0, 0.0, -0.0, 0.0, -0.0, 0.0]>
		max: <npyv_f32 of [-0.0, 0.0, -0.0, 0.0, -0.0, 0.0, -0.0, 0.0]>
	With SSE42 enabled:
		min: <npyv_f32 of [-0.0, 0.0, -0.0, 0.0]>
		max: <npyv_f32 of [-0.0, 0.0, -0.0, 0.0]>
	With baseline(SSE, SSE2, SSE3) enabled:
		min: <npyv_f32 of [-0.0, 0.0, -0.0, 0.0]>
		max: <npyv_f32 of [-0.0, 0.0, -0.0, 0.0]>
On aarch64
Operands:
	first: [ 0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.]
	second: [-0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0.]
	reduce: [ 0.  0. -0. -0.  0.  0. -0. -0.  0.  0. -0. -0.  0.  0. -0. -0.  0.  0.
 -0. -0.  0.  0. -0. -0.  0.  0. -0. -0.  0.  0. -0. -0. -0. -0.  0.  0.
 -0. -0.  0.  0. -0. -0.  0.  0. -0. -0.  0.  0. -0. -0.  0.  0. -0. -0.
  0.  0. -0. -0.  0.  0. -0. -0.  0.  0.]

NumPy behavior:
 CPU Features: NEON NEON_FP16 NEON_VFPV4 ASIMD ASIMDHP? ASIMDDP?
	np.minimum: [ 0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.]
	np.maximum: [ 0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.]
	np.fmin: [ 0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.]
	np.fmax: [ 0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.]
	np.minimum.reduce: 0.0
	np.maximum.reduce: 0.0
	np.fmin.reduce: 0.0
	np.fmax.reduce: 0.0

HW behavior:
	With baseline(NEON, NEON_FP16, NEON_VFPV4, ASIMD) enabled:
		min: <npyv_f32 of [-0.0, -0.0, -0.0, -0.0]>
		max: <npyv_f32 of [0.0, 0.0, 0.0, 0.0]>
On ppc64le
Operands:
	first: [ 0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.]
	second: [-0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0.]
	reduce: [ 0.  0. -0. -0.  0.  0. -0. -0.  0.  0. -0. -0.  0.  0. -0. -0.  0.  0.
 -0. -0.  0.  0. -0. -0.  0.  0. -0. -0.  0.  0. -0. -0. -0. -0.  0.  0.
 -0. -0.  0.  0. -0. -0.  0.  0. -0. -0.  0.  0. -0. -0.  0.  0. -0. -0.
  0.  0. -0. -0.  0.  0. -0. -0.  0.  0.]

NumPy behavior:
 CPU Features: VSX VSX2 VSX3*
	np.minimum: [ 0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.]
	np.maximum: [ 0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.]
	np.fmin: [ 0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.]
	np.fmax: [ 0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.  0. -0.]
	np.minimum.reduce: 0.0
	np.maximum.reduce: 0.0
	np.fmin.reduce: 0.0
	np.fmax.reduce: 0.0

HW behavior:
	With VSX3 enabled:
		min: <npyv_f32 of [-0.0, -0.0, -0.0, -0.0]>
		max: <npyv_f32 of [0.0, 0.0, 0.0, 0.0]>
	With baseline(VSX, VSX2) enabled:
		min: <npyv_f32 of [-0.0, -0.0, -0.0, -0.0]>
		max: <npyv_f32 of [0.0, 0.0, 0.0, 0.0]>

On aarch64 and ppc64le the native instructions of max/min respect the sign of zero, and now back to my previous comment "I'm not sure what we are supposed to do?" should we force all architectures to follow our current unified behavior including AVX512F or just follow the compiler way in fmax/fmin which is simply obey the native support.

@seberg
Copy link
Member

seberg commented Nov 2, 2021

Does this code change the behaviour to enforcing -0.0 < 0.0 (for the purpose of sorting)? In some sense, I think that would be better and it would be weird not to allow it. There is an argument to be made that we would change min/max, but not sorting here.
I would be willing to say that enforcing -0.0 < 0.0 is OK, and if it was an easy choice to enforce it everywhere (sorting related) that would not be bad. But since the C standard does not enforce it, I am willing to bet it is not worth the trouble.

@seiko2plus
Copy link
Member

Does this code change the behaviour to enforcing -0.0 < 0.0 (for the purpose of sorting)?

Yes but only on aarch64 and ppc64le. We can enforce x86 too but that would cost us extra cycles for each iteration since the native operations of max/min don't handle signed zero.

But since the C standard does not enforce it, I am willing to bet it is not worth the trouble.

Alright, then no need to unify certain behavior for all architectures.

@Developer-Ecosystem-Engineering
Copy link
Contributor Author

Thank you all for the feedback, working on a response!

Thank you @seiko2plus for the excellent example.

Reorganized code so that it can be used for other architectures.  Core implementations and unroll factors should be the same as before for ARM NEON. Beyond reorganizing, we've added default implementations using universal intrinsics for non-ARM-NEON.  Additionally, we've moved most min, max, fmin, fmax implementations to a new dispatchable source file: numpy/core/src/umath/loops_minmax.dispatch.c.src

**Testing**
- Apple silicon M1 native (arm64 / aarch64) -- No test failures
- Apple silicon M1 Rosetta (x86_64) -- No new test failures
- iMacPro1,1 (AVX512F) -- No test failures

**Benchmarks**
- Apple silicon M1 native (arm64 / aarch64)
  - Similar improvements as before reorg (comparison below)

- x86_64 (both Apple silicon M1 Rosetta and iMacPro1,1 AVX512F)
  - Some x86_64 benchmarks are better, some are worse

Apple silicon M1 native (arm64 / aarch64) comparison to original implementation / before reorg:
```
       before           after         ratio
     [559ddede]       [a3463b09]
     <gh-issue-17989/improve-neon-min-max>       <gh-issue-17989/feedback/round-1>
+     6.45±0.04μs      7.07±0.09μs     1.10  bench_lib.Nan.time_nanargmin(200, 0.1)
+      32.1±0.3μs       35.2±0.2μs     1.10  bench_ufunc_strides.Unary.time_ufunc(<ufunc '_ones_like'>, 2, 1, 'd')
+     29.1±0.02μs      31.8±0.05μs     1.10  bench_core.Core.time_array_int_l1000
+      69.0±0.2μs         75.3±3μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'logical_not'>, 2, 4, 'f')
+        92.0±1μs       99.5±0.5μs     1.08  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'logical_not'>, 4, 4, 'd')
+      9.29±0.1μs       9.99±0.5μs     1.08  bench_ma.UFunc.time_1d(True, True, 10)
+       338±0.6μs         362±10μs     1.07  bench_function_base.Sort.time_sort('quick', 'int16', ('random',))
+     4.21±0.03μs       4.48±0.2μs     1.07  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 100, <class 'str'>)
+     12.3±0.06μs       13.1±0.7μs     1.06  bench_function_base.Median.time_even_small
+        1.27±0μs      1.35±0.06μs     1.06  bench_itemselection.PutMask.time_dense(False, 'float16')
+         139±1ns          147±6ns     1.06  bench_core.Core.time_array_1
+     33.7±0.01μs         35.5±2μs     1.05  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'reciprocal'>, 2, 4, 'f')
+      69.4±0.1μs       73.1±0.2μs     1.05  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'logical_not'>, 4, 4, 'f')
+      225±0.09μs          237±9μs     1.05  bench_random.Bounded.time_bounded('PCG64', [<class 'numpy.uint32'>, 2047])
-      15.7±0.5μs      14.9±0.03μs     0.95  bench_core.CountNonzero.time_count_nonzero_axis(2, 10000, <class 'numpy.int64'>)
-        34.2±2μs      32.0±0.03μs     0.94  bench_ufunc_strides.Unary.time_ufunc(<ufunc '_ones_like'>, 4, 2, 'f')
-     1.03±0.05ms          955±3μs     0.92  bench_lib.Nan.time_nanargmax(200000, 50.0)
-     6.97±0.08μs      6.43±0.02μs     0.92  bench_ma.UFunc.time_scalar(True, False, 10)
-        5.41±0μs      4.98±0.01μs     0.92  bench_ufunc_strides.AVX_cmplx_arithmetic.time_ufunc('subtract', 2, 'F')
-     22.4±0.01μs      20.6±0.02μs     0.92  bench_core.Core.time_array_float64_l1000
-     1.51±0.01ms         1.38±0ms     0.92  bench_core.CorrConv.time_correlate(1000, 10000, 'same')
-      10.1±0.2μs      9.27±0.09μs     0.92  bench_ufunc.UFunc.time_ufunc_types('invert')
-     8.50±0.02μs      7.80±0.09μs     0.92  bench_indexing.ScalarIndexing.time_assign_cast(1)
-      29.5±0.2μs      26.6±0.03μs     0.90  bench_ma.Concatenate.time_it('masked', 100)
-     2.09±0.02ms         1.87±0ms     0.90  bench_ma.UFunc.time_2d(True, True, 1000)
-        298±10μs        267±0.3μs     0.89  bench_app.MaxesOfDots.time_it
-      10.7±0.2μs      9.60±0.02μs     0.89  bench_ma.UFunc.time_1d(True, True, 100)
-         567±3μs          505±2μs     0.89  bench_lib.Nan.time_nanargmax(200000, 90.0)
-       342±0.9μs          282±5μs     0.83  bench_lib.Nan.time_nanargmax(200000, 2.0)
-       307±0.7μs        244±0.8μs     0.80  bench_lib.Nan.time_nanargmax(200000, 0.1)
-         309±1μs        241±0.1μs     0.78  bench_lib.Nan.time_nanargmax(200000, 0)
```
@Developer-Ecosystem-Engineering
Copy link
Contributor Author

Developer-Ecosystem-Engineering commented Nov 18, 2021

Thank you @seiko2plus for the excellent example.

Reorganized code so that it can be used for other architectures. Core implementations and unroll factors should be the same as before for ARM NEON. Beyond reorganizing, we've added default implementations using universal intrinsics for non-ARM-NEON. Additionally, we've moved most min, max, fmin, fmax implementations to a new dispatchable source file: numpy/core/src/umath/loops_minmax.dispatch.c.src

Testing

  • Apple silicon M1 native (arm64 / aarch64) -- No test failures
  • Apple silicon M1 Rosetta (x86_64) -- No new test failures
  • iMacPro1,1 (AVX512F) -- No test failures

Benchmarks

  • Apple silicon M1 native (arm64 / aarch64)

    • Similar improvements as before reorg (comparison below)
  • x86_64 (both Apple silicon M1 Rosetta and iMacPro1,1 AVX512F)

    • Some x86_64 benchmarks are better, some are worse

Apple silicon M1 native (arm64 / aarch64) comparison to original implementation / before reorg:

M1 benchmark
       before           after         ratio
     [559ddede]       [a3463b09]
     <gh-issue-17989/improve-neon-min-max>       <gh-issue-17989/feedback/round-1>
+     6.45±0.04μs      7.07±0.09μs     1.10  bench_lib.Nan.time_nanargmin(200, 0.1)
+      32.1±0.3μs       35.2±0.2μs     1.10  bench_ufunc_strides.Unary.time_ufunc(<ufunc '_ones_like'>, 2, 1, 'd')
+     29.1±0.02μs      31.8±0.05μs     1.10  bench_core.Core.time_array_int_l1000
+      69.0±0.2μs         75.3±3μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'logical_not'>, 2, 4, 'f')
+        92.0±1μs       99.5±0.5μs     1.08  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'logical_not'>, 4, 4, 'd')
+      9.29±0.1μs       9.99±0.5μs     1.08  bench_ma.UFunc.time_1d(True, True, 10)
+       338±0.6μs         362±10μs     1.07  bench_function_base.Sort.time_sort('quick', 'int16', ('random',))
+     4.21±0.03μs       4.48±0.2μs     1.07  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 100, <class 'str'>)
+     12.3±0.06μs       13.1±0.7μs     1.06  bench_function_base.Median.time_even_small
+        1.27±0μs      1.35±0.06μs     1.06  bench_itemselection.PutMask.time_dense(False, 'float16')
+         139±1ns          147±6ns     1.06  bench_core.Core.time_array_1
+     33.7±0.01μs         35.5±2μs     1.05  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'reciprocal'>, 2, 4, 'f')
+      69.4±0.1μs       73.1±0.2μs     1.05  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'logical_not'>, 4, 4, 'f')
+      225±0.09μs          237±9μs     1.05  bench_random.Bounded.time_bounded('PCG64', [<class 'numpy.uint32'>, 2047])
-      15.7±0.5μs      14.9±0.03μs     0.95  bench_core.CountNonzero.time_count_nonzero_axis(2, 10000, <class 'numpy.int64'>)
-        34.2±2μs      32.0±0.03μs     0.94  bench_ufunc_strides.Unary.time_ufunc(<ufunc '_ones_like'>, 4, 2, 'f')
-     1.03±0.05ms          955±3μs     0.92  bench_lib.Nan.time_nanargmax(200000, 50.0)
-     6.97±0.08μs      6.43±0.02μs     0.92  bench_ma.UFunc.time_scalar(True, False, 10)
-        5.41±0μs      4.98±0.01μs     0.92  bench_ufunc_strides.AVX_cmplx_arithmetic.time_ufunc('subtract', 2, 'F')
-     22.4±0.01μs      20.6±0.02μs     0.92  bench_core.Core.time_array_float64_l1000
-     1.51±0.01ms         1.38±0ms     0.92  bench_core.CorrConv.time_correlate(1000, 10000, 'same')
-      10.1±0.2μs      9.27±0.09μs     0.92  bench_ufunc.UFunc.time_ufunc_types('invert')
-     8.50±0.02μs      7.80±0.09μs     0.92  bench_indexing.ScalarIndexing.time_assign_cast(1)
-      29.5±0.2μs      26.6±0.03μs     0.90  bench_ma.Concatenate.time_it('masked', 100)
-     2.09±0.02ms         1.87±0ms     0.90  bench_ma.UFunc.time_2d(True, True, 1000)
-        298±10μs        267±0.3μs     0.89  bench_app.MaxesOfDots.time_it
-      10.7±0.2μs      9.60±0.02μs     0.89  bench_ma.UFunc.time_1d(True, True, 100)
-         567±3μs          505±2μs     0.89  bench_lib.Nan.time_nanargmax(200000, 90.0)
-       342±0.9μs          282±5μs     0.83  bench_lib.Nan.time_nanargmax(200000, 2.0)
-       307±0.7μs        244±0.8μs     0.80  bench_lib.Nan.time_nanargmax(200000, 0.1)
-         309±1μs        241±0.1μs     0.78  bench_lib.Nan.time_nanargmax(200000, 0)
AVX512F min/max compare
       before           after         ratio
     [b0e1a445]       [f62fb2bf]
     <main>           <gh-issue-17989/feedback/round-1>
+      10.6±0.1μs        144±100μs    13.60  bench_ufunc.UFunc.time_ufunc_types('bitwise_not')
+      16.6±0.2μs        140±100μs     8.44  bench_ufunc.UFunc.time_ufunc_types('bitwise_or')
+     5.33±0.09μs       19.5±0.4μs     3.65  bench_ufunc_strides.AVX_BFunc.time_ufunc('maximum', 'f', 4)
+     5.13±0.04μs       17.9±0.2μs     3.50  bench_ufunc_strides.AVX_BFunc.time_ufunc('maximum', 'f', 2)
+     5.35±0.05μs       17.0±0.9μs     3.18  bench_ufunc_strides.AVX_BFunc.time_ufunc('minimum', 'f', 4)
+      5.23±0.1μs       15.1±0.2μs     2.88  bench_ufunc_strides.AVX_BFunc.time_ufunc('minimum', 'f', 2)
+      6.62±0.1μs       18.7±0.4μs     2.83  bench_ufunc_strides.AVX_BFunc.time_ufunc('maximum', 'd', 2)
+      6.63±0.1μs       18.7±0.3μs     2.82  bench_ufunc_strides.AVX_BFunc.time_ufunc('minimum', 'd', 2)
+         190±2μs        516±200μs     2.72  bench_ufunc.UFunc.time_ufunc_types('negative')
+     7.49±0.09μs       19.5±0.4μs     2.61  bench_ufunc_strides.AVX_BFunc.time_ufunc('minimum', 'd', 4)
+     7.46±0.07μs       18.7±0.2μs     2.50  bench_ufunc_strides.AVX_BFunc.time_ufunc('maximum', 'd', 4)
+         122±1μs        250±100μs     2.06  bench_indexing.Indexing.time_op('indexes_', ':,I', '')
+         271±3μs        550±300μs     2.03  bench_ufunc.UFunc.time_ufunc_types('less_equal')
+        654±20μs         962±30μs     1.47  bench_ufunc.UFunc.time_ufunc_types('sqrt')
+     3.46±0.05μs      4.88±0.07μs     1.41  bench_ufunc_strides.AVX_BFunc.time_ufunc('minimum', 'd', 1)
+     3.47±0.04μs      4.85±0.03μs     1.40  bench_ufunc_strides.AVX_BFunc.time_ufunc('maximum', 'd', 1)
+         110±3μs          152±5μs     1.38  bench_function_base.Sort.time_sort('merge', 'float64', ('sorted_block', 100))
+     1.63±0.02μs       2.26±0.2μs     1.38  bench_itemselection.PutMask.time_sparse(False, 'longfloat')
+      34.6±0.4μs       46.8±0.9μs     1.35  bench_function_base.Sort.time_argsort('heap', 'float64', ('uniform',))
+     1.66±0.02μs      2.25±0.09μs     1.35  bench_itemselection.PutMask.time_sparse(False, 'complex128')
+         129±2μs          173±4μs     1.34  bench_function_base.Sort.time_argsort('merge', 'float64', ('sorted_block', 100))
+     7.67±0.06μs         10.1±1μs     1.32  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 100, <class 'str'>)
+      37.9±0.3μs         49.7±1μs     1.31  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'bool'>)
+      75.2±0.8μs       97.6±0.8μs     1.30  bench_reduce.ArgMax.time_argmax(<class 'numpy.float32'>)
+     1.61±0.01μs       2.09±0.1μs     1.30  bench_itemselection.PutMask.time_sparse(False, 'float16')
+      27.7±0.6μs         35.8±1μs     1.30  bench_function_base.Sort.time_sort('merge', 'int16', ('sorted_block', 100))
+      38.2±0.1μs       48.9±0.7μs     1.28  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int8'>)
+     1.07±0.01μs      1.36±0.05μs     1.28  bench_itemselection.PutMask.time_sparse(True, 'float32')
+     1.61±0.02μs       2.05±0.1μs     1.27  bench_itemselection.PutMask.time_sparse(False, 'int16')
+     5.41±0.04μs      6.87±0.09μs     1.27  bench_core.UnpackBits.time_unpackbits_little
+      64.4±0.4μs         81.6±5μs     1.27  bench_function_base.Sort.time_sort('merge', 'float64', ('sorted_block', 1000))
+     1.10±0.03μs      1.39±0.07μs     1.26  bench_itemselection.PutMask.time_sparse(True, 'complex64')
+     1.95±0.04μs      2.45±0.07μs     1.26  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int8'>)
+        71.0±2μs         89.0±1μs     1.25  bench_function_base.Sort.time_argsort('merge', 'float64', ('sorted_block', 1000))
+      28.8±0.6μs       36.0±0.7μs     1.25  bench_function_base.Sort.time_sort('merge', 'int16', ('sorted_block', 10))
+         120±6μs          151±6μs     1.25  bench_function_base.Sort.time_sort('quick', 'float64', ('uniform',))
+     1.08±0.02μs      1.35±0.08μs     1.25  bench_itemselection.PutMask.time_sparse(True, 'int32')
+      7.92±0.1μs         9.82±1μs     1.24  bench_reduce.MinMax.time_max(<class 'numpy.float64'>)
+     1.95±0.05ms       2.39±0.1ms     1.23  bench_linalg.Eindot.time_tensordot_a_b_axes_1_0_0_1
+     1.10±0.05μs      1.35±0.07μs     1.22  bench_itemselection.PutMask.time_sparse(True, 'float64')
+       137±0.7μs          167±6μs     1.22  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int8'>)
+      62.3±0.7μs         75.4±3μs     1.21  bench_ufunc.UFunc.time_ufunc_types('signbit')
+     1.69±0.03μs      2.05±0.09μs     1.21  bench_itemselection.PutMask.time_dense(False, 'int16')
+     1.10±0.01μs       1.33±0.1μs     1.21  bench_itemselection.PutMask.time_sparse(True, 'int64')
+     1.98±0.06μs       2.38±0.1μs     1.21  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'bool'>)
+     1.78±0.02μs       2.15±0.1μs     1.20  bench_itemselection.PutMask.time_dense(False, 'complex128')
+         380±3ns          458±9ns     1.20  bench_array_coercion.ArrayCoercionSmall.time_asanyarray_dtype(1)
+        28.7±3μs       34.5±0.5μs     1.20  bench_function_base.Sort.time_sort('merge', 'int16', ('random',))
+     1.18±0.01μs      1.42±0.03μs     1.20  bench_itemselection.PutMask.time_sparse(True, 'complex128')
+      24.0±0.3μs         28.7±3μs     1.19  bench_ma.UFunc.time_scalar_1d(False, True, 100)
+      12.5±0.2ms       14.9±0.2ms     1.19  bench_lib.Unique.time_unique(200000, 2.0)
+         181±3μs          216±4μs     1.19  bench_function_base.Sort.time_argsort('merge', 'float64', ('sorted_block', 10))
+     6.48±0.09ms       7.70±0.5ms     1.19  bench_lib.Unique.time_unique(200000, 90.0)
+         168±1μs        199±100μs     1.18  bench_ufunc.UFunc.time_ufunc_types('isnan')
+         164±1μs          194±8μs     1.18  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'object'>)
+     10.3±0.09μs         12.1±1μs     1.18  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 100, <class 'object'>)
+         196±4μs          231±4μs     1.18  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 10000, <class 'object'>)
+     10.9±0.04ms       12.8±0.2ms     1.18  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'object'>)
+      55.2±0.3μs         64.9±2μs     1.18  bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'object'>)
+     1.53±0.02μs      1.80±0.04μs     1.18  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'bool'>)
+      10.6±0.2μs       12.5±0.1μs     1.17  bench_indexing.ScalarIndexing.time_assign_cast(0)
+     1.52±0.01μs      1.78±0.02μs     1.17  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int8'>)
+      72.5±0.2μs         84.9±2μs     1.17  bench_core.CountNonzero.time_count_nonzero_axis(1, 10000, <class 'str'>)
+     1.29±0.03μs      1.51±0.06μs     1.17  bench_itemselection.PutMask.time_dense(True, 'float64')
+      11.2±0.1μs       13.1±0.3μs     1.17  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'negative'>, 1, 1, 'f')
+     4.12±0.03μs       4.80±0.2μs     1.17  bench_itemselection.Take.time_contiguous((1000, 3), 'raise', 'int64')
+      31.3±0.4μs         36.4±2μs     1.16  bench_function_base.Sort.time_sort('merge', 'int16', ('reversed',))
+     1.64±0.02μs      1.90±0.03μs     1.16  bench_core.Core.time_ones_100
+     1.27±0.04μs      1.47±0.09μs     1.16  bench_itemselection.PutMask.time_dense(True, 'int64')
+        64.2±2μs         74.3±2μs     1.16  bench_ufunc_strides.Unary.time_ufunc(<ufunc '_ones_like'>, 1, 4, 'f')
+     5.54±0.09μs       6.41±0.4μs     1.16  bench_indexing.ScalarIndexing.time_index(0)
+     16.3±0.06ms       18.8±0.5ms     1.15  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'object'>)
+      12.2±0.1ms         14.1±1ms     1.15  bench_lib.Unique.time_unique(200000, 0)
+        626±20ns         723±30ns     1.15  bench_array_coercion.ArrayCoercionSmall.time_asanyarray([1])
+       109±0.5μs          125±2μs     1.15  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'object'>)
+      99.6±0.3μs          115±5μs     1.15  bench_linalg.Einsum.time_einsum_contig_outstride0(<class 'numpy.float32'>)
+         132±2μs          152±3μs     1.15  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'degrees'>, 1, 1, 'd')
+     4.07±0.02μs       4.68±0.2μs     1.15  bench_itemselection.Take.time_contiguous((1000, 3), 'wrap', 'longfloat')
+     1.88±0.02μs       2.16±0.1μs     1.15  bench_itemselection.PutMask.time_sparse(False, 'int64')
+     1.30±0.01μs      1.49±0.06μs     1.15  bench_ufunc.ArgParsingReduce.time_add_reduce_arg_parsing((array([0., 1.])))
+         384±1ns         440±10ns     1.15  bench_array_coercion.ArrayCoercionSmall.time_asarray_dtype(1)
+        368±10ns         422±10ns     1.15  bench_array_coercion.ArrayCoercionSmall.time_array_subok(1)
+         494±7μs         566±20μs     1.15  bench_linalg.Einsum.time_einsum_mul(<class 'numpy.float32'>)
+      12.4±0.2ms       14.2±0.9ms     1.15  bench_lib.Unique.time_unique(200000, 0.1)
+     1.20±0.02μs      1.37±0.06μs     1.14  bench_itemselection.PutMask.time_sparse(True, 'longfloat')
+     1.31±0.01μs      1.50±0.03μs     1.14  bench_ufunc.ArgParsingReduce.time_add_reduce_arg_parsing((array([0., 1.]), 0, None))
+         138±1μs          158±3μs     1.14  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'bool'>)
+     1.87±0.06μs       2.14±0.1μs     1.14  bench_itemselection.PutMask.time_sparse(False, 'complex64')
+    10.00±0.06ms       11.4±0.6ms     1.14  bench_lib.Unique.time_unique(200000, 50.0)
+         390±2ns          445±6ns     1.14  bench_array_coercion.ArrayCoercionSmall.time_asanyarray_dtype(5)
+     1.89±0.04μs       2.15±0.1μs     1.14  bench_itemselection.PutMask.time_sparse(False, 'float64')
+      22.1±0.2ms       25.1±0.5ms     1.14  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int64', 10000)
+     4.72±0.06μs       5.36±0.3μs     1.14  bench_itemselection.Take.time_contiguous((1000, 1), 'clip', 'int32')
+         336±3ns         382±10ns     1.14  bench_array_coercion.ArrayCoercionSmall.time_asanyarray(1)
+         438±5ns          497±7ns     1.14  bench_ufunc.Scalar.time_add_scalar
+      8.52±0.1μs       9.68±0.3μs     1.14  bench_indexing.ScalarIndexing.time_index(2)
+        673±10ns         764±30ns     1.13  bench_ufunc.ArgParsing.time_add_arg_parsing((array(1.), array(2.), subok=True))
+     5.55±0.09μs       6.29±0.4μs     1.13  bench_itemselection.Take.time_contiguous((1000, 3), 'raise', 'int16')
+     4.07±0.06μs       4.62±0.2μs     1.13  bench_itemselection.Take.time_contiguous((1000, 3), 'wrap', 'complex128')
+         132±2μs          149±2μs     1.13  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'fabs'>, 1, 1, 'd')
+        78.5±1μs         88.7±1μs     1.13  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'absolute'>, 1, 4, 'f')
+         707±6μs         798±60μs     1.13  bench_lib.Pad.time_pad((1024, 1024), 1, 'wrap')
+     5.33±0.06μs       6.01±0.4μs     1.13  bench_itemselection.Take.time_contiguous((1000, 3), 'clip', 'int16')
+     5.77±0.03μs       6.50±0.3μs     1.13  bench_core.CountNonzero.time_count_nonzero_axis(1, 100, <class 'str'>)
+         154±3μs          173±7μs     1.13  bench_function_base.Sort.time_sort('merge', 'float64', ('sorted_block', 10))
+     5.53±0.08ms      6.22±0.04ms     1.13  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'object'>)
+         221±3μs          248±6μs     1.12  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 10000, <class 'str'>)
+        367±10ns         411±20ns     1.12  bench_array_coercion.ArrayCoercionSmall.time_ascontiguousarray(5)
+     2.79±0.01μs      3.13±0.08μs     1.12  bench_ufunc_strides.AVX_BFunc.time_ufunc('maximum', 'f', 1)
+        66.1±1μs       74.1±0.8μs     1.12  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 2, 2, 'f')
+     3.35±0.03μs      3.75±0.07μs     1.12  bench_itemselection.Take.time_contiguous((1000, 3), 'wrap', 'int32')
+        91.1±2μs          102±2μs     1.12  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'square'>, 1, 2, 'd')
+      23.3±0.3ms       26.1±0.5ms     1.12  bench_trim_zeros.TrimZeros.time_trim_zeros(dtype('bool'), 30000)
+         237±1ms          265±4ms     1.12  bench_trim_zeros.TrimZeros.time_trim_zeros(dtype('bool'), 300000)
+         396±4ns         443±20ns     1.12  bench_array_coercion.ArrayCoercionSmall.time_asarray_dtype(5)
+      13.3±0.1μs       14.8±0.5μs     1.12  bench_core.CountNonzero.time_count_nonzero_axis(1, 10000, <class 'numpy.int64'>)
+     4.70±0.08μs       5.24±0.4μs     1.12  bench_itemselection.Take.time_contiguous((1000, 2), 'clip', 'int16')
+        96.4±2μs          107±1μs     1.11  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 4, 2, 'f')
+     4.70±0.07μs       5.24±0.3μs     1.11  bench_itemselection.Take.time_contiguous((1000, 1), 'clip', 'float32')
+     5.36±0.08μs       5.97±0.4μs     1.11  bench_itemselection.Take.time_contiguous((1000, 3), 'clip', 'float16')
+         487±6ns         542±20ns     1.11  bench_array_coercion.ArrayCoercionSmall.time_array_all_kwargs(5)
+         165±2μs          184±9μs     1.11  bench_lib.Pad.time_pad((256, 128, 1), 1, 'reflect')
+         305±2ns         339±20ns     1.11  bench_array_coercion.ArrayCoercionSmall.time_asarray(5)
+     5.13±0.04μs       5.70±0.4μs     1.11  bench_ma.Indexing.time_1d(False, 2, 100)
+     3.18±0.06μs      3.53±0.09μs     1.11  bench_ufunc_strides.AVX_ldexp.time_ufunc('d', 1)
+      8.71±0.1μs       9.66±0.6μs     1.11  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'clip', 'int32')
+      11.7±0.3μs       13.0±0.3μs     1.11  bench_lib.Unique.time_unique(200, 2.0)
+         389±7μs         431±30μs     1.11  bench_linalg.Eindot.time_einsum_i_ij_j
+     1.81±0.01μs      2.00±0.09μs     1.11  bench_array_coercion.ArrayCoercionSmall.time_asarray_dtype(range(0, 3))
+         132±2μs          146±1μs     1.11  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rad2deg'>, 1, 1, 'd')
+     5.59±0.03μs       6.18±0.3μs     1.11  bench_ma.Indexing.time_1d(True, 1, 10)
+         649±4μs         718±20μs     1.11  bench_core.VarComplex.time_var(100000)
+        1.16±0μs      1.28±0.03μs     1.11  bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'bool'>)
+         619±6μs         684±10μs     1.11  bench_function_base.Sort.time_argsort('merge', 'float64', ('random',))
+      10.3±0.1μs       11.3±0.3μs     1.10  bench_function_base.Sort.time_sort('merge', 'float64', ('reversed',))
+     5.47±0.06μs       6.04±0.4μs     1.10  bench_itemselection.Take.time_contiguous((1000, 3), 'raise', 'float16')
+     6.18±0.09μs       6.81±0.3μs     1.10  bench_core.CountNonzero.time_count_nonzero_axis(1, 100, <class 'object'>)
+     4.37±0.03μs      4.82±0.09μs     1.10  bench_core.CountNonzero.time_count_nonzero_axis(1, 100, <class 'numpy.int64'>)
+     1.78±0.01μs      1.96±0.02μs     1.10  bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'object'>)
+      46.1±0.4μs         50.8±1μs     1.10  bench_shape_base.Block2D.time_block2d((256, 256), 'uint16', (4, 4))
+       234±0.8μs         258±50μs     1.10  bench_ufunc.UFunc.time_ufunc_types('sign')
+         397±1μs         437±20μs     1.10  bench_random.Random.time_rng('binomial 10 0.5')
+         223±1μs          245±7μs     1.10  bench_core.CountNonzero.time_count_nonzero_axis(3, 10000, <class 'str'>)
+     5.12±0.09μs       5.61±0.3μs     1.10  bench_ma.Indexing.time_1d(False, 1, 10)
+         452±2ns          495±5ns     1.10  bench_array_coercion.ArrayCoercionSmall.time_asarray_dtype(array([5]))
+         120±1μs        131±0.9μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'reciprocal'>, 4, 4, 'f')
+     3.45±0.05μs       3.78±0.1μs     1.09  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -43)
+      13.8±0.2μs       15.1±0.4μs     1.09  bench_ma.UFunc.time_2d(False, True, 10)
+        77.8±1μs       84.9±0.4μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rint'>, 1, 4, 'f')
+      91.7±0.8μs      100.0±0.8μs     1.09  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int8'>)
+       153±0.7μs          167±2μs     1.09  bench_core.CountNonzero.time_count_nonzero_axis(2, 10000, <class 'str'>)
+        800±10μs         872±40μs     1.09  bench_lib.Pad.time_pad((1024, 1024), (0, 32), 'edge')
+     5.65±0.03μs       6.16±0.6μs     1.09  bench_reduce.MinMax.time_min(<class 'numpy.float32'>)
+         297±7μs          324±4μs     1.09  bench_core.UnpackBits.time_unpackbits_axis1_little
+     4.36±0.05μs      4.73±0.05μs     1.09  bench_core.CountNonzero.time_count_nonzero_axis(1, 100, <class 'numpy.int8'>)
+     2.33±0.04μs      2.53±0.04μs     1.09  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 43)
+     5.35±0.05μs       5.80±0.2μs     1.08  bench_ma.Indexing.time_0d(False, 1, 10)
+      14.2±0.1μs       15.4±0.5μs     1.08  bench_ma.Concatenate.time_it('ndarray', 2)
+        91.7±1μs         99.3±1μs     1.08  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'bool'>)
+      6.48±0.1μs       7.01±0.6μs     1.08  bench_lib.Nan.time_nancumprod(200, 90.0)
+     1.17±0.01μs      1.27±0.02μs     1.08  bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int8'>)
+     4.59±0.04μs      4.96±0.08μs     1.08  bench_core.CountNonzero.time_count_nonzero_axis(2, 100, <class 'numpy.int16'>)
+      14.3±0.2ms       15.4±0.2ms     1.08  bench_linalg.Eindot.time_einsum_ij_jk_a_b
+         337±6μs          363±8μs     1.08  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'log'>, 2, 1, 'd')
+     2.80±0.01ms      3.01±0.07ms     1.08  bench_ufunc.UFunc.time_ufunc_types('arctanh')
+      59.1±0.3μs       63.7±0.5μs     1.08  bench_shape_base.Block2D.time_block2d((512, 512), 'uint8', (4, 4))
+         515±2ns          554±6ns     1.08  bench_ufunc.ArgParsing.time_add_arg_parsing((array(1.), array(2.), array(3.)))
+     2.80±0.06ms      3.01±0.05ms     1.08  bench_ufunc.UFunc.time_ufunc_types('sinh')
+     3.93±0.01μs       4.23±0.1μs     1.07  bench_core.CountNonzero.time_count_nonzero_axis(1, 100, <class 'bool'>)
+      3.68±0.04s       3.96±0.02s     1.07  bench_ufunc_strides.Mandelbrot.time_mandel
+        621±10ns         667±20ns     1.07  bench_ufunc.Scalar.time_add_scalar_conv
+     5.11±0.09μs       5.49±0.2μs     1.07  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 100, <class 'numpy.int64'>)
+         337±3μs          362±5μs     1.07  bench_ufunc.UFunc.time_ufunc_types('spacing')
+        549±10ns          589±2ns     1.07  bench_array_coercion.ArrayCoercionSmall.time_array_invalid_kwarg(5)
+     1.22±0.01μs      1.30±0.03μs     1.07  bench_core.CountNonzero.time_count_nonzero(1, 100, <class 'object'>)
+         551±6ns          590±8ns     1.07  bench_core.Core.time_array_l1
+     2.63±0.03μs      2.81±0.05μs     1.07  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int16'>)
+         194±2μs          207±3μs     1.07  bench_ufunc.UFunc.time_ufunc_types('isinf')
+        825±10ms         880±10ms     1.07  bench_trim_zeros.TrimZeros.time_trim_zeros(dtype('int64'), 300000)
+         243±3ns          259±4ns     1.07  bench_array_coercion.ArrayCoercionSmall.time_array_no_copy(array([5]))
+        543±10ns          578±3ns     1.06  bench_array_coercion.ArrayCoercionSmall.time_array_invalid_kwarg(1)
+     7.53±0.02μs       8.02±0.2μs     1.06  bench_core.CountNonzero.time_count_nonzero_axis(3, 100, <class 'str'>)
+     4.78±0.03μs      5.08±0.08μs     1.06  bench_core.CorrConv.time_correlate(1000, 10, 'same')
+         577±3ns          614±6ns     1.06  bench_ufunc.ArgParsing.time_add_arg_parsing((array(1.), array(2.)))
+     2.44±0.03μs      2.59±0.03μs     1.06  bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'object'>)
+         150±2μs          159±1μs     1.06  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rad2deg'>, 2, 1, 'd')
+         370±2μs          391±6μs     1.06  bench_core.CountNonzero.time_count_nonzero_axis(2, 10000, <class 'object'>)
+        530±10μs          559±4μs     1.06  bench_function_base.Sort.time_sort('merge', 'float64', ('random',))
+       135±0.6ms          142±8ms     1.05  bench_core.CorrConv.time_convolve(100000, 10000, 'full')
+      58.5±0.2μs       61.4±0.6μs     1.05  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 1, 1, 'd')
-         154±2μs          145±1μs     0.94  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rad2deg'>, 2, 4, 'f')
-     4.17±0.04μs      3.93±0.05μs     0.94  bench_itemselection.Take.time_contiguous((1000, 3), 'raise', 'int32')
-      5.33±0.1μs      5.01±0.04μs     0.94  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 100, <class 'numpy.int32'>)
-     4.16±0.04μs      3.91±0.06μs     0.94  bench_itemselection.Take.time_contiguous((1000, 3), 'raise', 'float32')
-     13.0±0.09μs       12.2±0.1μs     0.94  bench_core.PackBits.time_packbits_axis1(<class 'bool'>)
-       195±0.5μs          182±6μs     0.93  bench_reduce.AddReduceSeparate.time_reduce(0, 'float32')
-         121±2μs          113±2μs     0.93  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 2, 2, 'f')
-      51.6±0.8μs       47.6±0.5μs     0.92  bench_shape_base.Block.time_block_simple_row_wise(100)
-        775±10μs          713±8μs     0.92  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 2, 2, 'd')
-         130±1μs          119±2μs     0.92  bench_shape_base.Block2D.time_block2d((1024, 1024), 'uint8', (2, 2))
-         298±3μs          272±4μs     0.91  bench_core.UnpackBits.time_unpackbits_axis1
-         495±5μs          453±9μs     0.91  bench_function_base.Sort.time_sort('heap', 'float64', ('ordered',))
-         232±3μs          210±8μs     0.90  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'logical_not'>, 4, 2, 'd')
-         292±2μs          263±1μs     0.90  bench_function_base.Sort.time_argsort('quick', 'int64', ('sorted_block', 1000))
-         153±1μs          138±1μs     0.90  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rad2deg'>, 4, 1, 'f')
-         388±3μs          346±6μs     0.89  bench_function_base.Sort.time_argsort('quick', 'int64', ('sorted_block', 100))
-         398±5μs          355±6μs     0.89  bench_function_base.Sort.time_argsort('quick', 'int64', ('sorted_block', 10))
-         696±3μs         621±10μs     0.89  bench_function_base.Sort.time_argsort('heap', 'float64', ('sorted_block', 10))
-        97.0±3μs       86.4±0.7μs     0.89  bench_function_base.Sort.time_sort('quick', 'int16', ('uniform',))
-         312±8μs          277±1μs     0.89  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'fabs'>, 4, 4, 'd')
-         180±4μs          160±1μs     0.89  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'logical_not'>, 4, 4, 'f')
-      74.6±0.9μs       66.0±0.6μs     0.89  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 2, 2, 'f')
-         155±2μs          137±1μs     0.88  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 4, 4, 'f')
-       103±0.4μs         91.1±1μs     0.88  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'trunc'>, 2, 4, 'f')
-         490±8μs          431±6μs     0.88  bench_function_base.Sort.time_argsort('quick', 'int64', ('random',))
-         152±3μs          134±1μs     0.88  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'degrees'>, 2, 2, 'f')
-       146±0.5μs          128±2μs     0.88  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rad2deg'>, 1, 2, 'f')
-       105±0.8μs       91.7±0.3μs     0.88  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'reciprocal'>, 2, 4, 'f')
-     4.17±0.09μs       3.65±0.2μs     0.88  bench_itemselection.Take.time_contiguous((1000, 1), 'raise', 'longfloat')
-     4.10±0.05μs       3.59±0.3μs     0.87  bench_itemselection.Take.time_contiguous((1000, 1), 'raise', 'complex64')
-         147±2μs          128±2μs     0.87  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'degrees'>, 1, 2, 'f')
-         148±2μs          130±1μs     0.87  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rad2deg'>, 2, 2, 'f')
-     4.12±0.04μs       3.59±0.2μs     0.87  bench_itemselection.Take.time_contiguous((1000, 1), 'raise', 'int64')
-         271±3μs          236±4μs     0.87  bench_function_base.Sort.time_sort('quick', 'int16', ('sorted_block', 1000))
-         686±4μs         597±20μs     0.87  bench_function_base.Sort.time_argsort('heap', 'int64', ('sorted_block', 100))
-      7.44±0.3μs       6.43±0.4μs     0.86  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'raise', 'float64')
-         317±2μs          274±2μs     0.86  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rad2deg'>, 4, 4, 'd')
-     5.19±0.07μs       4.48±0.3μs     0.86  bench_itemselection.Take.time_contiguous((1000, 3), 'raise', 'complex128')
-         789±8μs         680±20μs     0.86  bench_function_base.Sort.time_argsort('heap', 'int64', ('random',))
-        99.2±4μs         85.4±3μs     0.86  bench_lib.Nan.time_nanmax(200000, 0)
-         151±4μs          130±2μs     0.86  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'degrees'>, 2, 1, 'f')
-     1.79±0.03ms      1.54±0.05ms     0.86  bench_lib.Pad.time_pad((1, 1, 1, 1, 1), 8, 'constant')
-        90.7±5μs         77.9±1μs     0.86  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'negative'>, 1, 4, 'f')
-         150±2μs        128±0.6μs     0.86  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rad2deg'>, 2, 1, 'f')
-        86.2±2μs       73.7±0.8μs     0.85  bench_function_base.Sort.time_argsort('quick', 'int16', ('reversed',))
-         312±7μs          267±6μs     0.85  bench_function_base.Select.time_select_larger
-      7.54±0.2μs       6.45±0.4μs     0.85  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'raise', 'int64')
-     4.12±0.02μs       3.52±0.2μs     0.85  bench_itemselection.Take.time_contiguous((1000, 1), 'raise', 'float64')
-         659±9μs         561±10μs     0.85  bench_function_base.Sort.time_argsort('heap', 'int64', ('sorted_block', 10))
-         146±2μs          124±1μs     0.85  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'degrees'>, 1, 1, 'f')
-         537±6μs         455±10μs     0.85  bench_function_base.Sort.time_argsort('heap', 'float64', ('reversed',))
-      4.21±0.1μs       3.56±0.3μs     0.85  bench_itemselection.Take.time_contiguous((1000, 1), 'raise', 'complex128')
-        87.1±2μs         73.1±1μs     0.84  bench_function_base.Sort.time_argsort('quick', 'int64', ('uniform',))
-      4.37±0.2μs       3.66±0.2μs     0.84  bench_itemselection.Take.time_contiguous((1000, 2), 'raise', 'int64')
-      7.11±0.1μs       5.96±0.3μs     0.84  bench_function_base.Sort.time_sort('merge', 'int16', ('ordered',))
-      7.72±0.4μs       6.46±0.4μs     0.84  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'raise', 'complex64')
-         184±3μs          153±6μs     0.84  bench_function_base.Bincount.time_weights
-         137±6μs          114±2μs     0.83  bench_function_base.Bincount.time_bincount
-         862±9μs         717±20μs     0.83  bench_reduce.AddReduceSeparate.time_reduce(0, 'complex64')
-         148±2μs          122±2μs     0.83  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rad2deg'>, 1, 1, 'f')
-         490±9μs          405±5μs     0.83  bench_function_base.Sort.time_argsort('heap', 'int64', ('reversed',))
-     13.6±0.02μs       11.3±0.2μs     0.82  bench_function_base.Sort.time_argsort('merge', 'float64', ('reversed',))
-      56.2±0.1μs       46.4±0.6μs     0.82  bench_function_base.Sort.time_argsort('quick', 'int16', ('ordered',))
-      4.34±0.2μs       3.55±0.2μs     0.82  bench_itemselection.Take.time_contiguous((1000, 2), 'raise', 'int32')
-        654±40μs         528±10μs     0.81  bench_function_base.Sort.time_argsort('heap', 'int64', ('sorted_block', 1000))
-        839±30μs         675±20μs     0.80  bench_ufunc.UFunc.time_ufunc_types('multiply')
-         109±2μs         86.9±6μs     0.80  bench_lib.Nan.time_nanmin(200000, 0.1)
-     1.17±0.01ms         922±20μs     0.79  bench_core.PackBits.time_packbits_axis0(<class 'numpy.uint64'>)
-         108±2μs         84.9±3μs     0.79  bench_lib.Nan.time_nanmin(200000, 0)
-        8.50±1μs       6.56±0.3μs     0.77  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'raise', 'complex128')
-      35.0±0.4μs       26.5±0.5μs     0.76  bench_io.CopyTo.time_copyto_sparse
-       126±0.9μs       93.0±0.7μs     0.74  bench_lib.Nan.time_nanmax(200000, 2.0)
-     1.03±0.01ms         727±20μs     0.71  bench_core.PackBits.time_packbits_axis1(<class 'numpy.uint64'>)
-        52.5±2μs       37.0±0.6μs     0.70  bench_core.PackBits.time_packbits(<class 'numpy.uint64'>)
-        357±10μs          234±8μs     0.66  bench_core.PackBits.time_packbits_axis0(<class 'bool'>)
-         144±1μs         87.5±5μs     0.61  bench_lib.Nan.time_nanmin(200000, 2.0)
-         281±2μs       92.3±0.6μs     0.33  bench_lib.Nan.time_nanmin(200000, 90.0)
-         299±3μs         87.9±5μs     0.29  bench_lib.Nan.time_nanmax(200000, 90.0)
-         773±6μs         95.5±4μs     0.12  bench_lib.Nan.time_nanmax(200000, 50.0)
-        772±10μs         87.9±5μs     0.11  bench_lib.Nan.time_nanmin(200000, 50.0)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.
RERUN-- AVX 512F min/max compare
       before           after         ratio
     [b0e1a445]       [82801074]
     <main>           <minmax>  
+      5.10±0.1μs       17.5±0.3μs     3.43  bench_ufunc_strides.AVX_BFunc.time_ufunc('maximum', 'f', 2)
+      5.42±0.1μs       18.6±0.4μs     3.43  bench_ufunc_strides.AVX_BFunc.time_ufunc('maximum', 'f', 4)
+       167±0.8μs         523±90μs     3.13  bench_ufunc.UFunc.time_ufunc_types('conj')
+     5.35±0.04μs       16.2±0.2μs     3.04  bench_ufunc_strides.AVX_BFunc.time_ufunc('minimum', 'f', 4)
+      5.12±0.3μs       14.7±0.3μs     2.87  bench_ufunc_strides.AVX_BFunc.time_ufunc('minimum', 'f', 2)
+     6.48±0.06μs       18.1±0.2μs     2.79  bench_ufunc_strides.AVX_BFunc.time_ufunc('minimum', 'd', 2)
+     6.48±0.05μs       17.8±0.1μs     2.75  bench_ufunc_strides.AVX_BFunc.time_ufunc('maximum', 'd', 2)
+     7.46±0.05μs       18.4±0.3μs     2.47  bench_ufunc_strides.AVX_BFunc.time_ufunc('minimum', 'd', 4)
+     7.52±0.05μs       18.1±0.2μs     2.41  bench_ufunc_strides.AVX_BFunc.time_ufunc('maximum', 'd', 4)
+       189±0.5μs        405±200μs     2.14  bench_ufunc.UFunc.time_ufunc_types('positive')
+         238±2μs        462±200μs     1.94  bench_ufunc.UFunc.time_ufunc_types('abs')
+         119±4μs        175±0.8μs     1.47  bench_function_base.Sort.time_argsort('merge', 'float64', ('sorted_block', 100))
+        67.4±2μs       94.9±0.6μs     1.41  bench_function_base.Sort.time_argsort('merge', 'float64', ('sorted_block', 1000))
+     3.42±0.02μs       4.76±0.1μs     1.39  bench_ufunc_strides.AVX_BFunc.time_ufunc('minimum', 'd', 1)
+     3.42±0.02μs      4.74±0.02μs     1.39  bench_ufunc_strides.AVX_BFunc.time_ufunc('maximum', 'd', 1)
+         734±2μs       1.02±0.3ms     1.38  bench_core.CountNonzero.time_count_nonzero_axis(1, 1000000, <class 'numpy.int16'>)
+     8.30±0.07ms         11.5±2ms     1.38  bench_lib.Pad.time_pad((1, 1, 1, 1, 1), 8, 'wrap')
+      33.9±0.2μs       46.0±0.2μs     1.36  bench_function_base.Sort.time_argsort('heap', 'float64', ('uniform',))
+       111±0.3μs        146±0.8μs     1.32  bench_function_base.Sort.time_sort('merge', 'float64', ('sorted_block', 100))
+     1.64±0.01μs      2.12±0.01μs     1.29  bench_itemselection.PutMask.time_sparse(False, 'longfloat')
+        61.2±2μs         79.0±2μs     1.29  bench_function_base.Sort.time_sort('merge', 'float64', ('sorted_block', 1000))
+     1.64±0.02μs      2.12±0.01μs     1.29  bench_itemselection.PutMask.time_sparse(False, 'complex128')
+       179±0.9μs          230±2μs     1.28  bench_function_base.Sort.time_argsort('merge', 'float64', ('sorted_block', 10))
+      74.5±0.2μs       95.3±0.5μs     1.28  bench_reduce.ArgMax.time_argmax(<class 'numpy.float32'>)
+     3.60±0.02μs      4.56±0.01μs     1.27  bench_itemselection.Take.time_contiguous((1000, 3), 'wrap', 'complex64')
+     5.32±0.05μs      6.73±0.06μs     1.27  bench_core.UnpackBits.time_unpackbits_little
+     1.57±0.02μs      1.98±0.03μs     1.26  bench_itemselection.PutMask.time_sparse(False, 'int16')
+      54.7±0.1ms        68.4±10ms     1.25  bench_core.CountNonzero.time_count_nonzero_axis(3, 1000000, <class 'object'>)
+      27.5±0.9μs       34.2±0.2μs     1.24  bench_function_base.Sort.time_sort('merge', 'int16', ('random',))
+     38.2±0.09μs       47.5±0.3μs     1.24  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int8'>)
+      38.3±0.1μs       47.5±0.2μs     1.24  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'bool'>)
+     1.57±0.01μs      1.94±0.02μs     1.24  bench_itemselection.PutMask.time_sparse(False, 'float16')
+       187±0.3μs         230±40μs     1.23  bench_core.CountNonzero.time_count_nonzero_axis(1, 10000, <class 'object'>)
+     1.05±0.01μs         1.26±0μs     1.20  bench_itemselection.PutMask.time_sparse(True, 'complex64')
+      29.1±0.3μs       34.7±0.3μs     1.19  bench_function_base.Sort.time_sort('merge', 'int16', ('sorted_block', 10))
+      28.4±0.8μs         33.6±1μs     1.18  bench_function_base.Sort.time_sort('merge', 'int16', ('sorted_block', 100))
+         120±4μs        142±0.4μs     1.18  bench_ufunc_strides.Unary.time_ufunc(<ufunc '_ones_like'>, 1, 4, 'd')
+     1.75±0.02μs      2.05±0.01μs     1.18  bench_itemselection.PutMask.time_dense(False, 'longfloat')
+     1.93±0.01μs      2.27±0.03μs     1.17  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int8'>)
+     1.06±0.01μs      1.25±0.01μs     1.17  bench_itemselection.PutMask.time_sparse(True, 'float64')
+     1.06±0.01μs      1.25±0.01μs     1.17  bench_itemselection.PutMask.time_sparse(True, 'int32')
+     1.06±0.01μs      1.25±0.01μs     1.17  bench_itemselection.PutMask.time_sparse(True, 'int64')
+     1.78±0.01μs      2.07±0.01μs     1.17  bench_itemselection.PutMask.time_dense(False, 'complex128')
+      31.4±0.4μs       36.4±0.3μs     1.16  bench_function_base.Sort.time_sort('merge', 'int16', ('sorted_block', 1000))
+      11.1±0.1μs      12.9±0.09μs     1.16  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'negative'>, 1, 1, 'f')
+      61.5±0.1μs       71.1±0.9μs     1.16  bench_ufunc_strides.Unary.time_ufunc(<ufunc '_ones_like'>, 2, 4, 'f')
+     10.2±0.08μs       11.8±0.8μs     1.16  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 100, <class 'object'>)
+        601±10μs          693±4μs     1.15  bench_function_base.Sort.time_argsort('merge', 'float64', ('random',))
+     1.09±0.01μs      1.25±0.01μs     1.15  bench_itemselection.PutMask.time_sparse(True, 'float32')
+       122±0.7μs          141±4μs     1.15  bench_function_base.Sort.time_sort('quick', 'float64', ('uniform',))
+        62.2±1μs       71.5±0.3μs     1.15  bench_ufunc_strides.Unary.time_ufunc(<ufunc '_ones_like'>, 1, 4, 'f')
+      31.0±0.2μs         35.6±2μs     1.15  bench_function_base.Sort.time_sort('merge', 'int16', ('reversed',))
+     1.54±0.02μs      1.77±0.02μs     1.15  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int8'>)
+        1.94±0μs      2.23±0.02μs     1.15  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'bool'>)
+       148±0.4μs          170±1μs     1.14  bench_function_base.Sort.time_sort('merge', 'float64', ('sorted_block', 10))
+         286±6μs          327±3μs     1.14  bench_core.UnpackBits.time_unpackbits_axis1_little
+        847±40μs          967±4μs     1.14  bench_ufunc.UFunc.time_ufunc_types('divide')
+         162±1μs          185±2μs     1.14  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'object'>)
+     7.74±0.04μs      8.82±0.04μs     1.14  bench_reduce.MinMax.time_max(<class 'numpy.float64'>)
+     1.66±0.01μs      1.88±0.01μs     1.14  bench_itemselection.PutMask.time_dense(False, 'float16')
+      72.6±0.4μs         82.3±2μs     1.13  bench_core.CountNonzero.time_count_nonzero_axis(1, 10000, <class 'str'>)
+      93.6±0.9μs        106±0.5μs     1.13  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rint'>, 4, 2, 'f')
+      91.8±0.2μs        104±0.3μs     1.13  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sqrt'>, 2, 4, 'f')
+     5.40±0.02ms      6.11±0.02ms     1.13  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'object'>)
+         260±2μs          294±1μs     1.13  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'logical_not'>, 4, 4, 'd')
+         383±4ns          433±5ns     1.13  bench_array_coercion.ArrayCoercionSmall.time_asarray_dtype(1)
+      9.86±0.3μs      11.1±0.03μs     1.13  bench_function_base.Sort.time_sort('merge', 'float64', ('reversed',))
+     16.2±0.04ms      18.3±0.09ms     1.13  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'object'>)
+     1.55±0.01μs      1.75±0.02μs     1.12  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'bool'>)
+      31.0±0.2μs       34.8±0.1μs     1.12  bench_core.Core.time_array_float_l1000_dtype
+      88.7±0.9μs       99.5±0.7μs     1.12  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'negative'>, 4, 2, 'f')
+      72.7±0.2μs       81.5±0.5μs     1.12  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 10000, <class 'str'>)
+     1.67±0.01μs      1.87±0.02μs     1.12  bench_itemselection.PutMask.time_dense(False, 'int16')
+         474±6ns         530±30ns     1.12  bench_array_coercion.ArrayCoercionSmall.time_array_all_kwargs(1)
+       109±0.8μs          122±1μs     1.12  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'object'>)
+     10.8±0.05ms       12.1±0.1ms     1.12  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'object'>)
+         119±1μs        132±0.3μs     1.11  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 2, 2, 'd')
+         119±1μs        132±0.4μs     1.11  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'trunc'>, 4, 4, 'f')
+       129±0.5μs        144±0.7μs     1.11  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'fabs'>, 1, 1, 'd')
+         130±1μs        145±0.7μs     1.11  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rad2deg'>, 1, 1, 'd')
+     1.15±0.01μs      1.28±0.01μs     1.11  bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'bool'>)
+     6.50±0.05μs       7.23±0.3μs     1.11  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 100, <class 'object'>)
+       130±0.6μs          145±1μs     1.11  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'degrees'>, 1, 1, 'd')
+     5.40±0.02μs      6.00±0.02μs     1.11  bench_reduce.MinMax.time_min(<class 'numpy.float32'>)
+     3.34±0.02μs      3.70±0.02μs     1.11  bench_itemselection.Take.time_contiguous((1000, 3), 'wrap', 'int32')
+     1.14±0.02μs         1.26±0μs     1.11  bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int8'>)
+      46.7±0.5μs       51.6±0.7μs     1.10  bench_shape_base.Block.time_block_simple_row_wise(100)
+        362±10μs          400±2μs     1.10  bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 100))
+     5.44±0.02μs      6.01±0.03μs     1.10  bench_reduce.ArgMax.time_argmax(<class 'bool'>)
+       138±0.7μs          152±5μs     1.10  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int8'>)
+     7.87±0.03μs      8.64±0.08μs     1.10  bench_reduce.MinMax.time_min(<class 'numpy.float64'>)
+       232±0.6ns        255±0.8ns     1.10  bench_array_coercion.ArrayCoercionSmall.time_array_no_copy(array([5]))
+         319±3ns          351±5ns     1.10  bench_array_coercion.ArrayCoercionSmall.time_asanyarray(1)
+        278±10μs          304±2μs     1.09  bench_function_base.Sort.time_sort('merge', 'int64', ('random',))
+         306±2ns         335±10ns     1.09  bench_array_coercion.ArrayCoercionSmall.time_asarray(5)
+       120±0.8μs        132±0.6μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cos'>, 4, 2, 'f')
+      58.1±0.3μs       63.5±0.2μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'negative'>, 1, 1, 'd')
+         284±9μs          310±2μs     1.09  bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 1000))
+     6.98±0.03μs       7.61±0.2μs     1.09  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 100, <class 'str'>)
+      6.46±0.3μs       7.04±0.3μs     1.09  bench_core.CountNonzero.time_count_nonzero_axis(1, 100, <class 'object'>)
+         119±1μs          130±3μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'absolute'>, 4, 4, 'f')
+         152±1μs          166±1μs     1.09  bench_core.CountNonzero.time_count_nonzero_axis(2, 10000, <class 'str'>)
+         151±3μs          164±2μs     1.09  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 10000, <class 'str'>)
+       125±0.5μs          136±2μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'logical_not'>, 1, 2, 'd')
+     2.33±0.01μs      2.52±0.01μs     1.08  bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'object'>)
+         316±2ns         343±10ns     1.08  bench_array_coercion.ArrayCoercionSmall.time_asarray(1)
+     1.17±0.01μs      1.27±0.01μs     1.08  bench_itemselection.PutMask.time_sparse(True, 'complex128')
+     28.2±0.04μs      30.5±0.06μs     1.08  bench_function_base.Sort.time_argsort('heap', 'int16', ('uniform',))
+         445±2μs          481±2μs     1.08  bench_function_base.Sort.time_sort('quick', 'float64', ('random',))
+         689±3μs          743±3μs     1.08  bench_lib.Pad.time_pad((1024, 1024), 1, 'reflect')
+         220±2μs        237±0.8μs     1.08  bench_core.CountNonzero.time_count_nonzero_axis(3, 10000, <class 'str'>)
+     1.85±0.01μs      1.99±0.01μs     1.08  bench_itemselection.PutMask.time_sparse(False, 'int64')
+     8.62±0.06μs      9.28±0.05μs     1.08  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'clip', 'float32')
+         460±2μs          496±2μs     1.08  bench_random.RNG.time_64bit('MT19937')
+         161±2μs        173±0.9μs     1.08  bench_lib.Pad.time_pad((256, 128, 1), 1, 'wrap')
+         639±7ns          687±9ns     1.08  bench_array_coercion.ArrayCoercionSmall.time_array_dtype_not_kwargs([1])
+     1.85±0.01μs      1.99±0.01μs     1.08  bench_itemselection.PutMask.time_sparse(False, 'complex64')
+     3.15±0.01μs      3.39±0.02μs     1.08  bench_ufunc_strides.AVX_ldexp.time_ufunc('d', 1)
+     4.63±0.03μs      4.98±0.03μs     1.08  bench_itemselection.Take.time_contiguous((1000, 2), 'clip', 'float16')
+     12.1±0.03ms      13.0±0.05ms     1.07  bench_lib.Unique.time_unique(200000, 0)
+     4.79±0.02μs      5.14±0.09μs     1.07  bench_core.CountNonzero.time_count_nonzero_axis(3, 100, <class 'numpy.int64'>)
+     2.76±0.01μs      2.97±0.02μs     1.07  bench_ufunc_strides.AVX_BFunc.time_ufunc('minimum', 'f', 1)
+     4.66±0.01μs       5.00±0.1μs     1.07  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 100, <class 'numpy.int16'>)
+         169±1μs          181±5μs     1.07  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'degrees'>, 2, 2, 'd')
+       161±0.5μs        172±0.3μs     1.07  bench_shape_base.Block2D.time_block2d((1024, 1024), 'uint8', (4, 4))
+     1.77±0.01μs      1.90±0.02μs     1.07  bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'object'>)
+     12.2±0.01ms      13.1±0.03ms     1.07  bench_lib.Unique.time_unique(200000, 2.0)
+         446±2μs         477±20μs     1.07  bench_function_base.Sort.time_sort('heap', 'int16', ('sorted_block', 100))
+     2.80±0.02μs      3.00±0.05μs     1.07  bench_ufunc_strides.AVX_BFunc.time_ufunc('maximum', 'f', 1)
+     12.2±0.04ms      13.1±0.03ms     1.07  bench_lib.Unique.time_unique(200000, 0.1)
+      46.1±0.3μs       49.3±0.1μs     1.07  bench_core.PackBits.time_packbits_little(<class 'numpy.uint64'>)
+     5.28±0.03μs      5.63±0.02μs     1.07  bench_itemselection.Take.time_contiguous((1000, 3), 'clip', 'float16')
+         222±4μs        237±0.9μs     1.07  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 10000, <class 'str'>)
+     4.86±0.07μs      5.18±0.04μs     1.07  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 100, <class 'numpy.int16'>)
+         384±2μs          409±3μs     1.07  bench_linalg.Eindot.time_einsum_i_ij_j
+     4.89±0.03μs      5.21±0.09μs     1.07  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 100, <class 'numpy.int64'>)
+      29.9±0.2μs       31.8±0.9μs     1.07  bench_linalg.Einsum.time_einsum_sum_mul2(<class 'numpy.float32'>)
+     4.64±0.02μs      4.94±0.02μs     1.07  bench_itemselection.Take.time_contiguous((1000, 2), 'clip', 'int16')
+         788±2μs          839±4μs     1.06  bench_lib.Pad.time_pad((1024, 1024), 8, 'reflect')
+         694±5μs          738±2μs     1.06  bench_lib.Pad.time_pad((1024, 1024), 1, 'wrap')
+     7.76±0.05μs      8.25±0.06μs     1.06  bench_shape_base.Block.time_no_lists(10)
+      4.01±0.1ms      4.26±0.03ms     1.06  bench_lib.Pad.time_pad((256, 128, 1), 8, 'wrap')
+     6.44±0.06μs       6.85±0.2μs     1.06  bench_lib.Nan.time_nancumsum(200, 90.0)
+     3.89±0.02μs      4.13±0.07μs     1.06  bench_core.CountNonzero.time_count_nonzero_axis(1, 100, <class 'bool'>)
+         432±6ns         459±10ns     1.06  bench_core.Core.time_arange_100
+     10.7±0.06μs      11.3±0.08μs     1.06  bench_function_base.Where.time_1
+      24.8±0.1μs       26.4±0.6μs     1.06  bench_ma.UFunc.time_1d(True, False, 1000)
+      3.56±0.01s       3.78±0.02s     1.06  bench_ufunc_strides.Mandelbrot.time_mandel
+     5.44±0.02μs      5.77±0.05μs     1.06  bench_itemselection.Take.time_contiguous((1000, 3), 'raise', 'float16')
+     4.67±0.03μs      4.96±0.03μs     1.06  bench_itemselection.Take.time_contiguous((1000, 1), 'clip', 'float32')
+        517±20μs          549±2μs     1.06  bench_function_base.Sort.time_sort('merge', 'float64', ('random',))
+     5.32±0.03μs      5.65±0.04μs     1.06  bench_itemselection.Take.time_contiguous((1000, 3), 'clip', 'int16')
+     4.39±0.03μs      4.66±0.09μs     1.06  bench_core.CountNonzero.time_count_nonzero_axis(1, 100, <class 'numpy.int64'>)
+     4.82±0.02μs       5.11±0.1μs     1.06  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 100, <class 'numpy.int32'>)
+     11.3±0.07μs      11.9±0.04μs     1.06  bench_reduce.MinMax.time_max(<class 'numpy.int64'>)
+      46.7±0.4μs       49.5±0.6μs     1.06  bench_lib.Nan.time_nanvar(200, 0.1)
+         146±1μs          154±1μs     1.06  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'fabs'>, 2, 1, 'd')
+      7.73±0.1μs      8.18±0.05μs     1.06  bench_core.CountNonzero.time_count_nonzero_axis(3, 100, <class 'str'>)
+     8.39±0.04μs      8.87±0.08μs     1.06  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 100, <class 'object'>)
+     4.84±0.05μs      5.12±0.04μs     1.06  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 100, <class 'numpy.int8'>)
+     8.73±0.05μs      9.23±0.03μs     1.06  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'clip', 'int32')
+       276±0.8μs          291±2μs     1.05  bench_ufunc.UFunc.time_ufunc_types('nextafter')
+     1.89±0.02μs      2.00±0.01μs     1.05  bench_itemselection.PutMask.time_sparse(False, 'float64')
+     4.68±0.01μs       4.93±0.1μs     1.05  bench_core.CountNonzero.time_count_nonzero_axis(3, 100, <class 'numpy.int16'>)
+       186±0.3μs          196±1μs     1.05  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 10000, <class 'object'>)
+     4.82±0.03μs      5.08±0.05μs     1.05  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 100, <class 'numpy.int64'>)
+     3.99±0.01μs      4.20±0.04μs     1.05  bench_reduce.AnyAll.time_any_slow
+     5.42±0.02μs      5.71±0.03μs     1.05  bench_itemselection.Take.time_contiguous((1000, 3), 'raise', 'int16')
+     7.37±0.07μs      7.75±0.03μs     1.05  bench_lib.Nan.time_nansum(200, 2.0)
+     6.50±0.03ms      6.83±0.04ms     1.05  bench_lib.Unique.time_unique(200000, 90.0)
-         938±3μs          893±5μs     0.95  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cosh'>, 4, 4, 'd')
-      32.9±0.2μs       31.3±0.2μs     0.95  bench_linalg.Einsum.time_einsum_noncon_sum_mul(<class 'numpy.float32'>)
-         585±7ns          556±4ns     0.95  bench_scalar.ScalarMath.time_abs('longfloat')
-         224±3μs        214±0.8μs     0.95  bench_ufunc.UFunc.time_ufunc_types('floor')
-      11.0±0.1μs      10.4±0.08μs     0.95  bench_ma.UFunc.time_scalar(True, False, 1000)
-         747±8μs          711±6μs     0.95  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'arcsin'>, 4, 4, 'f')
-     5.60±0.04μs       5.33±0.1μs     0.95  bench_itemselection.Take.time_contiguous((1000, 3), 'wrap', 'float16')
-        779±10ns         741±20ns     0.95  bench_core.CountNonzero.time_count_nonzero(1, 100, <class 'numpy.int64'>)
-     1.01±0.02ms          954±2μs     0.95  bench_reduce.AddReduceSeparate.time_reduce(0, 'complex128')
-     1.29±0.01μs      1.22±0.01μs     0.95  bench_scalar.ScalarMath.time_power_of_two('int64')
-         321±2μs          304±2μs     0.95  bench_ufunc.UFunc.time_ufunc_types('logical_xor')
-     12.9±0.08μs      12.2±0.03μs     0.95  bench_core.PackBits.time_packbits_axis1(<class 'bool'>)
-         745±5μs          706±6μs     0.95  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'arcsin'>, 4, 1, 'f')
-         387±3μs          366±2μs     0.95  bench_function_base.Sort.time_argsort('quick', 'int16', ('sorted_block', 100))
-         599±6μs          567±4μs     0.95  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'arcsinh'>, 4, 1, 'f')
-      53.2±0.7μs      50.3±0.07μs     0.95  bench_ufunc.UFunc.time_ufunc_types('ldexp')
-      25.0±0.5μs       23.6±0.2μs     0.95  bench_scalar.ScalarMath.time_power_of_two('complex64')
-     3.34±0.01μs      3.16±0.01μs     0.94  bench_itemselection.Take.time_contiguous((1000, 2), 'wrap', 'complex64')
-         379±2μs         358±10μs     0.94  bench_function_base.Sort.time_sort('heap', 'int64', ('ordered',))
-      22.6±0.1μs      21.3±0.05μs     0.94  bench_shape_base.Block2D.time_block2d((128, 128), 'uint64', (2, 2))
-     3.34±0.02μs      3.15±0.02μs     0.94  bench_itemselection.Take.time_contiguous((1000, 2), 'wrap', 'float64')
-     4.38±0.03μs      4.13±0.01μs     0.94  bench_itemselection.Take.time_contiguous((1000, 1), 'raise', 'int16')
-     4.57±0.07μs      4.31±0.02μs     0.94  bench_itemselection.Take.time_contiguous((1000, 2), 'raise', 'complex256')
-     4.42±0.07μs      4.16±0.03μs     0.94  bench_itemselection.Take.time_contiguous((1000, 1), 'raise', 'float16')
-         393±2μs          369±1μs     0.94  bench_function_base.Sort.time_argsort('quick', 'int16', ('sorted_block', 10))
-      96.2±0.9μs       90.4±0.5μs     0.94  bench_lib.Nan.time_nanmax(200000, 0)
-     7.36±0.03μs      6.92±0.02μs     0.94  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'wrap', 'int16')
-     6.46±0.03μs      6.07±0.02μs     0.94  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'wrap', 'complex256')
-     4.70±0.02μs      4.41±0.05μs     0.94  bench_itemselection.Take.time_contiguous((1000, 2), 'wrap', 'float16')
-     5.59±0.03μs      5.25±0.02μs     0.94  bench_itemselection.Take.time_contiguous((1000, 3), 'wrap', 'int16')
-     5.92±0.02μs      5.55±0.03μs     0.94  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'wrap', 'int64')
-        700±10ns          657±5ns     0.94  bench_ufunc.ArgParsing.time_add_arg_parsing((array(1.), array(2.), out=array(3.), subok=True, where=True))
-     4.68±0.01μs      4.39±0.02μs     0.94  bench_itemselection.Take.time_contiguous((1000, 2), 'wrap', 'int16')
-     40.5±0.09μs       38.0±0.4μs     0.94  bench_linalg.Linalg.time_op('norm', 'float16')
-     2.76±0.03μs      2.58±0.01μs     0.94  bench_ufunc_strides.AVX_ldexp.time_ufunc('f', 1)
-     2.58±0.04μs      2.42±0.03μs     0.94  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 43)
-         569±9μs          532±5μs     0.94  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'log10'>, 4, 2, 'f')
-     5.94±0.05μs      5.56±0.03μs     0.94  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'wrap', 'longfloat')
-         609±2μs          570±4μs     0.94  bench_function_base.Sort.time_sort('heap', 'float64', ('sorted_block', 1000))
-      40.1±0.4μs       37.5±0.2μs     0.94  bench_lib.Pad.time_pad((1, 1, 1, 1, 1), 1, 'edge')
-     4.09±0.06μs      3.83±0.03μs     0.94  bench_itemselection.Take.time_contiguous((1000, 2), 'raise', 'complex128')
-         570±5μs          533±8μs     0.93  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'log10'>, 4, 1, 'f')
-     5.95±0.02μs      5.56±0.02μs     0.93  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'wrap', 'complex64')
-       294±0.9μs          274±1μs     0.93  bench_function_base.Sort.time_argsort('quick', 'int16', ('sorted_block', 1000))
-     13.0±0.07μs      12.1±0.06μs     0.93  bench_function_base.Sort.time_argsort('merge', 'float64', ('uniform',))
-     7.44±0.04μs      6.93±0.05μs     0.93  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'wrap', 'float16')
-     3.34±0.02μs      3.12±0.01μs     0.93  bench_itemselection.Take.time_contiguous((1000, 2), 'wrap', 'int32')
-     4.07±0.03μs      3.79±0.03μs     0.93  bench_itemselection.Take.time_contiguous((1000, 2), 'raise', 'longfloat')
-      28.4±0.2μs      26.4±0.08μs     0.93  bench_function_base.Sort.time_argsort('heap', 'int64', ('uniform',))
-     5.96±0.02μs      5.55±0.02μs     0.93  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'wrap', 'float64')
-     1.15±0.01μs      1.07±0.02μs     0.93  bench_itemselection.PutMask.time_sparse(True, 'complex256')
-         148±3μs          138±4μs     0.93  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rad2deg'>, 1, 4, 'f')
-         607±8μs          563±4μs     0.93  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'arcsinh'>, 4, 2, 'f')
-         154±2μs        142±0.9μs     0.93  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'degrees'>, 2, 4, 'f')
-      24.7±0.2ms       22.9±0.1ms     0.93  bench_linalg.Eindot.time_einsum_ijk_jil_kl
-      3.74±0.1μs      3.47±0.03μs     0.93  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 8)
-       135±0.3μs        125±0.4μs     0.93  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'logical_not'>, 2, 4, 'f')
-     13.0±0.03μs      12.1±0.05μs     0.92  bench_function_base.Sort.time_argsort('merge', 'float64', ('ordered',))
-      3.67±0.1μs      3.39±0.02μs     0.92  bench_itemselection.Take.time_contiguous((1000, 1), 'wrap', 'complex256')
-     5.98±0.02μs      5.53±0.03μs     0.92  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'wrap', 'complex128')
-      5.28±0.3μs      4.88±0.02μs     0.92  bench_itemselection.Take.time_contiguous((1000, 2), 'raise', 'float16')
-         151±2μs          140±2μs     0.92  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'degrees'>, 4, 1, 'f')
-     8.81±0.02μs      8.13±0.07μs     0.92  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'wrap', 'int32')
-         525±2μs          484±2μs     0.92  bench_function_base.Sort.time_sort('heap', 'float64', ('reversed',))
-     8.79±0.04μs      8.11±0.04μs     0.92  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'wrap', 'float32')
-      3.20±0.1ms         2.95±0ms     0.92  bench_lib.Pad.time_pad((256, 128, 1), 8, 'reflect')
-     7.40±0.04μs      6.81±0.04μs     0.92  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'raise', 'complex256')
-      3.41±0.1μs      3.13±0.02μs     0.92  bench_itemselection.Take.time_contiguous((1000, 1), 'wrap', 'complex64')
-     7.22±0.09ms      6.63±0.02ms     0.92  bench_reduce.AddReduceSeparate.time_reduce(0, 'float16')
-     3.41±0.09μs      3.13±0.03μs     0.92  bench_itemselection.Take.time_contiguous((1000, 1), 'wrap', 'int64')
-      40.2±0.4μs       36.8±0.3μs     0.92  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rint'>, 2, 1, 'f')
-      3.44±0.1μs      3.15±0.03μs     0.92  bench_itemselection.Take.time_contiguous((1000, 1), 'wrap', 'complex128')
-      3.44±0.1μs      3.16±0.02μs     0.92  bench_itemselection.Take.time_contiguous((1000, 1), 'wrap', 'longfloat')
-      47.2±0.7μs       43.0±0.1μs     0.91  bench_lib.Pad.time_pad((4, 4, 4, 4), 1, 'reflect')
-      4.18±0.1μs      3.81±0.01μs     0.91  bench_itemselection.Take.time_contiguous((1000, 1), 'wrap', 'float16')
-         292±1μs         266±10μs     0.91  bench_function_base.Select.time_select_larger
-       145±0.4μs          132±1μs     0.91  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'degrees'>, 2, 2, 'f')
-      4.83±0.2μs      4.39±0.02μs     0.91  bench_itemselection.Take.time_contiguous((1000, 1), 'wrap', 'float32')
-         151±2μs          137±2μs     0.91  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rad2deg'>, 4, 1, 'f')
-       149±0.7μs          134±2μs     0.90  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'degrees'>, 1, 4, 'f')
-         288±1μs          260±1μs     0.90  bench_function_base.Sort.time_argsort('quick', 'int64', ('sorted_block', 1000))
-      39.6±0.4μs       35.7±0.1μs     0.90  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sqrt'>, 2, 1, 'f')
-      4.23±0.2μs      3.80±0.02μs     0.90  bench_itemselection.Take.time_contiguous((1000, 1), 'wrap', 'int16')
-         114±3μs        102±0.6μs     0.90  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 4, 2, 'f')
-         394±2μs          353±2μs     0.90  bench_function_base.Sort.time_argsort('quick', 'int64', ('sorted_block', 10))
-        637±40μs          570±5μs     0.89  bench_ufunc.UFunc.time_ufunc_types('maximum')
-      85.0±0.4μs         76.0±1μs     0.89  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'positive'>, 1, 4, 'f')
-         382±2μs          342±1μs     0.89  bench_function_base.Sort.time_argsort('quick', 'int64', ('sorted_block', 100))
-         146±1μs        130±0.6μs     0.89  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rad2deg'>, 2, 2, 'f')
-         104±1μs       92.9±0.8μs     0.89  bench_lib.Nan.time_nanmax(200000, 0.1)
-     1.23±0.02μs      1.10±0.01μs     0.89  bench_itemselection.PutMask.time_sparse(True, 'float16')
-     1.23±0.01μs      1.09±0.01μs     0.89  bench_itemselection.PutMask.time_sparse(True, 'int16')
-         820±4μs          728±4μs     0.89  bench_function_base.Sort.time_argsort('heap', 'float64', ('random',))
-     2.06±0.02ms      1.83±0.01ms     0.89  bench_reduce.AddReduceSeparate.time_reduce(1, 'float16')
-       103±0.3μs         91.0±2μs     0.89  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'trunc'>, 1, 2, 'd')
-      82.4±0.6μs       72.7±0.3μs     0.88  bench_function_base.Sort.time_argsort('quick', 'int16', ('reversed',))
-         476±3μs          420±2μs     0.88  bench_function_base.Sort.time_argsort('quick', 'int64', ('random',))
-      90.8±0.1μs         80.1±2μs     0.88  bench_indexing.Indexing.time_op('indexes_', 'I', '')
-       144±0.7μs        127±0.7μs     0.88  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rad2deg'>, 1, 2, 'f')
-         735±9μs          646±3μs     0.88  bench_function_base.Sort.time_argsort('heap', 'float64', ('sorted_block', 100))
-      4.34±0.2μs      3.81±0.02μs     0.88  bench_itemselection.Take.time_contiguous((1000, 1), 'raise', 'complex256')
-      2.82±0.1ms      2.47±0.01ms     0.88  bench_ufunc.UFunc.time_ufunc_types('tan')
-         146±1μs          128±1μs     0.88  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'degrees'>, 2, 1, 'f')
-         678±4μs          593±3μs     0.88  bench_function_base.Sort.time_argsort('heap', 'float64', ('sorted_block', 1000))
-       144±0.7μs        126±0.8μs     0.87  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'degrees'>, 1, 2, 'f')
-      5.01±0.2μs      4.37±0.02μs     0.87  bench_itemselection.Take.time_contiguous((1000, 1), 'wrap', 'int32')
-       312±0.7μs          272±1μs     0.87  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'deg2rad'>, 4, 4, 'd')
-         683±3μs          595±1μs     0.87  bench_function_base.Sort.time_argsort('heap', 'float64', ('sorted_block', 10))
-         523±3μs          456±3μs     0.87  bench_function_base.Sort.time_argsort('heap', 'float64', ('reversed',))
-      70.2±0.3μs       61.0±0.2μs     0.87  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'positive'>, 1, 1, 'd')
-         771±5μs          667±3μs     0.87  bench_function_base.Sort.time_argsort('heap', 'int64', ('random',))
-         146±1μs        126±0.9μs     0.86  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rad2deg'>, 2, 1, 'f')
-     1.98±0.02μs      1.71±0.02μs     0.86  bench_itemselection.PutMask.time_dense(False, 'float32')
-         444±5μs          382±3μs     0.86  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 10000, <class 'object'>)
-         671±3μs          577±4μs     0.86  bench_function_base.Sort.time_argsort('heap', 'int64', ('sorted_block', 100))
-      84.6±0.6μs       72.5±0.4μs     0.86  bench_function_base.Sort.time_argsort('quick', 'int64', ('uniform',))
-       142±0.7μs        122±0.6μs     0.86  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rad2deg'>, 1, 1, 'f')
-      44.6±0.5ms      38.1±0.09ms     0.86  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 1000000, <class 'object'>)
-         646±5μs          550±3μs     0.85  bench_function_base.Sort.time_argsort('heap', 'int64', ('sorted_block', 10))
-      98.2±0.5μs         83.5±3μs     0.85  bench_function_base.Sort.time_sort('quick', 'int16', ('uniform',))
-         435±1μs          369±2μs     0.85  bench_function_base.Sort.time_argsort('heap', 'int64', ('ordered',))
-     2.00±0.01μs         1.70±0μs     0.85  bench_itemselection.PutMask.time_dense(False, 'int32')
-         143±1μs        121±0.9μs     0.85  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'degrees'>, 1, 1, 'f')
-     13.3±0.07μs      11.1±0.08μs     0.84  bench_function_base.Sort.time_argsort('merge', 'float64', ('reversed',))
-     7.20±0.05μs      6.03±0.03μs     0.84  bench_function_base.Sort.time_sort('merge', 'int16', ('ordered',))
-         617±4μs          517±1μs     0.84  bench_function_base.Sort.time_argsort('heap', 'int64', ('sorted_block', 1000))
-         497±2μs          416±1μs     0.84  bench_function_base.Sort.time_argsort('heap', 'float64', ('ordered',))
-         480±2μs          401±2μs     0.83  bench_function_base.Sort.time_argsort('heap', 'int64', ('reversed',))
-      4.09±0.5μs      3.41±0.01μs     0.83  bench_itemselection.Take.time_contiguous((1000, 3), 'wrap', 'float64')
-     7.26±0.08μs      6.04±0.02μs     0.83  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'raise', 'longfloat')
-        1.07±0ms          894±4μs     0.83  bench_core.PackBits.time_packbits_axis0(<class 'numpy.uint64'>)
-         136±1μs        113±0.5μs     0.83  bench_function_base.Bincount.time_bincount
-     7.30±0.04μs      6.07±0.03μs     0.83  bench_function_base.Sort.time_sort('merge', 'int16', ('uniform',))
-     7.26±0.03μs      6.04±0.03μs     0.83  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'raise', 'complex128')
-     4.14±0.06μs      3.44±0.03μs     0.83  bench_itemselection.Take.time_contiguous((1000, 2), 'raise', 'float32')
-     4.07±0.07μs      3.38±0.01μs     0.83  bench_itemselection.Take.time_contiguous((1000, 1), 'raise', 'float64')
-     4.06±0.03μs      3.37±0.04μs     0.83  bench_itemselection.Take.time_contiguous((1000, 1), 'raise', 'complex64')
-      5.11±0.1μs      4.23±0.02μs     0.83  bench_itemselection.Take.time_contiguous((1000, 3), 'raise', 'complex128')
-     4.09±0.03μs      3.38±0.01μs     0.83  bench_itemselection.Take.time_contiguous((1000, 1), 'raise', 'int64')
-     7.29±0.02μs      6.03±0.03μs     0.83  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'raise', 'complex64')
-     4.16±0.03μs      3.43±0.02μs     0.82  bench_itemselection.Take.time_contiguous((1000, 3), 'raise', 'int32')
-     4.12±0.05μs      3.39±0.02μs     0.82  bench_itemselection.Take.time_contiguous((1000, 2), 'raise', 'complex64')
-     7.31±0.03μs      6.01±0.03μs     0.82  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'raise', 'float64')
-         853±3μs          698±3μs     0.82  bench_reduce.AddReduceSeparate.time_reduce(0, 'complex64')
-      7.36±0.2μs      6.00±0.04μs     0.82  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'raise', 'int64')
-      4.18±0.1μs      3.41±0.02μs     0.82  bench_itemselection.Take.time_contiguous((1000, 2), 'raise', 'int64')
-       177±0.6μs          144±1μs     0.81  bench_function_base.Bincount.time_weights
-         109±2μs         87.7±5μs     0.81  bench_lib.Nan.time_nanmin(200000, 0.1)
-      4.23±0.1μs      3.41±0.04μs     0.81  bench_itemselection.Take.time_contiguous((1000, 1), 'raise', 'longfloat')
-      55.0±0.2μs       44.3±0.2μs     0.81  bench_function_base.Sort.time_argsort('quick', 'int16', ('ordered',))
-      34.8±0.2μs         28.0±1μs     0.80  bench_io.CopyTo.time_copyto_sparse
-       104±0.3μs       82.0±0.4μs     0.79  bench_lib.Nan.time_nanmin(200000, 0)
-      4.31±0.2μs      3.40±0.01μs     0.79  bench_itemselection.Take.time_contiguous((1000, 2), 'raise', 'int32')
-      4.33±0.6μs      3.39±0.02μs     0.78  bench_itemselection.Take.time_contiguous((1000, 1), 'raise', 'complex128')
-      4.34±0.2μs      3.39±0.01μs     0.78  bench_itemselection.Take.time_contiguous((1000, 2), 'raise', 'float64')
-        881±20μs          649±1μs     0.74  bench_ufunc.UFunc.time_ufunc_types('sqrt')
-      50.7±0.2μs       36.6±0.2μs     0.72  bench_core.PackBits.time_packbits(<class 'numpy.uint64'>)
-        1.00±0ms          707±2μs     0.70  bench_core.PackBits.time_packbits_axis1(<class 'numpy.uint64'>)
-         126±1μs         87.5±4μs     0.69  bench_lib.Nan.time_nanmax(200000, 2.0)
-         143±1μs       92.0±0.7μs     0.64  bench_lib.Nan.time_nanmin(200000, 2.0)
-         347±2μs        222±0.7μs     0.64  bench_core.PackBits.time_packbits_axis0(<class 'bool'>)
-       520±300μs          261±7μs     0.50  bench_ufunc.UFunc.time_ufunc_types('equal')
-       472±200μs          217±1μs     0.46  bench_ufunc.UFunc.time_ufunc_types('deg2rad')
-       476±200μs        217±0.8μs     0.46  bench_ufunc.UFunc.time_ufunc_types('radians')
-         276±2μs       82.9±0.2μs     0.30  bench_lib.Nan.time_nanmin(200000, 90.0)
-        650±20μs          191±6μs     0.29  bench_ufunc.UFunc.time_ufunc_types('negative')
-       298±0.7μs         83.4±1μs     0.28  bench_lib.Nan.time_nanmax(200000, 90.0)
-         764±4μs       91.6±0.2μs     0.12  bench_lib.Nan.time_nanmin(200000, 50.0)
-         757±2μs         87.0±4μs     0.11  bench_lib.Nan.time_nanmax(200000, 50.0)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

Edit (seiko2plus): put the benchmark into <details>

Edit: Added the re-run as requested

@seiko2plus
Copy link
Member

@Developer-Ecosystem-Engineering, On avx512 benchmark, I see a lot of regression not related to your patch, my guess is ASV is affected by the installed version of NumPy. please remove numpy from your environment via pip and install the latest NumPy from branch main and also make sure of rebasing as_min_max against the main then re-run the performance tests.

@rgommers
Copy link
Member

Thanks @Developer-Ecosystem-Engineering!

The ComprehensiveTests Linux_conda failure is unrelated. The TravisCI failures on Linux aarch64 and ppc64le do look related though - could you have a look at those?

Just copying one failure, there are multiple:

___________________________ TestMaximum.test_reduce ____________________________

self = <numpy.core.tests.test_umath.TestMaximum object at 0xfffeda72dd60>

    def test_reduce(self):
        dflt = np.typecodes['AllFloat']
        dint = np.typecodes['AllInteger']
        seq1 = np.arange(11)
        seq2 = seq1[::-1]
        func = np.maximum.reduce
        for dt in dint:
            tmp1 = seq1.astype(dt)
            tmp2 = seq2.astype(dt)
            assert_equal(func(tmp1), 10)
            assert_equal(func(tmp2), 10)
        for dt in dflt:
            tmp1 = seq1.astype(dt)
            tmp2 = seq2.astype(dt)
>           assert_equal(func(tmp1), 10)
E           AssertionError: 
E           Items are not equal:
E            ACTUAL: 0.0
E            DESIRED: 10

dflt       = 'efdgFDG'
dint       = 'bBhHiIlLqQpP'
dt         = 'g'
func       = <built-in method reduce of numpy.ufunc object at 0xfffefed57a40>
self       = <numpy.core.tests.test_umath.TestMaximum object at 0xfffeda72dd60>
seq1       = array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10])
seq2       = array([10,  9,  8,  7,  6,  5,  4,  3,  2,  1,  0])
tmp1       = array([0.e+00, 1.e+00, 2.e+00, 3.e+00, 4.e+00, 5.e+00, 6.e+00, 7.e+00,
       8.e+00, 9.e+00, 1.e+01], dtype=float128)
tmp2       = array([1.e+01, 9.e+00, 8.e+00, 7.e+00, 6.e+00, 5.e+00, 4.e+00, 3.e+00,
       2.e+00, 1.e+00, 0.e+00], dtype=float128)

../../builds/venv/lib/python3.10/site-packages/numpy-1.23.0.dev0+35.g51284a404-py3.10-linux-aarch64.egg/numpy/core/tests/test_umath.py:1681: AssertionError

@Developer-Ecosystem-Engineering
Copy link
Contributor Author

@Developer-Ecosystem-Engineering, On avx512 benchmark, I see a lot of regression not related to your patch, my guess is ASV is affected by the installed version of NumPy. please remove numpy from your environment via pip and install the latest NumPy from branch main and also make sure of rebasing as_min_max against the main then re-run the performance tests.

We've re-run and updated the comment with the results

@r-devulap
Copy link
Member

This patch significantly regresses performance on SKX with AVX-512:

       before           after         ratio
     [a688ed68]       [9fe353e0]
     <main>           <as_min_max>
+     7.77±0.07μs       26.8±0.2μs     3.45  bench_ufunc_strides.AVX_BFunc.time_ufunc('minimum', 'd', 2)
+     6.66±0.04μs       21.1±0.2μs     3.17  bench_ufunc_strides.AVX_BFunc.time_ufunc('minimum', 'f', 2)
+      9.36±0.7μs      28.4±0.06μs     3.04  bench_ufunc_strides.AVX_BFunc.time_ufunc('minimum', 'd', 4)
+      7.52±0.1μs       22.3±0.3μs     2.97  bench_ufunc_strides.AVX_BFunc.time_ufunc('minimum', 'f', 4)
+     6.66±0.04μs       19.2±0.1μs     2.88  bench_ufunc_strides.AVX_BFunc.time_ufunc('maximum', 'f', 2)
+     7.52±0.09μs       20.0±0.3μs     2.66  bench_ufunc_strides.AVX_BFunc.time_ufunc('maximum', 'f', 4)
+     7.77±0.06μs       20.4±0.1μs     2.63  bench_ufunc_strides.AVX_BFunc.time_ufunc('maximum', 'd', 2)
+      9.20±0.4μs       21.7±0.2μs     2.36  bench_ufunc_strides.AVX_BFunc.time_ufunc('maximum', 'd', 4)
+     3.85±0.04μs      7.39±0.01μs     1.92  bench_ufunc_strides.AVX_BFunc.time_ufunc('minimum', 'd', 1)
+     3.85±0.02μs      7.36±0.01μs     1.91  bench_ufunc_strides.AVX_BFunc.time_ufunc('maximum', 'd', 1)
+     2.53±0.02μs      4.27±0.02μs     1.69  bench_ufunc_strides.AVX_BFunc.time_ufunc('minimum', 'f', 1)
+     2.54±0.03μs      4.26±0.01μs     1.68  bench_ufunc_strides.AVX_BFunc.time_ufunc('maximum', 'f', 1)

@Developer-Ecosystem-Engineering
Copy link
Contributor Author

Thanks @Developer-Ecosystem-Engineering!

The ComprehensiveTests Linux_conda failure is unrelated. The TravisCI failures on Linux aarch64 and ppc64le do look related though - could you have a look at those?

Just copying one failure, there are multiple:

___________________________ TestMaximum.test_reduce ____________________________

self = <numpy.core.tests.test_umath.TestMaximum object at 0xfffeda72dd60>

    def test_reduce(self):
        dflt = np.typecodes['AllFloat']
        dint = np.typecodes['AllInteger']
        seq1 = np.arange(11)
        seq2 = seq1[::-1]
        func = np.maximum.reduce
        for dt in dint:
            tmp1 = seq1.astype(dt)
            tmp2 = seq2.astype(dt)
            assert_equal(func(tmp1), 10)
            assert_equal(func(tmp2), 10)
        for dt in dflt:
            tmp1 = seq1.astype(dt)
            tmp2 = seq2.astype(dt)
>           assert_equal(func(tmp1), 10)
E           AssertionError: 
E           Items are not equal:
E            ACTUAL: 0.0
E            DESIRED: 10

dflt       = 'efdgFDG'
dint       = 'bBhHiIlLqQpP'
dt         = 'g'
func       = <built-in method reduce of numpy.ufunc object at 0xfffefed57a40>
self       = <numpy.core.tests.test_umath.TestMaximum object at 0xfffeda72dd60>
seq1       = array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10])
seq2       = array([10,  9,  8,  7,  6,  5,  4,  3,  2,  1,  0])
tmp1       = array([0.e+00, 1.e+00, 2.e+00, 3.e+00, 4.e+00, 5.e+00, 6.e+00, 7.e+00,
       8.e+00, 9.e+00, 1.e+01], dtype=float128)
tmp2       = array([1.e+01, 9.e+00, 8.e+00, 7.e+00, 6.e+00, 5.e+00, 4.e+00, 3.e+00,
       2.e+00, 1.e+00, 0.e+00], dtype=float128)

../../builds/venv/lib/python3.10/site-packages/numpy-1.23.0.dev0+35.g51284a404-py3.10-linux-aarch64.egg/numpy/core/tests/test_umath.py:1681: AssertionError

The float128 failures appear to be related to the requested removal of the check for long double to be 64 bit. If we add it back it should pass again, but then arm64 linux will not receive optimized versions.

@r-devulap
Copy link
Member

@mattip Thanks, I am taking a look.

@r-devulap
Copy link
Member

@mattip LGTM. Thanks!

Copy link
Member

@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the late reply, due to my marriage, and honeymoon :).
thank you for your cooperation and effort. I have made some extra
enhancements to the current implementation to avoid any performance
regression on x86 and also improved benchmark tests as indicated in the commit messages. I still need a few hours to run benchmarks across all supported architectures to verify my latest changes

@charris
Copy link
Member

charris commented Jan 5, 2022

due to my marriage, and honeymoon :).

Congratulations Sayed.

  - Avoid unroll vectorized loops max/min by x6/x8 when SIMD width > 128
    to avoid memory bandwidth bottleneck
  - tune reduce max/min
  - vectorize non-contiguos max/min
  - fix code style
  - call npyv_cleanup() at end of inner loop
Copy link
Member

@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am satisfied with the performance. the improvement includes all integers, precision operations to all supported architectures. for record, removed raw x86 SIMD(SSE, AVX512) was only supports max&min for single&double precision.
Just one downgrade to performance is argmax operation for single precision. it shouldn't be related to these changes the pr cover but still for somehow affected not sure exactly why since the performance of reduce operations for both fmax and maxmuim have been increased. however the current SIMD code of argmax need improvements and to be replaced with universal intrinics. please check the following performance benchmarks for more information:

X86

CPU
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          4
On-line CPU(s) list:             0-3
Thread(s) per core:              2
Core(s) per socket:              2
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
Stepping:                        4
CPU MHz:                         3410.808
BogoMIPS:                        5999.99
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       64 KiB
L1i cache:                       64 KiB
L2 cache:                        2 MiB
L3 cache:                        24.8 MiB
NUMA node0 CPU(s):               0-3
Vulnerability Itlb multihit:     KVM: Mitigation: VMX unsupported
Vulnerability L1tf:              Mitigation; PTE Inversion
Vulnerability Mds:               Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full generic retpoline, STIBP disabled, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant
                                 _tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt
                                 tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep b
                                 mi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat
                                 pku ospke
OS
Linux ip-172-31-32-40 5.11.0-1020-aws #21~20.04.2-Ubuntu SMP Fri Oct 1 13:03:59 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Python 3.8.10
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

Benchmark

AVX512_SKX(before) vs AVX512_SKX(after)
unset NPY_DISABLE_CPU_FEATURES
python runtests.py -n --bench-compare parent/main "max|min" -- --sort ratio
 before           after         ratio
     [f224ca3c]       [b49819a6]
+       122±0.2μs          218±2μs     1.79  bench_reduce.ArgMax.time_argmax(<class 'numpy.float32'>)
+       431±0.4μs          483±1μs     1.12  bench_ufunc.UFunc.time_ufunc_types('fmax')
+         423±1μs          466±2μs     1.10  bench_ufunc.UFunc.time_ufunc_types('fmin')
+      93.4±0.2μs         98.9±1μs     1.06  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 1, 1, 'd')
-       103±0.6μs       98.3±0.8μs     0.95  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 1, 'I')
-      86.9±0.4μs       82.5±0.7μs     0.95  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 4, 2, 'H')
-        87.3±1μs         82.6±1μs     0.95  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 1, 'I')
-        75.1±1μs         70.7±1μs     0.94  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 1, 'I')
-      73.1±0.2μs       68.6±0.5μs     0.94  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 2, 'i')
-      74.3±0.9μs       69.6±0.6μs     0.94  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 2, 'h')
-      73.7±0.5μs       69.0±0.4μs     0.94  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 4, 'h')
-        76.2±1μs         71.3±1μs     0.94  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 2, 'h')
-      73.3±0.3μs       68.4±0.4μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 2, 'i')
-      74.5±0.9μs       69.5±0.4μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 1, 'i')
-      75.0±0.9μs       69.8±0.9μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 1, 'i')
-      73.2±0.2μs       68.1±0.3μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 2, 'I')
-      82.9±0.8μs       77.2±0.5μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 4, 1, 'H')
-        74.8±1μs         69.4±1μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 1, 'I')
-      82.4±0.9μs       76.5±0.8μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 4, 'H')
-      6.08±0.2μs      5.64±0.06μs     0.93  bench_lib.Nan.time_nanmin(200, 0.1)
-      6.15±0.1μs      5.70±0.05μs     0.93  bench_lib.Nan.time_nanmin(200, 2.0)
-      6.19±0.1μs      5.71±0.06μs     0.92  bench_lib.Nan.time_nanmin(200, 50.0)
-        1.01±0ms          931±5μs     0.92  bench_lib.Nan.time_nanargmin(200000, 90.0)
-      71.5±0.5μs       65.9±0.2μs     0.92  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 4, 'h')
-      72.6±0.3μs       66.8±0.1μs     0.92  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 1, 'h')
-      72.7±0.3μs       66.8±0.2μs     0.92  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 1, 'h')
-      6.10±0.2μs      5.61±0.08μs     0.92  bench_lib.Nan.time_nanmax(200, 0)
-      71.2±0.2μs       65.4±0.4μs     0.92  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 4, 'h')
-      6.14±0.2μs      5.63±0.03μs     0.92  bench_lib.Nan.time_nanmin(200, 0)
-      72.5±0.2μs       66.5±0.5μs     0.92  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 2, 'h')
-        1.01±0ms          926±4μs     0.92  bench_lib.Nan.time_nanargmax(200000, 90.0)
-      80.1±0.2μs       73.4±0.5μs     0.92  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 4, 'H')
-      79.9±0.3μs       73.1±0.1μs     0.92  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 2, 'I')
-        82.1±1μs         75.1±1μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 1, 'I')
-      81.1±0.7μs       74.1±0.8μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 2, 'H')
-        82.1±2μs       75.1±0.9μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 2, 2, 'H')
-      81.5±0.6μs       74.4±0.7μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 1, 'I')
-      73.2±0.2μs      66.8±0.2μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 2, 'h')
-      6.17±0.2μs      5.63±0.03μs     0.91  bench_lib.Nan.time_nanmax(200, 0.1)
-      70.1±0.3μs       63.8±0.1μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 4, 'h')
-      69.6±0.2μs       63.2±0.3μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 2, 'h')
-      6.23±0.1μs      5.65±0.01μs     0.91  bench_lib.Nan.time_nanmax(200, 50.0)
-      70.7±0.2μs      64.1±0.03μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 2, 'h')
-      68.7±0.1μs       62.2±0.1μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 2, 'h')
-      71.9±0.4μs       65.2±0.2μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 1, 'h')
-      72.0±0.2μs       65.2±0.2μs     0.90  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 1, 'h')
-      69.8±0.1μs      63.1±0.08μs     0.90  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 2, 'h')
-      6.26±0.2μs      5.66±0.01μs     0.90  bench_lib.Nan.time_nanmax(200, 2.0)
-      69.0±0.1μs       62.4±0.2μs     0.90  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 1, 'h')
-      70.2±0.1μs       63.5±0.2μs     0.90  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 1, 'h')
-      80.4±0.2μs       72.6±0.2μs     0.90  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 2, 'H')
-      78.9±0.2μs       71.1±0.5μs     0.90  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 4, 'H')
-      78.8±0.1μs       71.0±0.2μs     0.90  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 4, 'H')
-      78.0±0.1μs       70.3±0.2μs     0.90  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 2, 'H')
-      68.9±0.2μs       62.1±0.1μs     0.90  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 1, 'h')
-      79.2±0.3μs       71.3±0.2μs     0.90  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 1, 'H')
-      80.4±0.2μs       72.3±0.3μs     0.90  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 1, 'H')
-      77.3±0.2μs       69.4±0.1μs     0.90  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 2, 'H')
-      80.4±0.3μs       72.2±0.2μs     0.90  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 2, 1, 'H')
-      6.33±0.1μs      5.68±0.02μs     0.90  bench_lib.Nan.time_nanmax(200, 90.0)
-      6.37±0.1μs      5.71±0.03μs     0.90  bench_lib.Nan.time_nanmin(200, 90.0)
-      80.0±0.2μs       71.7±0.4μs     0.90  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 2, 'H')
-      77.1±0.2μs       69.1±0.1μs     0.90  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 2, 'H')
-      77.7±0.3μs       69.5±0.2μs     0.90  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 1, 'H')
-      76.4±0.2μs       68.3±0.1μs     0.89  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 1, 'H')
-      76.4±0.2μs       68.2±0.1μs     0.89  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 2, 'H')
-      77.7±0.1μs      69.4±0.09μs     0.89  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 4, 'H')
-      79.6±0.3μs       71.0±0.2μs     0.89  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 1, 'H')
-      76.5±0.1μs      68.3±0.06μs     0.89  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 1, 'H')
-         559±2μs          492±1μs     0.88  bench_lib.Nan.time_nanargmax(200000, 2.0)
-       560±0.7μs          493±2μs     0.88  bench_lib.Nan.time_nanargmin(200000, 2.0)
-      93.9±0.5μs       81.2±0.8μs     0.86  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 4, 2, 'h')
-      94.0±0.2μs       81.2±0.3μs     0.86  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 2, 'H')
-      93.5±0.2μs       80.3±0.4μs     0.86  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 4, 'H')
-      93.8±0.3μs       80.4±0.8μs     0.86  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 2, 4, 'h')
-         449±2μs          385±2μs     0.86  bench_lib.Nan.time_nanargmin(200000, 0.1)
-      93.5±0.3μs       80.1±0.4μs     0.86  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 4, 'H')
-         446±2μs        380±0.9μs     0.85  bench_lib.Nan.time_nanargmin(200000, 0)
-         444±1μs          378±1μs     0.85  bench_lib.Nan.time_nanargmax(200000, 0)
-         449±2μs          381±2μs     0.85  bench_lib.Nan.time_nanargmax(200000, 0.1)
-      94.0±0.4μs       79.3±0.8μs     0.84  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 4, 'h')
-      78.1±0.3μs       65.3±0.2μs     0.84  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 4, 'B')
-      78.0±0.3μs       65.1±0.4μs     0.84  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 4, 'b')
-     78.0±0.09μs       64.8±0.3μs     0.83  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 4, 4, 'b')
-     6.38±0.09μs      5.28±0.04μs     0.83  bench_reduce.MinMax.time_max(<class 'numpy.float32'>)
-     6.37±0.07μs      5.26±0.02μs     0.83  bench_reduce.MinMax.time_min(<class 'numpy.float32'>)
-      85.4±0.2μs       69.2±0.2μs     0.81  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 4, 4, 'B')
-      76.9±0.1μs       61.9±0.3μs     0.80  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 4, 'b')
-      84.4±0.2μs       67.7±0.2μs     0.80  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 4, 'B')
-      84.4±0.2μs       67.7±0.1μs     0.80  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 2, 4, 'B')
-      77.1±0.2μs       61.8±0.2μs     0.80  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 2, 4, 'b')
-      93.4±0.2μs       74.9±0.8μs     0.80  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 1, 'H')
-      77.1±0.2μs      61.6±0.09μs     0.80  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 4, 'b')
-      77.4±0.3μs       61.8±0.3μs     0.80  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 4, 'B')
-      84.7±0.2μs       67.5±0.2μs     0.80  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 4, 2, 'B')
-      93.5±0.2μs       74.6±0.5μs     0.80  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 4, 1, 'h')
-      77.1±0.2μs       61.5±0.2μs     0.80  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 4, 'b')
-      77.2±0.1μs       61.5±0.2μs     0.80  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 4, 2, 'b')
-      77.2±0.2μs       61.5±0.1μs     0.80  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 2, 'b')
-      83.9±0.1μs       66.8±0.2μs     0.80  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 4, 'B')
-      83.5±0.1μs       66.3±0.2μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 2, 2, 'B')
-      84.0±0.1μs      66.7±0.09μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 4, 'B')
-      77.4±0.2μs       61.4±0.1μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 2, 'B')
-      93.1±0.2μs       73.8±0.7μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 4, 'h')
-      83.1±0.1μs       65.8±0.2μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 2, 1, 'B')
-      93.0±0.2μs         73.7±1μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 4, 'H')
-      83.7±0.1μs       66.3±0.2μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 2, 'B')
-      84.1±0.1μs       66.6±0.3μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 4, 1, 'B')
-      92.9±0.7μs       73.5±0.5μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 4, 'H')
-      83.4±0.2μs       65.9±0.1μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 2, 'B')
-      76.5±0.2μs       60.4±0.2μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 4, 'b')
-      83.2±0.2μs       65.7±0.2μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 2, 'B')
-      77.7±0.3μs       61.3±0.2μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 4, 'B')
-     83.7±0.07μs       65.9±0.2μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 4, 'B')
-      76.7±0.2μs       60.4±0.3μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 4, 'b')
-      76.7±0.2μs       60.4±0.2μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 4, 'B')
-      83.1±0.2μs       65.5±0.1μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 4, 'B')
-      83.3±0.2μs       65.6±0.1μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 4, 'B')
-      82.9±0.1μs       65.2±0.1μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 4, 'B')
-     83.0±0.08μs       65.4±0.2μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 1, 'B')
-      82.9±0.1μs       65.2±0.2μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 2, 'B')
-      76.8±0.4μs       60.4±0.3μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 4, 'B')
-     82.8±0.06μs       65.1±0.1μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 1, 'B')
-      82.8±0.1μs       65.1±0.1μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 1, 'B')
-      83.1±0.2μs      65.3±0.06μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 1, 'B')
-      76.4±0.2μs       60.0±0.3μs     0.78  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 4, 'b')
-     82.8±0.02μs      64.9±0.04μs     0.78  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 2, 'B')
-     82.8±0.06μs      64.9±0.08μs     0.78  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 1, 'B')
-     82.8±0.02μs       64.9±0.1μs     0.78  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 2, 'B')
-     82.7±0.02μs      64.9±0.08μs     0.78  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 1, 'B')
-     82.8±0.02μs      64.9±0.09μs     0.78  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 2, 'B')
-      76.5±0.2μs       59.9±0.3μs     0.78  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 1, 'b')
-      93.2±0.8μs         73.0±1μs     0.78  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 4, 'h')
-      76.6±0.1μs       59.8±0.1μs     0.78  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 4, 'b')
-      75.9±0.1μs       59.2±0.2μs     0.78  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 4, 'B')
-      76.1±0.2μs       59.3±0.4μs     0.78  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 2, 'B')
-      76.8±0.2μs       59.8±0.5μs     0.78  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 4, 1, 'b')
-      76.1±0.1μs       59.3±0.3μs     0.78  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 2, 'b')
-      76.6±0.1μs       59.5±0.3μs     0.78  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 1, 'B')
-      76.2±0.2μs       59.0±0.2μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 2, 'B')
-     76.3±0.09μs       59.0±0.3μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 2, 2, 'b')
-      76.0±0.3μs       58.7±0.2μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 2, 'b')
-      76.2±0.2μs       58.8±0.2μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 4, 'b')
-      76.4±0.2μs       59.0±0.4μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 2, 'b')
-      75.6±0.2μs       58.4±0.2μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 2, 'b')
-      76.4±0.2μs       58.9±0.3μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 4, 'b')
-      75.7±0.2μs       58.4±0.2μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 2, 'B')
-     75.7±0.09μs       58.3±0.1μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 4, 'B')
-      75.9±0.2μs       58.4±0.2μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 2, 'b')
-      93.2±0.3μs         71.7±1μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 2, 'H')
-      75.9±0.2μs       58.4±0.2μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 4, 'b')
-      75.7±0.3μs       58.1±0.2μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 4, 'b')
-      75.5±0.2μs       58.0±0.1μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 1, 'b')
-      75.6±0.1μs       58.1±0.3μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 1, 'b')
-     75.7±0.08μs       58.1±0.2μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 2, 'B')
-     75.4±0.09μs      57.8±0.07μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 2, 'b')
-     75.9±0.08μs       58.2±0.2μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 4, 'b')
-     75.5±0.05μs       57.9±0.1μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 4, 'b')
-      75.8±0.2μs       58.1±0.2μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 4, 'B')
-      75.9±0.2μs       58.1±0.3μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 4, 'b')
-      75.8±0.1μs      58.1±0.04μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 2, 'b')
-      75.9±0.1μs      58.1±0.07μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 1, 'B')
-      75.5±0.1μs      57.8±0.09μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 4, 'b')
-      75.3±0.1μs       57.7±0.1μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 2, 'B')
-     75.7±0.08μs       58.0±0.3μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 1, 'b')
-      76.1±0.2μs       58.2±0.3μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 2, 'b')
-     75.5±0.05μs       57.7±0.2μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 1, 'b')
-      75.4±0.1μs      57.7±0.09μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 4, 'B')
-      75.5±0.1μs       57.7±0.2μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 1, 'b')
-      93.6±0.3μs         71.5±1μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 2, 2, 'h')
-      76.0±0.3μs       58.1±0.2μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 1, 'B')
-     75.4±0.06μs      57.6±0.08μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 2, 'b')
-     76.1±0.08μs      58.2±0.07μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 1, 'b')
-     75.4±0.03μs      57.5±0.05μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 2, 'B')
-     75.4±0.05μs       57.5±0.2μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 1, 'b')
-     75.3±0.07μs      57.5±0.03μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 2, 'b')
-      75.4±0.1μs      57.5±0.05μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 2, 'B')
-     75.4±0.07μs      57.5±0.04μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 2, 'b')
-      76.1±0.3μs       58.1±0.3μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 1, 'b')
-     75.5±0.08μs      57.6±0.06μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 1, 'B')
-     75.5±0.07μs      57.6±0.08μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 1, 'b')
-      75.5±0.1μs       57.5±0.2μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 1, 'B')
-     75.6±0.06μs       57.6±0.1μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 1, 'b')
-     75.4±0.04μs      57.5±0.08μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 2, 'b')
-      75.5±0.1μs       57.5±0.1μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 1, 'b')
-     75.3±0.08μs      57.4±0.06μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 1, 'b')
-     75.5±0.08μs      57.5±0.05μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 2, 'B')
-      75.4±0.1μs      57.5±0.05μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 2, 'b')
-      75.5±0.1μs      57.5±0.08μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 2, 'b')
-      75.6±0.2μs      57.6±0.07μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 1, 'B')
-      75.6±0.1μs       57.6±0.1μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 2, 'b')
-      76.2±0.1μs       58.0±0.2μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 2, 1, 'b')
-     75.4±0.06μs      57.4±0.04μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 1, 'B')
-     75.4±0.06μs      57.3±0.05μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 1, 'b')
-     75.5±0.06μs      57.4±0.07μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 1, 'B')
-     93.0±0.09μs       69.6±0.6μs     0.75  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 2, 'H')
-      93.0±0.6μs       69.2±0.4μs     0.74  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 4, 'h')
-      93.1±0.2μs       69.2±0.8μs     0.74  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 2, 'h')
-      92.8±0.5μs       68.8±0.3μs     0.74  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 4, 'H')
-      92.9±0.2μs       67.1±0.3μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 1, 'H')
-      92.9±0.2μs       66.9±0.2μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 1, 'H')
-      93.1±0.2μs       67.0±0.2μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 2, 'H')
-      93.1±0.2μs       67.0±0.3μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 1, 'h')
-      92.9±0.1μs       66.8±0.4μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 2, 'H')
-      93.1±0.2μs       66.9±0.3μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 2, 1, 'h')
-      93.2±0.1μs       66.8±0.2μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 2, 'h')
-      93.0±0.1μs       66.5±0.2μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 2, 'h')
-      92.5±0.2μs       65.8±0.4μs     0.71  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 4, 'H')
-     92.5±0.06μs       65.7±0.2μs     0.71  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 4, 'H')
-      92.7±0.2μs       65.6±0.4μs     0.71  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 4, 'h')
-      92.8±0.2μs       65.5±0.1μs     0.71  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 4, 'h')
-      92.7±0.1μs       65.3±0.2μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 1, 'h')
-      92.6±0.2μs       65.2±0.2μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 1, 'H')
-      92.6±0.2μs       65.1±0.2μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 1, 'H')
-      92.9±0.1μs       65.0±0.2μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 1, 'h')
-      92.1±0.2μs       64.2±0.1μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 2, 'h')
-      92.1±0.1μs       64.2±0.2μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 2, 'H')
-         537±6μs          373±1μs     0.69  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 4, 4, 'd')
-      91.6±0.1μs       63.5±0.1μs     0.69  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 1, 'H')
-      91.9±0.2μs       63.7±0.1μs     0.69  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 4, 'H')
-      91.4±0.2μs       63.2±0.1μs     0.69  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 2, 'H')
-      91.6±0.1μs      63.3±0.07μs     0.69  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 1, 'h')
-     91.6±0.07μs       63.1±0.2μs     0.69  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 2, 'h')
-     91.5±0.08μs       63.0±0.1μs     0.69  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 2, 'h')
-     91.7±0.08μs       63.1±0.2μs     0.69  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 2, 'H')
-     92.2±0.06μs      63.4±0.08μs     0.69  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 4, 'h')
-      90.9±0.2μs       62.4±0.2μs     0.69  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 2, 'H')
-     90.9±0.09μs       62.4±0.2μs     0.69  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 1, 'h')
-     90.8±0.09μs       62.4±0.1μs     0.69  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 2, 'h')
-      90.9±0.1μs       62.3±0.1μs     0.69  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 1, 'H')
-      90.8±0.1μs      62.3±0.06μs     0.69  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 1, 'H')
-     90.8±0.09μs       62.2±0.2μs     0.69  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 1, 'h')
-         544±6μs        372±0.9μs     0.68  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 4, 4, 'd')
-     9.33±0.05μs      6.30±0.03μs     0.67  bench_reduce.MinMax.time_min(<class 'numpy.float64'>)
-       184±0.9μs        124±0.3μs     0.67  bench_reduce.ArgMax.time_argmax(<class 'numpy.float64'>)
-     9.40±0.06μs      6.30±0.01μs     0.67  bench_reduce.MinMax.time_max(<class 'numpy.float64'>)
-         533±3μs          312±1μs     0.58  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 2, 4, 'd')
-         531±5μs        310±0.5μs     0.58  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 4, 4, 'd')
-         539±6μs          312±1μs     0.58  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 4, 2, 'd')
-         539±5μs        309±0.9μs     0.57  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 4, 4, 'd')
-         544±6μs          312±1μs     0.57  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 4, 2, 'd')
-         544±5μs        310±0.3μs     0.57  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 2, 4, 'd')
-         526±2μs        280±0.4μs     0.53  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 4, 4, 'd')
-        583±30μs         306±10μs     0.53  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 1, 4, 'd')
-         532±3μs          280±1μs     0.53  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 4, 4, 'd')
-       139±0.4μs         72.2±1μs     0.52  bench_lib.Nan.time_nanmin(200000, 0)
-         538±7μs          280±1μs     0.52  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 4, 1, 'd')
-        584±30μs         303±10μs     0.52  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 1, 4, 'd')
-      142±0.09μs         73.5±1μs     0.52  bench_lib.Nan.time_nanmin(200000, 0.1)
-         544±9μs          279±1μs     0.51  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 4, 1, 'd')
-         506±2μs          248±3μs     0.49  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 2, 4, 'd')
-         512±2μs          248±2μs     0.48  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 2, 4, 'd')
-         520±4μs        249±0.7μs     0.48  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 2, 2, 'd')
-         524±2μs        250±0.6μs     0.48  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 4, 2, 'd')
-         526±2μs        249±0.8μs     0.47  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 2, 2, 'd')
-         531±2μs        248±0.3μs     0.47  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 4, 2, 'd')
-      70.6±0.2μs         31.5±1μs     0.45  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 1, 'I')
-      71.0±0.2μs       31.3±0.7μs     0.44  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 1, 'i')
-        540±20μs          236±8μs     0.44  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 2, 4, 'd')
-        540±30μs         236±10μs     0.44  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 1, 4, 'd')
-        545±20μs         236±10μs     0.43  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 1, 4, 'd')
-      70.9±0.2μs       30.7±0.7μs     0.43  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 1, 'i')
-        545±20μs          236±8μs     0.43  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 2, 4, 'd')
-       507±0.9μs        217±0.6μs     0.43  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 4, 2, 'd')
-         509±2μs        218±0.6μs     0.43  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 1, 2, 'd')
-         507±5μs          217±1μs     0.43  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 4, 1, 'd')
-       514±0.4μs          217±1μs     0.42  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 4, 2, 'd')
-         515±2μs        218±0.5μs     0.42  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 2, 1, 'd')
-         514±5μs        217±0.8μs     0.42  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 4, 1, 'd')
-         516±2μs        217±0.5μs     0.42  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 1, 2, 'd')
-         520±2μs        216±0.7μs     0.42  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 2, 1, 'd')
-       177±0.3μs       72.9±0.6μs     0.41  bench_lib.Nan.time_nanmin(200000, 2.0)
-      78.2±0.1μs       31.0±0.6μs     0.40  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 1, 'I')
-        519±20μs          198±8μs     0.38  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 1, 4, 'd')
-        527±20μs          199±6μs     0.38  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 1, 4, 'd')
-       497±0.8μs          186±1μs     0.37  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 4, 4, 'f')
-       494±0.4μs        185±0.4μs     0.37  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 2, 2, 'd')
-         496±2μs          185±1μs     0.37  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 4, 4, 'f')
-         500±2μs        185±0.4μs     0.37  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 4, 1, 'd')
-       501±0.5μs        185±0.3μs     0.37  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 2, 2, 'd')
-       504±0.6μs        186±0.3μs     0.37  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 1, 1, 'd')
-         507±1μs        185±0.7μs     0.37  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 4, 1, 'd')
-       203±0.2μs       73.3±0.5μs     0.36  bench_lib.Nan.time_nanmax(200000, 0.1)
-         512±2μs        185±0.3μs     0.36  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 1, 1, 'd')
-     15.3±0.05μs      5.52±0.03μs     0.36  bench_reduce.MinMax.time_min(<class 'numpy.int64'>)
-     15.3±0.07μs      5.51±0.03μs     0.36  bench_reduce.MinMax.time_max(<class 'numpy.uint64'>)
-     15.3±0.08μs      5.49±0.03μs     0.36  bench_reduce.MinMax.time_max(<class 'numpy.int64'>)
-       199±0.2μs         70.9±1μs     0.36  bench_lib.Nan.time_nanmax(200000, 0)
-         486±2μs        155±0.4μs     0.32  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 1, 2, 'd')
-       488±0.4μs       155±0.09μs     0.32  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 4, 4, 'f')
-       489±0.5μs        155±0.3μs     0.32  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 4, 4, 'f')
-         490±2μs        155±0.4μs     0.32  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 2, 4, 'f')
-         490±1μs        155±0.6μs     0.32  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 2, 4, 'f')
-         488±1μs        154±0.2μs     0.32  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 2, 2, 'd')
-     13.7±0.01μs      4.33±0.05μs     0.32  bench_reduce.FMinMax.time_min(<class 'numpy.float64'>)
-       494±0.8μs        155±0.8μs     0.31  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 1, 2, 'd')
-         494±1μs        155±0.6μs     0.31  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 4, 2, 'f')
-         494±1μs        155±0.7μs     0.31  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 4, 2, 'f')
-       231±0.2μs         72.3±1μs     0.31  bench_lib.Nan.time_nanmax(200000, 2.0)
-       493±0.8μs        154±0.4μs     0.31  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 2, 1, 'd')
-         495±1μs        155±0.6μs     0.31  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 2, 2, 'd')
-       501±0.4μs        154±0.3μs     0.31  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 2, 1, 'd')
-     15.3±0.08μs      4.47±0.02μs     0.29  bench_reduce.MinMax.time_max(<class 'numpy.int32'>)
-     15.3±0.03μs      4.46±0.03μs     0.29  bench_reduce.MinMax.time_max(<class 'numpy.uint32'>)
-     15.4±0.04μs      4.47±0.01μs     0.29  bench_reduce.MinMax.time_min(<class 'numpy.int32'>)
-         487±1μs        140±0.5μs     0.29  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 1, 4, 'f')
-         486±2μs        140±0.6μs     0.29  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 1, 4, 'f')
-       487±0.8μs        139±0.6μs     0.29  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 4, 4, 'f')
-       488±0.3μs        140±0.6μs     0.29  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 4, 4, 'f')
-       493±0.8μs        139±0.6μs     0.28  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 4, 1, 'f')
-         494±2μs        139±0.4μs     0.28  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 4, 1, 'f')
-     21.2±0.09μs      5.52±0.03μs     0.26  bench_reduce.MinMax.time_min(<class 'numpy.uint64'>)
-       482±0.5μs        125±0.3μs     0.26  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 1, 2, 'd')
-     15.3±0.04μs      3.96±0.04μs     0.26  bench_reduce.MinMax.time_max(<class 'numpy.uint16'>)
-     15.3±0.04μs      3.95±0.02μs     0.26  bench_reduce.MinMax.time_max(<class 'numpy.int16'>)
-       484±0.4μs        125±0.3μs     0.26  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 2, 4, 'f')
-       485±0.6μs        125±0.4μs     0.26  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 2, 4, 'f')
-       489±0.6μs        126±0.9μs     0.26  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 2, 2, 'f')
-     15.3±0.05μs      3.92±0.02μs     0.26  bench_reduce.MinMax.time_min(<class 'numpy.int16'>)
-         489±2μs        125±0.8μs     0.26  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 2, 2, 'f')
-       488±0.8μs        125±0.4μs     0.26  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 4, 2, 'f')
-       488±0.6μs        125±0.2μs     0.26  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 4, 2, 'f')
-       486±0.5μs        124±0.2μs     0.25  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 1, 1, 'd')
-       491±0.8μs        125±0.9μs     0.25  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 1, 2, 'd')
-       487±0.8μs        124±0.2μs     0.25  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 2, 1, 'd')
-         494±1μs        124±0.6μs     0.25  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 2, 1, 'd')
-       495±0.3μs        124±0.4μs     0.25  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 1, 1, 'd')
-     15.3±0.06μs      3.78±0.01μs     0.25  bench_reduce.MinMax.time_max(<class 'numpy.uint8'>)
-     15.3±0.05μs      3.71±0.02μs     0.24  bench_reduce.MinMax.time_min(<class 'numpy.int8'>)
-     15.4±0.05μs      3.70±0.02μs     0.24  bench_reduce.MinMax.time_max(<class 'numpy.int8'>)
-     13.7±0.02μs      3.23±0.05μs     0.24  bench_reduce.FMinMax.time_min(<class 'numpy.float32'>)
-     13.7±0.01μs      3.19±0.04μs     0.23  bench_reduce.FMinMax.time_max(<class 'numpy.float32'>)
-       482±0.5μs        110±0.4μs     0.23  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 1, 4, 'f')
-       481±0.3μs        110±0.3μs     0.23  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 2, 4, 'f')
-       483±0.7μs        110±0.2μs     0.23  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 1, 4, 'f')
-       482±0.3μs        109±0.3μs     0.23  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 2, 4, 'f')
-       486±0.8μs        109±0.2μs     0.22  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 1, 2, 'f')
-         488±1μs        110±0.3μs     0.22  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 2, 1, 'f')
-       485±0.8μs        109±0.3μs     0.22  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 1, 2, 'f')
-       487±0.4μs        109±0.3μs     0.22  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 4, 1, 'f')
-         486±1μs        109±0.3μs     0.22  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 4, 2, 'f')
-       487±0.6μs        109±0.3μs     0.22  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 4, 2, 'f')
-       488±0.5μs        109±0.4μs     0.22  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 4, 1, 'f')
-         489±2μs        109±0.3μs     0.22  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 2, 1, 'f')
-     19.6±0.01μs      4.33±0.04μs     0.22  bench_reduce.FMinMax.time_max(<class 'numpy.float64'>)
-     21.2±0.04μs      4.45±0.02μs     0.21  bench_reduce.MinMax.time_min(<class 'numpy.uint32'>)
-       481±0.4μs       98.1±0.5μs     0.20  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 1, 1, 'd')
-       489±0.7μs       98.3±0.5μs     0.20  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 1, 1, 'd')
-        497±10μs         98.3±2μs     0.20  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 1, 4, 'f')
-        497±10μs         98.1±2μs     0.20  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 1, 4, 'f')
-       483±0.3μs         94.7±1μs     0.20  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 2, 2, 'f')
-       483±0.4μs         94.5±1μs     0.20  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 2, 2, 'f')
-         486±1μs       93.6±0.2μs     0.19  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 1, 1, 'f')
-       486±0.4μs       93.5±0.6μs     0.19  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 4, 1, 'f')
-       485±0.7μs       93.3±0.2μs     0.19  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 1, 1, 'f')
-       486±0.8μs       93.4±0.6μs     0.19  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 4, 1, 'f')
-     21.2±0.04μs      3.98±0.01μs     0.19  bench_reduce.MinMax.time_min(<class 'numpy.uint16'>)
-     21.3±0.06μs      3.73±0.02μs     0.18  bench_reduce.MinMax.time_min(<class 'numpy.uint8'>)
-       481±0.8μs       79.2±0.3μs     0.16  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 1, 2, 'f')
-       481±0.2μs       78.8±0.4μs     0.16  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 2, 2, 'f')
-       481±0.3μs       78.7±0.3μs     0.16  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 1, 2, 'f')
-       483±0.5μs       79.1±0.2μs     0.16  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 2, 1, 'f')
-       481±0.6μs       78.6±0.3μs     0.16  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 2, 2, 'f')
-       483±0.2μs       79.0±0.3μs     0.16  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 2, 1, 'f')
-       524±0.7μs         74.2±1μs     0.14  bench_lib.Nan.time_nanmax(200000, 90.0)
-       529±0.5μs       72.7±0.3μs     0.14  bench_lib.Nan.time_nanmin(200000, 90.0)
-       478±0.5μs       62.6±0.6μs     0.13  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 1, 2, 'f')
-       478±0.3μs       61.6±0.9μs     0.13  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 1, 2, 'f')
-       481±0.5μs       59.1±0.9μs     0.12  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 2, 1, 'f')
-       480±0.7μs       58.9±0.8μs     0.12  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 1, 1, 'f')
-       480±0.3μs       58.6±0.7μs     0.12  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 2, 1, 'f')
-       480±0.8μs       58.4±0.3μs     0.12  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 1, 1, 'f')
-     68.0±0.04μs       8.15±0.2μs     0.12  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 1, 'h')
-      75.7±0.1μs       8.48±0.2μs     0.11  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 1, 'H')
-      90.4±0.1μs       8.35±0.3μs     0.09  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 1, 'H')
-      90.3±0.1μs       8.29±0.1μs     0.09  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 1, 'h')
-        1.01±0ms       74.0±0.5μs     0.07  bench_lib.Nan.time_nanmax(200000, 50.0)
-       477±0.7μs         34.3±1μs     0.07  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 1, 1, 'f')
-        1.01±0ms       72.6±0.5μs     0.07  bench_lib.Nan.time_nanmin(200000, 50.0)
-         476±1μs       33.8±0.6μs     0.07  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 1, 1, 'f')
-     75.3±0.03μs       4.57±0.1μs     0.06  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 1, 'b')
-     75.3±0.05μs       4.55±0.2μs     0.06  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 1, 'b')
-     75.3±0.02μs       4.55±0.1μs     0.06  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 1, 'B')
-     82.7±0.05μs       4.57±0.1μs     0.06  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 1, 'B')
AVX2
export NPY_DISABLE_CPU_FEATURES="AVX512F AVX512_SKX"
python runtests.py -n --bench-compare parent/main "max|min" -- --sort ratio
+       122±0.3μs          220±2μs     1.81  bench_reduce.ArgMax.time_argmax(<class 'numpy.float32'>)
-        87.5±1μs       83.2±0.8μs     0.95  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 1, 'I')
-      86.6±0.5μs       82.3±0.4μs     0.95  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 4, 2, 'H')
-      72.9±0.5μs       69.1±0.4μs     0.95  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 2, 'i')
-      72.9±0.4μs       69.0±0.3μs     0.95  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 2, 'I')
-       422±0.7μs          398±2μs     0.94  bench_ufunc.UFunc.time_ufunc_types('maximum')
-        74.5±1μs       69.7±0.6μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 2, 'h')
-        74.8±1μs       69.8±0.6μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 1, 'i')
-      74.2±0.5μs       69.2±0.7μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 4, 'h')
-      82.6±0.8μs       77.0±0.7μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 4, 'H')
-      73.8±0.4μs      68.8±0.07μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 2, 'i')
-      83.1±0.6μs       77.2±0.6μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 4, 1, 'H')
-        81.8±1μs         76.0±1μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 2, 2, 'H')
-         983±3μs          911±5μs     0.93  bench_lib.Nan.time_nanargmax(200000, 90.0)
-      75.1±0.8μs       69.6±0.9μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 1, 'i')
-      72.6±0.3μs       67.2±0.5μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 2, 'h')
-      73.0±0.1μs       67.5±0.3μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 1, 'h')
-      71.3±0.2μs       65.9±0.1μs     0.92  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 4, 'h')
-      71.5±0.3μs       66.1±0.4μs     0.92  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 4, 'h')
-      80.1±0.4μs       73.9±0.4μs     0.92  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 4, 'H')
-         988±6μs          911±3μs     0.92  bench_lib.Nan.time_nanargmin(200000, 90.0)
-      80.9±0.7μs       74.6±0.8μs     0.92  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 2, 'H')
-      73.2±0.5μs       67.4±0.2μs     0.92  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 2, 'h')
-      73.1±0.2μs       67.1±0.4μs     0.92  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 1, 'h')
-      82.0±0.9μs         75.0±1μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 1, 'I')
-      80.1±0.3μs      73.3±0.08μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 2, 'I')
-      71.7±0.2μs       65.4±0.2μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 1, 'h')
-      69.6±0.1μs       63.5±0.2μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 2, 'h')
-      70.2±0.2μs       64.0±0.2μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 4, 'h')
-      81.8±0.4μs       74.5±0.6μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 1, 'I')
-      68.7±0.1μs       62.6±0.2μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 2, 'h')
-     70.0±0.03μs       63.7±0.3μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 1, 'h')
-      68.7±0.2μs       62.5±0.2μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 1, 'h')
-     69.8±0.08μs       63.4±0.2μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 2, 'h')
-      80.2±0.3μs       72.9±0.2μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 1, 'H')
-      80.2±0.3μs       72.8±0.3μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 2, 1, 'H')
-       419±0.7μs       380±0.5μs     0.91  bench_ufunc.UFunc.time_ufunc_types('minimum')
-      71.0±0.3μs       64.3±0.3μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 2, 'h')
-      68.9±0.2μs       62.4±0.1μs     0.90  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 1, 'h')
-      79.8±0.2μs       72.0±0.5μs     0.90  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 2, 'H')
-      79.1±0.3μs       71.4±0.5μs     0.90  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 4, 'H')
-      78.9±0.2μs       71.2±0.4μs     0.90  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 4, 'H')
-      72.1±0.2μs       65.0±0.6μs     0.90  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 1, 'h')
-      79.4±0.1μs       71.6±0.3μs     0.90  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 1, 'H')
-      80.6±0.1μs       72.6±0.2μs     0.90  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 2, 'H')
-         543±2μs          489±2μs     0.90  bench_lib.Nan.time_nanargmax(200000, 2.0)
-      77.3±0.1μs      69.4±0.07μs     0.90  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 2, 'H')
-      78.4±0.2μs       70.3±0.3μs     0.90  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 2, 'H')
-      79.7±0.4μs       71.4±0.2μs     0.90  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 1, 'H')
-      76.4±0.1μs       68.4±0.1μs     0.90  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 2, 'H')
-      77.7±0.2μs      69.6±0.07μs     0.90  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 4, 'H')
-      77.9±0.2μs       69.8±0.1μs     0.90  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 1, 'H')
-      76.5±0.2μs       68.4±0.2μs     0.89  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 1, 'H')
-      76.6±0.1μs       68.4±0.2μs     0.89  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 1, 'H')
-     77.6±0.08μs       69.3±0.1μs     0.89  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 2, 'H')
-         548±4μs        485±0.9μs     0.89  bench_lib.Nan.time_nanargmin(200000, 2.0)
-     6.29±0.01μs       5.57±0.1μs     0.89  bench_reduce.MinMax.time_max(<class 'numpy.float32'>)
-        6.28±0μs      5.54±0.09μs     0.88  bench_reduce.MinMax.time_min(<class 'numpy.float32'>)
-         435±2μs          376±2μs     0.86  bench_lib.Nan.time_nanargmax(200000, 0)
-         438±1μs          378±2μs     0.86  bench_lib.Nan.time_nanargmax(200000, 0.1)
-      94.5±0.3μs       81.3±0.6μs     0.86  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 2, 'H')
-         436±2μs          375±2μs     0.86  bench_lib.Nan.time_nanargmin(200000, 0)
-      94.1±0.3μs       80.8±0.5μs     0.86  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 4, 2, 'h')
-      93.8±0.5μs         80.3±1μs     0.86  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 4, 'h')
-      94.1±0.3μs       80.5±0.2μs     0.86  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 4, 'H')
-         442±2μs          378±2μs     0.86  bench_lib.Nan.time_nanargmin(200000, 0.1)
-      93.9±0.2μs       80.0±0.5μs     0.85  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 2, 4, 'h')
-      94.2±0.6μs       80.2±0.6μs     0.85  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 4, 'H')
-      78.3±0.3μs       65.1±0.4μs     0.83  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 4, 'b')
-      77.8±0.2μs       64.6±0.2μs     0.83  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 4, 4, 'b')
-      78.7±0.3μs       64.8±0.2μs     0.82  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 4, 'B')
-     9.28±0.01μs       7.64±0.1μs     0.82  bench_reduce.MinMax.time_max(<class 'numpy.float64'>)
-     9.23±0.01μs       7.59±0.1μs     0.82  bench_reduce.MinMax.time_min(<class 'numpy.float64'>)
-      85.3±0.1μs       69.2±0.2μs     0.81  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 4, 4, 'B')
-      77.0±0.2μs       61.8±0.2μs     0.80  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 2, 4, 'b')
-      84.6±0.1μs       67.8±0.2μs     0.80  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 4, 'B')
-      84.7±0.1μs      67.8±0.08μs     0.80  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 2, 4, 'B')
-      84.6±0.5μs       67.7±0.2μs     0.80  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 4, 2, 'B')
-      93.1±0.2μs         74.5±1μs     0.80  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 4, 'H')
-      93.4±0.7μs       74.6±0.9μs     0.80  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 4, 'H')
-      77.3±0.1μs       61.7±0.4μs     0.80  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 4, 'b')
-      77.2±0.2μs       61.6±0.2μs     0.80  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 4, 'b')
-      93.8±0.2μs         74.8±1μs     0.80  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 1, 'H')
-      93.4±0.2μs       74.5±0.8μs     0.80  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 4, 'h')
-      77.5±0.2μs       61.7±0.3μs     0.80  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 4, 'B')
-      94.0±0.3μs       74.9±0.8μs     0.80  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 4, 1, 'h')
-      77.5±0.1μs      61.7±0.07μs     0.80  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 4, 'b')
-      84.0±0.2μs       66.9±0.4μs     0.80  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 4, 'B')
-     77.5±0.06μs       61.5±0.7μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 4, 'B')
-      84.2±0.2μs       66.8±0.2μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 4, 1, 'B')
-      77.5±0.3μs       61.4±0.3μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 4, 2, 'b')
-      83.7±0.1μs       66.3±0.2μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 2, 2, 'B')
-      84.3±0.2μs       66.8±0.3μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 4, 'B')
-      93.2±0.6μs       73.8±0.8μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 4, 'h')
-      77.4±0.3μs       61.3±0.3μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 2, 'b')
-     83.6±0.05μs       66.1±0.2μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 4, 'B')
-      83.2±0.1μs       65.7±0.2μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 1, 'B')
-      76.6±0.1μs       60.5±0.6μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 4, 'b')
-     76.7±0.09μs       60.5±0.4μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 4, 'b')
-      77.8±0.2μs       61.4±0.3μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 2, 'B')
-      83.4±0.2μs       65.8±0.1μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 2, 'B')
-      76.6±0.2μs       60.5±0.4μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 4, 'b')
-      83.3±0.1μs       65.7±0.1μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 4, 'B')
-      83.3±0.3μs       65.7±0.1μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 2, 1, 'B')
-      83.7±0.1μs       66.0±0.2μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 2, 'B')
-      83.2±0.1μs       65.5±0.2μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 4, 'B')
-      77.0±0.4μs       60.6±0.4μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 4, 'B')
-      83.1±0.1μs       65.3±0.2μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 4, 'B')
-     82.8±0.05μs      65.1±0.06μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 1, 'B')
-      76.8±0.1μs       60.4±0.6μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 4, 'b')
-      83.1±0.2μs      65.3±0.06μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 2, 'B')
-     82.8±0.02μs      65.1±0.08μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 2, 'B')
-      82.9±0.3μs       65.2±0.1μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 1, 'B')
-     82.8±0.04μs      65.0±0.04μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 1, 'B')
-     82.8±0.02μs      65.0±0.06μs     0.78  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 2, 'B')
-      83.5±0.1μs       65.5±0.1μs     0.78  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 2, 'B')
-      83.0±0.2μs      65.0±0.03μs     0.78  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 2, 'B')
-      83.3±0.2μs       65.2±0.2μs     0.78  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 1, 'B')
-     82.9±0.06μs      64.8±0.03μs     0.78  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 1, 'B')
-      76.8±0.2μs       60.1±0.3μs     0.78  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 4, 'B')
-      76.0±0.2μs       59.3±0.2μs     0.78  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 2, 'b')
-     76.1±0.09μs       59.0±0.4μs     0.78  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 4, 'b')
-      76.3±0.2μs       59.1±0.4μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 2, 2, 'b')
-      77.3±0.3μs       59.8±0.6μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 1, 'b')
-      76.5±0.2μs       59.2±0.2μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 2, 'b')
-     76.1±0.07μs       58.8±0.2μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 4, 'b')
-      77.3±0.2μs       59.7±0.2μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 4, 1, 'b')
-      77.3±0.3μs       59.7±0.3μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 1, 'B')
-      76.4±0.2μs       58.9±0.4μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 2, 'B')
-      76.4±0.2μs       58.9±0.5μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 2, 'b')
-      75.7±0.1μs       58.3±0.1μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 2, 'B')
-      75.8±0.2μs       58.3±0.6μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 4, 'B')
-      76.5±0.1μs       58.9±0.1μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 2, 'B')
-      76.5±0.1μs       58.8±0.1μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 4, 'B')
-      76.2±0.3μs       58.6±0.2μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 2, 'B')
-      75.9±0.2μs       58.3±0.4μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 4, 'b')
-      93.6±0.2μs       72.0±0.8μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 2, 'H')
-      76.0±0.2μs       58.4±0.3μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 4, 'B')
-      75.9±0.1μs       58.3±0.2μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 1, 'b')
-      75.7±0.2μs       58.2±0.3μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 2, 'b')
-      76.0±0.2μs       58.3±0.2μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 2, 'b')
-      75.9±0.2μs       58.2±0.1μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 4, 'b')
-      75.9±0.2μs       58.2±0.3μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 4, 'b')
-      75.8±0.2μs       58.1±0.3μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 4, 'b')
-      76.0±0.2μs       58.2±0.1μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 1, 'B')
-      75.5±0.1μs       57.8±0.2μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 2, 'B')
-      75.9±0.3μs       58.1±0.2μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 2, 'b')
-      75.5±0.1μs       57.8±0.2μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 4, 'b')
-      75.6±0.1μs       57.9±0.1μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 4, 'B')
-     75.4±0.03μs       57.6±0.1μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 2, 'b')
-      75.4±0.1μs      57.7±0.06μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 2, 'B')
-     75.4±0.03μs      57.6±0.06μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 2, 'B')
-      75.6±0.1μs       57.8±0.1μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 2, 'b')
-      76.0±0.2μs       58.1±0.3μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 1, 'B')
-     75.4±0.04μs      57.6±0.05μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 2, 'b')
-      76.0±0.2μs       58.0±0.2μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 2, 'b')
-      76.0±0.2μs       58.1±0.1μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 1, 'b')
-     75.4±0.02μs      57.6±0.04μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 2, 'b')
-     75.4±0.06μs      57.6±0.09μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 1, 'B')
-     75.5±0.03μs      57.6±0.05μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 1, 'b')
-     75.5±0.04μs       57.6±0.2μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 1, 'b')
-      75.5±0.2μs      57.6±0.06μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 1, 'B')
-     75.9±0.07μs       57.9±0.3μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 2, 1, 'b')
-     75.4±0.07μs      57.6±0.02μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 2, 'B')
-     75.4±0.03μs      57.5±0.07μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 1, 'B')
-      76.0±0.2μs       58.0±0.1μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 1, 'b')
-     75.4±0.04μs      57.5±0.05μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 1, 'B')
-     75.4±0.06μs       57.5±0.1μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 2, 'b')
-     75.5±0.06μs      57.6±0.03μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 1, 'b')
-     75.6±0.06μs      57.6±0.09μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 2, 'b')
-      75.6±0.1μs      57.7±0.07μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 1, 'b')
-     75.4±0.03μs      57.5±0.03μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 1, 'b')
-      93.2±0.2μs         71.0±1μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 2, 2, 'h')
-     75.3±0.04μs      57.4±0.06μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 1, 'b')
-     75.4±0.06μs      57.5±0.06μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 2, 'b')
-     75.7±0.07μs      57.7±0.05μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 4, 'b')
-      75.8±0.1μs      57.8±0.07μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 2, 'b')
-     75.6±0.07μs      57.6±0.07μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 1, 'b')
-      75.7±0.1μs       57.6±0.1μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 1, 'B')
-      75.6±0.1μs      57.6±0.04μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 1, 'b')
-      75.5±0.1μs      57.5±0.07μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 1, 'b')
-     75.5±0.07μs      57.4±0.07μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 1, 'b')
-      93.3±0.2μs       70.2±0.8μs     0.75  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 2, 'H')
-      93.2±0.2μs       70.0±0.4μs     0.75  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 2, 'h')
-      93.1±0.4μs       69.4±0.4μs     0.75  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 4, 'h')
-      93.2±0.6μs       69.3±0.6μs     0.74  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 4, 'H')
-      92.9±0.1μs       67.4±0.2μs     0.73  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 2, 'H')
-      93.3±0.2μs       67.4±0.2μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 1, 'H')
-      93.0±0.3μs       67.2±0.3μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 2, 1, 'h')
-     93.0±0.08μs       67.1±0.3μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 2, 'h')
-      93.1±0.3μs       67.0±0.3μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 1, 'h')
-      93.1±0.2μs       67.1±0.4μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 2, 'h')
-      93.2±0.2μs       67.1±0.4μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 1, 'H')
-         529±3μs          381±5μs     0.72  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 4, 4, 'd')
-      93.3±0.2μs       67.1±0.4μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 2, 'H')
-      92.5±0.1μs       66.2±0.5μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 4, 'H')
-      92.9±0.4μs       66.2±0.4μs     0.71  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 4, 'h')
-      92.6±0.1μs       65.9±0.2μs     0.71  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 4, 'H')
-      92.9±0.2μs       66.0±0.1μs     0.71  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 4, 'h')
-      92.9±0.2μs       65.7±0.3μs     0.71  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 1, 'H')
-      92.6±0.2μs       65.4±0.3μs     0.71  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 1, 'h')
-         544±4μs          384±6μs     0.71  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 4, 4, 'd')
-      92.8±0.1μs       65.3±0.2μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 1, 'H')
-      92.9±0.2μs       65.1±0.2μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 1, 'h')
-     92.3±0.08μs       64.6±0.3μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 2, 'H')
-     91.8±0.08μs       64.0±0.1μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 4, 'H')
-      92.4±0.1μs       64.4±0.2μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 2, 'h')
-     91.7±0.06μs       63.7±0.1μs     0.69  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 1, 'H')
-     92.0±0.09μs      63.8±0.07μs     0.69  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 4, 'h')
-      91.7±0.2μs       63.5±0.2μs     0.69  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 1, 'h')
-      91.5±0.1μs       63.3±0.2μs     0.69  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 2, 'H')
-      91.8±0.1μs      63.5±0.08μs     0.69  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 2, 'h')
-      91.5±0.1μs       63.3±0.2μs     0.69  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 2, 'H')
-      90.8±0.1μs       62.5±0.2μs     0.69  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 1, 'H')
-      92.0±0.2μs       63.4±0.1μs     0.69  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 2, 'h')
-     90.8±0.08μs       62.4±0.1μs     0.69  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 2, 'H')
-      90.8±0.1μs       62.5±0.2μs     0.69  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 1, 'H')
-      90.9±0.2μs       62.4±0.2μs     0.69  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 1, 'h')
-     91.0±0.09μs       62.4±0.2μs     0.69  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 2, 'h')
-      91.0±0.1μs       62.3±0.2μs     0.68  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 1, 'h')
-         184±1μs        124±0.2μs     0.67  bench_reduce.ArgMax.time_argmax(<class 'numpy.float64'>)
-         527±5μs          318±3μs     0.60  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 4, 4, 'd')
-         530±3μs          319±2μs     0.60  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 2, 4, 'd')
-         531±3μs          318±1μs     0.60  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 4, 2, 'd')
-         644±4μs          380±8μs     0.59  bench_ufunc_strides.Binary.time_ufunc('minimum', 4, 4, 4, 'd')
-         541±5μs          318±2μs     0.59  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 4, 2, 'd')
-         538±5μs          314±4μs     0.58  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 4, 4, 'd')
-         652±4μs          378±6μs     0.58  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 4, 4, 'd')
-         544±4μs          314±6μs     0.58  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 2, 4, 'd')
-       134±0.2μs         76.2±1μs     0.57  bench_lib.Nan.time_nanmin(200000, 0)
-       138±0.2μs       78.0±0.6μs     0.57  bench_lib.Nan.time_nanmin(200000, 0.1)
-        572±30μs         315±10μs     0.55  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 1, 4, 'd')
-         523±3μs          285±1μs     0.54  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 4, 4, 'd')
-        527±10μs        286±0.9μs     0.54  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 4, 1, 'd')
-        537±10μs          285±2μs     0.53  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 4, 1, 'd')
-         533±1μs          282±4μs     0.53  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 4, 4, 'd')
-     15.2±0.02μs       7.90±0.1μs     0.52  bench_reduce.MinMax.time_max(<class 'numpy.uint64'>)
-        592±30μs         306±10μs     0.52  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 1, 4, 'd')
-         504±2μs          257±3μs     0.51  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 2, 4, 'd')
-         637±5μs          316±4μs     0.50  bench_ufunc_strides.Binary.time_ufunc('minimum', 4, 4, 2, 'd')
-         632±5μs          314±3μs     0.50  bench_ufunc_strides.Binary.time_ufunc('minimum', 2, 4, 4, 'd')
-         637±6μs          315±5μs     0.49  bench_ufunc_strides.Binary.time_ufunc('minimum', 4, 2, 4, 'd')
-         514±3μs          253±5μs     0.49  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 2, 4, 'd')
-         515±3μs          253±2μs     0.49  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 2, 2, 'd')
-         522±2μs          254±1μs     0.49  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 4, 2, 'd')
-         657±4μs          316±1μs     0.48  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 4, 2, 'd')
-         525±3μs          250±2μs     0.48  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 2, 2, 'd')
-         657±5μs        313±0.8μs     0.48  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 4, 4, 'd')
-         660±4μs          313±3μs     0.47  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 2, 4, 'd')
-         533±4μs          252±3μs     0.47  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 4, 2, 'd')
-      70.7±0.2μs         32.0±1μs     0.45  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 1, 'I')
-       172±0.1μs       77.7±0.2μs     0.45  bench_lib.Nan.time_nanmin(200000, 2.0)
-        540±30μs         244±10μs     0.45  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 2, 4, 'd')
-        535±30μs         241±10μs     0.45  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 1, 4, 'd')
-     15.2±0.01μs       6.83±0.1μs     0.45  bench_reduce.MinMax.time_min(<class 'numpy.int64'>)
-      70.7±0.4μs       31.7±0.3μs     0.45  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 1, 'i')
-      70.9±0.4μs       31.8±0.7μs     0.45  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 1, 'i')
-         629±2μs        280±0.8μs     0.45  bench_ufunc_strides.Binary.time_ufunc('minimum', 1, 4, 4, 'd')
-     15.2±0.03μs       6.77±0.1μs     0.44  bench_reduce.MinMax.time_max(<class 'numpy.int64'>)
-        639±10μs          283±3μs     0.44  bench_ufunc_strides.Binary.time_ufunc('minimum', 4, 4, 1, 'd')
-        546±30μs         239±10μs     0.44  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 1, 4, 'd')
-     13.7±0.02μs      5.99±0.06μs     0.44  bench_reduce.FMinMax.time_min(<class 'numpy.float64'>)
-        693±40μs         303±10μs     0.44  bench_ufunc_strides.Binary.time_ufunc('minimum', 4, 1, 4, 'd')
-         505±5μs        220±0.4μs     0.44  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 4, 1, 'd')
-         506±2μs        220±0.8μs     0.43  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 1, 2, 'd')
-         511±2μs          221±1μs     0.43  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 2, 1, 'd')
-         507±1μs        219±0.7μs     0.43  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 4, 2, 'd')
-        547±20μs         236±10μs     0.43  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 2, 4, 'd')
-         652±2μs        281±0.9μs     0.43  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 4, 4, 'd')
-         659±6μs          283±2μs     0.43  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 4, 1, 'd')
-        721±40μs         309±10μs     0.43  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 1, 4, 'd')
-       515±0.5μs          218±3μs     0.42  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 4, 2, 'd')
-         517±4μs          219±2μs     0.42  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 4, 1, 'd')
-         517±3μs          219±2μs     0.42  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 1, 2, 'd')
-         522±2μs          219±1μs     0.42  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 2, 1, 'd')
-         606±1μs          251±3μs     0.41  bench_ufunc_strides.Binary.time_ufunc('minimum', 2, 2, 4, 'd')
-      78.2±0.3μs       31.7±0.3μs     0.40  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 1, 'I')
-         624±3μs          249±3μs     0.40  bench_ufunc_strides.Binary.time_ufunc('minimum', 4, 2, 2, 'd')
-       196±0.3μs       78.2±0.8μs     0.40  bench_lib.Nan.time_nanmax(200000, 0.1)
-         634±2μs          252±3μs     0.40  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 2, 4, 'd')
-         629±2μs          250±2μs     0.40  bench_ufunc_strides.Binary.time_ufunc('minimum', 2, 4, 2, 'd')
-       193±0.4μs       76.6±0.6μs     0.40  bench_lib.Nan.time_nanmax(200000, 0)
-        517±20μs          201±7μs     0.39  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 1, 4, 'd')
-         649±2μs          250±2μs     0.39  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 2, 2, 'd')
-       495±0.5μs          190±1μs     0.38  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 4, 4, 'f')
-         654±1μs        250±0.9μs     0.38  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 4, 2, 'd')
-         494±1μs          188±1μs     0.38  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 4, 4, 'f')
-       492±0.5μs        187±0.4μs     0.38  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 2, 2, 'd')
-       498±0.9μs          188±2μs     0.38  bench_ufunc_strides.Binary.time_ufunc('minimum', 4, 4, 4, 'f')
-     21.1±0.02μs       7.94±0.1μs     0.38  bench_reduce.MinMax.time_min(<class 'numpy.uint64'>)
-        528±20μs          199±7μs     0.38  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 1, 4, 'd')
-       500±0.8μs        188±0.3μs     0.38  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 4, 1, 'd')
-         502±1μs        187±0.4μs     0.37  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 1, 1, 'd')
-       502±0.5μs        186±0.9μs     0.37  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 2, 2, 'd')
-         509±2μs          187±1μs     0.37  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 4, 1, 'd')
-       511±0.5μs          186±2μs     0.36  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 1, 1, 'd')
-        651±30μs         237±10μs     0.36  bench_ufunc_strides.Binary.time_ufunc('minimum', 2, 1, 4, 'd')
-        654±30μs          238±9μs     0.36  bench_ufunc_strides.Binary.time_ufunc('minimum', 1, 2, 4, 'd')
-         614±4μs          218±1μs     0.36  bench_ufunc_strides.Binary.time_ufunc('minimum', 2, 4, 1, 'd')
-       612±0.6μs        218±0.6μs     0.36  bench_ufunc_strides.Binary.time_ufunc('minimum', 1, 4, 2, 'd')
-         614±3μs          218±2μs     0.36  bench_ufunc_strides.Binary.time_ufunc('minimum', 4, 1, 2, 'd')
-       617±0.9μs          218±1μs     0.35  bench_ufunc_strides.Binary.time_ufunc('minimum', 4, 2, 1, 'd')
-       224±0.2μs       78.3±0.2μs     0.35  bench_lib.Nan.time_nanmax(200000, 2.0)
-        682±40μs         236±10μs     0.35  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 1, 4, 'd')
-        684±30μs          235±6μs     0.34  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 2, 4, 'd')
-         638±3μs          218±1μs     0.34  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 1, 2, 'd')
-         640±4μs        219±0.8μs     0.34  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 4, 1, 'd')
-         637±1μs          218±1μs     0.34  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 4, 2, 'd')
-       644±0.9μs          219±2μs     0.34  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 2, 1, 'd')
-     15.2±0.02μs       4.93±0.1μs     0.32  bench_reduce.MinMax.time_max(<class 'numpy.int32'>)
-       488±0.7μs        157±0.6μs     0.32  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 4, 4, 'f')
-       489±0.7μs        158±0.7μs     0.32  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 2, 4, 'f')
-       490±0.6μs          158±2μs     0.32  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 2, 4, 'f')
-         486±1μs        156±0.6μs     0.32  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 1, 2, 'd')
-       493±0.6μs        158±0.7μs     0.32  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 4, 2, 'f')
-     15.3±0.01μs       4.89±0.1μs     0.32  bench_reduce.MinMax.time_max(<class 'numpy.uint32'>)
-     15.2±0.03μs       4.88±0.2μs     0.32  bench_reduce.MinMax.time_min(<class 'numpy.int32'>)
-         489±1μs          156±1μs     0.32  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 2, 2, 'd')
-         493±1μs        158±0.3μs     0.32  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 2, 1, 'd')
-         493±1μs          157±2μs     0.32  bench_ufunc_strides.Binary.time_ufunc('minimum', 4, 2, 4, 'f')
-         590±1μs          188±2μs     0.32  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 4, 4, 'f')
-       489±0.4μs          156±1μs     0.32  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 4, 4, 'f')
-       492±0.7μs          157±1μs     0.32  bench_ufunc_strides.Binary.time_ufunc('minimum', 2, 4, 4, 'f')
-       494±0.8μs        157±0.6μs     0.32  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 4, 2, 'f')
-        626±30μs          198±6μs     0.32  bench_ufunc_strides.Binary.time_ufunc('minimum', 1, 1, 4, 'd')
-       497±0.9μs        157±0.8μs     0.32  bench_ufunc_strides.Binary.time_ufunc('minimum', 4, 4, 2, 'f')
-         496±1μs          156±1μs     0.31  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 2, 2, 'd')
-         496±1μs        156±0.8μs     0.31  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 1, 2, 'd')
-       594±0.7μs          186±1μs     0.31  bench_ufunc_strides.Binary.time_ufunc('minimum', 2, 2, 2, 'd')
-       501±0.3μs          156±2μs     0.31  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 2, 1, 'd')
-         606±8μs        186±0.7μs     0.31  bench_ufunc_strides.Binary.time_ufunc('minimum', 1, 4, 1, 'd')
-         608±1μs          186±1μs     0.31  bench_ufunc_strides.Binary.time_ufunc('minimum', 4, 1, 1, 'd')
-     19.6±0.02μs      5.99±0.07μs     0.31  bench_reduce.FMinMax.time_max(<class 'numpy.float64'>)
-        660±30μs          200±5μs     0.30  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 1, 4, 'd')
-         622±2μs        186±0.8μs     0.30  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 2, 2, 'd')
-         635±3μs        187±0.4μs     0.30  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 1, 1, 'd')
-         633±2μs        186±0.6μs     0.29  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 4, 1, 'd')
-         487±1μs        142±0.5μs     0.29  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 1, 4, 'f')
-         489±2μs          143±1μs     0.29  bench_ufunc_strides.Binary.time_ufunc('minimum', 4, 1, 4, 'f')
-         487±1μs          142±2μs     0.29  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 1, 4, 'f')
-       488±0.5μs        142±0.3μs     0.29  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 4, 4, 'f')
-       490±0.7μs        142±0.8μs     0.29  bench_ufunc_strides.Binary.time_ufunc('minimum', 1, 4, 4, 'f')
-       487±0.7μs          141±1μs     0.29  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 4, 4, 'f')
-     13.7±0.04μs      3.95±0.06μs     0.29  bench_reduce.FMinMax.time_min(<class 'numpy.float32'>)
-       493±0.9μs        141±0.3μs     0.29  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 4, 1, 'f')
-     13.7±0.03μs      3.90±0.05μs     0.29  bench_reduce.FMinMax.time_max(<class 'numpy.float32'>)
-       492±0.5μs          140±1μs     0.29  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 4, 1, 'f')
-         496±1μs          140±1μs     0.28  bench_ufunc_strides.Binary.time_ufunc('minimum', 4, 4, 1, 'f')
-     15.2±0.01μs       4.08±0.1μs     0.27  bench_reduce.MinMax.time_min(<class 'numpy.int16'>)
-     15.2±0.02μs       4.08±0.1μs     0.27  bench_reduce.MinMax.time_max(<class 'numpy.uint16'>)
-     15.2±0.02μs       4.05±0.1μs     0.27  bench_reduce.MinMax.time_max(<class 'numpy.int16'>)
-         585±2μs          157±1μs     0.27  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 2, 4, 'f')
-       586±0.9μs        156±0.6μs     0.27  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 4, 4, 'f')
-         590±1μs        156±0.5μs     0.26  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 4, 2, 'f')
-       587±0.9μs        155±0.5μs     0.26  bench_ufunc_strides.Binary.time_ufunc('minimum', 2, 1, 2, 'd')
-         589±1μs        155±0.6μs     0.26  bench_ufunc_strides.Binary.time_ufunc('minimum', 1, 2, 2, 'd')
-       487±0.6μs        128±0.7μs     0.26  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 2, 2, 'f')
-       484±0.2μs        126±0.4μs     0.26  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 2, 4, 'f')
-       594±0.9μs          155±1μs     0.26  bench_ufunc_strides.Binary.time_ufunc('minimum', 2, 2, 1, 'd')
-         488±1μs        127±0.3μs     0.26  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 4, 2, 'f')
-       483±0.5μs          126±2μs     0.26  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 1, 2, 'd')
-         488±1μs          127±1μs     0.26  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 2, 2, 'f')
-         492±2μs          128±1μs     0.26  bench_ufunc_strides.Binary.time_ufunc('minimum', 4, 2, 2, 'f')
-       485±0.6μs          126±1μs     0.26  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 2, 4, 'f')
-         489±2μs        126±0.9μs     0.26  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 4, 2, 'f')
-         487±1μs        126±0.6μs     0.26  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 2, 1, 'd')
-       486±0.3μs        125±0.5μs     0.26  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 1, 1, 'd')
-       487±0.5μs        125±0.2μs     0.26  bench_ufunc_strides.Binary.time_ufunc('minimum', 2, 2, 4, 'f')
-         491±1μs        126±0.7μs     0.26  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 1, 2, 'd')
-         490±1μs          126±1μs     0.26  bench_ufunc_strides.Binary.time_ufunc('minimum', 2, 4, 2, 'f')
-       495±0.6μs          125±1μs     0.25  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 2, 1, 'd')
-       496±0.3μs          125±1μs     0.25  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 1, 1, 'd')
-       617±0.8μs        155±0.5μs     0.25  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 1, 2, 'd')
-         617±1μs        155±0.3μs     0.25  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 2, 2, 'd')
-         621±1μs        155±0.2μs     0.25  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 2, 1, 'd')
-     15.2±0.01μs       3.74±0.1μs     0.25  bench_reduce.MinMax.time_max(<class 'numpy.uint8'>)
-     15.2±0.01μs       3.74±0.1μs     0.25  bench_reduce.MinMax.time_max(<class 'numpy.int8'>)
-         582±2μs        142±0.3μs     0.24  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 1, 4, 'f')
-     15.2±0.02μs       3.70±0.1μs     0.24  bench_reduce.MinMax.time_min(<class 'numpy.int8'>)
-       585±0.9μs        142±0.6μs     0.24  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 4, 4, 'f')
-         484±3μs        117±0.6μs     0.24  bench_ufunc_strides.Binary.time_ufunc('minimum', 1, 2, 4, 'f')
-       484±0.7μs          115±1μs     0.24  bench_ufunc_strides.Binary.time_ufunc('minimum', 2, 1, 4, 'f')
-         589±1μs          140±1μs     0.24  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 4, 1, 'f')
-       487±0.8μs        112±0.8μs     0.23  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 4, 2, 'f')
-     21.1±0.01μs       4.87±0.2μs     0.23  bench_reduce.MinMax.time_min(<class 'numpy.uint32'>)
-       482±0.4μs        111±0.4μs     0.23  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 2, 4, 'f')
-       486±0.5μs        112±0.8μs     0.23  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 4, 2, 'f')
-       482±0.7μs        111±0.9μs     0.23  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 1, 4, 'f')
-         482±1μs        111±0.8μs     0.23  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 1, 4, 'f')
-       481±0.7μs        111±0.7μs     0.23  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 2, 4, 'f')
-       488±0.7μs        112±0.6μs     0.23  bench_ufunc_strides.Binary.time_ufunc('minimum', 4, 1, 2, 'f')
-       485±0.9μs        111±0.6μs     0.23  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 1, 2, 'f')
-       489±0.5μs        112±0.5μs     0.23  bench_ufunc_strides.Binary.time_ufunc('minimum', 1, 4, 2, 'f')
-       485±0.7μs          111±1μs     0.23  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 1, 2, 'f')
-       488±0.3μs        111±0.4μs     0.23  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 4, 1, 'f')
-         488±1μs        110±0.4μs     0.23  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 2, 1, 'f')
-       489±0.6μs          110±1μs     0.22  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 4, 1, 'f')
-         489±2μs        110±0.7μs     0.22  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 2, 1, 'f')
-       491±0.7μs        110±0.3μs     0.22  bench_ufunc_strides.Binary.time_ufunc('minimum', 2, 4, 1, 'f')
-         491±2μs        109±0.4μs     0.22  bench_ufunc_strides.Binary.time_ufunc('minimum', 4, 2, 1, 'f')
-         585±2μs          126±1μs     0.22  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 2, 2, 'f')
-       581±0.7μs        125±0.3μs     0.22  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 2, 4, 'f')
-         586±1μs        126±0.3μs     0.21  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 4, 2, 'f')
-         585±3μs        125±0.6μs     0.21  bench_ufunc_strides.Binary.time_ufunc('minimum', 1, 1, 2, 'd')
-       588±0.4μs        125±0.7μs     0.21  bench_ufunc_strides.Binary.time_ufunc('minimum', 2, 1, 1, 'd')
-        498±10μs          106±3μs     0.21  bench_ufunc_strides.Binary.time_ufunc('minimum', 1, 1, 4, 'f')
-       589±0.3μs          125±1μs     0.21  bench_ufunc_strides.Binary.time_ufunc('minimum', 1, 2, 1, 'd')
-         613±2μs        125±0.8μs     0.20  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 1, 2, 'd')
-       616±0.9μs        125±0.3μs     0.20  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 2, 1, 'd')
-       617±0.3μs        125±0.4μs     0.20  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 1, 1, 'd')
-       578±0.6μs        116±0.5μs     0.20  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 2, 4, 'f')
-         579±1μs          116±1μs     0.20  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 1, 4, 'f')
-       483±0.6μs         96.3±1μs     0.20  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 2, 2, 'f')
-       483±0.5μs         96.2±2μs     0.20  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 2, 2, 'f')
-        497±10μs         98.4±2μs     0.20  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 1, 4, 'f')
-       481±0.4μs       94.9±0.4μs     0.20  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 1, 1, 'd')
-       486±0.1μs         95.5±2μs     0.20  bench_ufunc_strides.Binary.time_ufunc('minimum', 2, 2, 2, 'f')
-        497±10μs         97.6±2μs     0.20  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 1, 4, 'f')
-       485±0.8μs       94.5±0.4μs     0.19  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 1, 1, 'f')
-       486±0.7μs       94.6±0.3μs     0.19  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 4, 1, 'f')
-       485±0.8μs       94.1±0.7μs     0.19  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 1, 1, 'f')
-     21.1±0.02μs       4.09±0.1μs     0.19  bench_reduce.MinMax.time_min(<class 'numpy.uint16'>)
-       486±0.8μs       94.0±0.7μs     0.19  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 4, 1, 'f')
-       490±0.6μs       94.4±0.6μs     0.19  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 1, 1, 'd')
-         583±1μs        112±0.4μs     0.19  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 1, 2, 'f')
-         490±1μs      94.0±0.05μs     0.19  bench_ufunc_strides.Binary.time_ufunc('minimum', 1, 4, 1, 'f')
-       489±0.8μs       93.8±0.6μs     0.19  bench_ufunc_strides.Binary.time_ufunc('minimum', 4, 1, 1, 'f')
-         584±1μs        112±0.6μs     0.19  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 4, 2, 'f')
-       585±0.8μs        110±0.6μs     0.19  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 2, 1, 'f')
-         585±1μs        109±0.4μs     0.19  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 4, 1, 'f')
-     21.1±0.02μs       3.77±0.1μs     0.18  bench_reduce.MinMax.time_min(<class 'numpy.uint8'>)
-       483±0.4μs       85.4±0.5μs     0.18  bench_ufunc_strides.Binary.time_ufunc('minimum', 1, 2, 2, 'f')
-        598±10μs          106±3μs     0.18  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 1, 4, 'f')
-       483±0.4μs       84.3±0.8μs     0.17  bench_ufunc_strides.Binary.time_ufunc('minimum', 2, 1, 2, 'f')
-       481±0.4μs         83.0±1μs     0.17  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 2, 2, 'f')
-       481±0.4μs         81.7±1μs     0.17  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 2, 2, 'f')
-         480±1μs       81.2±0.5μs     0.17  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 1, 2, 'f')
-         482±1μs         80.9±1μs     0.17  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 1, 2, 'f')
-       484±0.4μs       80.6±0.2μs     0.17  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 2, 1, 'f')
-       483±0.4μs       80.1±0.4μs     0.17  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 2, 1, 'f')
-       486±0.3μs       80.0±0.4μs     0.16  bench_ufunc_strides.Binary.time_ufunc('minimum', 2, 2, 1, 'f')
-         580±1μs         95.3±1μs     0.16  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 2, 2, 'f')
-         583±2μs      94.6±0.05μs     0.16  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 1, 1, 'f')
-       583±0.9μs       94.1±0.1μs     0.16  bench_ufunc_strides.Binary.time_ufunc('minimum', 1, 1, 1, 'd')
-       583±0.7μs       93.8±0.2μs     0.16  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 4, 1, 'f')
-       481±0.6μs         76.8±1μs     0.16  bench_ufunc_strides.Binary.time_ufunc('minimum', 1, 1, 2, 'f')
-       611±0.7μs       94.1±0.1μs     0.15  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 1, 1, 'd')
-         513±1μs       78.4±0.1μs     0.15  bench_lib.Nan.time_nanmin(200000, 90.0)
-       509±0.5μs       77.8±0.4μs     0.15  bench_lib.Nan.time_nanmax(200000, 90.0)
-     68.1±0.08μs       10.2±0.5μs     0.15  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 1, 'h')
-       577±0.6μs       85.2±0.4μs     0.15  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 2, 2, 'f')
-       578±0.6μs       84.6±0.9μs     0.15  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 1, 2, 'f')
-       580±0.7μs       79.9±0.4μs     0.14  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 2, 1, 'f')
-      75.7±0.2μs       10.2±0.4μs     0.14  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 1, 'H')
-       575±0.6μs       76.6±0.9μs     0.13  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 1, 2, 'f')
-       478±0.6μs       63.4±0.6μs     0.13  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 1, 2, 'f')
-       479±0.4μs       63.3±0.6μs     0.13  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 1, 2, 'f')
-       480±0.9μs       61.6±0.4μs     0.13  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 1, 1, 'f')
-         479±2μs         60.9±1μs     0.13  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 1, 1, 'f')
-       480±0.5μs       60.9±0.8μs     0.13  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 2, 1, 'f')
-       483±0.5μs       61.3±0.3μs     0.13  bench_ufunc_strides.Binary.time_ufunc('minimum', 1, 2, 1, 'f')
-         481±1μs       60.4±0.8μs     0.13  bench_ufunc_strides.Binary.time_ufunc('minimum', 2, 1, 1, 'f')
-       481±0.7μs       60.3±0.5μs     0.13  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 2, 1, 'f')
-      90.2±0.1μs       10.2±0.2μs     0.11  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 1, 'H')
-      90.3±0.1μs       10.1±0.1μs     0.11  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 1, 'h')
-       577±0.8μs         60.6±1μs     0.11  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 1, 1, 'f')
-       577±0.9μs       60.4±0.4μs     0.10  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 2, 1, 'f')
-         977±2μs       78.6±0.6μs     0.08  bench_lib.Nan.time_nanmax(200000, 50.0)
-         975±2μs       78.3±0.7μs     0.08  bench_lib.Nan.time_nanmin(200000, 50.0)
-     75.3±0.03μs      5.34±0.02μs     0.07  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 1, 'B')
-     75.4±0.04μs      5.32±0.03μs     0.07  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 1, 'b')
-     75.4±0.03μs      5.31±0.02μs     0.07  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 1, 'b')
-       478±0.8μs       33.6±0.5μs     0.07  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 1, 1, 'f')
-       478±0.7μs       33.4±0.5μs     0.07  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 1, 1, 'f')
-         480±2μs       33.0±0.5μs     0.07  bench_ufunc_strides.Binary.time_ufunc('minimum', 1, 1, 1, 'f')
-     82.7±0.04μs      5.31±0.02μs     0.06  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 1, 'B')
-         575±1μs       33.8±0.3μs     0.06  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 1, 1, 'f')

Power little-endian

CPU
Architecture:                    ppc64le
Byte Order:                      Little Endian
CPU(s):                          8
On-line CPU(s) list:             0-7
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       8
NUMA node(s):                    1
Model:                           2.2 (pvr 004e 1202)
Model name:                      POWER9 (architected), altivec supported
L1d cache:                       256 KiB
L1i cache:                       256 KiB
NUMA node0 CPU(s):               0-7
Vulnerability L1tf:              Not affected
Vulnerability Meltdown:          Mitigation; RFI Flush
Vulnerability Spec store bypass: Mitigation; Kernel entry/exit barrier (eieio)
Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
Vulnerability Spectre v2:        Vulnerable

processor   : 7
cpu     : POWER9 (architected), altivec supported
clock       : 2200.000000MHz
revision    : 2.2 (pvr 004e 1202)

timebase    : 512000000
platform    : pSeries
model       : IBM pSeries (emulated by qemu)
machine     : CHRP IBM pSeries (emulated by qemu)
MMU     : Radix

OS
Linux e517009a912a 4.19.0-2-powerpc64le #1 SMP Debian 4.19.16-1 (2019-01-17) ppc64le ppc64le ppc64le GNU/Linux
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

Benchmark

baseline(VSX2)
python runtests.py -n --bench-compare parent/main "max|min" -- --sort ratio
       before           after         ratio
     [1684a933]       [fd5a2601]
+       125±0.3μs       154±0.07μs     1.24  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 1, 'q')
+         698±2μs          816±1μs     1.17  bench_ufunc.UFunc.time_ufunc_types('fmax')
+         140±2μs          154±3μs     1.10  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 4, 'L')
+       136±0.6μs          144±1μs     1.06  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 4, 'l')
-         149±1μs        141±0.6μs     0.95  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 2, 'l')
-       695±0.6μs          660±4μs     0.95  bench_ufunc.UFunc.time_ufunc_types('maximum')
-         146±2μs          137±1μs     0.94  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 1, 'Q')
-         145±1μs          136±1μs     0.94  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 1, 'q')
-         146±1μs          136±1μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 1, 'L')
-         147±2μs          136±2μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 1, 'l')
-         142±2μs          129±2μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 2, 'Q')
-     40.7±0.06μs       36.8±0.2μs     0.90  bench_reduce.MinMax.time_max(<class 'numpy.float64'>)
-         142±2μs          127±2μs     0.89  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 2, 'L')
-         143±2μs          127±1μs     0.89  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 2, 'Q')
-         143±2μs          127±2μs     0.89  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 2, 'l')
-         144±1μs          127±1μs     0.89  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 2, 'q')
-         143±1μs          126±1μs     0.89  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 2, 'l')
-         143±2μs        126±0.8μs     0.88  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 2, 'L')
-       139±0.9μs        121±0.6μs     0.87  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 1, 'Q')
-         140±1μs        121±0.7μs     0.86  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 1, 'q')
-       141±0.6μs          121±1μs     0.86  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 1, 'Q')
-         140±1μs        120±0.8μs     0.86  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 1, 'q')
-       139±0.7μs        120±0.8μs     0.86  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 1, 'L')
-         140±2μs        119±0.7μs     0.85  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 1, 'l')
-         140±1μs        118±0.6μs     0.84  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 1, 'L')
-       141±0.9μs        118±0.8μs     0.84  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 1, 'l')
-         128±1μs        104±0.2μs     0.81  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 2, 'Q')
-       125±0.8μs        102±0.4μs     0.81  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 2, 'Q')
-       125±0.5μs        101±0.9μs     0.81  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 2, 'q')
-       131±0.3μs        106±0.1μs     0.81  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 2, 'q')
-       128±0.5μs        104±0.3μs     0.81  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 2, 'q')
-       131±0.4μs        106±0.1μs     0.80  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 2, 'Q')
-       129±0.4μs        104±0.4μs    0.80  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 2, 'Q')
-       128±0.1μs        103±0.3μs     0.80  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 1, 'q')
-       132±0.7μs        106±0.5μs     0.80  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 1, 'Q')
-       129±0.6μs        103±0.4μs     0.80  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 2, 'q')
-       129±0.4μs          103±1μs     0.80  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 1, 'q')
-       128±0.3μs        103±0.1μs     0.80  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 1, 'Q')
-       131±0.6μs        105±0.7μs     0.80  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 1, 'q')
-       129±0.5μs        102±0.4μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 1, 'Q')
-       141±0.2μs          111±1μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 1, 'B')
-       142±0.1μs        111±0.1μs     0.78  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 1, 'B')
-      139±0.03μs        109±0.3μs     0.78  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 2, 'H')
-       141±0.2μs          110±2μs     0.78  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 1, 'b')
-       139±0.3μs        109±0.2μs     0.78  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 2, 'H')
-       139±0.4μs        108±0.2μs     0.78  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 4, 'H')
-       140±0.5μs        109±0.1μs     0.78  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 2, 'B')
-       140±0.5μs        109±0.2μs     0.78  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 4, 'B')
-      141±0.09μs        109±0.5μs     0.78  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 1, 'H')
-         140±1μs          109±1μs     0.78  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 2, 'H')
-       140±0.1μs        109±0.2μs     0.78  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 1, 'H')
-       139±0.5μs       108±0.07μs     0.78  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 2, 'B')
-       141±0.1μs        109±0.2μs     0.78  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 4, 'B')
-       140±0.4μs       109±0.09μs     0.78  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 4, 'H')
-       141±0.1μs       109±0.07μs     0.78  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 2, 'B')
-      140±0.03μs        109±0.3μs     0.78  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 1, 'H')
-       140±0.1μs        109±0.3μs     0.78  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 2, 'H')
-       140±0.1μs        109±0.2μs     0.78  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 1, 'H')
-      140±0.06μs       109±0.04μs     0.78  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 2, 'B')
-       140±0.5μs        109±0.3μs     0.78  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 2, 'B')
-         140±1μs        108±0.6μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 2, 'B')
-       140±0.6μs        109±0.2μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 2, 'H')
-       140±0.2μs        108±0.2μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 1, 'B')
-       140±0.5μs       108±0.03μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 4, 'B')
-      140±0.05μs        109±0.1μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 2, 'B')
-       141±0.2μs        109±0.2μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 1, 'H')
-       140±0.1μs        108±0.2μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 1, 'H')
-       141±0.2μs        109±0.3μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 2, 'H')
-       141±0.1μs        109±0.1μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 1, 'B')
-       141±0.2μs       109±0.08μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 4, 'B')
-      141±0.05μs        109±0.3μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 1, 'B')
-       141±0.3μs        109±0.3μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 4, 'H')
-       140±0.7μs        109±0.1μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 4, 'B')
-       141±0.1μs       109±0.09μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 4, 'B')
-      141±0.05μs        109±0.2μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 2, 'H')
-       141±0.2μs        109±0.3μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 4, 'H')
-      140±0.08μs       108±0.07μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 2, 'H')
-       141±0.1μs        109±0.5μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 2, 'B')
-      141±0.08μs        109±0.2μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 1, 'H')
-       140±0.2μs       109±0.04μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 4, 'H')
-       141±0.7μs       109±0.08μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 2, 'H')
-       140±0.2μs       109±0.07μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 1, 'B')
-       142±0.2μs       109±0.03μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 1, 'b')
-         139±1μs          107±3μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 2, 'B')
-         140±1μs          108±1μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 4, 'H')
-       141±0.1μs        109±0.4μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 1, 'H')
-      139±0.09μs        107±0.1μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 1, 'B')
-       139±0.3μs          108±2μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 1, 'B')
-      141±0.08μs       109±0.06μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 4, 'H')
-      141±0.05μs       109±0.05μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 4, 'B')
-       141±0.1μs        109±0.2μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 4, 'H')
-       141±0.1μs       109±0.03μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 4, 'H')
-         140±1μs          108±3μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 4, 'B')
-         141±1μs          108±1μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 4, 'B')
-         132±1μs        101±0.6μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 4, 'i')
-       140±0.4μs        107±0.2μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 2, 'b')
-      141±0.08μs        108±0.2μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 2, 'b')
-       141±0.3μs        108±0.2μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 4, 'b')
-       139±0.7μs        106±0.1μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 4, 'b')
-      140±0.06μs       107±0.04μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 2, 'b')
-       141±0.2μs        108±0.6μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 1, 'h')
-         138±1μs          106±3μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 2, 'b')
-         140±1μs          107±1μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 2, 'h')
-       141±0.5μs        108±0.2μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 4, 'b')
-       141±0.4μs        108±0.2μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 2, 'b')
-       139±0.7μs        107±0.8μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 2, 'b')
-       131±0.2μs        100±0.2μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 2, 'l')
-       131±0.2μs        100±0.4μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 2, 'L')
-       140±0.5μs        107±0.2μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 2, 'h')
-       140±0.9μs          107±2μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 4, 'b')
-       140±0.5μs        107±0.2μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 4, 'b')
-       141±0.2μs        107±0.1μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 4, 'b')
-      141±0.06μs        107±0.1μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 4, 'b')
-      139±0.09μs        106±0.2μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 1, 'b')
-       140±0.4μs        106±0.2μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 2, 'h')
-       141±0.2μs        107±0.2μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 1, 'b')
-         133±1μs        101±0.6μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 4, 'I')
-       141±0.6μs        107±0.2μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 4, 'h')
-       139±0.3μs        106±0.2μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 4, 'h')
-       140±0.3μs        106±0.1μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 2, 'b')
-       141±0.2μs        107±0.2μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 1, 'h')
-      140±0.06μs        106±0.1μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 1, 'b')
-       141±0.2μs        107±0.6μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 2, 'b')
-       141±0.6μs        107±0.3μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 1, 'b')
-       141±0.2μs        107±0.4μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 1, 'h')
-       140±0.6μs        106±0.1μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 1, 'h')
-       141±0.2μs       107±0.04μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 1, 'b')
-       140±0.5μs       107±0.06μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 2, 'b')
-       140±0.1μs        107±0.2μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 1, 'h')
-       141±0.1μs        107±0.3μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 4, 'b')
-       141±0.6μs        107±0.1μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 4, 'h')
-       141±0.7μs        107±0.3μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 2, 'h')
-       141±0.1μs        107±0.2μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 2, 'h')
-       140±0.2μs        106±0.1μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 1, 'h')
-       141±0.2μs        107±0.3μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 1, 'h')
-         140±1μs          106±3μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 4, 'b')
-       141±0.3μs       107±0.07μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 4, 'h')
-       140±0.3μs          106±2μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 1, 'b')
-         141±1μs        107±0.3μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 2, 'h')
-       141±0.3μs        107±0.3μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 2, 'h')
-       140±0.7μs        106±0.1μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 2, 'h')
-       141±0.7μs        107±0.2μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 1, 'h')
-       141±0.5μs        107±0.1μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 4, 'h')
-       141±0.4μs       107±0.03μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 4, 'h')
-       141±0.8μs        107±0.2μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 2, 'h')
-       142±0.9μs        107±0.3μs     0.75  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 4, 'h')
-       142±0.1μs        107±0.2μs     0.75  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 4, 'h')
-       131±0.7μs       98.6±0.6μs     0.75  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 2, 'i')
-         141±1μs          106±1μs     0.75  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 4, 'h')
-       131±0.7μs       97.5±0.9μs     0.74  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 2, 'I')
-       131±0.6μs       97.1±0.7μs     0.74  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 1, 'I')
-       130±0.4μs       96.6±0.6μs     0.74  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 1, 'i')
-       128±0.3μs       94.7±0.5μs     0.74  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 4, 'I')
-       128±0.4μs       94.2±0.4μs     0.73  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 4, 'i')
-       131±0.3μs       96.3±0.8μs     0.73  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 1, 'l')
-       129±0.6μs       94.8±0.7μs     0.73  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 4, 'i')
-       130±0.6μs       94.8±0.5μs     0.73  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 4, 'I')
-       129±0.1μs       94.1±0.5μs     0.73  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 4, 'I')
-       131±0.3μs       95.8±0.8μs     0.73  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 1, 'L')
-       125±0.3μs       91.1±0.2μs     0.73  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 4, 'I')
-       125±0.2μs       90.8±0.2μs     0.73  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 4, 'i')
-       129±0.1μs       93.6±0.5μs     0.73  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 4, 'i')
-       128±0.2μs       92.8±0.4μs     0.73  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 1, 'i')
-       125±0.5μs       90.4±0.4μs     0.73  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 1, 'i')
-       129±0.5μs       93.6±0.4μs     0.73  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 2, 'i')
-       129±0.6μs       93.7±0.3μs     0.73  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 2, 'I')
-       126±0.5μs       91.2±0.3μs     0.73  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 4, 'i')
-       125±0.3μs       90.3±0.6μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 1, 'I')
-       128±0.3μs       93.1±0.3μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 2, 'I')
-       124±0.1μs       90.2±0.3μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 2, 'I')
-       130±0.4μs       94.1±0.9μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 4, 'I')
-       125±0.2μs       90.3±0.3μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 2, 'i')
-       129±0.4μs       93.5±0.5μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 1, 'I')
-       130±0.7μs       94.0±0.9μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 4, 'i')
-       128±0.3μs       92.5±0.3μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 1, 'I')
-       125±0.1μs       90.6±0.3μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 1, 'i')
-       129±0.9μs       93.4±0.5μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 1, 'i')
-       129±0.6μs       92.8±0.4μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 2, 'L')
-       126±0.4μs       90.6±0.4μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 4, 'I')
-       128±0.3μs       92.4±0.2μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 2, 'I')
-       129±0.7μs       92.8±0.3μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 2, 'i')
-       128±0.2μs       92.3±0.2μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 2, 'i')
-      125±0.06μs       90.4±0.4μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 1, 'I')
-       129±0.2μs       92.8±0.5μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 1, 'i')
-       129±0.7μs       93.0±0.6μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 2, 'L')
-       129±0.4μs       92.8±0.4μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 1, 'I')
-       129±0.3μs       92.4±0.5μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 1, 'I')
-       129±0.5μs       92.7±0.4μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 2, 'l')
-       128±0.2μs       92.1±0.5μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 1, 'i')
-         129±1μs       92.5±0.2μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 2, 'l')
-       125±0.5μs       89.8±0.2μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 2, 'I')
-       128±0.6μs       92.0±0.2μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 2, 'i')
-       125±0.6μs       89.8±0.2μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 2, 'i')
-       127±0.2μs       90.7±0.2μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 2, 'I')
-       125±0.7μs       89.2±0.1μs     0.71  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 2, 'l')
-       128±0.2μs       91.5±0.4μs     0.71  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 2, 'I')
-       125±0.3μs       89.2±0.5μs     0.71  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 2, 'L')
-       126±0.7μs       89.6±0.3μs     0.71  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 2, 'i')
-      127±0.08μs       90.1±0.1μs     0.71  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 4, 'I')
-       127±0.2μs       90.1±0.1μs     0.71  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 4, 'i')
-       129±0.8μs       91.3±0.4μs     0.71  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 1, 'l')
-       126±0.2μs       89.5±0.4μs     0.71  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 1, 'i')
-      36.6±0.1μs      25.9±0.08μs     0.71  bench_reduce.MinMax.time_max(<class 'numpy.int64'>)
-     36.7±0.06μs      25.9±0.06μs     0.71  bench_reduce.MinMax.time_max(<class 'numpy.uint64'>)
-       127±0.2μs       89.4±0.3μs     0.71  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 1, 'I')
-       125±0.9μs       88.5±0.6μs     0.71  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 2, 'i')
-       125±0.2μs       88.5±0.7μs     0.71  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 2, 'I')
-       126±0.5μs       88.6±0.4μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 4, 'i')
-       128±0.2μs       90.1±0.2μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 1, 'L')
-       126±0.4μs       88.6±0.4μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 4, 'I')
-       130±0.5μs       90.6±0.3μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 1, 'L')
-       129±0.5μs       89.8±0.2μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 1, 'l')
-     44.9±0.08μs      26.5±0.08μs     0.59  bench_reduce.MinMax.time_max(<class 'numpy.float32'>)
-      37.2±0.1μs      20.8±0.03μs     0.56  bench_reduce.MinMax.time_max(<class 'numpy.int32'>)
-      37.3±0.1μs      20.8±0.08μs     0.56  bench_reduce.MinMax.time_max(<class 'numpy.uint32'>)
-      33.0±0.1μs      18.2±0.05μs     0.55  bench_reduce.FMinMax.time_max(<class 'numpy.float64'>)
-         286±1μs        136±0.2μs     0.48  bench_lib.Nan.time_nanmax(200000, 0)
-     38.3±0.06μs      18.1±0.02μs     0.47  bench_reduce.MinMax.time_max(<class 'numpy.uint16'>)
-      38.3±0.3μs      18.1±0.07μs     0.47  bench_reduce.MinMax.time_max(<class 'numpy.int16'>)
-       294±0.7μs        136±0.2μs     0.46  bench_lib.Nan.time_nanmax(200000, 0.1)
-      38.1±0.2μs      16.8±0.06μs     0.44  bench_reduce.MinMax.time_max(<class 'numpy.uint8'>)
-     38.2±0.07μs      16.8±0.01μs     0.44  bench_reduce.MinMax.time_max(<class 'numpy.int8'>)
-     37.6±0.08μs      13.4±0.08μs     0.36  bench_reduce.FMinMax.time_max(<class 'numpy.float32'>)
-         408±3μs        136±0.2μs     0.33  bench_lib.Nan.time_nanmax(200000, 2.0)
-       125±0.2μs      41.7±0.07μs     0.33  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 1, 'i')
-       125±0.1μs      41.7±0.04μs     0.33  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 1, 'I')
-         890±2μs         276±10μs     0.31  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 4, 4, 'd')
-         888±3μs          249±4μs     0.28  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 4, 2, 'd')
-         895±5μs         242±10μs     0.27  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 4, 4, 'd')
-         892±3μs          240±1μs     0.27  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 4, 4, 'd')
-       895±0.6μs          240±2μs     0.27  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 2, 4, 'd')
-         892±3μs          239±3μs     0.27  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 2, 2, 'd')
-         892±2μs          237±2μs     0.27  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 4, 2, 'd')
-        881±10μs          229±1μs     0.26  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 2, 4, 'd')
-         885±3μs          225±1μs     0.25  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 4, 1, 'd')
-         877±2μs        223±0.8μs     0.25  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 2, 2, 'd')
-         957±1μs          243±2μs     0.25  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 4, 4, 'f')
-         957±3μs        243±0.3μs     0.25  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 4, 2, 'f')
-         954±4μs        242±0.8μs     0.25  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 2, 4, 'f')
-         957±7μs          243±1μs     0.25  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 4, 2, 'f')
-         956±4μs        243±0.8μs     0.25  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 4, 4, 'f')
-         952±3μs        241±0.7μs     0.25  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 2, 2, 'f')
-       960±0.6μs          243±1μs     0.25  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 2, 4, 'f')
-         961±2μs          242±2μs     0.25  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 2, 2, 'f')
-         884±3μs          219±3μs     0.25  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 4, 2, 'd')
-         884±3μs          219±3μs     0.25  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 1, 2, 'd')
-        896±10μs          220±1μs     0.25  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 1, 4, 'd')
-         888±3μs          218±3μs     0.25  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 4, 4, 'd')
-         895±2μs          220±2μs     0.25  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 2, 1, 'd')
-         891±2μs          219±1μs     0.25  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 4, 1, 'd')
-         886±6μs          210±2μs     0.24  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 2, 4, 'd')
-         892±7μs          211±5μs     0.24  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 1, 4, 'd')
-         879±4μs          205±1μs     0.23  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 2, 1, 'd')
-         880±4μs        202±0.4μs     0.23  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 2, 2, 'd')
-         882±3μs          201±2μs     0.23  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 1, 2, 'd')
-         883±3μs          194±1μs     0.22  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 1, 1, 'd')
-         885±3μs        193±0.8μs     0.22  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 4, 1, 'd')
-         880±3μs          186±1μs     0.21  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 2, 1, 'd')
-         882±8μs          184±2μs     0.21  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 1, 4, 'd')
-         958±2μs          199±3μs     0.21  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 4, 4, 'f')
-         958±7μs          199±4μs     0.21  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 4, 2, 'f')
-         888±4μs          185±1μs     0.21  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 4, 2, 'd')
-         896±2μs          186±1μs     0.21  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 2, 4, 'd')
-         895±2μs          185±3μs     0.21  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 4, 4, 'd')
-         883±3μs          182±2μs     0.21  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 1, 1, 'd')
-        878±10μs          181±2μs     0.21  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 1, 2, 'd')
-         960±1μs          196±3μs     0.20  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 2, 4, 'f')
-         955±1μs          194±2μs     0.20  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 2, 4, 'f')
-         960±1μs        194±0.2μs     0.20  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 4, 4, 'f')
-         961±2μs        193±0.7μs     0.20  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 4, 2, 'f')
-         894±2μs          180±2μs     0.20  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 2, 2, 'd')
-         894±2μs          180±2μs     0.20  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 4, 2, 'd')
-         955±2μs          191±2μs     0.20  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 2, 2, 'f')
-       961±0.9μs        191±0.5μs     0.20  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 2, 2, 'f')
-        882±10μs          173±2μs     0.20  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 2, 4, 'd')
-         886±2μs          173±2μs     0.20  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 4, 1, 'd')
-       141±0.9μs       27.0±0.2μs     0.19  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 1, 'H')
-       141±0.9μs      26.9±0.03μs     0.19  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 1, 'h')
-         879±2μs        164±0.4μs     0.19  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 2, 2, 'd')
-         889±1μs          165±4μs     0.19  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 4, 4, 'd')
-        895±10μs          166±1μs     0.19  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 1, 4, 'd')
-         894±3μs          161±3μs     0.18  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 2, 1, 'd')
-         887±1μs          158±2μs     0.18  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 1, 2, 'd')
-         888±2μs        159±0.9μs     0.18  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 4, 2, 'd')
-         895±2μs          159±1μs     0.18  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 4, 1, 'd')
-         898±8μs          160±2μs     0.18  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 1, 4, 'd')
-         775±3μs        136±0.2μs     0.18  bench_lib.Nan.time_nanmax(200000, 90.0)
-         894±7μs          157±2μs     0.18  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 2, 4, 'd')
-         957±4μs        167±0.8μs     0.17  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 4, 4, 'f')
-         960±2μs        168±0.9μs     0.17  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 1, 4, 'f')
-         959±1μs        167±0.8μs     0.17  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 1, 2, 'f')
-         954±3μs        166±0.3μs     0.17  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 2, 4, 'f')
-         961±2μs        167±0.2μs     0.17  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 4, 2, 'f')
-         954±3μs        166±0.2μs     0.17  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 2, 2, 'f')
-         958±1μs        165±0.6μs     0.17  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 1, 2, 'f')
-         957±3μs        164±0.7μs     0.17  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 1, 4, 'f')
-         881±2μs          148±1μs     0.17  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 2, 1, 'd')
-       884±0.4μs        143±0.3μs     0.16  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 2, 2, 'd')
-         887±8μs          144±3μs     0.16  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 1, 4, 'd')
-         884±1μs        143±0.6μs     0.16  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 1, 2, 'd')
-         957±3μs        149±0.3μs     0.16  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 4, 1, 'f')
-       973±0.9μs        151±0.6μs     0.16  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 4, 1, 'f')
-         958±3μs        148±0.3μs     0.15  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 2, 1, 'f')
-         954±7μs        147±0.7μs     0.15  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 2, 1, 'f')
-         888±2μs        137±0.9μs     0.15  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 4, 1, 'd')
-         887±1μs          136±1μs     0.15  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 1, 1, 'd')
-         876±4μs       133±0.05μs     0.15  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 1, 1, 'd')
-       884±0.9μs          126±1μs     0.14  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 1, 1, 'd')
-       883±0.8μs          125±1μs     0.14  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 2, 1, 'd')
-        883±10μs        124±0.5μs     0.14  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 1, 2, 'd')
-         961±5μs          128±1μs     0.13  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 1, 4, 'f')
-         961±2μs        128±0.5μs     0.13  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 4, 4, 'f')
-         959±1μs        127±0.4μs     0.13  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 1, 2, 'f')
-         961±1μs        127±0.7μs     0.13  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 4, 2, 'f')
-         959±2μs        126±0.6μs     0.13  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 1, 4, 'f')
-       958±0.6μs        125±0.5μs     0.13  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 1, 2, 'f')
-       958±0.9μs       125±0.08μs     0.13  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 2, 2, 'f')
-       959±0.7μs        125±0.4μs     0.13  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 2, 4, 'f')
-         957±4μs        121±0.4μs     0.13  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 4, 1, 'f')
-         959±3μs          122±1μs     0.13  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 1, 4, 'f')
-         955±3μs        119±0.5μs     0.12  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 1, 1, 'f')
-         954±6μs          119±1μs     0.12  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 1, 2, 'f')
-         955±2μs       119±0.07μs     0.12  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 2, 1, 'f')
-         956±4μs       117±0.03μs     0.12  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 1, 1, 'f')
-       973±0.6μs        113±0.3μs     0.12  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 4, 1, 'f')
-       961±0.7μs        110±0.4μs     0.11  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 4, 1, 'f')
-         961±1μs        110±0.3μs     0.11  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 2, 1, 'f')
-         955±8μs        109±0.2μs     0.11  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 2, 1, 'f')
-         137±1μs       15.5±0.1μs     0.11  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 1, 'B')
-         137±1μs      15.4±0.07μs     0.11  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 1, 'b')
-         958±3μs       91.5±0.7μs     0.10  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 1, 4, 'f')
-         954±1μs       90.1±0.6μs     0.09  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 1, 2, 'f')
-         959±2μs       85.5±0.6μs     0.09  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 4, 1, 'f')
-         960±1μs       85.2±0.5μs     0.09  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 1, 1, 'f')
-         958±2μs       82.5±0.2μs     0.09  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 1, 1, 'f')
-       958±0.8μs       82.3±0.2μs     0.09  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 2, 1, 'f')
-       876±0.7μs      74.7±0.08μs     0.09  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 1, 1, 'd')
-         953±1μs      72.0±0.05μs     0.08  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 1, 1, 'f')
-        1.94±0ms        136±0.3μs     0.07  bench_lib.Nan.time_nanmax(200000, 50.0)
-       953±0.7μs      41.9±0.02μs     0.04  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 1, 1, 'f')`

AArch64

CPU
Architecture:                    aarch64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
CPU(s):                          2
On-line CPU(s) list:             0,1
Thread(s) per core:              1
Core(s) per socket:              2
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       ARM
Model:                           1
Model name:                      Neoverse-N1
Stepping:                        r3p1
BogoMIPS:                        243.75
L1d cache:                       128 KiB
L1i cache:                       128 KiB
L2 cache:                        2 MiB
L3 cache:                        32 MiB
NUMA node0 CPU(s):               0,1
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
Vulnerability Spectre v2:        Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
OS
Linux ip-172-31-44-172 5.11.0-1020-aws #21~20.04.2-Ubuntu SMP Fri Oct 1 13:01:34 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

Benchmark

baseline(ASIMD)
python runtests.py --bench-compare parent/main "max|min" -- --sort ratio
       before           after         ratio
     [f224ca3c]       [34d15c3d]
     <as_min_max^2>       <as_min_max>
+         166±1μs          246±1μs     1.48  bench_reduce.ArgMax.time_argmax(<class 'numpy.float32'>)
+         735±2μs          904±5μs     1.23  bench_ufunc.UFunc.time_ufunc_types('maximum')
+         727±5μs          892±3μs     1.23  bench_ufunc.UFunc.time_ufunc_types('minimum')
-        1.76±0ms         1.67±0ms     0.95  bench_lib.Nan.time_nanargmax(200000, 50.0)
-        1.75±0ms         1.67±0ms     0.95  bench_lib.Nan.time_nanargmin(200000, 50.0)
-         143±2μs        136±0.7μs     0.95  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 4, 4, 'i')
-        1.12±0ms         1.05±0ms     0.94  bench_lib.Nan.time_nanargmin(200000, 90.0)
-         201±2μs          189±2μs     0.94  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 1, 'L')
-         200±2μs          188±2μs     0.94  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 1, 'L')
-        1.12±0ms         1.05±0ms     0.94  bench_lib.Nan.time_nanargmax(200000, 90.0)
-         200±1μs          188±3μs     0.94  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 1, 'q')
-         201±2μs          189±2μs     0.94  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 1, 'l')
-         201±2μs          189±2μs     0.94  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 1, 'Q')
-         201±2μs          189±2μs     0.94  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 1, 'Q')
-         202±2μs          189±2μs     0.94  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 1, 'q')
-         201±2μs          189±2μs     0.94  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 1, 'l')
-       203±0.9μs          190±2μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 1, 'q')
-         162±3μs          151±2μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 1, 'q')
-         162±2μs          151±2μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 1, 'l')
-         163±1μs          151±2μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 1, 'L')
-         162±1μs          150±2μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 1, 'Q')
-         203±1μs          189±3μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 1, 'l')
-         162±2μs          151±1μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 1, 'l')
-      22.0±0.1μs      20.5±0.03μs     0.93  bench_reduce.ArgMax.time_argmax(<class 'bool'>)
-         234±2μs          217±3μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 1, 'l')
-         203±1μs          189±2μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 1, 'L')
-         202±2μs          187±1μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 1, 'Q')
-         163±2μs          151±2μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 1, 'q')
-         203±1μs        188±0.7μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 1, 'l')
-         162±2μs          150±2μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 1, 'Q')
-         162±3μs          150±2μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 1, 'L')
-         234±2μs          217±3μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 1, 'L')
-         202±2μs        187±0.8μs     0.92  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 1, 'q')
-         202±1μs        187±0.4μs     0.92  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 1, 'L')
-         235±2μs          217±3μs     0.92  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 1, 'Q')
-         204±1μs          189±2μs     0.92  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 1, 'Q')
-         234±2μs          216±2μs     0.92  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 1, 'Q')
-       235±0.7μs          217±3μs     0.92  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 1, 'q')
-         301±2μs          277±4μs     0.92  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 1, 'Q')
-       238±0.8μs          219±5μs     0.92  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 1, 'L')
-         235±2μs          216±2μs     0.92  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 1, 'L')
-         236±1μs          217±2μs     0.92  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 1, 'q')
-       238±0.7μs          219±3μs     0.92  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 1, 'l')
-         298±2μs          273±3μs     0.92  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 1, 'L')
-       238±0.5μs          218±2μs     0.92  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 1, 'Q')
-         295±4μs          271±2μs     0.92  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 4, 1, 'q')
-         296±3μs          271±2μs     0.92  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 4, 1, 'l')
-         159±2μs        145±0.8μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 4, 1, 'i')
-       238±0.4μs          217±4μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 1, 'q')
-         299±3μs          273±3μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 1, 'l')
-         237±2μs          216±3μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 1, 'l')
-         236±1μs          215±1μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 2, 1, 'L')
-         236±2μs        215±0.8μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 2, 1, 'l')
-         236±2μs          215±1μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 2, 1, 'Q')
-         296±3μs          269±2μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 4, 1, 'L')
-         236±2μs        214±0.8μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 2, 1, 'q')
-       160±0.4μs          145±2μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 1, 'i')
-         301±2μs          273±4μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 1, 'q')
-         159±1μs        144±0.7μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 4, 1, 'I')
-         299±3μs          268±2μs     0.90  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 4, 1, 'Q')
-         161±1μs          144±2μs     0.90  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 1, 'I')
-       131±0.4μs        116±0.3μs     0.89  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 4, 'i')
-       131±0.6μs        116±0.5μs     0.89  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 4, 'I')
-       131±0.5μs        116±0.8μs     0.89  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 4, 'i')
-       131±0.6μs        116±0.4μs     0.89  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 4, 'I')
-       132±0.5μs          116±2μs     0.88  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 4, 'i')
-       131±0.3μs        116±0.6μs     0.88  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 2, 'L')
-       132±0.5μs          116±1μs     0.88  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 4, 'I')
-       132±0.4μs          116±1μs     0.88  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 2, 4, 'I')
-         132±1μs        116±0.5μs     0.88  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 2, 'l')
-       132±0.7μs        116±0.3μs     0.88  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 2, 'l')
-       133±0.8μs        116±0.5μs     0.88  bench_ufunc_strides.inaryInt.time_ufunc('maximum', 1, 2, 2, 'q')
-       132±0.5μs        116±0.7μs     0.88  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 2, 'Q')
-       133±0.8μs          116±1μs     0.87  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 2, 'Q')
-       132±0.6μs        116±0.7μs     0.87  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 2, 'q')
-       132±0.4μs          116±1μs     0.87  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 2, 4, 'i')
-       133±0.2μs        116±0.8μs     0.87  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 2, 'L')
-       133±0.6μs          116±1μs     0.87  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 2, 'L')
-       132±0.7μs        115±0.9μs     0.87  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 2, 'Q')
-       133±0.6μs        116±0.9μs     0.87  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 2, 'l')
-       133±0.7μs        116±0.5μs     0.87  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 2, 'q')
-       133±0.6μs        115±0.8μs     0.87  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 2, 'q')
-       136±0.3μs          118±1μs     0.87  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 1, 'Q')
-       136±0.5μs        118±0.8μs     0.87  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 1, 'L')
-       133±0.3μs          115±1μs     0.87  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 2, 'L')
-       133±0.6μs          115±2μs     0.87  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 2, 'Q')
-       133±0.7μs        116±0.8μs     0.87  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 2, 'l')
-       136±0.1μs          117±1μs     0.87  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 1, 'q')
-       136±0.2μs        118±0.9μs     0.87  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 1, 'Q')
-       136±0.5μs          118±1μs     0.86  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 1, 'Q')
-       137±0.4μs          118±1μs     0.86  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 1, 'L')
-         734±3μs          635±1μs     0.86  bench_lib.Nan.time_nanargmin(200000, 2.0)
-         735±1μs          635±1μs     0.86  bench_lib.Nan.time_nanargmax(200000, 2.0)
-       136±0.9μs        118±0.5μs     0.86  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 1, 'l')
-       136±0.2μs          118±1μs     0.86  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 1, 'Q')
-       136±0.5μs          117±1μs     0.86  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 1, 'L')
-       137±0.7μs          118±1μs     0.86  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 1, 'l')
-       136±0.4μs        118±0.8μs     0.86  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 1, 'q')
-       136±0.2μs          117±1μs     0.86  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 1, 'q')
-       136±0.5μs          117±1μs     0.86  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 1, 'l')
-       137±0.3μs          118±1μs     0.86  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 1, 'L')
-       137±0.4μs          118±1μs     0.86  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 1, 'q')
-       137±0.6μs          117±2μs     0.86  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 1, 'l')
-       132±0.6μs        112±0.8μs     0.85  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 2, 'i')
-         644±2μs          546±2μs     0.85  bench_lib.Nan.time_nanargmax(200000, 0.1)
-         132±1μs        112±0.7μs     0.85  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 2, 'I')
-       641±0.8μs        543±0.9μs     0.85  bench_lib.Nan.time_nanargmax(200000, 0)
-         644±1μs          546±1μs     0.85  bench_lib.Nan.time_nanargmin(200000, 0.1)
-         640±1μs          542±2μs     0.85  bench_lib.Nan.time_nanargmin(200000, 0)
-       136±0.6μs          115±1μs     0.84  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 1, 'i')
-       135±0.4μs          114±1μs     0.84  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 1, 'I')
-       135±0.6μs          114±1μs     0.84  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 1, 'i')
-       132±0.4μs        111±0.4μs     0.84  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 4, 2, 'i')
-       132±0.4μs        111±0.3μs     0.84  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 4, 2, 'I')
-       136±0.2μs          114±1μs     0.84  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 1, 'I')
-       136±0.5μs          114±1μs     0.84  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 1, 'i')
-       136±0.6μs          114±1μs     0.84  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 1, 'I')
-       130±0.9μs        107±0.9μs     0.82  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 4, 'i')
-       137±0.3μs        112±0.6μs     0.82  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 2, 1, 'I')
-       130±0.5μs        107±0.6μs     0.82  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 4, 'I')
-       137±0.6μs        112±0.4μs     0.82  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 2, 1, 'i')
-         131±1μs          107±1μs     0.82  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 4, 'i')
-       131±0.5μs        107±0.9μs     0.82  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 4, 'i')
-         131±2μs          107±1μs     0.82  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 4, 'i')
-       131±0.4μs        107±0.4μs     0.82  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 4, 'I')
-         132±2μs        107±0.5μs     0.81  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 4, 'I')
-       132±0.7μs        106±0.7μs     0.81  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 4, 'I')
-       125±0.2μs        101±0.7μs     0.81  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 4, 'h')
-       124±0.2μs       99.7±0.3μs     0.80  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 4, 'h')
-       126±0.7μs        101±0.6μs     0.80  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 4, 4, 'h')
-       124±0.3μs       99.2±0.1μs     0.80  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 4, 'h')
-       125±0.4μs       99.9±0.4μs     0.80  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 4, 'h')
-       125±0.7μs       99.4±0.2μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 2, 4, 'h')
-      124±0.08μs       97.7±0.1μs     0.79  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 4, 'h')
-       124±0.6μs       97.3±0.2μs     0.78  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 4, 'h')
-       124±0.2μs       96.7±0.3μs     0.78  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 4, 'h')
-       125±0.3μs       96.7±0.5μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 4, 'h')
-       124±0.2μs       94.8±0.4μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 4, 'h')
-       124±0.4μs       94.9±0.4μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 4, 'h')
-      123±0.09μs       94.5±0.2μs     0.77  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 4, 'h')
-       124±0.1μs       94.7±0.2μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 2, 'h')
-       124±0.2μs       94.6±0.2μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 4, 'h')
-       127±0.3μs      96.6±0.09μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 1, 'h')
-       124±0.2μs       94.8±0.2μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 4, 'h')
-       124±0.4μs       94.7±0.1μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 4, 2, 'h')
-       127±0.1μs       96.7±0.2μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 4, 1, 'h')
-       128±0.5μs         97.5±1μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 2, 'Q')
-       129±0.5μs         97.6±1μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 2, 'Q')
-       129±0.2μs       97.6±0.9μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 2, 'L')
-       128±0.3μs         97.3±1μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 2, 'q')
-       125±0.4μs       94.4±0.2μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 4, 'h')
-       128±0.2μs       97.3±0.5μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 4, 'i')
-       129±0.1μs       97.3±0.5μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 4, 'I')
-       124±0.2μs       93.5±0.2μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 2, 'h')
-       129±0.3μs       97.2±0.4μs     0.76  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 4, 'I')
-       124±0.2μs       93.5±0.2μs     0.75  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 2, 'h')
-       125±0.1μs       94.6±0.3μs     0.75  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 1, 'h')
-       129±0.6μs         97.2±1μs     0.75  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 2, 'l')
-       123±0.1μs       93.0±0.5μs     0.75  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 4, 'h')
-       125±0.2μs      94.0±0.02μs     0.75  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 1, 'h')
-       129±0.3μs       97.0±0.9μs     0.75  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 2, 'l')
-       125±0.1μs       94.4±0.2μs     0.75  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 1, 'h')
-       123±0.2μs       92.9±0.2μs     0.75  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 2, 'h')
-       129±0.6μs       96.9±0.6μs     0.75  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 2, 'q')
-      125±0.04μs       94.3±0.2μs     0.75  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 1, 'h')
-       125±0.1μs       94.1±0.1μs     0.75  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 1, 'h')
-       124±0.2μs       92.9±0.2μs     0.75  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 2, 'h')
-       125±0.2μs       94.1±0.2μs     0.75  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 1, 'h')
-       129±0.4μs       96.9±0.6μs     0.75  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 4, 'i')
-       124±0.5μs       93.5±0.2μs     0.75  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 2, 2, 'h')
-       125±0.2μs      94.0±0.05μs     0.75  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 1, 'h')
-       124±0.3μs       93.2±0.2μs     0.75  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 2, 'h')
-       124±0.2μs       92.7±0.4μs     0.75  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 4, 'h')
-       129±0.7μs         97.0±2μs     0.75  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 2, 'L')
-       124±0.5μs       93.3±0.2μs     0.75  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 2, 'h')
-       126±0.4μs      94.2±0.08μs     0.75  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 2, 1, 'h')
-       123±0.1μs       92.1±0.4μs     0.75  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 2, 'h')
-       124±0.3μs      92.9±0.09μs     0.75  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 2, 'h')
-       123±0.1μs      91.9±0.06μs     0.75  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 2, 'h')
-       123±0.1μs       91.9±0.2μs     0.75  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 2, 'h')
-       123±0.3μs       91.9±0.2μs     0.75  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 2, 'h')
-       124±0.2μs       92.2±0.1μs     0.75  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 2, 'h')
-       123±0.4μs       91.9±0.1μs     0.75  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 2, 'h')
-       123±0.2μs       91.4±0.3μs     0.74  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 1, 'h')
-       123±0.2μs      91.2±0.04μs     0.74  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 1, 'h')
-       123±0.1μs       91.0±0.2μs     0.74  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 1, 'h')
-       133±0.1μs       98.8±0.9μs     0.74  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 1, 'I')
-       123±0.1μs      90.9±0.07μs     0.74  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 1, 'h')
-       123±0.2μs       91.0±0.3μs     0.74  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 1, 'h')
-       123±0.3μs       91.2±0.3μs     0.74  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 1, 'h')
-       123±0.1μs       91.0±0.1μs     0.74  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 2, 'h')
-       123±0.1μs       91.0±0.1μs     0.74  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 2, 'h')
-       133±0.1μs         98.5±1μs     0.74  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 1, 'i')
-      133±0.04μs       98.4±0.4μs     0.74  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 1, 'I')
-       133±0.1μs       98.3±0.9μs     0.74  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 1, 'i')
-       133±0.3μs         98.5±1μs     0.74  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 1, 'i')
-       133±0.1μs       98.2±0.8μs     0.74  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 1, 'I')
-       134±0.2μs       98.0±0.4μs     0.73  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 1, 'I')
-       128±0.8μs         93.4±1μs     0.73  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 4, 'i')
-       128±0.5μs       93.6±0.4μs     0.73  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 4, 'I')
-       129±0.4μs       93.9±0.8μs     0.73  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 4, 'i')
-       128±0.2μs       93.2±0.5μs     0.73  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 4, 'I')
-       128±0.1μs       93.2±0.3μs     0.73  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 4, 'I')
-         134±1μs       97.6±0.7μs     0.73  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 1, 'i')
-       128±0.2μs       92.7±0.5μs     0.73  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 4, 'i')
-       128±0.3μs       92.9±0.4μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 4, 'i')
-       125±0.4μs         90.2±5μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 4, 'b')
-       125±0.3μs         90.2±5μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 4, 'b')
-       128±0.2μs       92.1±0.7μs     0.72  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 4, 'I')
-       123±0.2μs      87.9±0.06μs     0.71  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 4, 'b')
-       123±0.2μs       87.9±0.5μs     0.71  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 4, 'b')
-       123±0.2μs      87.8±0.08μs     0.71  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 4, 'b')
-       123±0.2μs       87.7±0.1μs     0.71  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 4, 'b')
-       123±0.1μs       87.7±0.1μs     0.71  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 2, 4, 'b')
-       123±0.1μs       87.7±0.3μs     0.71  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 4, 4, 'b')
-       124±0.3μs         88.4±3μs     0.71  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 2, 'b')
-       124±0.4μs         88.1±3μs     0.71  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 2, 'b')
-       123±0.1μs       87.3±0.1μs     0.71  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 4, 'b')
-       123±0.1μs       87.1±0.2μs     0.71  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 4, 'b')
-       123±0.1μs       87.1±0.4μs     0.71  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 4, 'b')
-      123±0.08μs       87.2±0.3μs     0.71  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 4, 'b')
-      123±0.07μs       87.2±0.2μs     0.71  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 4, 'b')
-       123±0.1μs       87.2±0.2μs     0.71  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 4, 'b')
-       123±0.1μs       86.7±0.1μs     0.71  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 4, 2, 'b')
-      123±0.06μs       86.5±0.1μs     0.71  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 2, 'b')
-      123±0.07μs       86.7±0.2μs     0.71  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 2, 'b')
-      123±0.06μs      86.6±0.09μs     0.71  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 4, 'b')
-       122±0.2μs       86.0±0.2μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 4, 4, 'H')
-       123±0.1μs      86.5±0.06μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 4, 'b')
-      123±0.07μs      86.4±0.08μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 2, 'b')
-       123±0.2μs       86.6±0.1μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 2, 'b')
-      123±0.04μs       86.3±0.3μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 2, 'b')
-       123±0.2μs       86.4±0.1μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 2, 'b')
-      123±0.09μs       86.4±0.1μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 2, 'b')
-       123±0.1μs       86.4±0.1μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 2, 'b')
-       123±0.1μs       86.4±0.1μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 2, 'b')
-       123±0.1μs      86.4±0.09μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 2, 'b')
-      123±0.09μs       86.3±0.1μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 2, 'b')
-       123±0.1μs       86.2±0.2μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 1, 'b')
-      123±0.05μs       86.2±0.1μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 4, 'b')
-       123±0.2μs       86.3±0.1μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 2, 2, 'b')
-      123±0.06μs       86.2±0.2μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 2, 1, 'b')
-      123±0.06μs       86.2±0.2μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 1, 'b')
-       123±0.1μs       86.2±0.2μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 4, 'b')
-         127±1μs       89.4±0.9μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 4, 'I')
-       123±0.2μs       86.1±0.1μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 1, 'b')
-       123±0.1μs       86.3±0.2μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 1, 'b')
-       123±0.1μs       86.3±0.1μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 1, 'b')
-      122±0.07μs       85.9±0.2μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 2, 'b')
-       123±0.1μs      86.1±0.07μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 4, 1, 'b')
-       129±0.5μs       90.5±0.6μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 2, 'i')
-      122±0.05μs       85.8±0.2μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 2, 'b')
-      123±0.03μs       85.9±0.1μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 2, 'b')
-       123±0.3μs      86.1±0.09μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 1, 'b')
-       123±0.2μs       85.8±0.8μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 1, 'b')
-       123±0.1μs       86.1±0.1μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 1, 'b')
-       123±0.2μs       86.0±0.4μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 1, 'b')
-       129±0.5μs       90.3±0.2μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 2, 'I')
-       129±0.4μs       90.2±0.5μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 2, 'i')
-       122±0.2μs       85.6±0.7μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 1, 'b')
-       130±0.1μs       91.1±0.5μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 2, 'i')
-      122±0.06μs      85.4±0.05μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 1, 'b')
-      122±0.03μs       85.2±0.1μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 1, 'b')
-       124±0.3μs       86.5±0.5μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 4, 'H')
-       123±0.1μs      85.3±0.06μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 1, 'b')
-       129±0.3μs       90.1±0.3μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 2, 'I')
-      122±0.08μs      85.1±0.08μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 1, 'b')
-         128±2μs       88.8±0.5μs     0.70  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 4, 'i')
-       122±0.2μs       84.5±0.1μs     0.69  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 2, 4, 'H')
-       122±0.3μs       84.6±0.3μs     0.69  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 4, 'H')
-       130±0.6μs       90.5±0.6μs     0.69  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 2, 'I')
-         129±2μs         89.3±1μs     0.69  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 4, 'i')
-         129±2μs       89.3±0.9μs     0.69  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 4, 'I')
-       131±0.4μs       90.2±0.4μs     0.69  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 2, 2, 'I')
-       121±0.1μs       83.4±0.4μs     0.69  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 4, 'H')
-       121±0.2μs       83.4±0.3μs     0.69  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 4, 'H')
-       124±0.2μs       85.0±0.3μs     0.69  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 4, 1, 'H')
-       131±0.5μs       89.9±0.5μs     0.69  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 2, 2, 'i')
-       121±0.2μs       82.6±0.3μs     0.69  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 4, 'H')
-       124±0.3μs       84.8±0.5μs     0.68  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 4, 'H')
-       124±0.2μs       84.5±0.3μs     0.68  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 4, 'H')
-         248±2μs          169±2μs     0.68  bench_reduce.ArgMax.time_argmax(<class 'numpy.float64'>)
-      120±0.07μs       81.9±0.3μs     0.68  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 4, 'H')
-       121±0.2μs       81.9±0.4μs     0.68  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 4, 'H')
-       126±0.2μs       85.6±0.5μs     0.68  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 1, 'H')
-       121±0.1μs       82.4±0.2μs     0.68  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 4, 2, 'H')
-       124±0.2μs       83.6±0.3μs     0.68  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 4, 'H')
-       122±0.1μs       82.7±0.2μs     0.68  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 1, 'H')
-       120±0.2μs       80.9±0.4μs     0.68  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 4, 'H')
-       122±0.2μs       82.0±0.2μs     0.67  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 1, 'H')
-       122±0.2μs       82.0±0.2μs     0.67  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 1, 'H')
-       123±0.3μs       82.6±0.2μs     0.67  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 2, 1, 'H')
-       120±0.3μs       80.5±0.4μs     0.67  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 2, 'H')
-       124±0.3μs       83.3±0.4μs     0.67  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 4, 'H')
-       121±0.3μs       81.3±0.2μs     0.67  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 2, 2, 'H')
-       123±0.5μs       82.7±0.6μs     0.67  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 4, 'H')
-       121±0.3μs       81.3±0.1μs     0.67  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 2, 'H')
-       121±0.2μs       80.7±0.2μs     0.67  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 2, 'H')
-       119±0.3μs       79.6±0.2μs     0.67  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 1, 'H')
-       120±0.1μs       79.9±0.1μs     0.67  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 2, 'H')
-       124±0.2μs       82.5±0.3μs     0.67  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 2, 'H')
-       121±0.1μs       80.6±0.2μs     0.67  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 2, 'H')
-       120±0.1μs       79.8±0.2μs     0.67  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 2, 'H')
-       123±0.3μs       82.2±0.2μs     0.67  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 4, 'H')
-       119±0.2μs      79.2±0.09μs     0.67  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 1, 'H')
-       119±0.1μs       78.9±0.3μs     0.66  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 2, 'H')
-       123±0.1μs       81.8±0.3μs     0.66  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 4, 'H')
-       119±0.2μs       79.0±0.3μs     0.66  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 1, 'H')
-       123±0.2μs       81.6±0.1μs     0.66  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 2, 'H')
-       125±0.2μs       82.9±0.2μs     0.66  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 1, 'H')
-       125±0.1μs       82.6±0.2μs     0.66  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 1, 'H')
-       123±0.3μs       81.4±0.2μs     0.66  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 2, 'H')
-       125±0.2μs       82.1±0.1μs     0.66  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 1, 'H')
-      123±0.04μs       80.6±0.2μs     0.66  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 2, 'H')
-       125±0.3μs       82.1±0.3μs     0.66  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 1, 'H')
-       123±0.1μs       80.7±0.4μs     0.66  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 4, 'H')
-       123±0.1μs       80.7±0.1μs     0.65  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 2, 'H')
-       128±0.2μs       83.7±0.3μs     0.65  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 2, 'i')
-       123±0.2μs       80.6±0.2μs     0.65  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 2, 'H')
-       128±0.1μs       83.5±0.3μs     0.65  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 2, 'i')
-       128±0.2μs       83.5±0.1μs     0.65  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 2, 'I')
-       123±0.2μs       79.9±0.2μs     0.65  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 2, 'H')
-       123±0.2μs       79.6±0.1μs     0.65  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 1, 'H')
-      128±0.08μs       83.3±0.4μs     0.65  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 2, 'I')
-      123±0.07μs       79.6±0.3μs     0.65  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 2, 'H')
-       129±0.2μs       83.0±0.4μs     0.65  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 2, 'i')
-       129±0.2μs       83.0±0.4μs     0.65  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 2, 'I')
-      123±0.09μs       79.0±0.1μs     0.64  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 1, 'H')
-       129±0.3μs       82.8±0.2μs     0.64  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 2, 'I')
-       123±0.2μs       79.0±0.2μs     0.64  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 1, 'H')
-       128±0.4μs       82.5±0.3μs     0.64  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 2, 'i')
-       127±0.2μs       81.6±0.4μs     0.64  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 1, 'I')
-       123±0.2μs       78.9±0.3μs     0.64  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 2, 'H')
-       128±0.5μs       81.3±0.7μs     0.64  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 1, 'i')
-       128±0.1μs       81.4±0.4μs     0.64  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 1, 'i')
-       128±0.2μs       81.1±0.7μs     0.63  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 1, 'I')
-       125±0.7μs       79.1±0.3μs     0.63  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 2, 'i')
-       125±0.5μs       78.7±0.5μs     0.63  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 2, 'I')
-       126±0.6μs       78.8±0.4μs     0.63  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 2, 'I')
-       126±0.5μs       78.8±0.3μs     0.63  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 2, 'i')
-       119±0.3μs       73.9±0.1μs     0.62  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 1, 'B')
-       120±0.6μs       74.4±0.3μs     0.62  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 4, 'B')
-       124±0.2μs       77.0±0.3μs     0.62  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 2, 'i')
-       124±0.2μs       76.9±0.2μs     0.62  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 2, 'I')
-       125±0.1μs       77.3±0.4μs     0.62  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 2, 'i')
-         119±2μs         73.4±1μs     0.62  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 4, 'B')
-       125±0.2μs       77.1±0.4μs     0.62  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 2, 'i')
-      120±0.08μs       73.9±0.1μs     0.62  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 4, 'B')
-       125±0.1μs       77.1±0.3μs     0.62  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 2, 'I')
-       125±0.4μs       77.3±0.4μs     0.62  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 2, 'i')
-       119±0.1μs      73.4±0.08μs     0.62  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 1, 'B')
-       119±0.2μs       73.3±0.2μs     0.62  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 1, 'B')
-       119±0.2μs       73.5±0.2μs     0.62  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 1, 'B')
-       125±0.4μs       77.0±0.3μs     0.62  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 2, 'I')
-         118±1μs       72.7±0.5μs     0.62  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 2, 'B')
-       118±0.3μs      72.9±0.05μs     0.62  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 2, 'B')
-       120±0.1μs       73.6±0.2μs     0.62  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 2, 'B')
-       119±0.2μs       73.2±0.1μs     0.62  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 4, 'B')
-       125±0.3μs       77.1±0.2μs     0.62  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 2, 'I')
-       119±0.2μs      73.3±0.08μs     0.62  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 1, 'B')
-       119±0.1μs       73.2±0.1μs     0.62  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 4, 'B')
-       119±0.2μs      73.1±0.09μs     0.62  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 4, 'B')
-       118±0.2μs       72.8±0.1μs     0.61  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 2, 'B')
-       119±0.1μs       73.5±0.2μs     0.61  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 4, 'B')
-       125±0.4μs       77.0±0.2μs     0.61  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 1, 'I')
-      119±0.05μs      73.2±0.09μs     0.61  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 2, 'B')
-       120±0.2μs       73.6±0.2μs     0.61  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 4, 'B')
-       120±0.2μs       73.4±0.2μs     0.61  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 4, 'B')
-       125±0.1μs       77.0±0.3μs     0.61  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 1, 'i')
-      119±0.08μs       73.3±0.1μs     0.61  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 2, 'B')
-       119±0.1μs       73.1±0.1μs     0.61  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 2, 'B')
-      118±0.06μs      72.5±0.05μs     0.61  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 1, 'B')
-      118±0.04μs      72.5±0.04μs     0.61  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 1, 'B')
-      118±0.06μs      72.6±0.04μs     0.61  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 2, 'B')
-       120±0.1μs       73.3±0.1μs     0.61  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 2, 'B')
-       125±0.2μs       76.7±0.2μs     0.61  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 1, 'i')
-         119±1μs       72.6±0.1μs     0.61  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 2, 1, 'B')
-       122±0.3μs       74.9±0.1μs     0.61  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 2, 'i')
-       122±0.2μs       74.9±0.2μs     0.61  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 2, 'I')
-       125±0.1μs       76.6±0.2μs     0.61  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 1, 'I')
-       126±0.6μs       76.8±0.2μs     0.61  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 1, 'i')
-       126±0.3μs       76.7±0.2μs     0.61  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 1, 'i')
-       126±0.3μs       76.8±0.4μs     0.61  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 1, 'I')
-       126±0.3μs       76.5±0.1μs     0.61  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 1, 'I')
-      123±0.06μs       74.2±0.1μs     0.60  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 4, 1, 'B')
-       124±0.2μs       74.9±0.1μs     0.60  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 2, 'I')
-       123±0.1μs       74.3±0.3μs     0.60  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 4, 4, 'B')
-       124±0.2μs       74.8±0.2μs     0.60  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 2, 'i')
-       123±0.2μs       73.8±0.4μs     0.60  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 1, 'B')
-       123±0.2μs       73.8±0.1μs     0.60  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 2, 4, 'B')
-       123±0.2μs       73.9±0.1μs     0.60  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 4, 2, 'B')
-       123±0.2μs       73.8±0.2μs     0.60  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 4, 'B')
-       123±0.2μs      73.5±0.04μs     0.60  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 2, 'B')
-       123±0.1μs      73.6±0.07μs     0.60  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 4, 'B')
-       123±0.1μs       73.5±0.2μs     0.60  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 1, 'B')
-       123±0.1μs       73.5±0.1μs     0.60  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 4, 'B')
-       123±0.1μs      73.3±0.08μs     0.60  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 2, 'B')
-       123±0.2μs       73.3±0.2μs     0.60  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 2, 1, 'B')
-       123±0.2μs       73.3±0.2μs     0.60  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 1, 'B')
-      123±0.09μs       73.2±0.1μs     0.60  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 2, 2, 'B')
-       123±0.1μs       73.2±0.2μs     0.60  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 4, 'B')
-      123±0.08μs      73.1±0.03μs     0.60  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 4, 'B')
-       123±0.3μs         73.2±1μs     0.60  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 4, 'B')
-      123±0.07μs      73.1±0.09μs     0.60  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 4, 'B')
-      122±0.06μs       72.9±0.1μs     0.59  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 2, 'B')
-      123±0.04μs      72.9±0.06μs     0.59  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 2, 'B')
-       123±0.1μs       72.8±0.1μs     0.59  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 2, 'B')
-      122±0.02μs       72.7±0.6μs     0.59  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 2, 'B')
-      123±0.08μs       72.7±0.1μs     0.59  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 1, 'B')
-      122±0.06μs       72.6±0.1μs     0.59  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 1, 'B')
-      122±0.05μs      72.5±0.07μs     0.59  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 1, 'B')
-       123±0.1μs      72.7±0.04μs     0.59  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 2, 'B')
-       127±0.3μs       72.7±0.6μs     0.57  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 1, 'q')
-       127±0.2μs       72.6±0.4μs     0.57  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 1, 'l')
-       128±0.2μs       72.8±0.5μs     0.57  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 1, 'Q')
-       128±0.6μs       72.6±0.6μs     0.57  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 1, 'L')
-       128±0.2μs       72.4±0.7μs     0.57  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 1, 'L')
-      128±0.05μs       72.4±0.5μs     0.57  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 1, 'q')
-       128±0.2μs       72.1±0.6μs     0.57  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 1, 'Q')
-       128±0.1μs       72.2±0.4μs     0.56  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 1, 'l')
-       579±0.9μs          292±3μs     0.50  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 4, 4, 'd')
-         579±1μs          290±5μs     0.50  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 4, 4, 'd')
-         578±2μs          289±3μs     0.50  bench_ufunc_strides.Binary.time_ufunc('minimum', 4, 4, 4, 'd')
-         579±1μs          289±3μs     0.50  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 4, 4, 'd')
-     22.0±0.05μs      10.6±0.09μs     0.48  bench_reduce.MinMax.time_max(<class 'numpy.uint64'>)
-     22.0±0.04μs      10.5±0.02μs     0.48  bench_reduce.MinMax.time_min(<class 'numpy.uint64'>)
-      22.0±0.1μs       10.6±0.1μs     0.48  bench_reduce.MinMax.time_max(<class 'numpy.int64'>)
-     22.0±0.03μs      10.5±0.03μs     0.48  bench_reduce.MinMax.time_min(<class 'numpy.int64'>)
-         572±3μs          262±5μs     0.46  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 4, 1, 'd')
-         572±1μs          261±5μs     0.46  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 4, 1, 'd')
-         572±3μs          260±2μs     0.46  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 4, 1, 'd')
-         573±3μs          260±1μs     0.45  bench_ufunc_strides.Binary.time_ufunc('minimum', 4, 4, 1, 'd')
-       571±0.6μs          249±2μs     0.44  bench_ufunc_strides.Binary.time_ufunc('minimum', 2, 4, 4, 'd')
-       572±0.4μs          249±2μs     0.44  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 4, 4, 'd')
-       572±0.9μs          249±2μs     0.44  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 2, 4, 'd')
-       571±0.9μs          249±2μs     0.44  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 4, 4, 'd')
-         573±1μs          249±1μs     0.43  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 2, 4, 'd')
-       571±0.9μs          248±1μs     0.43  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 4, 4, 'd')
-       573±0.5μs        248±0.5μs     0.43  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 2, 4, 'd')
-       574±0.9μs        248±0.9μs     0.43  bench_ufunc_strides.Binary.time_ufunc('minimum', 4, 2, 4, 'd')
-         568±1μs          230±2μs     0.40  bench_ufunc_strides.Binary.time_ufunc('minimum', 1, 4, 4, 'd')
-         568±1μs          229±2μs     0.40  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 4, 4, 'd')
-       568±0.4μs          229±1μs     0.40  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 4, 4, 'd')
-       568±0.8μs        228±0.8μs     0.40  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 4, 4, 'd')
-         569±2μs          228±1μs     0.40  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 4, 2, 'd')
-        601±20μs          240±6μs     0.40  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 1, 4, 'd')
-        600±20μs          239±7μs     0.40  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 1, 4, 'd')
-         569±2μs        227±0.7μs     0.40  bench_ufunc_strides.Binary.time_ufunc('minimum', 4, 4, 2, 'd')
-         569±2μs          227±2μs     0.40  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 4, 2, 'd')
-        602±20μs          240±6μs     0.40  bench_ufunc_strides.Binary.time_ufunc('minimum', 4, 1, 4, 'd')
-        599±20μs          239±5μs     0.40  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 1, 4, 'd')
-         570±1μs          227±2μs     0.40  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 4, 2, 'd')
-         564±1μs          213±2μs     0.38  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 2, 4, 'd')
-       563±0.6μs          212±4μs     0.38  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 4, 1, 'd')
-       565±0.5μs          213±2μs     0.38  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 2, 4, 'd')
-       563±0.5μs          212±2μs     0.38  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 2, 1, 'd')
-       563±0.9μs          212±2μs     0.38  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 4, 1, 'd')
-         564±1μs          212±2μs     0.38  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 2, 1, 'd')
-       564±0.7μs          211±3μs     0.37  bench_ufunc_strides.Binary.time_ufunc('minimum', 2, 2, 4, 'd')
-         566±1μs          211±1μs     0.37  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 2, 4, 'd')
-         565±2μs        210±0.8μs     0.37  bench_ufunc_strides.Binary.time_ufunc('minimum', 4, 2, 1, 'd')
-         564±1μs        209±0.4μs     0.37  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 2, 1, 'd')
-       564±0.5μs          210±2μs     0.37  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 4, 1, 'd')
-       563±0.5μs        209±0.6μs     0.37  bench_ufunc_strides.Binary.time_ufunc('minimum', 2, 4, 1, 'd')
-        594±20μs          199±3μs     0.34  bench_ufunc_strides.Binary.time_ufunc('minimum', 1, 2, 4, 'd')
-     21.9±0.06μs       7.36±0.1μs     0.34  bench_reduce.MinMax.time_max(<class 'numpy.int32'>)
-     21.9±0.05μs       7.35±0.1μs     0.34  bench_reduce.MinMax.time_max(<class 'numpy.uint32'>)
-        595±20μs          199±4μs     0.33  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 2, 4, 'd')
-        594±20μs          199±4μs     0.33  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 2, 4, 'd')
-        597±20μs          199±5μs     0.33  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 1, 4, 'd')
-        599±20μs          200±5μs     0.33  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 1, 4, 'd')
-     21.8±0.04μs      7.29±0.04μs     0.33  bench_reduce.MinMax.time_min(<class 'numpy.int32'>)
-        598±20μs          199±6μs     0.33  bench_ufunc_strides.Binary.time_ufunc('minimum', 2, 1, 4, 'd')
-        598±20μs          199±5μs     0.33  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 1, 4, 'd')
-        597±20μs          199±4μs     0.33  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 2, 4, 'd')
-         559±1μs          186±2μs     0.33  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 1, 1, 'd')
-       559±0.6μs          185±2μs     0.33  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 1, 1, 'd')
-     22.1±0.06μs      7.29±0.03μs     0.33  bench_reduce.MinMax.time_min(<class 'numpy.uint32'>)
-         562±1μs          185±2μs     0.33  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 4, 1, 'd')
-       561±0.9μs          183±1μs     0.33  bench_ufunc_strides.Binary.time_ufunc('minimum', 1, 4, 1, 'd')
-         560±1μs        183±0.9μs     0.33  bench_ufunc_strides.Binary.time_ufunc('minimum', 4, 1, 1, 'd')
-       561±0.5μs          183±1μs     0.33  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 4, 1, 'd')
-       559±0.4μs          182±1μs     0.33  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 1, 1, 'd')
-       561±0.6μs          182±1μs     0.33  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 4, 1, 'd')
-       564±0.7μs          183±1μs     0.32  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 4, 2, 'd')
-     29.2±0.05μs       9.49±0.1μs     0.32  bench_reduce.MinMax.time_max(<class 'numpy.float64'>)
-         565±1μs          183±2μs     0.32  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 4, 2, 'd')
-     29.2±0.05μs      9.46±0.06μs     0.32  bench_reduce.MinMax.time_min(<class 'numpy.float64'>)
-       565±0.5μs        182±0.9μs     0.32  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 4, 2, 'd')
-       564±0.9μs          182±1μs     0.32  bench_ufunc_strides.Binary.time_ufunc('minimum', 2, 4, 2, 'd')
-       567±0.7μs          182±1μs     0.32  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 2, 2, 'd')
-       567±0.8μs        182±0.9μs     0.32  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 2, 2, 'd')
-       567±0.5μs        181±0.8μs     0.32  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 2, 2, 'd')
-       568±0.8μs        181±0.7μs     0.32  bench_ufunc_strides.Binary.time_ufunc('minimum', 4, 2, 2, 'd')
-        586±20μs          175±4μs     0.30  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 1, 4, 'd')
-        587±20μs          175±4μs     0.30  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 1, 4, 'd')
-        586±20μs          174±3μs     0.30  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 1, 4, 'd')
-        587±20μs          174±4μs     0.30  bench_ufunc_strides.Binary.time_ufunc('minimum', 1, 1, 4, 'd')
-       561±0.8μs        160±0.5μs     0.28  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 1, 2, 'd')
-       562±0.4μs          160±1μs     0.28  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 1, 2, 'd')
-       562±0.9μs        159±0.6μs     0.28  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 1, 2, 'd')
-       563±0.8μs        159±0.8μs     0.28  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 4, 2, 'd')
-       561±0.8μs        159±0.8μs     0.28  bench_ufunc_strides.Binary.time_ufunc('minimum', 4, 1, 2, 'd')
-         563±1μs        159±0.6μs     0.28  bench_ufunc_strides.Binary.time_ufunc('minimum', 1, 4, 2, 'd')
-         563±1μs        159±0.2μs     0.28  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 4, 2, 'd')
-       563±0.8μs        159±0.3μs     0.28  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 4, 2, 'd')
-       555±0.8μs          150±2μs     0.27  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 2, 1, 'd')
-       555±0.8μs          150±1μs     0.27  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 2, 1, 'd')
-       555±0.5μs          148±1μs     0.27  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 2, 1, 'd')
-       554±0.4μs        147±0.4μs     0.27  bench_ufunc_strides.Binary.time_ufunc('minimum', 2, 2, 1, 'd')
-       555±0.8μs          145±3μs     0.26  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 4, 1, 'f')
-       555±0.6μs          145±3μs     0.26  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 4, 1, 'f')
-       554±0.8μs          144±1μs     0.26  bench_ufunc_strides.Binary.time_ufunc('minimum', 4, 4, 1, 'f')
-       554±0.4μs          144±2μs     0.26  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 4, 1, 'f')
-     26.7±0.06μs      6.88±0.05μs     0.26  bench_reduce.FMinMax.time_min(<class 'numpy.float64'>)
-     26.7±0.05μs      6.86±0.05μs     0.26  bench_reduce.FMinMax.time_max(<class 'numpy.float64'>)
-     29.2±0.05μs      7.36±0.08μs     0.25  bench_reduce.MinMax.time_max(<class 'numpy.float32'>)
-     29.2±0.03μs      7.36±0.01μs     0.25  bench_reduce.MinMax.time_min(<class 'numpy.float32'>)
-       560±0.8μs        137±0.8μs     0.25  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 2, 2, 'd')
-         558±2μs        137±0.6μs     0.24  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 2, 2, 'd')
-         561±1μs          137±1μs     0.24  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 2, 2, 'd')
-         559±1μs        136±0.7μs     0.24  bench_ufunc_strides.Binary.time_ufunc('minimum', 2, 2, 2, 'd')
-         564±3μs          137±1μs     0.24  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 4, 4, 'f')
-         563±2μs          136±1μs     0.24  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 4, 4, 'f')
-         564±3μs        136±0.7μs     0.24  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 4, 4, 'f')
-         563±3μs        136±0.2μs     0.24  bench_ufunc_strides.Binary.time_ufunc('minimum', 4, 4, 4, 'f')
-       120±0.5μs       28.4±0.7μs     0.24  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 1, 'I')
-       120±0.3μs       28.2±0.5μs     0.23  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 1, 'i')
-       124±0.2μs       28.2±0.7μs     0.23  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 1, 'I')
-       123±0.2μs       28.0±0.3μs     0.23  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 1, 'i')
-       256±0.4μs       56.3±0.4μs     0.22  bench_lib.Nan.time_nanmax(200000, 0)
-       255±0.3μs       56.1±0.2μs     0.22  bench_lib.Nan.time_nanmin(200000, 0)
-       257±0.2μs       56.2±0.4μs     0.22  bench_lib.Nan.time_nanmax(200000, 0.1)
-       257±0.4μs       56.0±0.5μs     0.22  bench_lib.Nan.time_nanmin(200000, 0.1)
-     29.0±0.02μs       6.19±0.1μs     0.21  bench_reduce.MinMax.time_max(<class 'numpy.uint16'>)
-     29.0±0.03μs       6.18±0.1μs     0.21  bench_reduce.MinMax.time_max(<class 'numpy.int16'>)
-     29.0±0.01μs      6.15±0.08μs     0.21  bench_reduce.MinMax.time_min(<class 'numpy.int16'>)
-     29.0±0.02μs      6.16±0.04μs     0.21  bench_reduce.MinMax.time_min(<class 'numpy.uint16'>)
-       553±0.5μs          116±2μs     0.21  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 1, 1, 'd')
-       553±0.9μs        115±0.6μs     0.21  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 2, 1, 'd')
-       552±0.3μs          115±1μs     0.21  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 1, 1, 'd')
-         558±1μs          116±1μs     0.21  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 2, 4, 'f')
-         558±1μs          116±1μs     0.21  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 4, 4, 'f')
-       553±0.5μs        115±0.7μs     0.21  bench_ufunc_strides.Binary.time_ufunc('minimum', 2, 1, 1, 'd')
-       553±0.6μs        115±0.5μs     0.21  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 1, 1, 'd')
-       553±0.5μs        115±0.6μs     0.21  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 2, 1, 'd')
-         553±1μs        115±0.4μs     0.21  bench_ufunc_strides.Binary.time_ufunc('minimum', 1, 2, 1, 'd')
-       559±0.9μs          116±2μs     0.21  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 2, 4, 'f')
-         559±1μs        116±0.7μs     0.21  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 4, 4, 'f')
-         559±1μs          116±1μs     0.21  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 1, 2, 'd')
-         558±2μs        115±0.4μs     0.21  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 4, 4, 'f')
-       552±0.5μs        114±0.3μs     0.21  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 2, 1, 'd')
-         559±2μs          115±1μs     0.21  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 2, 4, 'f')
-       558±0.9μs        115±0.1μs     0.21  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 2, 2, 'd')
-       558±0.8μs        115±0.9μs     0.21  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 2, 2, 'd')
-         559±1μs        115±0.5μs     0.21  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 1, 2, 'd')
-         558±1μs        115±0.6μs     0.21  bench_ufunc_strides.Binary.time_ufunc('minimum', 4, 2, 4, 'f')
-         559±1μs        115±0.2μs     0.21  bench_ufunc_strides.Binary.time_ufunc('minimum', 2, 4, 4, 'f')
-       558±0.8μs        115±0.6μs     0.21  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 2, 2, 'd')
-         558±1μs        115±0.2μs     0.21  bench_ufunc_strides.Binary.time_ufunc('minimum', 1, 2, 2, 'd')
-       559±0.8μs        115±0.4μs     0.21  bench_ufunc_strides.Binary.time_ufunc('minimum', 2, 1, 2, 'd')
-         559±1μs        115±0.4μs     0.21  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 1, 2, 'd')
-         553±1μs          112±1μs     0.20  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 4, 1, 'f')
-         553±1μs          112±1μs     0.20  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 2, 1, 'f')
-         553±1μs          112±1μs     0.20  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 2, 1, 'f')
-       552±0.5μs        112±0.9μs     0.20  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 4, 1, 'f')
-       552±0.6μs        111±0.3μs     0.20  bench_ufunc_strides.Binary.time_ufunc('minimum', 2, 4, 1, 'f')
-         552±1μs        111±0.6μs     0.20  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 4, 1, 'f')
-       558±0.7μs        111±0.6μs     0.20  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 4, 2, 'f')
-       554±0.7μs        110±0.6μs     0.20  bench_ufunc_strides.Binary.time_ufunc('minimum', 4, 2, 1, 'f')
-       552±0.9μs        110±0.7μs     0.20  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 2, 1, 'f')
-       558±0.6μs        111±0.8μs     0.20  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 4, 2, 'f')
-       558±0.8μs        111±0.5μs     0.20  bench_ufunc_strides.Binary.time_ufunc('minimum', 4, 4, 2, 'f')
-         558±1μs        111±0.2μs     0.20  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 4, 2, 'f')
-       285±0.3μs       56.2±0.7μs     0.20  bench_lib.Nan.time_nanmax(200000, 2.0)
-       285±0.3μs       55.7±0.2μs     0.20  bench_lib.Nan.time_nanmin(200000, 2.0)
-       559±0.6μs          108±1μs     0.19  bench_ufunc_strides.Binary.time_ufunc('minimum', 4, 1, 4, 'f')
-         559±1μs          107±1μs     0.19  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 1, 4, 'f')
-       557±0.9μs        107±0.7μs     0.19  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 4, 4, 'f')
-         559±1μs          107±1μs     0.19  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 1, 4, 'f')
-       558±0.9μs        106±0.3μs     0.19  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 1, 4, 'f')
-       558±0.9μs        106±0.6μs     0.19  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 4, 4, 'f')
-         558±1μs        106±0.4μs     0.19  bench_ufunc_strides.Binary.time_ufunc('minimum', 1, 4, 4, 'f')
-       557±0.8μs        106±0.3μs     0.19  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 4, 4, 'f')
-     28.9±0.04μs       5.44±0.1μs     0.19  bench_reduce.MinMax.time_max(<class 'numpy.uint8'>)
-     28.9±0.03μs       5.43±0.1μs     0.19  bench_reduce.MinMax.time_max(<class 'numpy.int8'>)
-     28.9±0.01μs      5.40±0.04μs     0.19  bench_reduce.MinMax.time_min(<class 'numpy.uint8'>)
-     28.9±0.02μs      5.39±0.05μs     0.19  bench_reduce.MinMax.time_min(<class 'numpy.int8'>)
-     26.6±0.06μs      4.73±0.04μs     0.18  bench_reduce.FMinMax.time_max(<class 'numpy.float32'>)
-     26.6±0.05μs      4.69±0.03μs     0.18  bench_reduce.FMinMax.time_min(<class 'numpy.float32'>)
-         556±1μs       97.1±0.9μs     0.17  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 1, 2, 'd')
-       555±0.2μs       96.6±0.7μs     0.17  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 2, 4, 'f')
-       551±0.9μs       95.9±0.7μs     0.17  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 1, 1, 'f')
-       556±0.6μs       96.6±0.4μs     0.17  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 2, 4, 'f')
-       555±0.6μs         95.9±1μs     0.17  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 1, 2, 'd')
-       556±0.8μs       96.0±0.7μs     0.17  bench_ufunc_strides.Binary.time_ufunc('minimum', 1, 1, 2, 'd')
-       556±0.5μs         95.8±2μs     0.17  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 1, 2, 'd')
-       555±0.8μs       95.7±0.3μs     0.17  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 2, 4, 'f')
-       551±0.5μs       94.9±0.8μs     0.17  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 1, 1, 'f')
-       556±0.6μs       95.7±0.2μs     0.17  bench_ufunc_strides.Binary.time_ufunc('minimum', 2, 2, 4, 'f')
-       553±0.6μs       94.9±0.7μs     0.17  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 4, 1, 'f')
-         551±1μs       94.4±0.5μs     0.17  bench_ufunc_strides.Binary.time_ufunc('minimum', 4, 1, 1, 'f')
-       552±0.5μs       94.4±0.4μs     0.17  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 1, 1, 'f')
-       553±0.6μs       94.5±0.9μs     0.17  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 4, 1, 'f')
-       552±0.7μs       94.4±0.3μs     0.17  bench_ufunc_strides.Binary.time_ufunc('minimum', 1, 4, 1, 'f')
-       552±0.6μs       94.4±0.6μs     0.17  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 4, 1, 'f')
-       555±0.7μs       92.9±0.7μs     0.17  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 1, 4, 'f')
-         554±1μs       92.4±0.6μs     0.17  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 1, 4, 'f')
-       555±0.8μs       92.0±0.9μs     0.17  bench_ufunc_strides.Binary.time_ufunc('minimum', 2, 1, 4, 'f')
-       556±0.7μs       91.8±0.9μs     0.17  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 1, 4, 'f')
-       554±0.4μs       91.1±0.4μs     0.16  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 2, 4, 'f')
-       554±0.9μs       90.8±0.3μs     0.16  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 2, 4, 'f')
-       555±0.6μs       90.6±0.4μs     0.16  bench_ufunc_strides.Binary.time_ufunc('minimum', 1, 2, 4, 'f')
-       555±0.7μs       90.6±0.6μs     0.16  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 2, 4, 'f')
-         556±2μs       89.6±0.4μs     0.16  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 4, 2, 'f')
-       556±0.3μs       89.5±0.8μs     0.16  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 2, 2, 'f')
-         556±1μs       89.5±0.3μs     0.16  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 2, 2, 'f')
-         556±1μs       89.3±0.4μs     0.16  bench_ufunc_strides.Binary.time_ufunc('minimum', 4, 2, 2, 'f')
-         555±1μs       89.1±0.3μs     0.16  bench_ufunc_strides.Binary.time_ufunc('minimum', 2, 4, 2, 'f')
-       556±0.9μs       89.0±0.5μs     0.16  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 2, 2, 'f')
-         556±1μs       89.0±0.2μs     0.16  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 4, 2, 'f')
-         556±2μs       88.9±0.3μs     0.16  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 4, 2, 'f')
-         568±8μs       88.0±0.8μs     0.16  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 1, 4, 'f')
-         568±8μs       87.0±0.9μs     0.15  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 1, 4, 'f')
-         568±8μs         86.7±1μs     0.15  bench_ufunc_strides.Binary.time_ufunc('minimum', 1, 1, 4, 'f')
-         569±9μs         86.7±1μs     0.15  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 1, 4, 'f')
-       554±0.6μs       81.6±0.4μs     0.15  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 4, 2, 'f')
-       553±0.5μs       81.3±0.4μs     0.15  bench_ufunc_strides.Binary.time_ufunc('fmax', 4, 1, 2, 'f')
-       555±0.4μs       81.4±0.5μs     0.15  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 4, 2, 'f')
-       553±0.5μs       81.1±0.1μs     0.15  bench_ufunc_strides.Binary.time_ufunc('fmin', 4, 1, 2, 'f')
-       555±0.4μs       81.3±0.5μs     0.15  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 4, 2, 'f')
-       555±0.9μs       81.2±0.2μs     0.15  bench_ufunc_strides.Binary.time_ufunc('minimum', 1, 4, 2, 'f')
-         554±1μs       81.0±0.2μs     0.15  bench_ufunc_strides.Binary.time_ufunc('minimum', 4, 1, 2, 'f')
-       553±0.9μs       80.8±0.3μs     0.15  bench_ufunc_strides.Binary.time_ufunc('maximum', 4, 1, 2, 'f')
-       384±0.3μs       55.9±0.6μs     0.15  bench_lib.Nan.time_nanmax(200000, 90.0)
-       386±0.5μs       56.0±0.3μs     0.15  bench_lib.Nan.time_nanmin(200000, 90.0)
-       552±0.5μs       75.4±0.5μs     0.14  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 2, 1, 'f')
-       551±0.8μs       75.2±0.3μs     0.14  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 2, 1, 'f')
-         554±2μs       75.6±0.1μs     0.14  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 2, 2, 'f')
-       554±0.9μs       75.5±0.4μs     0.14  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 2, 2, 'f')
-         555±1μs       75.3±0.3μs     0.14  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 2, 2, 'f')
-       551±0.9μs       74.8±0.3μs     0.14  bench_ufunc_strides.Binary.time_ufunc('minimum', 2, 2, 1, 'f')
-         555±2μs       75.3±0.4μs     0.14  bench_ufunc_strides.Binary.time_ufunc('minimum', 2, 2, 2, 'f')
-       551±0.6μs       74.6±0.7μs     0.14  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 2, 1, 'f')
-         552±1μs       72.6±0.3μs     0.13  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 1, 2, 'f')
-       553±0.6μs       72.7±0.6μs     0.13  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 1, 2, 'f')
-       552±0.4μs       72.2±0.3μs     0.13  bench_ufunc_strides.Binary.time_ufunc('minimum', 2, 1, 2, 'f')
-       552±0.7μs       72.1±0.4μs     0.13  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 2, 2, 'f')
-         553±1μs       72.2±0.3μs     0.13  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 1, 2, 'f')
-       553±0.5μs       72.0±0.3μs     0.13  bench_ufunc_strides.Binary.time_ufunc('minimum', 1, 2, 2, 'f')
-       552±0.6μs       71.7±0.2μs     0.13  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 2, 2, 'f')
-       552±0.6μs       71.5±0.1μs     0.13  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 2, 2, 'f')
-       551±0.9μs       69.4±0.6μs     0.13  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 1, 1, 'd')
-       551±0.5μs       69.4±0.5μs     0.13  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 1, 1, 'd')
-       552±0.6μs       68.9±0.4μs     0.12  bench_ufunc_strides.Binary.time_ufunc('minimum', 1, 1, 1, 'd')
-       550±0.4μs       68.5±0.2μs     0.12  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 1, 1, 'd')
-         550±1μs       68.2±0.3μs     0.12  bench_ufunc_strides.Binary.time_ufunc('maximum', 2, 1, 1, 'f')
-       549±0.6μs       68.0±0.3μs     0.12  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 1, 1, 'f')
-         550±1μs       68.0±0.2μs     0.12  bench_ufunc_strides.Binary.time_ufunc('minimum', 2, 1, 1, 'f')
-       549±0.7μs      67.9±0.07μs     0.12  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 2, 1, 'f')
-         550±2μs       67.9±0.1μs     0.12  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 1, 1, 'f')
-       550±0.5μs       67.8±0.2μs     0.12  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 2, 1, 'f')
-       550±0.9μs       67.9±0.1μs     0.12  bench_ufunc_strides.Binary.time_ufunc('minimum', 1, 2, 1, 'f')
-       550±0.2μs       67.6±0.2μs     0.12  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 2, 1, 'f')
-       551±0.8μs       67.2±0.3μs     0.12  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 1, 2, 'f')
-       551±0.6μs       67.2±0.3μs     0.12  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 1, 2, 'f')
-       552±0.9μs       67.0±0.2μs     0.12  bench_ufunc_strides.Binary.time_ufunc('minimum', 1, 1, 2, 'f')
-       550±0.5μs       66.8±0.2μs     0.12  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 1, 2, 'f')
-       118±0.2μs       12.0±0.7μs     0.10  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 1, 'H')
-      123±0.08μs       11.6±0.4μs     0.09  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 1, 'h')
-       122±0.2μs       11.4±0.4μs     0.09  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 1, 'H')
-       123±0.2μs       11.4±0.3μs     0.09  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 1, 'h')
-         993±1μs       55.8±0.3μs     0.06  bench_lib.Nan.time_nanmin(200000, 50.0)
-         994±1μs       55.8±0.3μs     0.06  bench_lib.Nan.time_nanmax(200000, 50.0)
-       118±0.2μs      6.48±0.05μs     0.06  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 1, 'B')
-      122±0.03μs      6.50±0.05μs     0.05  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 1, 'B')
-       124±0.6μs      6.50±0.04μs     0.05  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 1, 'b')
-       124±0.5μs      6.46±0.03μs     0.05  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 1, 'b')
-       549±0.4μs       28.1±0.5μs     0.05  bench_ufunc_strides.Binary.time_ufunc('maximum', 1, 1, 1, 'f')
-       550±0.9μs       28.0±0.2μs     0.05  bench_ufunc_strides.Binary.time_ufunc('minimum', 1, 1, 1, 'f')
-       549±0.4μs       27.7±0.3μs     0.05  bench_ufunc_strides.Binary.time_ufunc('fmax', 1, 1, 1, 'f')
-       549±0.8μs       27.6±0.4μs     0.05  bench_ufunc_strides.Binary.time_ufunc('fmin', 1, 1, 1, 'f')

@seiko2plus
Copy link
Member

my latest push only adds npyv_cleanup() at the end of the inner functions but CI runner DEBUG get crashed with:

2022-01-06T18:55:39.9852972Z Fatal Python error: Aborted
2022-01-06T18:55:39.9854461Z 
2022-01-06T18:55:39.9855828Z Current thread 0x00007fd960bc9740 (most recent call first):
2022-01-06T18:55:39.9858195Z   File "/home/runner/work/numpy/numpy/numpy/core/tests/test_ufunc.py", line 2158 in test_reducelike_byteorder_resolution
2022-01-06T18:55:39.9861865Z   File "/home/runner/work/numpy/numpy/builds/venv/lib/python3.8/site-packages/_pytest/python.py", line 183 in pytest_pyfunc_call
2022-01-06T18:55:39.9880785Z   File "/home/runner/work/numpy/numpy/builds/venv/lib/python3.8/site-packages/pluggy/_callers.py", line 39 in _multicall
2022-01-06T18:55:39.9882503Z   File "/home/runner/work/numpy/numpy/builds/venv/lib/python3.8/site-packages/pluggy/_manager.py", line 80 in _hookexec
2022-01-06T18:55:39.9884321Z   File "/home/runner/work/numpy/numpy/builds/venv/lib/python3.8/site-packages/pluggy/_hooks.py", line 265 in __call__
2022-01-06T18:55:39.9885893Z   File "/home/runner/work/numpy/numpy/builds/venv/lib/python3.8/site-packages/_pytest/python.py", line 1641 in runtest
2022-01-06T18:55:39.9887521Z   File "/home/runner/work/numpy/numpy/builds/venv/lib/python3.8/site-packages/_pytest/runner.py", line 162 in pytest_runtest_call
2022-01-06T18:55:39.9889161Z   File "/home/runner/work/numpy/numpy/builds/venv/lib/python3.8/site-packages/pluggy/_callers.py", line 39 in _multicall
2022-01-06T18:55:39.9890747Z   File "/home/runner/work/numpy/numpy/builds/venv/lib/python3.8/site-packages/pluggy/_manager.py", line 80 in _hookexec
2022-01-06T18:55:39.9892266Z   File "/home/runner/work/numpy/numpy/builds/venv/lib/python3.8/site-packages/pluggy/_hooks.py", line 265 in __call__
2022-01-06T18:55:39.9893800Z   File "/home/runner/work/numpy/numpy/builds/venv/lib/python3.8/site-packages/_pytest/runner.py", line 255 in <lambda>
2022-01-06T18:55:39.9895359Z   File "/home/runner/work/numpy/numpy/builds/venv/lib/python3.8/site-packages/_pytest/runner.py", line 311 in from_call
2022-01-06T18:55:39.9896933Z   File "/home/runner/work/numpy/numpy/builds/venv/lib/python3.8/site-packages/_pytest/runner.py", line 254 in call_runtest_hook
2022-01-06T18:55:39.9898549Z   File "/home/runner/work/numpy/numpy/builds/venv/lib/python3.8/site-packages/_pytest/runner.py", line 215 in call_and_report
2022-01-06T18:55:39.9900197Z   File "/home/runner/work/numpy/numpy/builds/venv/lib/python3.8/site-packages/_pytest/runner.py", line 126 in runtestprotocol
2022-01-06T18:55:39.9901914Z   File "/home/runner/work/numpy/numpy/builds/venv/lib/python3.8/site-packages/_pytest/runner.py", line 109 in pytest_runtest_protocol
2022-01-06T18:55:39.9903544Z   File "/home/runner/work/numpy/numpy/builds/venv/lib/python3.8/site-packages/pluggy/_callers.py", line 39 in _multicall
2022-01-06T18:55:39.9905119Z   File "/home/runner/work/numpy/numpy/builds/venv/lib/python3.8/site-packages/pluggy/_manager.py", line 80 in _hookexec
2022-01-06T18:55:39.9906660Z   File "/home/runner/work/numpy/numpy/builds/venv/lib/python3.8/site-packages/pluggy/_hooks.py", line 265 in __call__
2022-01-06T18:55:39.9908240Z   File "/home/runner/work/numpy/numpy/builds/venv/lib/python3.8/site-packages/_pytest/main.py", line 348 in pytest_runtestloop
2022-01-06T18:55:39.9909834Z   File "/home/runner/work/numpy/numpy/builds/venv/lib/python3.8/site-packages/pluggy/_callers.py", line 39 in _multicall
2022-01-06T18:55:39.9911675Z   File "/home/runner/work/numpy/numpy/builds/venv/lib/python3.8/site-packages/pluggy/_manager.py", line 80 in _hookexec
2022-01-06T18:55:39.9913207Z   File "/home/runner/work/numpy/numpy/builds/venv/lib/python3.8/site-packages/pluggy/_hooks.py", line 265 in __call__
2022-01-06T18:55:39.9914689Z   File "/home/runner/work/numpy/numpy/builds/venv/lib/python3.8/site-packages/_pytest/main.py", line 323 in _main
2022-01-06T18:55:39.9923391Z   File "/home/runner/work/numpy/numpy/builds/venv/lib/python3.8/site-packages/_pytest/main.py", line 269 in wrap_session
2022-01-06T18:55:39.9925014Z   File "/home/runner/work/numpy/numpy/builds/venv/lib/python3.8/site-packages/_pytest/main.py", line 316 in pytest_cmdline_main
2022-01-06T18:55:39.9926614Z   File "/home/runner/work/numpy/numpy/builds/venv/lib/python3.8/site-packages/pluggy/_callers.py", line 39 in _multicall
2022-01-06T18:55:39.9928195Z   File "/home/runner/work/numpy/numpy/builds/venv/lib/python3.8/site-packages/pluggy/_manager.py", line 80 in _hookexec
2022-01-06T18:55:39.9929743Z   File "/home/runner/work/numpy/numpy/builds/venv/lib/python3.8/site-packages/pluggy/_hooks.py", line 265 in __call__
2022-01-06T18:55:39.9931282Z   File "/home/runner/work/numpy/numpy/builds/venv/lib/python3.8/site-packages/_pytest/config/__init__.py", line 162 in main
2022-01-06T18:55:39.9932410Z   File "/home/runner/work/numpy/numpy/numpy/_pytesttester.py", line 204 in __call__
2022-01-06T18:55:39.9933201Z   File "../runtests.py", line 388 in main
2022-01-06T18:55:39.9933855Z   File "../runtests.py", line 701 in <module>
2022-01-06T18:55:55.2013682Z ./tools/travis-test.sh: line 79:  4561 Aborted                 (core dumped) $PYTHON ../runtests.py -n -v $DURATIONS_FLAG -- -rs

probely related to this pull-request I'm going to invistage on it.

@seberg
Copy link
Member

seberg commented Jan 6, 2022

Please feel free to ignore that, this seems to have slipped in through another fix (not sure how it got passed CI), I have not yet spend serious effort to track it down though.
(It is not related to this PR, and almost certainly related to the ufunc dispatching itself.)

@mattip
Copy link
Member

mattip commented Jan 7, 2022

@seiko2plus would you like to get this in as-is and then open an issue for improving argmax + float32 ? @Developer-Ecosystem-Engineering thoughts?

@seiko2plus
Copy link
Member

@seberg,

(It is not related to this PR, and almost certainly related to the ufunc dispatching itself.)

I gonna ignore it then, thanks for the clarification.

@mattip,

would you like to get this in as-is and then open an issue for improving argmax + float32 ?

Yes, I would but the issue is that I don't clearly understand how the new changes affected negatively to argmax/float32 and positively to argmax/float64. :

AVX512

+       122±0.2μs          218±2μs     1.79  bench_reduce.ArgMax.time_argmax(<class 'numpy.float32'>)
-       184±0.9μs        124±0.3μs     0.67  bench_reduce.ArgMax.time_argmax(<class 'numpy.float64'>)

AVX2

+       122±0.3μs          220±2μs     1.81  bench_reduce.ArgMax.time_argmax(<class 'numpy.float32'>)
-         184±1μs        124±0.2μs     0.67  bench_reduce.ArgMax.time_argmax(<class 'numpy.float64'>)

ASIMD

+         166±1μs          246±1μs     1.48  bench_reduce.ArgMax.time_argmax(<class 'numpy.float32'>)
-         248±2μs          169±2μs     0.68  bench_reduce.ArgMax.time_argmax(<class 'numpy.float64'>)

Is there any code paths for argmax different than the following?:

static int
BOOL_argmax(npy_bool *ip, npy_intp n, npy_intp *max_ind,
PyArrayObject *NPY_UNUSED(aip))
{
npy_intp i = 0;
/* memcmp like logical_and on i386 is maybe slower for small arrays */
#ifdef NPY_HAVE_SSE2_INTRINSICS
const __m128i zero = _mm_setzero_si128();
for (; i < n - (n % 32); i+=32) {
__m128i d1 = _mm_loadu_si128((__m128i*)&ip[i]);
__m128i d2 = _mm_loadu_si128((__m128i*)&ip[i + 16]);
d1 = _mm_cmpeq_epi8(d1, zero);
d2 = _mm_cmpeq_epi8(d2, zero);
if (_mm_movemask_epi8(_mm_min_epu8(d1, d2)) != 0xFFFF) {
break;
}
}
#else
#if defined(__ARM_NEON__) || defined (__ARM_NEON)
uint8x16_t zero = vdupq_n_u8(0);
for(; i < n - (n % 32); i+=32) {
uint8x16_t d1 = vld1q_u8((uint8_t *)&ip[i]);
uint8x16_t d2 = vld1q_u8((uint8_t *)&ip[i + 16]);
d1 = vceqq_u8(d1, zero);
d2 = vceqq_u8(d2, zero);
if(_mm_movemask_epi8_neon(vminq_u8(d1, d2)) != 0xFFFF) {
break;
}
}
#endif
#endif
for (; i < n; i++) {
if (ip[i]) {
*max_ind = i;
return 0;
}
}
*max_ind = 0;
return 0;
}
/**begin repeat
*
* #fname = BYTE, UBYTE, SHORT, USHORT, INT, UINT,
* LONG, ULONG, LONGLONG, ULONGLONG,
* HALF, FLOAT, DOUBLE, LONGDOUBLE,
* CFLOAT, CDOUBLE, CLONGDOUBLE,
* DATETIME, TIMEDELTA#
* #type = npy_byte, npy_ubyte, npy_short, npy_ushort, npy_int, npy_uint,
* npy_long, npy_ulong, npy_longlong, npy_ulonglong,
* npy_half, npy_float, npy_double, npy_longdouble,
* npy_float, npy_double, npy_longdouble,
* npy_datetime, npy_timedelta#
* #isfloat = 0*10, 1*7, 0*2#
* #isnan = nop*10, npy_half_isnan, npy_isnan*6, nop*2#
* #le = _LESS_THAN_OR_EQUAL*10, npy_half_le, _LESS_THAN_OR_EQUAL*8#
* #iscomplex = 0*14, 1*3, 0*2#
* #incr = ip++*14, ip+=2*3, ip++*2#
* #isdatetime = 0*17, 1*2#
*/
static int
@fname@_argmax(@type@ *ip, npy_intp n, npy_intp *max_ind,
PyArrayObject *NPY_UNUSED(aip))
{
npy_intp i;
@type@ mp = *ip;
#if @iscomplex@
@type@ mp_im = ip[1];
#endif
*max_ind = 0;
#if @isfloat@
if (@isnan@(mp)) {
/* nan encountered; it's maximal */
return 0;
}
#endif
#if @iscomplex@
if (@isnan@(mp_im)) {
/* nan encountered; it's maximal */
return 0;
}
#endif
#if @isdatetime@
if (mp == NPY_DATETIME_NAT) {
/* NaT encountered, it's maximal */
return 0;
}
#endif
for (i = 1; i < n; i++) {
@incr@;
/*
* Propagate nans, similarly as max() and min()
*/
#if @iscomplex@
/* Lexical order for complex numbers */
if ((ip[0] > mp) || ((ip[0] == mp) && (ip[1] > mp_im))
|| @isnan@(ip[0]) || @isnan@(ip[1])) {
mp = ip[0];
mp_im = ip[1];
*max_ind = i;
if (@isnan@(mp) || @isnan@(mp_im)) {
/* nan encountered, it's maximal */
break;
}
}
#else
#if @isdatetime@
if (*ip == NPY_DATETIME_NAT) {
/* NaT encountered, it's maximal */
*max_ind = i;
break;
}
#endif
if (!@le@(*ip, mp)) { /* negated, for correct nan handling */
mp = *ip;
*max_ind = i;
#if @isfloat@
if (@isnan@(mp)) {
/* nan encountered, it's maximal */
break;
}
#endif
}
#endif
}
return 0;
}

@seberg
Copy link
Member

seberg commented Jan 7, 2022

If it is randomly affecting different dtypes in opposite direction I am willing to blame the typical (~30%) fluctuations we often see :/. And for those, my best guess is compiler (e.g. optimization/code layout) differences due to unrelated code changes.

My best idea for mitigation is to try using PGO (profile guided optimization) on the benchmarks to stabilize them (Python does this).
But, that had been a bit tricky to try out due to ICC (internal compiler errors) in GCC, which should be gone in recent versions (coverage seems to work again on my local GCC I think). So I never tried it yet, since it just didn't bother me enough since the time I have a GCC again that should make it not painful.

@mattip
Copy link
Member

mattip commented Jan 8, 2022

A couple of thoughts:

  • The implementation is in numpy/core/src/multiarray/arraytypes.c.src, which did not change in this PR.
  • I don't think there is a change in the compilation flags in this PR, right?
  • Does the change also manifest in int dtypes? So far the benchmark checks float32, float64, bool.
  • Does the benchmark result change if only that benchmark is run in isolation? If the order of dtypes is changed?

@Developer-Ecosystem-Engineering
Copy link
Contributor Author

@seiko2plus would you like to get this in as-is and then open an issue for improving argmax + float32 ? @Developer-Ecosystem-Engineering thoughts?

Splitting up the work by functions makes sense

@mattip
Copy link
Member

mattip commented Jan 11, 2022

Putting this in, I opened #20785 to track the argmax regression.

@mattip mattip merged commit 2d74972 into numpy:main Jan 11, 2022
@mattip
Copy link
Member

mattip commented Jan 11, 2022

@seiko2plus
Copy link
Member

@seberg, both builds(before & after) were merged against the latest main and were under the same compiler flags so I think special profiling(PGO) not going to change the ground.

@mattip,

The implementation is in numpy/core/src/multiarray/arraytypes.c.src, which did not change in this PR.

yes.

I don't think there is a change in the compilation flags in this PR, right?

right 100%.

Does the change also manifest in int dtypes? So far the benchmark checks float32, float64, bool.

within my latest commits, I covered tests for integer types and the performance improvement was pretty good.

Does the benchmark result change if only that benchmark is run in isolation? If the order of dtypes is changed?

same even with filtering only argmax.

As a precaution, I have improved the performance of argmax/argmin within #20846 till I got some free time to investigate on it.

@seberg
Copy link
Member

seberg commented Jan 18, 2022

@seberg, both builds(before & after) were merged against the latest main and were under the same compiler flags so I think special profiling(PGO) not going to change the ground.

I am basing that guess solely on this: https://vstinner.github.io/journey-to-stable-benchmark-deadcode.html where during Python optimization dead code would cause differences and only PGO would eliminate it. I agree this feels too big of an effect, so probably there is something more/else... but I have no idea what :)

EDIT: Actually, the malloc behaviour from this discussion that just poppoed up again, but Francesc dug deeper is likely a component: https://mail.python.org/archives/list/numpy-discussion@python.org/message/FKPBT24OMLC5BWHHRVG4CUX7QYKGWQKJ/

@seberg
Copy link
Member

seberg commented Jan 20, 2022

xref gh-20863, I did not look closely at this, but it seems likely that this PR may have been the cause of the regression in min precision?

Sounds like an incorrectly typed temporary, the example converts 2147483647 to 2147483648 (off by one), which happens for this operation:

np.float64(np.float32(2147483647))

(The example also has a change for a float32 example, but it mixes that with int32, so the operation may end up using float64 somewhere.)

@Developer-Ecosystem-Engineering
Copy link
Contributor Author

Taking a look

@Developer-Ecosystem-Engineering
Copy link
Contributor Author

Developer-Ecosystem-Engineering commented Jan 20, 2022

This occurs in 1.21 prior to this commit


>>> import numpy as np
>>> np.version.version
'1.21.0'
>>> np.float64(np.float32(2147483647))
2147483648.0

@seberg
Copy link
Member

seberg commented Jan 20, 2022

@Developer-Ecosystem-Engineering that was just me making a hypothesis about the reason. The problem is described in the linked issue, sorry. It is more something like np.min(np.float128([np.inf, 2147483647.])) returning 2147483648.0.

@seberg
Copy link
Member

seberg commented Jan 20, 2022

Looking closer at it, it seems the issue is limited to longdouble (where I can reproduce it locally on my ryzen linux). The type table on the issue is probably just misleading, because it mainly sees precision loss during conversion (which is correct and fine).

@Developer-Ecosystem-Engineering
Copy link
Contributor Author

@Developer-Ecosystem-Engineering that was just me making a hypothesis about the reason. The problem is described in the linked issue, sorry. It is more something like np.min([np.inf, 2147483647.]) returning 2147483648.0.

Ah, got it. We are able to reproduce, appears to be related to these changes. Looking further.

@Developer-Ecosystem-Engineering
Copy link
Contributor Author

Should be resolved by #20872

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
00 - Bug component: SIMD Issues in SIMD (fast instruction sets) code or machinery
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Native code order of magnitude slower than translated code on Apple M1
8 participants