Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Speedup ufunc.at when casting is not needed #22889

Merged
merged 19 commits into from
Jan 3, 2023

Conversation

mattip
Copy link
Member

@mattip mattip commented Dec 26, 2022

Towards solving #5922. This refactors the looping code in ufunc_at into ufunc_at__slow_iter that is the same as now, and ufunc_at__fast_iter that avoids creating the NpyIter_AdvancedNew iterator to do casting. The exact conditions to hit the fast path are

  • the loop function is a generic, legacy loop
  • the data types are numeric
  • there is no casting needed

This speeds up the new benchmark by about 3x

       before           after         ratio
     [07c34877]       [7853cbc1]
-      53.9±0.2ms       18.2±0.1ms     0.34  bench_ufunc.At.time_sum_at

The changes are a bit messy. I started off by moving code bits around in order to isolate the code that became the "slow" function, and then copy-pasted and removed code to make the "fast" function.

@mattip
Copy link
Member Author

mattip commented Dec 26, 2022

Going back to the original issue, and repeating the test code

import numpy as np
test_idx = np.random.randint(0, 1e3, 1e5)
test_vals = np.random.rand(1e5)*100-50
res = np.zeros(1000)
%timeit np.add.at(res, test_idx, test_vals)
#--> 100 loops, best of 3: 13 ms per loop
%timeit res[:] += np.bincount(test_idx, weights=test_vals)
#--> 1000 loops, best of 3: 220 µs per loop

The comparison (that was 13_000 / 220 ~= 60x) now gives me 1_820 / 143 ~= 13x so there is still more that could be done. The fast path is still calling the loop function for every datapoint (with a single stride iteration). The loop functions are designed to be called more rarely with large buffers of data and strides.

@mattip
Copy link
Member Author

mattip commented Dec 27, 2022

I found another two optimizations, now the speedup is about 6.3x better:

$ python runtests.py -j6 --bench-compare fe73a8498 -- -b bench_ufunc.At
...
       before           after         ratio
     [fe73a849]       [ef9f35a9]
     <asv-compare~1>       <speedup-ufunc.at-main>
-      54.0±0.2ms      8.42±0.02ms     0.16  bench_ufunc.At.time_sum_at

  • the function call was going through a wrapper. Moving the "unwrap" out of the hot loop got me to 4x
  • the inner loop functions have some overhead since for every call they go through a "can I use a SIMD optimization of 128 bit vectors? No, so maybe 64 bit vectors? No, so maybe ...". This works well when dealing with a contiguous chunk of memory once. For the ufunc_at case, the inner loop function is called with a single data point, so all that is wasted effort. Short-circuiting the checks if count<4 gets an nice speed increase. @seiko2plus thoughts?

@@ -537,7 +537,7 @@ NPY_NO_EXPORT void NPY_CPU_DISPATCH_CURFX(@TYPE@_@kind@)
*((@type@ *)iop1) = io1;
#endif
}
else if (!run_binary_simd_@kind@_@TYPE@(args, dimensions, steps)) {
else if (dimensions[0] < 4 || !run_binary_simd_@kind@_@TYPE@(args, dimensions, steps)) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

short circuit the SIMD optimizations for small counts

@mattip
Copy link
Member Author

mattip commented Dec 28, 2022

Here is a complete benchmark run. Is there a way to make asv runs more stable on ubuntu using a AMD processor? Pyperf has some directions, but are they applicable to asv? There is also this project with some hints, has anyone tried them?

benchmark results
       before           after         ratio
     [fe73a849]       [ef9f35a9]
     <asv-compare~1>       <speedup-ufunc.at-main>
+     2.26±0.02μs      3.85±0.01μs     1.70  bench_io.Copy.time_cont_assign('float32')
+     45.6±0.04μs         68.6±3μs     1.50  bench_function_base.Sort.time_sort('merge', 'int32', ('sorted_block', 100))
+       238±0.5μs         346±10μs     1.45  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'U', True, '<=')
+      60.6±0.2μs         87.2±2μs     1.44  bench_strings.StringComparisons.time_compare_identical(10000, 'U', False, '<=')
+       120±0.2μs         171±20μs     1.43  bench_strings.StringComparisons.time_compare_identical(10000, 'U', True, '<=')
+       104±0.2μs        148±0.2μs     1.42  bench_function_base.Mean.time_mean(100000)
+      51.2±0.1μs         72.3±2μs     1.41  bench_strings.StringComparisons.time_compare_identical(10000, 'U', False, '>=')
+       108±0.2μs        152±0.2μs     1.41  bench_function_base.Mean.time_mean_axis(100000)
+     9.70±0.02μs      13.4±0.02μs     1.39  bench_io.Copy.time_cont_assign('complex128')
+       201±0.6μs          276±2μs     1.37  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'U', True, '>=')
+     5.42±0.02μs      7.42±0.01μs     1.37  bench_function_base.Sort.time_sort('merge', 'int32', ('ordered',))
+     5.41±0.01μs      7.40±0.03μs     1.37  bench_function_base.Sort.time_sort('merge', 'int32', ('uniform',))
+      57.1±0.2μs         77.4±9μs     1.36  bench_function_base.Sort.time_argsort('quick', 'int32', ('ordered',))
+     1.70±0.01μs      2.23±0.03μs     1.31  bench_strings.StringComparisons.time_compare_identical(100, 'U', True, '>=')
+      177±0.04μs       227±0.05μs     1.28  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 2, 1, 'e')
+      177±0.06μs       227±0.04μs     1.28  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 4, 2, 'e')
+       177±0.2μs        227±0.4μs     1.28  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 2, 2, 'e')
+      177±0.08μs       227±0.08μs     1.28  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 4, 1, 'e')
+       177±0.2μs        227±0.1μs     1.28  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 1, 1, 'e')
+       177±0.4μs       227±0.07μs     1.28  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 1, 2, 'e')
+      177±0.05μs       227±0.05μs     1.28  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 1, 4, 'e')
+       177±0.7μs        227±0.9μs     1.28  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 4, 4, 'e')
+      24.9±0.4μs         31.8±1μs     1.28  bench_function_base.Sort.time_sort('merge', 'int32', ('sorted_block', 1000))
+       178±0.8μs        227±0.8μs     1.28  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 2, 4, 'e')
+         107±2μs          136±2μs     1.28  bench_linalg.Einsum.time_einsum_mul(<class 'numpy.float32'>)
+     18.5±0.07μs      23.6±0.06μs     1.27  bench_ufunc.CustomInplace.time_float_add
+        16.5±1μs         21.0±1μs     1.27  bench_strings.StringComparisons.time_compare_identical(10000, 'S', False, '!=')
+     1.69±0.02μs       2.12±0.3μs     1.25  bench_strings.StringComparisons.time_compare_identical(100, 'U', True, '!=')
+       114±0.5μs          142±2μs     1.25  bench_function_base.Sort.time_sort('quick', 'float64', ('reversed',))
+      517±0.05μs          641±2μs     1.24  bench_ufunc.UFunc.time_ufunc_types('mod')
+     26.0±0.02μs      32.0±0.02μs     1.23  bench_function_base.Sort.time_argsort('heap', 'int16', ('uniform',))
+         520±8μs          639±6μs     1.23  bench_ufunc.UFunc.time_ufunc_types('remainder')
+      62.5±0.3μs         76.8±1μs     1.23  bench_function_base.Sort.time_argsort('merge', 'uint32', ('sorted_block', 100))
+         142±3μs         175±10μs     1.23  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'U', False, '!=')
+       122±0.7μs        150±0.2μs     1.22  bench_reduce.AddReduceSeparate.time_reduce(1, 'int64')
+        1.10±0μs      1.33±0.02μs     1.22  bench_itemselection.PutMask.time_sparse(True, 'int32')
+        34.9±2μs         42.2±5μs     1.21  bench_strings.StringComparisons.time_compare_different((1000, 20), 'S', False, '>')
+     1.30±0.02μs      1.57±0.03μs     1.21  bench_strings.StringComparisons.time_compare_identical(100, 'U', False, '<=')
+     1.09±0.01μs         1.32±0μs     1.20  bench_itemselection.PutMask.time_sparse(True, 'complex256')
+     14.1±0.04μs      16.9±0.06μs     1.20  bench_io.Copy.time_strided_copy('complex128')
+      9.18±0.2μs       11.0±0.2μs     1.20  bench_linalg.Linalg.time_op('norm', 'int64')
+         481±9μs         574±50μs     1.19  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'exp'>, 1, 4, 'd')
+      32.7±0.6μs       39.0±0.9μs     1.19  bench_function_base.Sort.time_argsort('merge', 'uint32', ('sorted_block', 1000))
+       528±0.9μs         629±50μs     1.19  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sqrt'>, 2, 2, 'e')
+         926±4μs      1.10±0.01ms     1.19  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 2, 4, 'e')
+        472±10μs         562±50μs     1.19  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'exp'>, 1, 2, 'd')
+     1.10±0.01μs      1.31±0.01μs     1.19  bench_itemselection.PutMask.time_sparse(True, 'float32')
+        468±10μs         556±50μs     1.19  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'exp'>, 1, 1, 'd')
+        470±10μs         558±50μs     1.19  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'exp'>, 2, 1, 'd')
+        472±10μs         561±50μs     1.19  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'exp'>, 2, 2, 'd')
+        3.89±0μs      4.62±0.01μs     1.19  bench_itemselection.PutMask.time_dense(True, 'complex256')
+     31.1±0.02μs         37.0±2μs     1.19  bench_strings.StringComparisons.time_compare_identical(10000, 'S', True, '!=')
+        476±10μs         565±50μs     1.19  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'exp'>, 4, 2, 'd')
+         482±9μs         571±50μs     1.19  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'exp'>, 2, 4, 'd')
+        484±10μs         573±50μs     1.18  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'exp'>, 4, 4, 'd')
+        473±10μs         560±50μs     1.18  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'exp'>, 4, 1, 'd')
+      96.7±0.3μs        114±0.5μs     1.18  bench_function_base.Sort.time_argsort('merge', 'uint32', ('sorted_block', 10))
+         926±2μs      1.10±0.02ms     1.18  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 4, 1, 'e')
+         926±4μs      1.09±0.02ms     1.18  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 4, 2, 'e')
+         926±1μs      1.09±0.02ms     1.18  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 1, 4, 'e')
+         922±2μs      1.09±0.02ms     1.18  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 1, 2, 'e')
+     16.1±0.01μs       18.9±0.8μs     1.17  bench_strings.StringComparisons.time_compare_different(10000, 'S', False, '==')
+         530±3μs         618±30μs     1.17  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 4, 2, 'e')
+      61.7±0.1μs      71.9±0.05μs     1.17  bench_strings.StringComparisons.time_compare_different((1000, 20), 'S', True, '!=')
+     31.2±0.08μs      36.3±0.06μs     1.16  bench_strings.StringComparisons.time_compare_different(10000, 'S', True, '<=')
+     61.7±0.05μs       71.8±0.1μs     1.16  bench_strings.StringComparisons.time_compare_different((1000, 20), 'S', True, '==')
+         532±8μs          619±9μs     1.16  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 2, 4, 'e')
+      61.9±0.2μs       71.9±0.1μs     1.16  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'S', True, '!=')
+     31.2±0.04μs      36.3±0.06μs     1.16  bench_strings.StringComparisons.time_compare_different(10000, 'S', True, '==')
+      61.8±0.1μs      71.8±0.06μs     1.16  bench_strings.StringComparisons.time_compare_different((1000, 20), 'S', True, '>')
+     61.8±0.06μs      71.8±0.06μs     1.16  bench_strings.StringComparisons.time_compare_different((1000, 20), 'S', True, '<=')
+     31.2±0.03μs      36.3±0.05μs     1.16  bench_strings.StringComparisons.time_compare_different(10000, 'S', True, '!=')
+     31.2±0.06μs      36.3±0.05μs     1.16  bench_strings.StringComparisons.time_compare_different(10000, 'S', True, '>')
+         923±3μs      1.07±0.04ms     1.16  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 1, 1, 'e')
+         928±3μs      1.08±0.02ms     1.16  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 2, 2, 'e')
+     16.2±0.08μs      18.7±0.09μs     1.16  bench_strings.StringComparisons.time_compare_different(10000, 'S', False, '<=')
+     8.15±0.03μs      9.44±0.04μs     1.16  bench_function_base.Sort.time_argsort('merge', 'int32', ('uniform',))
+     16.2±0.03μs      18.7±0.03μs     1.16  bench_strings.StringComparisons.time_compare_different(10000, 'S', False, '!=')
+     8.19±0.05μs      9.48±0.03μs     1.16  bench_function_base.Sort.time_argsort('merge', 'int32', ('ordered',))
+     32.3±0.06μs      37.4±0.07μs     1.16  bench_strings.StringComparisons.time_compare_different((1000, 20), 'S', False, '==')
+         529±3μs          612±9μs     1.16  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 4, 1, 'e')
+     86.4±0.02μs         99.9±5μs     1.16  bench_ufunc_strides.AVX_ldexp.time_ufunc('e', 2)
+     16.2±0.05μs      18.7±0.02μs     1.16  bench_strings.StringComparisons.time_compare_different(10000, 'S', False, '>')
+         926±4μs      1.07±0.02ms     1.15  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 4, 4, 'e')
+     32.3±0.03μs      37.3±0.06μs     1.15  bench_strings.StringComparisons.time_compare_different((1000, 20), 'S', False, '!=')
+      65.6±0.2μs       75.7±0.1μs     1.15  bench_function_base.Sort.time_sort('quick', 'float64', ('ordered',))
+         529±2μs         609±20μs     1.15  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'trunc'>, 4, 1, 'e')
+      94.2±0.2μs        108±0.4μs     1.15  bench_function_base.Sort.time_argsort('quick', 'int32', ('reversed',))
+         923±2μs      1.06±0.02ms     1.15  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 2, 1, 'e')
+         529±1μs          606±3μs     1.15  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 2, 2, 'e')
+       528±0.8μs          605±3μs     1.15  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'trunc'>, 2, 2, 'e')
+        6.36±0ms         7.28±0ms     1.15  bench_reduce.AddReduceSeparate.time_reduce(0, 'float16')
+         531±2μs          608±4μs     1.14  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'trunc'>, 4, 4, 'e')
+       351±0.9μs        402±0.5μs     1.14  bench_linalg.Eindot.time_einsum_i_ij_j
+         532±3μs          609±5μs     1.14  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'trunc'>, 2, 4, 'e')
+         528±5μs          605±4μs     1.14  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 1, 4, 'e')
+       527±0.1μs        603±0.6μs     1.14  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 1, 1, 'e')
+       527±0.2μs          603±1μs     1.14  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'trunc'>, 1, 2, 'e')
+       528±0.6μs        603±0.7μs     1.14  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'trunc'>, 2, 1, 'e')
+       527±0.2μs        602±0.2μs     1.14  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'trunc'>, 1, 1, 'e')
+       528±0.5μs        603±0.6μs     1.14  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 2, 1, 'e')
+         553±2μs          632±3μs     1.14  bench_lib.Nan.time_nanstd(200000, 0.1)
+         533±3μs          608±5μs     1.14  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 4, 4, 'e')
+     18.7±0.02μs      21.3±0.05μs     1.14  bench_strings.StringComparisons.time_compare_identical(10000, 'S', False, '<=')
+     18.7±0.03μs      21.3±0.04μs     1.14  bench_strings.StringComparisons.time_compare_identical(10000, 'S', False, '>')
+     36.2±0.05μs      41.2±0.01μs     1.14  bench_strings.StringComparisons.time_compare_identical(10000, 'S', True, '<=')
+         530±3μs        603±0.7μs     1.14  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'trunc'>, 1, 4, 'e')
+         546±3μs          621±5μs     1.14  bench_lib.Nan.time_nanstd(200000, 0)
+     36.3±0.05μs      41.3±0.03μs     1.14  bench_strings.StringComparisons.time_compare_identical(10000, 'S', True, '>')
+         531±3μs          604±2μs     1.14  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'trunc'>, 4, 2, 'e')
+     37.3±0.07μs      42.4±0.08μs     1.14  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'S', False, '>')
+      72.0±0.1μs      81.8±0.03μs     1.14  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'S', True, '>')
+     20.2±0.06μs      23.0±0.07μs     1.14  bench_ufunc.CustomInplace.time_double_add
+      72.1±0.1μs      81.8±0.07μs     1.13  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'S', True, '<=')
+      99.3±0.1μs          113±2μs     1.13  bench_function_base.Sort.time_sort('merge', 'int32', ('sorted_block', 10))
+         554±4μs          628±1μs     1.13  bench_lib.Nan.time_nanvar(200000, 0.1)
+     37.3±0.03μs      42.3±0.05μs     1.13  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'S', False, '<=')
+         696±2μs          789±2μs     1.13  bench_lib.Nan.time_nanvar(200000, 2.0)
+       702±0.9μs          795±2μs     1.13  bench_lib.Nan.time_nanstd(200000, 2.0)
+         532±4μs        602±0.2μs     1.13  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 1, 2, 'e')
+         548±5μs          619±2μs     1.13  bench_lib.Nan.time_nanvar(200000, 0)
+         456±2μs         509±60μs     1.12  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'reciprocal'>, 2, 4, 'e')
+        2.30±0ms         2.56±0ms     1.11  bench_reduce.AddReduceSeparate.time_reduce(1, 'float16')
+       453±0.8μs          504±1μs     1.11  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'reciprocal'>, 2, 2, 'e')
+       453±0.3μs          503±1μs     1.11  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'reciprocal'>, 1, 2, 'e')
+       453±0.2μs        503±0.4μs     1.11  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'reciprocal'>, 1, 4, 'e')
+       452±0.6μs        502±0.4μs     1.11  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'reciprocal'>, 1, 1, 'e')
+     10.8±0.01μs      11.9±0.03μs     1.11  bench_function_base.Sort.time_argsort('merge', 'int64', ('reversed',))
+       452±0.3μs        502±0.2μs     1.11  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'reciprocal'>, 2, 1, 'e')
+         454±2μs          504±2μs     1.11  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'reciprocal'>, 4, 4, 'e')
+       453±0.1μs        503±0.3μs     1.11  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'reciprocal'>, 4, 2, 'e')
+       453±0.5μs        503±0.7μs     1.11  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'reciprocal'>, 4, 1, 'e')
+     25.8±0.04μs       28.6±0.1μs     1.11  bench_ufunc.CustomInplace.time_double_add_temp
+       176±0.1ms        195±0.2ms     1.10  bench_function_base.Histogram2D.time_fine_binning
+        2.16±0μs      2.38±0.01μs     1.10  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'clip', 'complex64')
+         554±3μs         609±50μs     1.10  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'radians'>, 4, 4, 'e')
+         528±3μs          580±3μs     1.10  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 4, 4, 'e')
+         528±2μs          580±3μs     1.10  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sqrt'>, 2, 4, 'e')
+     1.28±0.01ms      1.40±0.03ms     1.10  bench_linalg.Einsum.time_einsum_noncon_outer(<class 'numpy.float32'>)
+       528±0.3μs          579±3μs     1.10  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'degrees'>, 4, 1, 'e')
+         528±3μs          579±3μs     1.10  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 1, 4, 'e')
+        1.08±0ms      1.18±0.09ms     1.10  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'log10'>, 1, 1, 'e')
+         533±7μs         585±40μs     1.10  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rad2deg'>, 4, 4, 'e')
+     1.20±0.01μs         1.32±0μs     1.10  bench_itemselection.PutMask.time_sparse(False, 'complex128')
+         528±2μs          579±2μs     1.10  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'fabs'>, 1, 4, 'e')
+       527±0.2μs          578±1μs     1.10  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sqrt'>, 2, 1, 'e')
+         529±1μs          580±3μs     1.10  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'fabs'>, 2, 2, 'e')
+       528±0.9μs        578±0.9μs     1.10  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sqrt'>, 4, 2, 'e')
+         527±2μs          578±2μs     1.10  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 1, 2, 'e')
+       528±0.7μs        578±0.7μs     1.10  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'fabs'>, 2, 1, 'e')
+     2.17±0.01μs      2.38±0.01μs     1.10  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'clip', 'int64')
+       527±0.3μs        577±0.6μs     1.10  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'degrees'>, 1, 1, 'e')
+       527±0.5μs        577±0.2μs     1.10  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'degrees'>, 2, 1, 'e')
+       527±0.8μs        577±0.1μs     1.10  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rad2deg'>, 1, 1, 'e')
+       527±0.1μs        577±0.7μs     1.10  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 1, 1, 'e')
+       527±0.1μs        577±0.1μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'fabs'>, 1, 1, 'e')
+       528±0.4μs       578±0.06μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sqrt'>, 4, 1, 'e')
+         529±2μs        579±0.7μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'fabs'>, 4, 2, 'e')
+       527±0.4μs        577±0.7μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sqrt'>, 1, 1, 'e')
+     2.17±0.01μs      2.38±0.02μs     1.09  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'clip', 'float64')
+       527±0.2μs        577±0.4μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sqrt'>, 1, 2, 'e')
+         528±2μs        578±0.1μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rad2deg'>, 1, 4, 'e')
+         527±1μs          577±1μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rad2deg'>, 1, 2, 'e')
+         528±2μs        578±0.1μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sqrt'>, 1, 4, 'e')
+       528±0.7μs          578±2μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'fabs'>, 1, 2, 'e')
+       528±0.6μs        578±0.7μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'fabs'>, 4, 1, 'e')
+         529±1μs          578±1μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rad2deg'>, 4, 2, 'e')
+         529±1μs          578±2μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'degrees'>, 1, 2, 'e')
+         529±2μs          579±1μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 2, 2, 'e')
+         528±3μs        578±0.4μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 2, 1, 'e')
+         278±4μs         304±10μs     1.09  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'U', True, '!=')
+       528±0.7μs        578±0.9μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rad2deg'>, 2, 1, 'e')
+         528±2μs          578±2μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sqrt'>, 4, 4, 'e')
+         530±2μs          579±1μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rad2deg'>, 2, 2, 'e')
+        1.22±0μs         1.33±0μs     1.09  bench_itemselection.PutMask.time_dense(True, 'float32')
+         531±2μs          580±2μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'fabs'>, 4, 4, 'e')
+         530±1μs        578±0.8μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 4, 1, 'e')
+         529±1μs        578±0.9μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'degrees'>, 2, 2, 'e')
+     1.21±0.02μs         1.32±0μs     1.09  bench_itemselection.PutMask.time_sparse(False, 'longfloat')
+        1.08±0ms      1.18±0.08ms     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'log10'>, 4, 1, 'e')
+       289±0.9ns         315±20ns     1.09  bench_ma.MAMethod0v.time_methods_0v('conjugate', 'small')
+         552±1μs        602±0.6μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'deg2rad'>, 4, 1, 'e')
+       551±0.9μs        600±0.9μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'deg2rad'>, 2, 1, 'e')
+         530±1μs        578±0.2μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'degrees'>, 4, 2, 'e')
+        1.21±0μs      1.32±0.01μs     1.09  bench_itemselection.PutMask.time_sparse(False, 'int32')
+       551±0.8μs        601±0.9μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'radians'>, 2, 1, 'e')
+         552±2μs          602±2μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'deg2rad'>, 1, 2, 'e')
+         552±1μs          602±1μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'deg2rad'>, 4, 2, 'e')
+         552±2μs          601±2μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'radians'>, 1, 1, 'e')
+         530±6μs        578±0.7μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rad2deg'>, 4, 1, 'e')
+         531±3μs          579±2μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'fabs'>, 2, 4, 'e')
+         553±2μs          602±1μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'deg2rad'>, 2, 4, 'e')
+         533±8μs          581±3μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 2, 4, 'e')
+       552±0.3μs        601±0.8μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'radians'>, 4, 1, 'e')
+         552±2μs          601±2μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'radians'>, 2, 2, 'e')
+       553±0.5μs        602±0.9μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'radians'>, 2, 4, 'e')
+         553±2μs          602±2μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'radians'>, 4, 2, 'e')
+         552±2μs          601±2μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'deg2rad'>, 1, 1, 'e')
+         553±2μs          602±3μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'radians'>, 1, 4, 'e')
+     1.21±0.01μs         1.31±0μs     1.09  bench_itemselection.PutMask.time_sparse(False, 'float32')
+      321±0.08μs        349±0.4μs     1.09  bench_reduce.AddReduceSeparate.time_reduce(1, 'int32')
+         553±1μs          601±2μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'radians'>, 1, 2, 'e')
+         554±2μs          603±2μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'deg2rad'>, 2, 2, 'e')
+        1.19±0ms      1.30±0.01ms     1.09  bench_lib.Nan.time_nanvar(200000, 90.0)
+     1.19±0.01ms         1.30±0ms     1.09  bench_lib.Nan.time_nanstd(200000, 90.0)
+         532±3μs          579±2μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'degrees'>, 4, 4, 'e')
+         533±9μs          579±1μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 4, 2, 'e')
+         471±7μs          512±3μs     1.09  bench_function_base.Sort.time_argsort('heap', 'int16', ('reversed',))
+     1.23±0.01μs         1.33±0μs     1.09  bench_itemselection.PutMask.time_dense(True, 'int32')
+     5.82±0.01μs      6.32±0.01μs     1.09  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'clip', 'complex256')
+       315±0.6μs        342±0.2μs     1.08  bench_reduce.AddReduceSeparate.time_reduce(1, 'int16')
+         555±3μs          602±3μs     1.08  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'deg2rad'>, 4, 4, 'e')
+         482±4μs          523±1μs     1.08  bench_function_base.Sort.time_argsort('heap', 'float64', ('ordered',))
+        1.18±0ms      1.28±0.08ms     1.08  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'arcsin'>, 1, 2, 'e')
+      27.5±0.3μs      29.8±0.06μs     1.08  bench_ufunc.CustomInplace.time_float_add_temp
+        1.18±0ms      1.28±0.08ms     1.08  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'arcsin'>, 1, 1, 'e')
+     5.84±0.01μs      6.33±0.01μs     1.08  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'wrap', 'complex256')
+     27.2±0.01μs      29.5±0.02μs     1.08  bench_function_base.Sort.time_argsort('heap', 'int64', ('uniform',))
+         534±2μs          578±3μs     1.08  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'degrees'>, 2, 4, 'e')
+        1.08±0ms      1.17±0.08ms     1.08  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'log10'>, 2, 2, 'e')
+      7.72±0.2μs      8.35±0.05μs     1.08  bench_ma.MACreation.time_ma_creations(100, True)
+       575±0.8μs        623±0.4μs     1.08  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 1000000, <class 'bool'>)
+       865±0.6μs        935±0.9μs     1.08  bench_core.CountNonzero.time_count_nonzero_axis(3, 1000000, <class 'bool'>)
+       291±0.3μs        314±0.1μs     1.08  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 1000000, <class 'bool'>)
+       862±0.5μs        932±0.6μs     1.08  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 1000000, <class 'bool'>)
+       578±0.5μs        625±0.4μs     1.08  bench_core.CountNonzero.time_count_nonzero_axis(2, 1000000, <class 'bool'>)
+         557±4μs          602±2μs     1.08  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'deg2rad'>, 1, 4, 'e')
+        1.08±0ms      1.16±0.07ms     1.08  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'log10'>, 2, 1, 'e')
+         186±6μs          201±1μs     1.08  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'object'>)
+         290±1μs        313±0.3μs     1.08  bench_core.CountNonzero.time_count_nonzero_axis(1, 1000000, <class 'bool'>)
+     3.29±0.02μs      3.54±0.01μs     1.08  bench_itemselection.Take.time_contiguous((1000, 2), 'wrap', 'complex128')
+       681±0.5μs        732±0.5μs     1.08  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 1000000, <class 'numpy.int16'>)
+        1.08±0ms      1.16±0.08ms     1.07  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'log10'>, 4, 4, 'e')
+        1.03±0ms         1.11±0ms     1.07  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 1000000, <class 'numpy.int16'>)
+     3.29±0.01μs      3.54±0.01μs     1.07  bench_itemselection.Take.time_contiguous((1000, 2), 'clip', 'complex128')
+       519±0.9μs          557±1μs     1.07  bench_lib.Isin.time_isin(100000, 10)
+         459±5ns          492±1ns     1.07  bench_array_coercion.ArrayCoercionSmall.time_asanyarray_dtype(1)
+         522±2μs          560±2μs     1.07  bench_lib.Isin.time_isin(100000, 10000)
+     3.30±0.02μs      3.54±0.01μs     1.07  bench_itemselection.Take.time_contiguous((1000, 2), 'wrap', 'longfloat')
+       602±0.8μs          646±8μs     1.07  bench_function_base.Sort.time_argsort('heap', 'int16', ('sorted_block', 10))
+     7.15±0.02ms      7.66±0.02ms     1.07  bench_linalg.Einsum.time_einsum_outer(<class 'numpy.float32'>)
+      341±0.07μs        365±0.5μs     1.07  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 1000000, <class 'numpy.int16'>)
+     18.7±0.06μs      20.0±0.08μs     1.07  bench_strings.StringComparisons.time_compare_identical(10000, 'S', False, '==')
+     36.2±0.07μs      38.8±0.03μs     1.07  bench_ufunc_strides.AVX_cmplx_funcs.time_ufunc('absolute', 4, 'F')
+     3.30±0.03μs         3.53±0μs     1.07  bench_itemselection.Take.time_contiguous((1000, 2), 'clip', 'longfloat')
+        1.03±0ms         1.11±0ms     1.07  bench_core.CountNonzero.time_count_nonzero_axis(3, 1000000, <class 'numpy.int16'>)
+     36.2±0.04μs      38.7±0.02μs     1.07  bench_strings.StringComparisons.time_compare_identical(10000, 'S', True, '==')
+      65.7±0.6μs       70.3±0.5μs     1.07  bench_function_base.Sort.time_sort('merge', 'float32', ('sorted_block', 100))
+       683±0.9μs          731±1μs     1.07  bench_core.CountNonzero.time_count_nonzero_axis(2, 1000000, <class 'numpy.int16'>)
+     36.1±0.01μs      38.7±0.03μs     1.07  bench_ufunc_strides.AVX_cmplx_funcs.time_ufunc('absolute', 2, 'F')
+     36.1±0.01μs      38.6±0.02μs     1.07  bench_ufunc_strides.AVX_cmplx_funcs.time_ufunc('absolute', 1, 'F')
+     3.31±0.01μs      3.54±0.01μs     1.07  bench_itemselection.Take.time_contiguous((1000, 1), 'wrap', 'complex256')
+        1.67±0μs      1.79±0.01μs     1.07  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'wrap', 'float16')
+         526±2ns          563±6ns     1.07  bench_array_coercion.ArrayCoercionSmall.time_array([1])
+         627±1μs          670±1μs     1.07  bench_function_base.Sort.time_argsort('heap', 'int16', ('sorted_block', 100))
+      71.9±0.2μs      76.8±0.09μs     1.07  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'S', True, '==')
+     2.55±0.05μs      2.72±0.02μs     1.07  bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'object'>)
+        1.46±0μs      1.55±0.01μs     1.07  bench_itemselection.Take.time_contiguous((1000, 2), 'clip', 'float32')
+     1.47±0.01μs         1.57±0μs     1.07  bench_itemselection.Take.time_contiguous((1000, 1), 'clip', 'complex64')
+        1.47±0μs      1.57±0.01μs     1.07  bench_itemselection.Take.time_contiguous((1000, 1), 'clip', 'float64')
+         572±2μs          611±5μs     1.07  bench_function_base.Sort.time_argsort('heap', 'int16', ('sorted_block', 1000))
+     37.3±0.09μs      39.8±0.06μs     1.07  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'S', False, '==')
+       341±0.6μs        363±0.2μs     1.07  bench_core.CountNonzero.time_count_nonzero_axis(1, 1000000, <class 'numpy.int16'>)
+        651±10ns          694±4ns     1.07  bench_ufunc.UFuncSmall.time_ufunc_small_array('cos')
+         824±5ns          878±2ns     1.07  bench_io.Copy.time_cont_assign('int8')
+         851±2μs          908±5μs     1.07  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 1000000, <class 'numpy.int32'>)
+        1.18±0ms       1.26±0.1ms     1.07  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'arcsin'>, 2, 2, 'e')
+        1.08±0ms      1.15±0.05ms     1.07  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'log10'>, 1, 2, 'e')
+         464±1μs        494±0.5μs     1.07  bench_function_base.Sort.time_argsort('merge', 'uint32', ('random',))
+       637±0.1μs        678±0.7μs     1.06  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'exp2'>, 2, 1, 'e')
+     1.41±0.01ms         1.50±0ms     1.06  bench_core.CountNonzero.time_count_nonzero_axis(3, 1000000, <class 'numpy.int32'>)
+       636±0.1μs        677±0.2μs     1.06  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'exp2'>, 1, 1, 'e')
+       637±0.3μs        678±0.1μs     1.06  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'exp2'>, 4, 2, 'e')
+     5.02±0.09μs       5.34±0.1μs     1.06  bench_indexing.ScalarIndexing.time_index(0)
+         637±1μs          678±2μs     1.06  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'exp2'>, 2, 2, 'e')
+       637±0.1μs        678±0.4μs     1.06  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'exp2'>, 1, 2, 'e')
+      11.3±0.1ms       12.0±0.1ms     1.06  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int64', 100000)
+       637±0.3μs        678±0.6μs     1.06  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'exp2'>, 4, 1, 'e')
+        1.41±0ms      1.50±0.01ms     1.06  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 1000000, <class 'numpy.int32'>)
+         638±3μs          678±3μs     1.06  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'exp2'>, 2, 4, 'e')
+       638±0.3μs        678±0.5μs     1.06  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'exp2'>, 1, 4, 'e')
+        1.22±0ms       1.30±0.1ms     1.06  bench_io.Copy.time_memcpy_large_out_of_place('float32')
+     12.9±0.04μs      13.7±0.08μs     1.06  bench_core.CountNonzero.time_count_nonzero_axis(3, 10000, <class 'bool'>)
+     2.17±0.01μs      2.31±0.01μs     1.06  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'clip', 'float16')
+        1.46±0μs      1.55±0.01μs     1.06  bench_itemselection.Take.time_contiguous((1000, 2), 'clip', 'int32')
+       418±0.8μs          443±8μs     1.06  bench_indexing.Indexing.time_op('indexes_rand_', 'np.ix_(I, I)', '=1')
+         427±3μs          453±5μs     1.06  bench_function_base.Sort.time_argsort('heap', 'int16', ('ordered',))
+     2.18±0.02μs      2.31±0.01μs     1.06  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'clip', 'int16')
+      48.4±0.1μs       51.4±0.1μs     1.06  bench_function_base.Sort.time_argsort('quick', 'uint32', ('uniform',))
+         640±3μs          679±3μs     1.06  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'exp2'>, 4, 4, 'e')
+      81.8±0.2μs       86.7±0.6μs     1.06  bench_function_base.Sort.time_argsort('quick', 'float32', ('ordered',))
+         692±1μs          734±1μs     1.06  bench_function_base.Sort.time_argsort('heap', 'int16', ('random',))
+        1.18±0ms       1.25±0.1ms     1.06  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'arcsin'>, 2, 1, 'e')
+     13.0±0.05μs      13.7±0.08μs     1.06  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 10000, <class 'bool'>)
+         854±2μs          904±2μs     1.06  bench_core.CountNonzero.time_count_nonzero_axis(2, 1000000, <class 'numpy.int32'>)
+         497±1ns          526±2ns     1.06  bench_array_coercion.ArrayCoercionSmall.time_array_invalid_kwarg(1)
+         852±3ns         901±10ns     1.06  bench_scalar.ScalarMath.time_add_int32_other('int64')
+         503±2μs          533±3μs     1.06  bench_function_base.Sort.time_sort('heap', 'uint32', ('sorted_block', 10))
+     86.4±0.04μs      91.4±0.04μs     1.06  bench_ufunc_strides.AVX_ldexp.time_ufunc('e', 1)
+       421±0.3μs          445±1μs     1.06  bench_core.CountNonzero.time_count_nonzero_axis(1, 1000000, <class 'numpy.int32'>)
+     15.3±0.07μs      16.1±0.01μs     1.06  bench_core.CountNonzero.time_count_nonzero_axis(3, 10000, <class 'numpy.int16'>)
+     1.47±0.01μs         1.56±0μs     1.06  bench_itemselection.Take.time_contiguous((1000, 1), 'clip', 'int64')
+     86.5±0.04μs      91.4±0.02μs     1.06  bench_ufunc_strides.AVX_ldexp.time_ufunc('e', 4)
+       422±0.6μs        446±0.8μs     1.06  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 1000000, <class 'numpy.int32'>)
+         506±3μs          534±3μs     1.06  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'square'>, 1, 2, 'e')
+         506±1μs          535±1μs     1.06  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'square'>, 4, 1, 'e')
+        1.13±0ms         1.19±0ms     1.06  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int64', 10000)
+         505±3μs          534±3μs     1.06  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'square'>, 1, 1, 'e')
+       506±0.9μs          535±1μs     1.06  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'square'>, 4, 2, 'e')
+       504±0.7μs        533±0.7μs     1.06  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'square'>, 2, 1, 'e')
+         505±1μs        534±0.8μs     1.06  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'square'>, 2, 2, 'e')
+         506±4μs          535±3μs     1.06  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'square'>, 1, 4, 'e')
+       506±0.5μs        534±0.6μs     1.06  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'square'>, 2, 4, 'e')
+         602±7ns         636±10ns     1.06  bench_ufunc.UFuncSmall.time_ufunc_small_array('sqrt')
+         507±1μs          535±1μs     1.06  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'square'>, 4, 4, 'e')
+       971±0.4μs         1.02±0ms     1.06  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 1000000, <class 'numpy.int8'>)
+         404±1μs          427±4μs     1.05  bench_function_base.Sort.time_sort('heap', 'int32', ('reversed',))
+       973±0.6μs         1.03±0ms     1.05  bench_core.CountNonzero.time_count_nonzero_axis(3, 1000000, <class 'numpy.int8'>)
+         819±9ns         863±20ns     1.05  bench_strings.StringComparisons.time_compare_different(100, 'S', False, '==')
+       368±0.3μs        388±0.9μs     1.05  bench_function_base.Sort.time_sort('heap', 'uint32', ('ordered',))
+         484±1μs          510±2μs     1.05  bench_function_base.Sort.time_sort('heap', 'uint32', ('sorted_block', 1000))
+    10.00±0.06μs       10.5±0.1μs     1.05  bench_core.CountNonzero.time_count_nonzero_axis(2, 10000, <class 'bool'>)
+      17.7±0.7μs      18.6±0.06μs     1.05  bench_ma.UFunc.time_1d(True, True, 100)
+       646±0.9μs        680±0.3μs     1.05  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 1000000, <class 'numpy.int8'>)
+       670±0.5μs          705±2μs     1.05  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'exp'>, 1, 1, 'e')
+      98.1±0.2μs          103±3μs     1.05  bench_function_base.Sort.time_sort('merge', 'uint32', ('sorted_block', 10))
+         152±1μs        160±0.7μs     1.05  bench_random.Randint_dtype.time_randint_fast('uint8')
+         766±3ns          806±8ns     1.05  bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int16'>)
+        990±20ns      1.04±0.01μs     1.05  bench_strings.StringComparisons.time_compare_different(100, 'S', True, '==')
+      16.9±0.2μs      17.7±0.03μs     1.05  bench_ma.UFunc.time_1d(True, True, 10)
+       649±0.9μs        682±0.4μs     1.05  bench_core.CountNonzero.time_count_nonzero_axis(2, 1000000, <class 'numpy.int8'>)
+      19.7±0.3μs       20.7±0.4μs     1.05  bench_ma.UFunc.time_1d(True, True, 1000)
+     10.1±0.05μs      10.6±0.02μs     1.05  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 10000, <class 'bool'>)
-       564±0.5μs          536±1μs     0.95  bench_function_base.Sort.time_sort('heap', 'int64', ('sorted_block', 100))
-         497±3μs          473±2μs     0.95  bench_function_base.Sort.time_argsort('merge', 'int64', ('random',))
-         441±3ns          419±1ns     0.95  bench_scalar.ScalarMath.time_multiplication('int16')
-      6.15±0.2μs      5.84±0.01μs     0.95  bench_ma.Indexing.time_1d(True, 1, 1000)
-     5.28±0.04ms      5.02±0.02ms     0.95  bench_core.VarComplex.time_var(1000000)
-     48.9±0.07μs      46.5±0.02μs     0.95  bench_ufunc_strides.AVX_ldexp.time_ufunc('f', 4)
-     48.8±0.03μs      46.3±0.03μs     0.95  bench_ufunc_strides.AVX_ldexp.time_ufunc('f', 2)
-      82.3±0.4μs       78.1±0.5μs     0.95  bench_function_base.Sort.time_argsort('quick', 'float64', ('ordered',))
-     48.8±0.02μs      46.3±0.01μs     0.95  bench_ufunc_strides.AVX_ldexp.time_ufunc('f', 1)
-     46.7±0.05μs      44.3±0.06μs     0.95  bench_ufunc_strides.AVX_ldexp.time_ufunc('d', 4)
-     25.1±0.01μs      23.8±0.04μs     0.95  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('complex128', 100)
-       233±0.6μs        221±0.4μs     0.95  bench_core.VarComplex.time_var(100000)
-         248±3μs          235±2μs     0.95  bench_lib.Nan.time_nanargmax(200000, 0)
-     46.5±0.03μs      44.0±0.09μs     0.95  bench_ufunc_strides.AVX_ldexp.time_ufunc('d', 2)
-         503±1μs          476±2μs     0.95  bench_function_base.Sort.time_argsort('merge', 'int32', ('random',))
-       278±0.5μs        263±0.5μs     0.95  bench_function_base.Sort.time_argsort('quick', 'int16', ('sorted_block', 1000))
-       617±0.5μs        584±0.7μs     0.95  bench_function_base.Sort.time_sort('heap', 'int64', ('random',))
-        4.30±0μs      4.06±0.02μs     0.95  bench_itemselection.Take.time_contiguous((1000, 3), 'raise', 'int16')
-     46.4±0.01μs      43.8±0.04μs     0.94  bench_ufunc_strides.AVX_ldexp.time_ufunc('d', 1)
-     2.07±0.01ms      1.95±0.01ms     0.94  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('complex128', 10000)
-       127±0.6μs       120±0.08μs     0.94  bench_ufunc.UFunc.time_ufunc_types('isfinite')
-     44.2±0.03μs      41.7±0.08μs     0.94  bench_ufunc_strides.AVX_cmplx_funcs.time_ufunc('absolute', 4, 'D')
-     4.30±0.01μs      4.06±0.01μs     0.94  bench_itemselection.Take.time_contiguous((1000, 3), 'raise', 'float16')
-         280±2μs          264±1μs     0.94  bench_function_base.Sort.time_argsort('quick', 'int64', ('sorted_block', 1000))
-     1.05±0.01μs         986±10ns     0.94  bench_strings.StringComparisons.time_compare_identical(100, 'S', True, '<')
-         229±2μs        216±0.9μs     0.94  bench_lib.Nan.time_nansum(200000, 0.1)
-         373±1μs          352±2μs     0.94  bench_indexing.Indexing.time_op('indexes_', 'np.ix_(I, I)', '')
-      5.71±0.2μs      5.38±0.03μs     0.94  bench_ma.Indexing.time_1d(False, 2, 1000)
-      99.9±0.3μs       94.1±0.3μs     0.94  bench_function_base.Sort.time_argsort('quick', 'int16', ('reversed',))
-     43.8±0.03μs      41.3±0.05μs     0.94  bench_ufunc_strides.AVX_cmplx_funcs.time_ufunc('absolute', 2, 'D')
-       407±0.9μs          384±4μs     0.94  bench_function_base.Sort.time_sort('heap', 'int16', ('reversed',))
-     43.7±0.03μs      41.2±0.01μs     0.94  bench_ufunc_strides.AVX_cmplx_funcs.time_ufunc('absolute', 1, 'D')
-      6.30±0.2μs      5.93±0.04μs     0.94  bench_ma.Indexing.time_1d(True, 2, 10)
-         405±7μs          381±2μs     0.94  bench_function_base.Sort.time_sort('heap', 'int64', ('ordered',))
-     1.59±0.07μs         1.49±0μs     0.94  bench_itemselection.PutMask.time_dense(False, 'complex128')
-      55.2±0.1μs      52.0±0.08μs     0.94  bench_function_base.Where.time_interleaved_zeros_x2
-       108±0.5μs       101±0.09μs     0.94  bench_function_base.Sort.time_argsort('quick', 'uint32', ('reversed',))
-         223±1μs          210±2μs     0.94  bench_lib.Nan.time_nansum(200000, 0)
-     39.9±0.04μs      37.4±0.05μs     0.94  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'S', False, '>=')
-        3.79±0μs      3.56±0.01μs     0.94  bench_itemselection.Take.time_contiguous((1000, 3), 'raise', 'int32')
-        773±20ns          724±5ns     0.94  bench_scalar.ScalarMath.time_power_of_two('complex128')
-     19.9±0.02μs      18.7±0.04μs     0.94  bench_strings.StringComparisons.time_compare_identical(10000, 'S', False, '>=')
-     38.7±0.04μs      36.2±0.03μs     0.94  bench_strings.StringComparisons.time_compare_identical(10000, 'S', True, '>=')
-        86.8±4μs       81.2±0.3μs     0.94  bench_ufunc_strides.AVX_UFunc_log.time_log(1, 'e')
-       553±0.6μs          518±1μs     0.94  bench_function_base.Sort.time_sort('heap', 'int64', ('sorted_block', 10))
-      76.9±0.2μs       72.0±0.1μs     0.94  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'S', True, '>=')
-     3.82±0.01μs      3.57±0.01μs     0.94  bench_itemselection.Take.time_contiguous((1000, 3), 'raise', 'complex128')
-     1.28±0.01μs      1.19±0.01μs     0.94  bench_itemselection.Take.time_contiguous((1000, 1), 'wrap', 'float32')
-         426±2μs          398±4μs     0.93  bench_function_base.Sort.time_sort('heap', 'int64', ('reversed',))
-       414±0.3μs        386±0.6μs     0.93  bench_function_base.Sort.time_sort('merge', 'uint32', ('random',))
-         243±2μs          227±2μs     0.93  bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 1000))
-     3.80±0.01μs      3.54±0.01μs     0.93  bench_itemselection.Take.time_contiguous((1000, 3), 'raise', 'complex64')
-     3.81±0.01μs      3.55±0.01μs     0.93  bench_itemselection.Take.time_contiguous((1000, 3), 'raise', 'longfloat')
-     3.81±0.01μs      3.55±0.01μs     0.93  bench_itemselection.Take.time_contiguous((1000, 2), 'raise', 'complex256')
-       524±0.8μs          489±4μs     0.93  bench_function_base.Sort.time_sort('heap', 'int64', ('sorted_block', 1000))
-     17.4±0.04μs      16.2±0.02μs     0.93  bench_strings.StringComparisons.time_compare_different(10000, 'S', False, '<')
-     1.28±0.01μs      1.19±0.01μs     0.93  bench_itemselection.Take.time_contiguous((1000, 2), 'wrap', 'float16')
-     13.3±0.09μs      12.4±0.04μs     0.93  bench_indexing.ScalarIndexing.time_assign_cast(0)
-     1.29±0.02μs      1.20±0.01μs     0.93  bench_itemselection.Take.time_contiguous((1000, 2), 'wrap', 'int16')
-     52.1±0.08μs      48.4±0.08μs     0.93  bench_function_base.Sort.time_argsort('quick', 'int32', ('uniform',))
-     3.81±0.01μs      3.54±0.01μs     0.93  bench_itemselection.Take.time_contiguous((1000, 3), 'raise', 'float64')
-     33.7±0.03μs      31.3±0.05μs     0.93  bench_strings.StringComparisons.time_compare_different(10000, 'S', True, '>=')
-     33.7±0.01μs      31.3±0.03μs     0.93  bench_strings.StringComparisons.time_compare_different(10000, 'S', True, '<')
-     2.59±0.01ms      2.40±0.01ms     0.93  bench_io.Copy.time_memcpy_large_out_of_place('complex64')
-     34.9±0.08μs      32.4±0.07μs     0.93  bench_strings.StringComparisons.time_compare_different((1000, 20), 'S', False, '>=')
-     34.9±0.04μs      32.3±0.07μs     0.93  bench_strings.StringComparisons.time_compare_different((1000, 20), 'S', False, '<')
-      20.1±0.2ms       18.7±0.2ms     0.93  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('complex128', 100000)
-      26.8±0.2μs      24.8±0.04μs     0.93  bench_function_base.Sort.time_sort('merge', 'uint32', ('sorted_block', 1000))
-     1.30±0.01μs      1.20±0.01μs     0.93  bench_itemselection.Take.time_contiguous((1000, 1), 'wrap', 'int32')
-     66.8±0.02μs      61.8±0.07μs     0.93  bench_strings.StringComparisons.time_compare_different((1000, 20), 'S', True, '>=')
-     3.83±0.02μs      3.54±0.01μs     0.92  bench_itemselection.Take.time_contiguous((1000, 3), 'raise', 'int64')
-     1.31±0.02μs      1.21±0.01μs     0.92  bench_strings.StringComparisons.time_compare_identical(100, 'U', False, '<')
-      58.8±0.3μs       54.2±0.6μs     0.92  bench_core.PackBits.time_packbits_little(<class 'numpy.uint64'>)
-     59.6±0.02μs      54.7±0.01μs     0.92  bench_linalg.Linalg.time_op('norm', 'float16')
-     54.9±0.02μs      50.3±0.09μs     0.92  bench_function_base.Where.time_interleaved_zeros_x4
-     1.89±0.01μs      1.72±0.01μs     0.91  bench_strings.StringComparisons.time_compare_identical(100, 'U', True, '<')
-     1.81±0.01μs         1.65±0μs     0.91  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'wrap', 'int32')
-      98.9±0.2μs       90.3±0.2μs     0.91  bench_function_base.Sort.time_argsort('quick', 'int64', ('reversed',))
-     5.96±0.02ms         5.43±0ms     0.91  bench_ufunc.UFunc.time_ufunc_types('tanh')
-      39.5±0.1μs      35.9±0.04μs     0.91  bench_function_base.Sort.time_argsort('merge', 'int32', ('sorted_block', 1000))
-     1.33±0.01μs      1.21±0.01μs     0.90  bench_itemselection.PutMask.time_sparse(False, 'int64')
-     1.33±0.01μs      1.21±0.01μs     0.90  bench_itemselection.PutMask.time_sparse(False, 'int16')
-     54.7±0.04μs       49.5±0.2μs     0.90  bench_function_base.Where.time_interleaved_zeros_x8
-     1.34±0.01μs         1.21±0μs     0.90  bench_itemselection.PutMask.time_sparse(False, 'complex64')
-     1.33±0.01μs         1.20±0μs     0.90  bench_itemselection.PutMask.time_sparse(False, 'float64')
-     1.82±0.01μs         1.64±0μs     0.90  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'wrap', 'float32')
-     53.8±0.04μs      48.5±0.04μs     0.90  bench_function_base.Sort.time_argsort('quick', 'int16', ('uniform',))
-        77.4±2μs       69.7±0.3μs     0.90  bench_function_base.Sort.time_sort('merge', 'float64', ('sorted_block', 100))
-     1.33±0.01μs      1.19±0.01μs     0.90  bench_itemselection.PutMask.time_sparse(False, 'float16')
-     20.0±0.02μs      17.9±0.02μs     0.89  bench_function_base.Sort.time_sort('heap', 'uint32', ('uniform',))
-     68.6±0.07ms       61.3±0.2ms     0.89  bench_function_base.Sort.time_sort_worst
-     20.0±0.04μs      17.8±0.03μs     0.89  bench_function_base.Sort.time_sort('heap', 'int32', ('uniform',))
-     12.0±0.01μs      10.7±0.03μs     0.89  bench_function_base.Sort.time_argsort('merge', 'uint32', ('reversed',))
-      87.0±0.3μs       77.1±0.1μs     0.89  bench_function_base.Sort.time_argsort('merge', 'float64', ('sorted_block', 100))
-     1.53±0.01μs         1.35±0μs     0.88  bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int16'>)
-        72.9±4μs       64.3±0.1μs     0.88  bench_function_base.Sort.time_argsort('quick', 'uint32', ('ordered',))
-        567±60ns          499±3ns     0.88  bench_scalar.ScalarMath.time_abs('int32')
-      51.8±0.1μs      45.4±0.05μs     0.88  bench_function_base.Sort.time_sort('quick', 'int64', ('uniform',))
-      51.7±0.2μs       45.2±0.2μs     0.87  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (1), 4, 4, 'e')
-      51.7±0.2μs       45.1±0.2μs     0.87  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (1), 2, 4, 'e')
-      85.7±0.4μs       74.8±0.3μs     0.87  bench_function_base.Sort.time_sort('quick', 'int64', ('reversed',))
-     51.7±0.04μs      45.1±0.02μs     0.87  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'positive'>, 1, 4, 'e')
-      51.8±0.2μs       45.2±0.2μs     0.87  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'positive'>, 2, 4, 'e')
-       112±0.8μs         97.4±1μs     0.87  bench_function_base.Sort.time_sort('merge', 'int64', ('sorted_block', 10))
-         152±3μs          132±5μs     0.87  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rad2deg'>, 2, 1, 'f')
-     51.7±0.08μs      45.0±0.02μs     0.87  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (0), 1, 4, 'e')
-      51.8±0.2μs       45.1±0.2μs     0.87  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'positive'>, 4, 4, 'e')
-      51.8±0.1μs       45.1±0.2μs     0.87  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (0), 2, 4, 'e')
-     51.7±0.04μs      45.0±0.02μs     0.87  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (1), 1, 4, 'e')
-      51.8±0.2μs       45.1±0.2μs     0.87  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (0), 4, 4, 'e')
-        560±70ns          486±3ns     0.87  bench_scalar.ScalarMath.time_abs('complex128')
-      4.71±0.4μs      4.06±0.02μs     0.86  bench_itemselection.Take.time_contiguous((1000, 3), 'wrap', 'float16')
-      51.6±0.1μs      44.4±0.09μs     0.86  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'positive'>, 4, 2, 'e')
-      51.6±0.1μs      44.4±0.06μs     0.86  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (1), 4, 2, 'e')
-      51.6±0.1μs       44.4±0.1μs     0.86  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (0), 4, 2, 'e')
-      51.5±0.1μs      44.3±0.09μs     0.86  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'positive'>, 2, 2, 'e')
-      114±0.07μs       97.8±0.5μs     0.86  bench_function_base.Sort.time_argsort('merge', 'int64', ('sorted_block', 10))
-     51.4±0.05μs      44.2±0.05μs     0.86  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'positive'>, 1, 2, 'e')
-      51.6±0.1μs      44.3±0.07μs     0.86  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (0), 2, 2, 'e')
-       146±0.3μs        125±0.3μs     0.86  bench_function_base.Sort.time_argsort('quick', 'float64', ('reversed',))
-     51.4±0.04μs      44.1±0.02μs     0.86  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (0), 1, 2, 'e')
-     51.6±0.08μs      44.3±0.07μs     0.86  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (1), 2, 2, 'e')
-     51.5±0.02μs      44.1±0.04μs     0.86  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (1), 1, 2, 'e')
-         155±2μs          133±6μs     0.86  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'deg2rad'>, 2, 1, 'f')
-       152±0.7μs          130±4μs     0.85  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'fabs'>, 1, 1, 'f')
-      51.4±0.1μs       43.9±0.1μs     0.85  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (0), 2, 1, 'e')
-     51.4±0.05μs       43.9±0.2μs     0.85  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (1), 2, 1, 'e')
-     51.6±0.03μs      44.0±0.02μs     0.85  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (0), 4, 1, 'e')
-       153±0.6μs          130±3μs     0.85  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'deg2rad'>, 2, 2, 'f')
-     51.4±0.02μs      43.8±0.06μs     0.85  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'positive'>, 2, 1, 'e')
-     51.3±0.01μs      43.7±0.08μs     0.85  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'positive'>, 1, 1, 'e')
-     51.2±0.05μs      43.6±0.04μs     0.85  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (0), 1, 1, 'e')
-     51.6±0.03μs      43.9±0.03μs     0.85  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'positive'>, 4, 1, 'e')
-     51.6±0.04μs      43.9±0.07μs     0.85  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (1), 4, 1, 'e')
-     51.3±0.04μs      43.7±0.01μs     0.85  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (1), 1, 1, 'e')
-       153±0.8μs          130±1μs     0.85  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rad2deg'>, 2, 2, 'd')
-         152±3μs          129±2μs     0.85  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'degrees'>, 2, 1, 'f')
-     60.5±0.05μs       51.4±0.3μs     0.85  bench_strings.StringComparisons.time_compare_identical(10000, 'U', False, '<')
-     14.5±0.05μs      12.3±0.04μs     0.85  bench_function_base.Sort.time_argsort('merge', 'float32', ('reversed',))
-         152±1μs        129±0.9μs     0.85  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'radians'>, 2, 1, 'f')
-         154±4μs          130±3μs     0.85  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'fabs'>, 2, 4, 'd')
-       121±0.2μs        102±0.4μs     0.85  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'U', False, '<')
-       152±0.9μs          129±1μs     0.85  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'degrees'>, 4, 1, 'f')
-      119±0.05μs        101±0.1μs     0.85  bench_strings.StringComparisons.time_compare_identical(10000, 'U', True, '<')
-         154±5μs        130±0.4μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'degrees'>, 4, 2, 'd')
-       155±0.6μs          131±2μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'fabs'>, 4, 4, 'd')
-       238±0.6μs        201±0.7μs     0.84  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'U', True, '<')
-         155±4μs          131±1μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'degrees'>, 2, 2, 'f')
-         154±1μs        130±0.4μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'radians'>, 4, 2, 'd')
-       155±0.7μs          130±2μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'degrees'>, 4, 4, 'd')
-         157±3μs          132±2μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'radians'>, 1, 4, 'f')
-         154±4μs          130±1μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'deg2rad'>, 4, 4, 'f')
-         159±3μs          134±3μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'deg2rad'>, 1, 4, 'd')
-         154±1μs        130±0.5μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'fabs'>, 4, 2, 'd')
-         153±1μs        128±0.8μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'degrees'>, 2, 1, 'd')
-         152±3μs          128±2μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'degrees'>, 1, 1, 'f')
-         159±4μs          134±3μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'radians'>, 1, 4, 'd')
-         156±2μs          131±3μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'deg2rad'>, 1, 4, 'f')
-       155±0.5μs          130±1μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'deg2rad'>, 4, 4, 'd')
-       153±0.4μs        128±0.4μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'fabs'>, 2, 2, 'd')
-     2.15±0.01μs         1.81±0μs     0.84  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int16'>)
-       154±0.9μs        129±0.6μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'fabs'>, 4, 1, 'd')
-         154±1μs        129±0.3μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'radians'>, 4, 1, 'd')
-         154±3μs        130±0.4μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'deg2rad'>, 4, 2, 'd')
-         153±1μs        128±0.7μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rad2deg'>, 2, 4, 'f')
-     1.60±0.02μs         1.35±0μs     0.84  bench_itemselection.PutMask.time_dense(False, 'int16')
-       154±0.6μs        129±0.7μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'deg2rad'>, 4, 1, 'd')
-         154±2μs          129±2μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rad2deg'>, 1, 2, 'd')
-       153±0.8μs        128±0.3μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'deg2rad'>, 2, 2, 'd')
-         153±2μs          129±1μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'degrees'>, 2, 2, 'd')
-         158±4μs          132±3μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rad2deg'>, 1, 4, 'f')
-         153±3μs          128±2μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'radians'>, 4, 1, 'f')
-         156±3μs          130±1μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rad2deg'>, 4, 4, 'd')
-         154±2μs          129±1μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'fabs'>, 2, 4, 'f')
-         155±2μs          130±1μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rad2deg'>, 4, 2, 'd')
-     1.59±0.01μs      1.33±0.01μs     0.84  bench_itemselection.PutMask.time_dense(False, 'float16')
-         152±7μs          128±2μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rad2deg'>, 1, 1, 'f')
-         155±3μs          130±3μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'radians'>, 4, 4, 'd')
-         153±6μs        128±0.9μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'radians'>, 4, 2, 'f')
-         160±4μs          134±4μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'fabs'>, 1, 4, 'd')
-         159±4μs          133±2μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rad2deg'>, 1, 4, 'd')
-       152±0.6μs        127±0.6μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'degrees'>, 1, 1, 'd')
-       152±0.5μs        127±0.4μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'fabs'>, 1, 1, 'd')
-         153±1μs          128±1μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'radians'>, 2, 2, 'f')
-         153±1μs          128±1μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'radians'>, 2, 4, 'f')
-       152±0.6μs        127±0.8μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'deg2rad'>, 1, 1, 'd')
-         153±6μs        128±0.4μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'deg2rad'>, 2, 1, 'd')
-         154±2μs          128±2μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'degrees'>, 1, 2, 'd')
-         154±2μs       129±0.09μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'degrees'>, 4, 1, 'd')
-         153±3μs        128±0.8μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'deg2rad'>, 1, 2, 'f')
-         156±2μs          130±1μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'degrees'>, 1, 4, 'f')
-         153±4μs        127±0.6μs     0.83  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'degrees'>, 2, 4, 'f')
-         154±2μs          128±1μs     0.83  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'radians'>, 1, 2, 'd')
-         156±2μs          130±2μs     0.83  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'fabs'>, 1, 4, 'f')
-         155±2μs          130±2μs     0.83  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'degrees'>, 4, 4, 'f')
-         118±1μs         98.7±2μs     0.83  bench_function_base.Sort.time_argsort('merge', 'int32', ('sorted_block', 10))
-         154±2μs          129±2μs     0.83  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'fabs'>, 1, 2, 'd')
-         155±2μs          130±1μs     0.83  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rad2deg'>, 4, 4, 'f')
-         153±6μs        127±0.6μs     0.83  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'deg2rad'>, 4, 1, 'f')
-         156±4μs          130±3μs     0.83  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'radians'>, 2, 4, 'd')
-         153±7μs        128±0.8μs     0.83  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rad2deg'>, 2, 1, 'd')
-         154±1μs          128±1μs     0.83  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'radians'>, 2, 2, 'd')
-      4.65±0.2μs       3.86±0.3μs     0.83  bench_itemselection.PutMask.time_dense(False, 'complex256')
-         153±4μs          128±1μs     0.83  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'fabs'>, 2, 1, 'd')
-         155±1μs          129±2μs     0.83  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'radians'>, 4, 4, 'f')
-         154±2μs        128±0.9μs     0.83  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'fabs'>, 4, 2, 'f')
-         153±3μs        127±0.9μs     0.83  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'radians'>, 2, 1, 'd')
-         157±2μs          130±1μs     0.83  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'fabs'>, 4, 4, 'f')
-         155±4μs        128±0.3μs     0.83  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rad2deg'>, 4, 2, 'f')
-         160±5μs          133±4μs     0.83  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'degrees'>, 1, 4, 'd')
-         155±6μs          129±1μs     0.83  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'fabs'>, 4, 1, 'f')
-         154±3μs          128±1μs     0.83  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'degrees'>, 4, 2, 'f')
-         156±3μs          129±3μs     0.83  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'deg2rad'>, 2, 4, 'd')
-         154±2μs          127±2μs     0.83  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'radians'>, 1, 2, 'f')
-         154±3μs        128±0.8μs     0.83  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'deg2rad'>, 4, 2, 'f')
-        73.4±2μs      60.6±0.05μs     0.83  bench_strings.StringComparisons.time_compare_identical(10000, 'U', False, '!=')
-      66.1±0.3μs      54.6±0.06μs     0.82  bench_function_base.Sort.time_argsort('quick', 'int16', ('ordered',))
-         156±7μs          129±2μs     0.82  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'deg2rad'>, 1, 2, 'd')
-         155±7μs        127±0.8μs     0.82  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'fabs'>, 1, 2, 'f')
-         160±8μs          131±3μs     0.82  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'degrees'>, 2, 4, 'd')
-         158±4μs          129±3μs     0.82  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rad2deg'>, 2, 4, 'd')
-         156±6μs          128±2μs     0.82  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'fabs'>, 2, 1, 'f')
-      67.1±0.2μs       54.9±0.2μs     0.82  bench_function_base.Sort.time_argsort('quick', 'int64', ('ordered',))
-         156±3μs          128±1μs     0.82  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'deg2rad'>, 2, 4, 'f')
-         157±6μs          128±1μs     0.82  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'degrees'>, 1, 2, 'f')
-         157±6μs          127±1μs     0.81  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rad2deg'>, 4, 1, 'f')
-         159±8μs        129±0.6μs     0.81  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rad2deg'>, 2, 2, 'f')
-     10.7±0.01μs      8.67±0.01μs     0.81  bench_function_base.Sort.time_sort('merge', 'int64', ('reversed',))
-         157±4μs        127±0.6μs     0.81  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'fabs'>, 2, 2, 'f')
-         160±6μs          129±2μs     0.80  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'radians'>, 1, 1, 'f')
-     2.78±0.01μs      2.23±0.02μs     0.80  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int16'>)
-     12.5±0.03μs      10.0±0.01μs     0.80  bench_function_base.Sort.time_sort('merge', 'float32', ('reversed',))
-      36.5±0.4μs         29.2±2μs     0.80  bench_function_base.Sort.time_sort('merge', 'int64', ('sorted_block', 1000))
-      1.53±0.2μs         1.21±0μs     0.79  bench_strings.StringComparisons.time_compare_identical(100, 'U', False, '>')
-        164±10μs          129±3μs     0.79  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'deg2rad'>, 1, 1, 'f')
-         165±3μs        128±0.9μs     0.77  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rad2deg'>, 1, 2, 'f')
-      77.6±0.7μs       59.7±0.6μs     0.77  bench_function_base.Sort.time_argsort('merge', 'int32', ('sorted_block', 100))
-     15.0±0.09μs      11.5±0.08μs     0.77  bench_io.Copy.time_strided_copy('complex64')
-         307±1μs        235±0.3μs     0.77  bench_random.Bounded.time_bounded('PCG64', [<class 'numpy.uint32'>, 1535])
-      2.94±0.1μs      2.14±0.07μs     0.73  bench_strings.StringComparisons.time_compare_identical(100, 'U', True, '==')
-        62.8±5μs       45.8±0.2μs     0.73  bench_function_base.Sort.time_sort('quick', 'int64', ('ordered',))
-     13.2±0.01μs      9.48±0.01μs     0.72  bench_io.Copy.time_cont_assign('complex64')
-        181±20μs        129±0.6μs     0.71  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rad2deg'>, 4, 1, 'd')
-     1.82±0.03μs      1.30±0.01μs     0.71  bench_strings.StringComparisons.time_compare_identical(100, 'U', False, '==')
-     64.3±0.03μs      45.5±0.09μs     0.71  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int16'>)
-       192±0.1μs        136±0.3μs     0.71  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int16'>)
-       128±0.1μs      90.0±0.05μs     0.71  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int16'>)
-        181±20μs        128±0.4μs     0.70  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rad2deg'>, 1, 1, 'd')
-        72.1±2μs       50.4±0.6μs     0.70  bench_function_base.Sort.time_sort('merge', 'int64', ('sorted_block', 100))
-       167±0.3μs        115±0.4μs     0.69  bench_function_base.Sort.time_argsort('merge', 'float64', ('sorted_block', 10))
-     51.5±0.03μs       33.3±0.1μs     0.65  bench_ufunc_strides.Unary.time_ufunc(<ufunc '_ones_like'>, 1, 4, 'e')
-      51.6±0.2μs       33.3±0.1μs     0.65  bench_ufunc_strides.Unary.time_ufunc(<ufunc '_ones_like'>, 4, 4, 'e')
-      51.6±0.2μs       33.3±0.2μs     0.65  bench_ufunc_strides.Unary.time_ufunc(<ufunc '_ones_like'>, 2, 4, 'e')
-      51.3±0.1μs      33.0±0.06μs     0.64  bench_ufunc_strides.Unary.time_ufunc(<ufunc '_ones_like'>, 2, 2, 'e')
-     51.3±0.05μs      33.0±0.07μs     0.64  bench_ufunc_strides.Unary.time_ufunc(<ufunc '_ones_like'>, 4, 2, 'e')
-     51.3±0.01μs      33.0±0.03μs     0.64  bench_ufunc_strides.Unary.time_ufunc(<ufunc '_ones_like'>, 1, 2, 'e')
-     51.2±0.02μs      32.8±0.05μs     0.64  bench_ufunc_strides.Unary.time_ufunc(<ufunc '_ones_like'>, 4, 1, 'e')
-     51.2±0.04μs      32.8±0.06μs     0.64  bench_ufunc_strides.Unary.time_ufunc(<ufunc '_ones_like'>, 2, 1, 'e')
-     51.1±0.03μs      32.8±0.03μs     0.64  bench_ufunc_strides.Unary.time_ufunc(<ufunc '_ones_like'>, 1, 1, 'e')
-        199±20μs        127±0.4μs     0.64  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'radians'>, 1, 1, 'd')
-       328±0.2μs        201±0.4μs     0.61  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'U', True, '>')
-     7.33±0.01μs         4.22±0μs     0.57  bench_io.CopyTo.time_copyto_dense
-         454±2μs        244±0.3μs     0.54  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'U', True, '==')
-         230±8μs        123±0.2μs     0.54  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'U', False, '==')
-         119±9μs       61.9±0.1μs     0.52  bench_strings.StringComparisons.time_compare_identical(10000, 'U', False, '==')
-       52.5±30μs      26.2±0.04μs     0.50  bench_function_base.Sort.time_sort('merge', 'int16', ('sorted_block', 10))
-        257±30μs        122±0.1μs     0.47  bench_strings.StringComparisons.time_compare_identical(10000, 'U', True, '==')
-      55.2±0.8ms       8.47±0.4ms     0.15  bench_ufunc.At.time_sum_at

Running $ asv run --bench bench_ufunc.At speedup-ufunc.at-main...fe73a8498~1 in the benchmarks directory, I see these results

commit time for single benchmark note
fe73a84 53.9±0.2ms baseline
7b31199 55.6±0.6ms find the place to add a fast path for ufunc_at
c266953 53.9±0.04ms refactor ufunc_at in preparation for breaking it into sub-functions
07c3487 53.7±0.2ms refactor part that does the iteration into function
7853cbc 18.3±0.09ms add fast iter loop
154c293 17.8±0.03ms elide the generic_wrapped_legacy_loop wrapper out of the hot loop (does not seem to have significant impact)
cfa2b17 8.44±0.08ms do not try SIMD arithmetic loops if count < 4
ef9f35a <speedup-ufunc.at-main> 8.40±0.02ms HEAD (fix for non-generic_legacy loops)

That is the good news. The bad news is that the slow-down in some benchmarks seems real:

$ python runtests.py -j6 --bench-compare fe73a8498 -- -b bench_io.Copy
...
       before           after         ratio
     [fe73a849]       [ef9f35a9]
     <asv-compare~1>       <speedup-ufunc.at-main>
+        2.29±0μs      3.84±0.05μs     1.67  bench_io.Copy.time_cont_assign('float32')
+     9.73±0.02μs      13.4±0.03μs     1.38  bench_io.Copy.time_cont_assign('complex128')
+     14.0±0.04μs      16.9±0.05μs     1.20  bench_io.Copy.time_strided_copy('complex128')
+        1.21±0ms         1.28±0ms     1.06  bench_io.Copy.time_memcpy_large_out_of_place('float32')
-     2.60±0.01ms         2.43±0ms     0.94  bench_io.Copy.time_memcpy_large_out_of_place('complex64')
-        54.3±6μs       47.2±0.1μs     0.87  bench_io.CopyTo.time_copyto_8_dense
-      15.0±0.1μs      11.5±0.07μs     0.77  bench_io.Copy.time_strided_copy('complex64')
-     13.3±0.02μs      9.44±0.01μs     0.71  bench_io.Copy.time_cont_assign('complex64')
-        7.33±0μs         4.22±0μs     0.58  bench_io.CopyTo.time_copyto_dense

@mattip
Copy link
Member Author

mattip commented Dec 28, 2022

cfa2b17 seems significant and maybe worth exploring for all the loops with SIMD logic

@mattip
Copy link
Member Author

mattip commented Dec 28, 2022

Lowering the count limit to 2 improved the situation for the float add/subtract/multiply/divide template:

$ python runtests.py -j6 --bench-compare fe73a8498 -- -b bench_io.Copy
...
       before           after         ratio
     [fe73a849]       [fb31a971]
     <asv-compare~1>       <speedup-ufunc.at-main>
-      47.8±0.9μs      44.5±0.03μs     0.93  bench_io.CopyTo.time_copyto_8_dense
-     2.62±0.01ms      2.43±0.01ms     0.93  bench_io.Copy.time_memcpy_large_out_of_place('complex64')
-     14.9±0.03μs      11.5±0.04μs     0.77  bench_io.Copy.time_strided_copy('complex64')
-     13.3±0.05μs      9.44±0.03μs     0.71  bench_io.Copy.time_cont_assign('complex64')
-     7.33±0.01μs         4.21±0μs     0.58  bench_io.CopyTo.time_copyto_dense

@seiko2plus
Copy link
Member

Here is a complete benchmark run. Is there a way to make asv runs more stable on ubuntu using a AMD processor? Pyperf has some directions, but are they applicable to asv? There is also this project with some hints, has anyone tried them?

Turn off the boost and isolate at least one full core(due to hyper-threading) all you need. Follow pyperf instrctions it should fit, but the last time I checked pyperf system tune fails to turn off the turbo boost on AMD.

You can use the following command after isolating the CPU core during the kernel load e.g. through grub rcu_nocbs=7,15 isolcpus=7,15 ('7,15' represents the last core I have on my CPU, use lscpu -p to check hyper core):

sudo pyperf system tune
# disable AMD boost
echo "0" | sudo tee /sys/devices/system/cpu/cpufreq/boost

Then use asv test option --cpu-affinity:

python runtests.py -n --bench-compare parent/main bench_io  -- --cpu-affinity=7

Signed-off-by: mattip <matti.picus@gmail.com>
@seberg
Copy link
Member

seberg commented Jan 2, 2023

I will have a closer look at it later, in general looks good, since special casing casting and no-casting does seem OK to me (although there may be a way to do it in one, see below).

Eliding the wrapper seems like a very moderate speed-up even after the SIMD change? I can't say I like it and would much prefer to not do it, since it means only old-style loops have the speed up.

There could be a way to do both casting and no-cast in the same loop (although it makes sense to do the second arguments casting in larger buffers rather than individually).
That is, we could create a "with cast" inner-loop, which wraps the casts and the actual inner-loop into a single one. This will also be the correct path to optimize the cast version a lot.

The one thing is that casting of the second argument should be done in chunks (a larger buffer). It may be that MapIter can actually do that already with more direct access. But digging into MapIter a bit more seems like a next, additional step.

@mattip
Copy link
Member Author

mattip commented Jan 2, 2023

Eliding the wrapper seems like a very moderate speed-up even after the SIMD change?

Backed out that change, which on my machine made the new benchmark time go from 8.4 to 9.8 msec (+16%). The fast-path is only activated for old-style loops. Adding more logic to the loop will slow it down, so I wouldn't want to merge a casting loop with the non-casting one.

@seberg
Copy link
Member

seberg commented Jan 2, 2023

The idea of "merging" them is to move the casting into the strided loop function. So you would use the current "no cast" loop also for the casting path without modifying it all. I do think that would be much better than what we have now, but I am happy to consider it a followup.

Why is the fast-path only activated with is_generic_wrapped_legacy_loop? It seems to me that you can simply remove that if.

@mattip
Copy link
Member Author

mattip commented Jan 2, 2023

The idea of "merging" them is to move the casting into the strided loop function

That will slow things down. The speed-ups come from

  • removing the third iterator that does buffering
  • skip checking for SIMD, etc.

If we add back casting, we will slow things down again. I still am not convinced I should remove the function eliding (commit 333d012) since it led to a 16% slowdown, which I think is significant.

@seberg
Copy link
Member

seberg commented Jan 2, 2023

What I mean would be very very roughly, something like:

int
strided_loop_with_cast(..., NpyAuxdata *auxdata)
{
    data = (my_auxdata *)auxdata;
    cast_to_buffers(...);

    original_function_loop(...)

    cast_from_buffer(...);
}


get_cast(..., &view_offset);

if (view_offset != 0) {
    auxdata = build_wrapping_auxdata(cast, original_loop, original_auxdata);
    strided_loop = &strided_loop_with_cast;
}
else {
    strided_loop = original_loop;
    auxdata = original_auxdata;
}

That is annoying to create and would be similar to here (although that is hopefully a lot more complex than required here).

I am fine with keeping the optimization, 20% is nice! I am just not a big fan of optimizing the old-style loops, when I would hope we can eventually start removing them all.

@mattip
Copy link
Member Author

mattip commented Jan 2, 2023

The 15% comes from realizing the auxdata holds a pointer to the real function, and looking it up only once. So I would really rather do something like this, with possibly even more specialization via a flag to indicate that the real_loop is a primitive operation like addition, subtraction, multiplication, negation.


/* only dereference this once */
real_loop = ((long_name*)auxdata)->loop;
/* possibly more setup for casting */

int
strided_loop_with_cast(..., real_loop)
{
    cast_to_buffers(...);

    real_loop(...)

    cast_from_buffer(...);
}


get_cast(..., &view_offset);

if (view_offset != 0) {
    strided_loop = &strided_loop_with_cast;
}
else if is_primitive(ufunc) {
    switch (operation) {
        case ADDITION:
            strided_loop = add_one_function[dtype];
        case MULTIPLICATION:
            strided_loop = mul_one_function[dtype];
        ....
    }
else{
    strided_loop = real_loop;
}

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Matti, generally looks good to me. I didn't quite parse all the paths for correct cleanup.

I would like to suggest not doing that "unpack the wrapped legacy loop" thing in this PR and then discuss it. There is still a lot that can be done, but right now I think the small cleanups I commented on and the 2-3 additional tests (and maybe one error check) would be good.

This code is a mess, and besides having two paths (that we need anyway) the PR seems cleaner than what we had, so I would like to move this on and then follow up with a bit more checks (e.g. for reference leaks).

int
is_generic_wrapped_legacy_loop(PyArrayMethod_StridedLoop *strided_loop) {
return strided_loop == generic_wrapped_legacy_loop;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless you want to re-add the fast path, please remove this.

Comment on lines 6014 to 6016
if (res != 0 && err_msg) {
PyErr_SetString(PyExc_ValueError, err_msg);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (res != 0 && err_msg) {
PyErr_SetString(PyExc_ValueError, err_msg);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This is also in the other path, err_msg is always NULL, so this is dead code.)

int buffersize;
int errormask = 0;
int res = 0;
char * err_msg = NULL;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
char * err_msg = NULL;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that this was never used to begin with.

}
else {
Py_INCREF(PyArray_DESCR(op1_array));
array_operands[1] = new_array_op(op1_array, iter->dataptr);
array_operands[2] = NULL;
nop = 2;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test to cover this path? I am not sure that this should even happen, be allowed, but:

arr = np.ones(10, dtype=int)
np.log.at(arr, [1, 2, 0])

should do the trick (yes, it is a terrible example), I am not sure there are neat ones, it seems entirely plausible that we should just reject this path entirely (assuming we make sure the fast path is always taken when it should be).

Another (maybe saner) option would be:

unaligned_arr = np.zeros(1+8*4, dtype="b")[1:].view(np.int64)
unaligned_arr[...] = 1
np.negative.at(unaligned_arr, [0, 1, 2, 2])
assert_array_equal(unaligned_arr, [-1, -1, 0, 0])

int res = 0;
int nop = 0;
NpyIter_IterNextFunc *iternext;
char * err_msg = NULL;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
char * err_msg = NULL;

numpy/core/src/umath/ufunc_object.c Outdated Show resolved Hide resolved
numpy/core/src/umath/ufunc_object.c Outdated Show resolved Hide resolved
fast_path = 0;
}
}
if (PyArray_DESCR(op1_array) != operation_descrs[0]) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be done more precisely, but it is a bit tedious and I think it should be done together with removing the "nditer for casting" hack from the casting path.

@@ -6309,29 +6462,15 @@ ufunc_at(PyUFuncObject *ufunc, PyObject *args)
* (e.g. `power` released the GIL but manually set an Exception).
*/
if (res != 0 || PyErr_Occurred()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't need the PyErr_Occurred() anymore, but we can follow up on that.

@@ -537,7 +538,7 @@ NPY_NO_EXPORT void NPY_CPU_DISPATCH_CURFX(@TYPE@_@kind@)
*((@type@ *)iop1) = io1;
#endif
}
else if (!run_binary_simd_@kind@_@TYPE@(args, dimensions, steps)) {
else if (dimensions[0] < @count@ || !run_binary_simd_@kind@_@TYPE@(args, dimensions, steps)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy to add this, if @seiko2plus doesn't like it, we can follow up also.

@mattip
Copy link
Member Author

mattip commented Jan 3, 2023

Thanks for the careful review. I made changes as suggested.

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two more small things I noticed, but I am happy to make these follow-ups. It would be great to improve test coverage a bit a more also, but happy to follow up.

Did you have a check for output dtype matching? E.g.:

arr = np.arange(3)
np.equal.at(arr, [0, 0], [0, 1])
assert arr[0] == 1

which is nonsense, but... (this test was failing already)

The new tests look great, thanks! I think a few more (and maybe splitting it up a bit) would be awesome, but that also seems like better for a follow-up.

benchmarks/benchmarks/bench_ufunc.py Outdated Show resolved Hide resolved
numpy/core/src/umath/ufunc_object.c Outdated Show resolved Hide resolved
@mattip
Copy link
Member Author

mattip commented Jan 3, 2023

For once the travis error looks legitimate:

# Test boolean indexing and boolean ufuncs
a = np.arange(10)
index = a % 2 == 0
np.equal.at(a, index, [0, 2, 4, 6, 8])
>       assert_equal(a, [1, 1, 1, 3, 1, 5, 1, 7, 1, 9])
a = array([72057594037927936,                 1, 72057594037927938,
                           3, 72057594037927940,                 5,
           72057594037927942,                 7, 72057594037927944,
                           9])

This is on 390x, is the byte order different there and the output is being set to 1 in the wrong byte order?

Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>
@seberg
Copy link
Member

seberg commented Jan 3, 2023

@mattip ohh! that was the test I was just asking about. The problem is probably just that we need to use the casting path here and fail to do so. We are missing the check that the output also doesn't need casts.


# reset a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be nice to follow up and split this test up. It now runs all of the rest twice with the same values, since it discards the parametrization.

@seberg
Copy link
Member

seberg commented Jan 3, 2023

Lets just put this in and follow up after that. Thanks for finally starting to clean up (and speed up) these code paths! As some notes what would be nice to follow ups:

  • Splitting up the large test would be nice. ✔️
  • Would be awesome to improve code coverage for some paths even more. They are just moved from before and never had any...
  • The checks for whether casting is necessary is more strict than necessary, it might be nice to make them precise. However, that is probably related to...
  • ... the casting branch might be nice to speed up a bit. Honestly, I am happy if this would mainly remove the "abuse NpyIter" for casting and not even speed things up much overall. At least casting the main array is super niche. Casting the other array could probably be done by using the private MapIter API.

For larger changes, there are basically two angles (the last two points). Improving the casting in general, and second making use of the private MapIter API to handle the additional array and possibly "unroll" the inner-loop.

@seberg seberg merged commit ba89ef9 into numpy:main Jan 3, 2023
@seberg
Copy link
Member

seberg commented Jan 3, 2023

We also should do a valgrind/refcount leak check eventually to test the cleanup paths more carefully.

@nschloe
Copy link
Contributor

nschloe commented Jan 3, 2023

I've updated the analysis at #5922 (comment). bincount is now "only" 8 times faster vs. np.add.at (as oppsed to 30 to 40 times before).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants