Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add inner loop functions for indexed loops and use them in ufunc_at #89

Closed
wants to merge 12 commits into from

Conversation

mattip
Copy link
Owner

@mattip mattip commented Jan 31, 2023

  • Create inner loop functions where the second argument is an index to use for the first argument
  • Add a slot for these loops to ArrayMethodObject
  • Add code to parse a new kwarg "indexed" to the Ufunc class in numpy/core/code_generators/generate_umath.py which will generate C code to add the loops. It uses get_info_no_cast to find the correct ArrayMethodObject in the ufunc._loop info list.
  • thread a new try_trivial_at_loop into the logic of ufunc_at to call the new loops.

The good news is that ufunc_at can be up to 6.5x faster (via the benchmark added in numpy#22889). The bad news is that other benchmarks got slower. I am not sure why. Maybe the added field made the ufunc too big?

Benchmarks vs. main
       before           after         ratio
     [c662a712]       [eb21b250]
     <main>           <indexed-loopos>
+      68.0±0.4μs          126±7μs     1.85  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'U', False, '<')
+      66.6±0.6μs        117±0.5μs     1.75  bench_strings.StringComparisons.time_compare_identical(10000, 'U', True, '<=')
+       134±0.5μs        234±0.7μs     1.75  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'U', True, '<=')
+        67.8±1μs        118±0.5μs     1.74  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'U', False, '<=')
+      66.7±0.7μs          111±1μs     1.66  bench_strings.StringComparisons.time_compare_identical(10000, 'U', True, '>')
+      33.5±0.4μs         54.1±1μs     1.61  bench_strings.StringComparisons.time_compare_identical(10000, 'U', False, '<')
+       133±0.7μs          215±7μs     1.61  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'U', True, '>')
+      34.1±0.2μs         54.6±1μs     1.60  bench_strings.StringComparisons.time_compare_identical(10000, 'U', False, '>')
+         131±3μs        210±0.8μs     1.60  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'U', True, '<')
+      66.2±0.3μs        106±0.1μs     1.59  bench_strings.StringComparisons.time_compare_identical(10000, 'U', True, '<')
+      68.2±0.6μs          108±3μs     1.59  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'U', False, '>')
+     1.18±0.01μs       1.71±0.1μs     1.45  bench_strings.StringComparisons.time_compare_identical(100, 'U', True, '<')
+         532±4μs         720±10μs     1.35  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rad2deg'>, 1, 1, 'e')
+         532±2μs         716±50μs     1.35  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 4, 2, 'e')
+        1.96±0ms       2.64±0.6ms     1.35  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'exp'>, 2, 4, 'f')
+     4.58±0.03μs      6.16±0.01μs     1.34  bench_function_base.Sort.time_sort('merge', 'int32', ('ordered',))
+        26.1±2μs       35.0±0.8μs     1.34  bench_function_base.Sort.time_sort('merge', 'int32', ('sorted_block', 1000))
+         516±1μs         689±50μs     1.34  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'degrees'>, 4, 4, 'e')
+     4.57±0.02μs      6.10±0.03μs     1.33  bench_function_base.Sort.time_sort('merge', 'int32', ('uniform',))
+         531±5μs         699±60μs     1.32  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'fabs'>, 4, 4, 'e')
+         680±2ns         891±20ns     1.31  bench_itemselection.PutMask.time_sparse(True, 'int64')
+     1.21±0.01μs      1.58±0.03μs     1.31  bench_strings.StringComparisons.time_compare_identical(100, 'U', True, '>')
+         530±2μs         692±40μs     1.31  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 1, 4, 'e')
+        663±20ns          867±2ns     1.31  bench_itemselection.PutMask.time_sparse(True, 'float64')
+         530±2μs         690±10μs     1.30  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'degrees'>, 4, 1, 'e')
+         531±3μs         685±90μs     1.29  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 2, 2, 'e')
+         678±5ns          873±3ns     1.29  bench_itemselection.PutMask.time_sparse(True, 'complex64')
+       551±0.5μs         708±10μs     1.28  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 1, 1, 'e')
+         527±9μs         670±50μs     1.27  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'fabs'>, 1, 2, 'e')
+         535±4μs         674±60μs     1.26  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'degrees'>, 1, 4, 'e')
+         537±4μs         670±10μs     1.25  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 4, 1, 'e')
+         554±4μs         682±30μs     1.23  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 4, 4, 'e')
+        856±20ns         1.04±0μs     1.22  bench_strings.StringComparisons.time_compare_identical(100, 'U', False, '<')
+         918±8ns         1.12±0μs     1.22  bench_itemselection.PutMask.time_dense(False, 'float16')
+         872±5ns      1.06±0.03μs     1.22  bench_strings.StringComparisons.time_compare_identical(100, 'U', False, '>')
+         660±1μs         800±20μs     1.21  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'log2'>, 4, 1, 'e')
+         660±2μs         800±40μs     1.21  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'log2'>, 2, 2, 'e')
+        91.6±2μs        111±0.1μs     1.21  bench_function_base.Sort.time_argsort('merge', 'uint32', ('sorted_block', 10))
+         661±3μs         798±10μs     1.21  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'log2'>, 1, 2, 'e')
+        52.7±5μs       63.4±0.9μs     1.20  bench_function_base.Sort.time_sort('merge', 'int32', ('sorted_block', 100))
+      76.5±0.2μs       92.0±0.6μs     1.20  bench_core.UnpackBits.time_unpackbits_axis1
+         917±3ns         1.10±0μs     1.20  bench_itemselection.PutMask.time_dense(False, 'int16')
+      39.4±0.7μs       47.3±0.1μs     1.20  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'positive'>, 2, 1, 'e')
+         666±5μs         798±10μs     1.20  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'log2'>, 1, 4, 'e')
+      28.5±0.2μs         34.1±2μs     1.19  bench_strings.StringComparisons.time_compare_identical(10000, 'S', True, '<')
+         660±1μs         788±30μs     1.19  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'log2'>, 4, 2, 'e')
+         662±3μs         788±30μs     1.19  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'log2'>, 4, 4, 'e')
+      40.3±0.1μs      47.6±0.07μs     1.18  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'positive'>, 4, 1, 'e')
+     4.41±0.01μs      5.21±0.05μs     1.18  bench_core.UnpackBits.time_unpackbits
+     10.7±0.07μs       12.6±0.3μs     1.18  bench_function_base.Sort.time_argsort('merge', 'float32', ('reversed',))
+        650±10μs         764±50μs     1.18  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'log2'>, 1, 1, 'e')
+     40.3±0.04μs       47.2±0.2μs     1.17  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'positive'>, 1, 2, 'e')
+     3.04±0.02μs      3.55±0.01μs     1.17  bench_itemselection.PutMask.time_dense(True, 'complex256')
+      39.5±0.5μs       46.0±0.7μs     1.17  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'positive'>, 1, 1, 'e')
+        2.62±0ms      3.03±0.06ms     1.16  bench_lib.Pad.time_pad((256, 128, 1), 8, 'linear_ramp')
+      41.3±0.2μs       47.7±0.2μs     1.15  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'positive'>, 4, 4, 'e')
+         660±4μs         762±40μs     1.15  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'log2'>, 2, 4, 'e')
+      40.6±0.2μs         46.6±1μs     1.15  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'positive'>, 4, 2, 'e')
+      40.6±0.3μs       46.4±0.9μs     1.14  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'positive'>, 2, 2, 'e')
+        50.1±1μs       57.2±0.2μs     1.14  bench_function_base.Sort.time_argsort('quick', 'int32', ('ordered',))
+      41.2±0.6μs       46.9±0.7μs     1.14  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'positive'>, 1, 4, 'e')
+        73.7±2μs       83.3±0.1μs     1.13  bench_reduce.AddReduceSeparate.time_reduce(0, 'float32')
+      41.5±0.3μs       46.8±0.7μs     1.13  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'positive'>, 2, 4, 'e')
+         797±6ns          896±9ns     1.12  bench_itemselection.PutMask.time_sparse(False, 'int16')
+      32.2±0.4μs       36.2±0.9μs     1.12  bench_strings.StringComparisons.time_compare_different((1000, 20), 'S', False, '<')
+        82.7±2μs         92.8±2μs     1.12  bench_function_base.Sort.time_sort('merge', 'int32', ('sorted_block', 10))
+         208±4μs          233±2μs     1.12  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'exp2'>, 1, 1, 'f')
+       210±0.5μs          235±4μs     1.12  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'exp2'>, 2, 4, 'f')
+         263±9μs          293±5μs     1.12  bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 100))
+     4.50±0.09μs       5.02±0.1μs     1.11  bench_indexing.ScalarIndexing.time_assign(0)
+       210±0.4μs        234±0.8μs     1.11  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'exp2'>, 4, 1, 'f')
+         110±3μs          122±2μs     1.11  bench_function_base.Sort.time_argsort('merge', 'float32', ('sorted_block', 10))
+         214±2μs          237±2μs     1.11  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'exp2'>, 4, 4, 'f')
+         210±2μs          233±3μs     1.11  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'exp2'>, 2, 1, 'f')
+       212±0.5μs          235±1μs     1.11  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'exp2'>, 2, 2, 'f')
+         797±3ns          881±2ns     1.11  bench_itemselection.PutMask.time_dense(True, 'int64')
+      23.5±0.1μs       26.0±0.6μs     1.11  bench_function_base.Sort.time_argsort('heap', 'int32', ('uniform',))
+         795±6ns          877±3ns     1.10  bench_itemselection.PutMask.time_sparse(False, 'float16')
+      16.8±0.1μs      18.5±0.06μs     1.10  bench_ufunc.CustomInplace.time_float_add
+      56.4±0.4μs       62.1±0.6μs     1.10  bench_function_base.Sort.time_argsort('quick', 
-         285±1μs          257±1μs     0.90  bench_function_base.Sort.time_sort('quick', 'int16', ('sorted_block', 100))
-      48.2±0.3μs       43.5±0.3μs     0.90  bench_function_base.Sort.time_argsort('quick', 'int16', ('uniform',))
-       339±0.8μs          306±1μs     0.90  bench_function_base.Sort.time_sort('quick', 'int16', ('random',))
-     3.21±0.01μs      2.90±0.07μs     0.90  bench_itemselection.Take.time_contiguous((1000, 2), 'clip', 'longfloat')
-        706±40μs          637±3μs     0.90  bench_random.Random.time_rng('poisson 10')
-         388±6μs          350±3μs     0.90  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'logical_not'>, 1, 1, 'e')
-        76.5±1μs       69.0±0.7μs     0.90  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'S', True, '<=')
-     20.2±0.08μs      18.2±0.06μs     0.90  bench_strings.StringComparisons.time_compare_identical(10000, 'S', False, '<=')
-         512±2μs          462±9μs     0.90  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'square'>, 4, 2, 'e')
-         517±2μs          466±9μs     0.90  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'square'>, 2, 4, 'e')
-        31.7±1μs       28.6±0.3μs     0.90  bench_core.Core.time_array_float_l1000_dtype
-      9.44±0.6ms      8.49±0.07ms     0.90  bench_core.CorrConv.time_correlate(100000, 1000, 'full')
-      38.4±0.3μs       34.5±0.1μs     0.90  bench_strings.StringComparisons.time_compare_identical(10000, 'S', True, '<=')
-         511±4μs          460±8μs     0.90  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'square'>, 1, 4, 'e')
-        38.3±1μs       34.5±0.3μs     0.90  bench_strings.StringComparisons.time_compare_identical(10000, 'S', True, '>=')
-      9.33±0.6ms      8.39±0.05ms     0.90  bench_core.CorrConv.time_correlate(100000, 1000, 'valid')
-     1.35±0.01μs      1.22±0.03μs     0.90  bench_itemselection.Take.time_contiguous((1000, 1), 'clip', 'float64')
-         510±3μs         458±10μs     0.90  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'square'>, 2, 2, 'e')
-      57.2±0.2μs       51.4±0.2μs     0.90  bench_function_base.Sort.time_sort('quick', 'float64', ('ordered',))
-     1.35±0.01μs      1.21±0.01μs     0.90  bench_itemselection.Take.time_contiguous((1000, 2), 'clip', 'float32')
-         511±3μs          458±8μs     0.90  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'square'>, 1, 1, 'e')
-         445±2μs        400±0.8μs     0.90  bench_function_base.Sort.time_sort('heap', 'float64', ('ordered',))
-        416±20ns          373±2ns     0.90  bench_scalar.ScalarMath.time_abs('float32')
-      3.42±0.2μs      3.07±0.02μs     0.90  bench_itemselection.Take.time_contiguous((1000, 1), 'wrap', 'complex256')
-      69.2±0.7μs       62.0±0.7μs     0.90  bench_strings.StringComparisons.time_compare_different((1000, 20), 'S', True, '<=')
-         370±6μs          331±8μs     0.90  bench_linalg.Eindot.time_einsum_i_ij_j
-        889±20ns          795±8ns     0.90  bench_itemselection.PutMask.time_sparse(False, 'float64')
-         371±3μs          332±2μs     0.90  bench_function_base.Sort.time_sort('merge', 'uint32', ('random',))
-         514±4μs          459±9μs     0.89  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'square'>, 4, 4, 'e')
-      18.5±0.1μs       16.5±0.2μs     0.89  bench_strings.StringComparisons.time_compare_different(10000, 'S', False, '<=')
-     1.13±0.01μs         1.01±0μs     0.89  bench_itemselection.Take.time_contiguous((1000, 1), 'wrap', 'float32')
-         905±1ns         808±10ns     0.89  bench_itemselection.PutMask.time_sparse(False, 'int64')
-      4.73±0.4ms      4.22±0.04ms     0.89  bench_ufunc.UFunc.time_ufunc_types('exp')
-      5.90±0.2μs      5.25±0.02μs     0.89  bench_linalg.Linalg.time_op('norm', 'int32')
-        99.4±2μs       88.3±0.4μs     0.89  bench_function_base.Bincount.time_bincount
-       186±0.8μs        165±0.7μs     0.89  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 2, 4, 'e')
-     2.10±0.04μs         1.86±0μs     0.89  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'clip', 'float64')
-       198±0.3ms        176±0.4ms     0.89  bench_function_base.Histogram2D.time_fine_binning
-     1.97±0.01μs      1.74±0.01μs     0.89  bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'str'>)
-       111±0.4μs       97.9±0.3μs     0.89  bench_function_base.Sort.time_sort('merge', 'float64', ('sorted_block', 10))
-         183±3μs          162±3μs     0.88  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 4, 4, 'e')
-        606±20μs          536±3μs     0.88  bench_core.CountNonzero.time_count_nonzero_axis(2, 1000000, <class 'numpy.int16'>)
-       395±0.6μs          349±2μs     0.88  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'logical_not'>, 4, 1, 'e')
-        528±20ns          467±9ns     0.88  bench_scalar.ScalarMath.time_addition('float16')
-      24.5±0.9μs       21.6±0.3μs     0.88  bench_function_base.Sort.time_sort('merge', 'uint32', ('sorted_block', 1000))
-      54.1±0.1μs       47.7±0.4μs     0.88  bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'str'>)
-     1.37±0.02μs      1.21±0.03μs     0.88  bench_itemselection.Take.time_contiguous((1000, 1), 'clip', 'int64')
-     1.59±0.01μs      1.40±0.01μs     0.88  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'wrap', 'int32')
-     1.37±0.02μs      1.20±0.01μs     0.88  bench_itemselection.Take.time_contiguous((1000, 2), 'clip', 'int32')
-      46.8±0.7μs       41.1±0.7μs     0.88  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (1), 2, 4, 'e')
-       186±0.8μs        163±0.5μs     0.88  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 2, 2, 'e')
-     45.3±0.09μs       39.7±0.6μs     0.88  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (1), 2, 1, 'e')
-       186±0.8μs          163±1μs     0.87  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 4, 2, 'e')
-         904±3ns          791±4ns     0.87  bench_itemselection.PutMask.time_dense(True, 'int16')
-         904±3ns          790±9ns     0.87  bench_itemselection.PutMask.time_sparse(False, 'int32')
-       108±0.7μs       93.9±0.5μs     0.87  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'str'>)
-       185±0.9μs          161±3μs     0.87  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 1, 4, 'e')
-        1.01±0ms         875±40μs     0.87  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rint'>, 1, 4, 'e')
-     2.15±0.02μs      1.86±0.01μs     0.87  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'clip', 'int64')
-        620±20μs         539±30μs     0.87  bench_reduce.AddReduceSeparate.time_reduce(0, 'complex128')
-     1.60±0.01μs      1.39±0.01μs     0.87  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'wrap', 'float32')
-        46.8±1μs       40.6±0.7μs     0.87  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (1), 4, 4, 'e')
-         186±1μs          161±3μs     0.87  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 4, 1, 'e')
-     2.14±0.01μs      1.85±0.01μs     0.87  bench_itemselection.Take.time_contiguous((2, 1000, 1), 'clip', 'complex64')
-         185±1μs          160±3μs     0.87  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 1, 1, 'e')
-       385±0.9μs          333±1μs     0.87  bench_function_base.Sort.time_sort('merge', 'int64', ('random',))
-      91.2±0.3μs       78.9±0.6μs     0.87  bench_ufunc_strides.AVX_ldexp.time_ufunc('e', 2)
-      47.2±0.1μs       40.8±0.5μs     0.87  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (0), 1, 4, 'e')
-     1.15±0.01ms          993±6μs     0.87  bench_lib.Nan.time_nanvar(200000, 90.0)
-       185±0.3μs          160±2μs     0.86  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 2, 1, 'e')
-      47.4±0.3μs       40.9±0.2μs     0.86  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (1), 4, 2, 'e')
-         896±3ns          773±4ns     0.86  bench_itemselection.PutMask.time_sparse(False, 'float32')
-     47.0±0.07μs       40.6±0.6μs     0.86  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (1), 1, 4, 'e')
-      47.1±0.1μs       40.6±0.1μs     0.86  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (1), 2, 2, 'e')
-       185±0.3μs          159±3μs     0.86  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 1, 2, 'e')
-      34.6±0.6μs      29.7±0.08μs     0.86  bench_function_base.Sort.time_argsort('heap', 'float32', ('uniform',))
-         103±3μs         88.7±2μs     0.86  bench_function_base.Sort.time_sort('quick', 'float32', ('reversed',))
-      47.8±0.2μs       41.1±0.9μs     0.86  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (0), 4, 4, 'e')
-         901±5ns          774±5ns     0.86  bench_itemselection.PutMask.time_dense(True, 'float16')
-       104±0.3μs         89.0±3μs     0.86  bench_function_base.Sort.time_sort('quick', 'float64', ('reversed',))
-      46.7±0.4μs       40.0±0.9μs     0.86  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (0), 2, 2, 'e')
-      47.7±0.2μs       40.9±0.4μs     0.86  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (0), 2, 4, 'e')
-      47.1±0.2μs       40.3±0.5μs     0.86  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (0), 1, 2, 'e')
-       106±0.4μs         91.2±1μs     0.86  bench_function_base.Sort.time_argsort('quick', 'float64', ('reversed',))
-       100.0±1μs       85.5±0.1μs     0.86  bench_function_base.Sort.time_argsort('quick', 'int64', ('reversed',))
-      47.2±0.2μs       40.3±0.6μs     0.85  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (1), 1, 2, 'e')
-     1.16±0.01ms          987±3μs     0.85  bench_lib.Nan.time_nanstd(200000, 90.0)
-        1.02±0ms         862±10μs     0.85  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rint'>, 1, 2, 'e')
-         907±5ns          770±4ns     0.85  bench_itemselection.PutMask.time_sparse(False, 'complex64')
-      47.4±0.3μs       40.2±0.4μs     0.85  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (0), 4, 2, 'e')
-      47.1±0.1μs       39.8±0.5μs     0.85  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (0), 2, 1, 'e')
-      47.4±0.5μs       39.9±0.5μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (1), 4, 1, 'e')
-      59.3±0.2μs       49.9±0.1μs     0.84  bench_function_base.Sort.time_argsort('quick', 'int16', ('ordered',))
-     26.8±0.08μs       22.5±0.2μs     0.84  bench_function_base.Sort.time_argsort('heap', 'uint32', ('uniform',))
-      47.0±0.2μs         39.5±1μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (0), 1, 1, 'e')
-     1.33±0.01ms      1.12±0.03ms     0.84  bench_linalg.Einsum.time_einsum_noncon_outer(<class 'numpy.float32'>)
-      47.2±0.1μs         39.7±1μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (0), 4, 1, 'e')
-     63.4±0.08μs       53.3±0.1μs     0.84  bench_function_base.Sort.time_argsort('quick', 'float64', ('ordered',))
-      47.0±0.3μs       39.4±0.6μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (1), 1, 1, 'e')
-     18.5±0.03μs       15.5±0.2μs     0.84  bench_ufunc.CustomInplace.time_double_add
-     8.24±0.03μs       6.90±0.2μs     0.84  bench_function_base.Sort.time_argsort('merge', 'uint32', ('uniform',))
-         116±2μs       96.4±0.3μs     0.83  bench_function_base.Sort.time_argsort('merge', 'int32', ('sorted_block', 10))
-         761±3μs         634±80μs     0.83  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sqrt'>, 1, 1, 'e')
-      47.4±0.1μs       39.4±0.2μs     0.83  bench_function_base.Sort.time_sort('quick', 'int64', ('uniform',))
-         595±9μs          494±3μs     0.83  bench_lib.Nan.time_nanvar(200000, 2.0)
-       101±0.3μs       83.7±0.8μs     0.83  bench_function_base.Sort.time_argsort('merge', 'int64', ('sorted_block', 10))
-         758±1μs         626±80μs     0.83  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sqrt'>, 4, 1, 'e')
-        13.7±1μs       11.2±0.3μs     0.82  bench_function_base.Where.time_2_broadcast
-         609±2μs          498±6μs     0.82  bench_lib.Nan.time_nanstd(200000, 2.0)
-     1.05±0.02μs          856±9ns     0.81  bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int16'>)
-     8.27±0.01μs      6.72±0.02μs     0.81  bench_function_base.Sort.time_argsort('merge', 'uint32', ('ordered',))
-      37.1±0.4μs       30.1±0.2μs     0.81  bench_function_base.Sort.time_argsort('merge', 'int64', ('sorted_block', 1000))
-     1.16±0.01μs         933±20ns     0.81  bench_itemselection.PutMask.time_dense(False, 'int64')
-         489±3μs          393±5μs     0.80  bench_lib.Nan.time_nanstd(200000, 0.1)
-      94.7±0.9μs       76.1±0.3μs     0.80  bench_function_base.Sort.time_sort('merge', 'int64', ('sorted_block', 10))
-      13.3±0.1μs      10.7±0.06μs     0.80  bench_function_base.Where.time_2
-         783±4μs         627±80μs     0.80  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'radians'>, 4, 1, 'e')
-         468±4μs          373±2μs     0.80  bench_lib.Nan.time_nanvar(200000, 0)
-         481±1μs          380±7μs     0.79  bench_lib.Nan.time_nanstd(200000, 0)
-        1.15±0μs          906±6ns     0.79  bench_itemselection.PutMask.time_dense(False, 'complex64')
-        69.6±4μs         54.9±1μs     0.79  bench_function_base.Sort.time_argsort('merge', 'int64', ('sorted_block', 100))
-     8.76±0.04μs       6.89±0.2μs     0.79  bench_function_base.Sort.time_sort('merge', 'uint32', ('reversed',))
-        1.16±0μs          912±4ns     0.79  bench_itemselection.PutMask.time_dense(False, 'float64')
-      94.9±0.6μs         74.3±1μs     0.78  bench_function_base.Sort.time_sort('quick', 'int16', ('reversed',))
-      11.9±0.2μs       9.31±0.2μs     0.78  bench_function_base.Sort.time_sort('merge', 'float64', ('reversed',))
-         488±1μs          380±3μs     0.78  bench_lib.Nan.time_nanvar(200000, 0.1)
-      9.64±0.2μs       7.50±0.1μs     0.78  bench_function_base.Sort.time_sort('merge', 'int64', ('reversed',))
-     1.11±0.01μs         865±10ns     0.78  bench_strings.StringComparisons.time_compare_identical(100, 'U', False, '!=')
-     9.43±0.09μs      7.26±0.03μs     0.77  bench_linalg.Linalg.time_op('norm', 'int64')
-        1.64±0μs         1.25±0μs     0.76  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int16'>)
-        1.11±0μs          841±2ns     0.76  bench_strings.StringComparisons.time_compare_identical(100, 'U', False, '==')
-        53.9±5μs       40.8±0.8μs     0.76  bench_function_base.Sort.time_sort('quick', 'uint32', ('ordered',))
-         120±1μs         89.9±2μs     0.75  bench_linalg.Einsum.time_einsum_mul(<class 'numpy.float32'>)
-         782±6μs         584±30μs     0.75  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'deg2rad'>, 4, 4, 'e')
-         901±2ns          668±9ns     0.74  bench_itemselection.PutMask.time_sparse(True, 'int16')
-     1.64±0.01μs      1.21±0.01μs     0.74  bench_strings.StringComparisons.time_compare_identical(100, 'U', True, '!=')
-     2.25±0.03μs      1.65±0.01μs     0.73  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int16'>)
-         902±7ns         659±10ns     0.73  bench_itemselection.PutMask.time_sparse(True, 'float16')
-      1.20±0.1μs         864±20ns     0.72  bench_strings.StringComparisons.time_compare_identical(100, 'U', False, '>=')
-     1.63±0.01μs         1.16±0μs     0.71  bench_strings.StringComparisons.time_compare_identical(100, 'U', True, '>=')
-     6.96±0.07μs      4.83±0.05μs     0.69  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'absolute'>, 1, 1, 'e')
-      58.6±0.2μs       40.1±0.3μs     0.68  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int16'>)
-       175±0.4μs        119±0.9μs     0.68  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int16'>)
-        751±10μs        511±100μs     0.68  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sqrt'>, 4, 2, 'e')
-        1.71±0μs         1.16±0μs     0.68  bench_strings.StringComparisons.time_compare_identical(100, 'U', True, '==')
-         135±1μs         91.1±2μs     0.67  bench_function_base.Mean.time_mean_axis(100000)
-       118±0.8μs       79.4±0.5μs     0.67  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int16'>)
-       131±0.3μs       86.8±0.6μs     0.66  bench_function_base.Mean.time_mean(100000)
-      46.1±0.8μs      30.2±0.09μs     0.66  bench_ufunc_strides.Unary.time_ufunc(<ufunc '_ones_like'>, 4, 2, 'e')
-       371±0.5μs          243±6μs     0.65  bench_core.PackBits.time_packbits_axis0(<class 'bool'>)
-      46.9±0.2μs       30.3±0.5μs     0.65  bench_ufunc_strides.Unary.time_ufunc(<ufunc '_ones_like'>, 1, 4, 'e')
-        46.0±1μs       29.6±0.6μs     0.64  bench_ufunc_strides.Unary.time_ufunc(<ufunc '_ones_like'>, 4, 1, 'e')
-      47.0±0.3μs       30.2±0.5μs     0.64  bench_ufunc_strides.Unary.time_ufunc(<ufunc '_ones_like'>, 1, 2, 'e')
-        46.1±1μs       29.7±0.5μs     0.64  bench_ufunc_strides.Unary.time_ufunc(<ufunc '_ones_like'>, 2, 2, 'e')
-      46.9±0.1μs      30.0±0.07μs     0.64  bench_ufunc_strides.Unary.time_ufunc(<ufunc '_ones_like'>, 1, 1, 'e')
-      47.1±0.2μs       30.0±0.5μs     0.64  bench_ufunc_strides.Unary.time_ufunc(<ufunc '_ones_like'>, 2, 4, 'e')
-      47.4±0.2μs       30.2±0.6μs     0.64  bench_ufunc_strides.Unary.time_ufunc(<ufunc '_ones_like'>, 4, 4, 'e')
-      46.9±0.2μs       29.5±0.6μs     0.63  bench_ufunc_strides.Unary.time_ufunc(<ufunc '_ones_like'>, 2, 1, 'e')
-       111±0.2μs         66.7±1μs     0.60  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'U', False, '!=')
-         108±2μs       65.0±0.1μs     0.60  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'U', False, '>=')
-      55.6±0.1μs       32.9±0.2μs     0.59  bench_strings.StringComparisons.time_compare_identical(10000, 'U', False, '!=')
-         217±1μs        128±0.8μs     0.59  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'U', True, '!=')
-       109±0.4μs      64.3±0.07μs     0.59  bench_strings.StringComparisons.time_compare_identical(10000, 'U', True, '!=')
-        57.4±3μs       33.6±0.8μs     0.59  bench_strings.StringComparisons.time_compare_identical(10000, 'U', False, '>=')
-         113±6μs       64.2±0.4μs     0.57  bench_strings.StringComparisons.time_compare_identical(10000, 'U', True, '>=')
-        227±10μs        127±0.3μs     0.56  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'U', True, '>=')
-     3.41±0.01μs      1.90±0.02μs     0.56  bench_io.Copy.time_cont_assign('float32')
-      61.9±0.2μs       33.6±0.5μs     0.54  bench_strings.StringComparisons.time_compare_identical(10000, 'U', False, '==')
-         123±1μs       65.8±0.3μs     0.54  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'U', False, '==')
-       244±0.8μs          130±3μs     0.53  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'U', True, '==')
-       123±0.4μs       64.1±0.2μs     0.52  bench_strings.StringComparisons.time_compare_identical(10000, 'U', True, '==')
-     8.93±0.04ms      1.17±0.01ms     0.13  bench_ufunc.At.time_sum_at

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

@mattip
Copy link
Owner Author

mattip commented Feb 6, 2023

Closing in favor of the PR opened on numpy/numpy

@mattip mattip closed this Feb 6, 2023
mattip pushed a commit that referenced this pull request Nov 15, 2023
Merge in ~STEPAN.SINDELAR_ORACLE.COM/numpy-hpy from mq/GR-40889 to labs-hpy-port

* commit '23216016a53c3788a1f4b7284ca1fa030dd1587b':
  address comments
  address comments
  hacky bug fix (revert me)
  bug fix
  bug fixes and hpy abort clean up
  port PyArray_GenericReduceFunction to HPy
  bug fix
  use free instead of PyMem_Free
  clean up
  port PyArray_EnsureXXArray to HPy
  missing PyArray_PyXXXAbstractDType initializations
  HPyArray_MaskedStridedUnaryOp require hpy ctx
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant