MAINT: Optimize numpy.count_nonzero for int types using SIMD operations #18183

touqir14 · 2021-01-18T12:23:15Z

As pointed out in this issue, numpy.count_nonzero, numpy.nonzero, numpy.flatnonzero are rather slow which could use some optimization. This PR optimizes numpy.count_nonzero for signed and unsigned 8 bit, 16 bit, 32 bit and 64 bit integers using SIMD operations. This in turn speeds up numpy.flatnonzero, numpy.nonzero, and several other functions that depend on numpy.count_nonzero.

Below, I have given benchmarks to showcase the speed improvements for the integer types with AVX2.

import numpy as np

np.random.seed(0)
x = np.random.randint(0,2, size=10**6, dtype=np.int8)
%timeit -n 1000 -r 30 np.count_nonzero(x)
# With SIMD optimization: 34.9 µs ± 1.11 µs per loop (mean ± std. dev. of 30 runs, 1000 loops each)
# Without SIMD optimization: 1.96 ms ± 86 µs per loop (mean ± std. dev. of 30 runs, 1000 loops each)

import numpy as np
np.random.seed(0)
x = np.random.randint(0,2, size=10**6, dtype=np.int16)
%timeit -n 1000 -r 30 np.count_nonzero(x)
# With SIMD optimization: 55.9 µs ± 1.81 µs per loop (mean ± std. dev. of 30 runs, 1000 loops each)
# Without SIMD optimization: 1.94 ms ± 64.7 µs per loop (mean ± std. dev. of 30 runs, 1000 loops each)

import numpy as np
np.random.seed(0)
x = np.random.randint(0,2, size=10**6, dtype=np.int32)
%timeit -n 1000 -r 30 np.count_nonzero(x)
# With SIMD optimization: 138 µs ± 4.75 µs per loop (mean ± std. dev. of 30 runs, 1000 loops each)
# Without SIMD optimization: 1.93 ms ± 31.3 µs per loop (mean ± std. dev. of 30 runs, 1000 loops each)

import numpy as np
np.random.seed(0)
x = np.random.randint(0,2, size=10**6, dtype=np.int64)
%timeit -n 1000 -r 30 np.count_nonzero(x)
# With SIMD optimization: 387 µs ± 5.76 µs per loop (mean ± std. dev. of 30 runs, 1000 loops each)
# Without SIMD optimization: 2.02 ms ± 18.8 µs per loop (mean ± std. dev. of 30 runs, 1000 loops each)

I have added few test cases for each of the integer types. Let me know if more is required.

…o function

numpy/core/src/multiarray/item_selection.c

numpy/core/tests/test_numeric.py

…ed benchmarks

touqir14 · 2021-01-19T10:28:59Z

@Qiyu8 , I have replaced the MIN/MAX Macros, placed the NPY_SIMD checking guard at the proper place, merged the count_nonzero_int16/32/64 functions into a single function and added benchmarks for the 4 int types.

Qiyu8 · 2021-01-21T03:05:50Z

numpy/core/src/multiarray/item_selection.c

+        vsum64 = npyv_add_u64(vsum64, vt);
+    }
+
+    npy_uint64 sums[npyv_nlanes_u64];


you can use new acceleration intrinsics after #18200 merged.

@Qiyu8 , I have replaced the manual sums with horizontal SIMD sums.

Well done, The replaced part looks good to me, Now you need to focus on fix the CI failures and provide ASV benchmark result.

gnool · 2021-01-22T05:53:04Z

Sorry to hijack this thread but on a related topic on nonzero(), is there a reason why calling nonzero on a 1D array is orders of magnitude faster than a multi-dimensional array?

For example, calling it on a Boolean array of shape (1000000,) is taking ~40 µs, while it takes 1400 µs for an array of shape (1000,1000). Both arrays are identical in values and only differ in shape.

Any idea what's the significant overhead cost here?

touqir14 · 2021-01-22T06:37:49Z

@gnool , without further investigation, I am speculating the overhead is coming from the use of an iterator and calls to get_multi_index function in https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/item_selection.c#L2555-L2600 whenever a multidimensional array is passed. In contrast, when a single dimensional array is passed, without an iterator the indices are computed using the npy_memchr function which is quite fast : https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/item_selection.c#L2485-L2520

tylerjereddy · 2021-01-29T00:18:25Z

I make heavy use of numpy.nonzero() on computational geometry work--will be great to see some potential speedups there.

touqir14 · 2021-02-05T11:00:35Z

@seiko2plus , @Qiyu8 I have pushed updates.

touqir14 · 2021-02-05T23:03:16Z

@tylerjereddy , can you point me to some of your use cases of nonzero ? I think I might be able to speed it up for certain cases.

tylerjereddy · 2021-02-06T00:43:24Z

@touqir14 Finding the indices where an extremely large array of bools is True for example.

Maybe something like:

# test_array is a huge 1D array of np.float64
relevant_indices = np.nonzero(test_array > 0.5)

touqir14 · 2021-02-06T01:08:54Z

Yes, a speedup in this case is possible. I will push a commit implementing the optimization for special cases later today or tomorrow. @tylerjereddy

numpy/core/src/multiarray/item_selection.c

touqir14 · 2021-02-07T05:06:09Z

@Qiyu8 , @seiko2plus , I also want to add optimizations to PyArray_Nonzero for some special input cases(Ndarray is aligned and C contiguous). Should I just keep sending commits in this PR?

…d Adel

seiko2plus · 2021-02-07T14:52:03Z

@touqir14,

Should I just keep sending commits in this PR?

let's keep this pull-request only for count_nonzero

…elsize

touqir14 · 2021-02-08T00:43:06Z

@seiko2plus , the overflow possibilities have been taken care of. Please see my last commit to verify. Looks like distinguishing dtypes using kind and elsize has taken away the build (testing) issues on win32/64 and linux 32bit. There is a failed test issue regarding numpy versioning which however has nothing to do with my commits.

touqir14 · 2021-02-13T05:52:55Z

@seiko2plus , are we all good now? If so, please merge this PR.

seiko2plus · 2021-02-13T05:59:55Z

@touqir14, I made some changes in order to increase readability and reduce the amount of code, it wouldn't affect performance.

are we all good now? If so, please merge this PR.

yes, I think it's good. I prefer to wait one day more to give a chance to the others to look at the code.

mattip · 2021-02-14T07:43:07Z

It would be nice to see a report of the benchmark changes before/after this PR to make sure we have not by mistake slowed any cases (non-contiguous?, F-order?) down.

seiko2plus · 2021-02-15T08:28:51Z

@mattip,

It would be nice to see a report of the benchmark changes before/after this PR

Performance has increased for all supported arches, check the following benchmarks:

Power9/GCC 9.2.1(baseline VSX2)

python runtests.py -j8 --bench-compare master CountNonzero -- --sort name --cpu-affinity 1,5

   before           after        ratio                                            (numaxes   size  dtype)
 [7a18e4ac]       [85e2ce98]
 <master>         <count_nonzero>
   2.85±0μs         2.35±0μs     0.82  bench_core.CountNonzero.time_count_nonzero(1, 100, <class 'numpy.int16'>)
2.76±0.01μs         2.38±0μs     0.86  bench_core.CountNonzero.time_count_nonzero(1, 100, <class 'numpy.int32'>)
   2.77±0μs      2.40±0.01μs     0.87  bench_core.CountNonzero.time_count_nonzero(1, 100, <class 'numpy.int64'>)
2.84±0.01μs      2.36±0.02μs     0.83  bench_core.CountNonzero.time_count_nonzero(1, 100, <class 'numpy.int8'>)
 78.4±0.2μs         4.53±0μs     0.06  bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int16'>)
71.8±0.02μs      5.79±0.01μs     0.08  bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int32'>)
 72.7±0.1μs      8.63±0.01μs     0.12  bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int64'>)
78.3±0.02μs      3.67±0.03μs     0.05  bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int8'>)
   7.58±0ms        174±0.2μs     0.02  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int16'>)
6.93±0.01ms        292±0.3μs     0.04  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int32'>)
   7.00±0ms        585±0.3μs     0.08  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int64'>)
 7.95±0.2ms      90.6±0.05μs     0.01  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int8'>)
   3.60±0μs         2.37±0μs     0.66  bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int16'>)
3.47±0.01μs         2.41±0μs     0.69  bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int32'>)
   3.50±0μs         2.48±0μs     0.71  bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int64'>)
3.60±0.01μs      2.38±0.01μs     0.66  bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int8'>)
  154±0.1μs         6.26±0μs     0.04  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int16'>)
 141±0.02μs      8.66±0.03μs     0.06  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int32'>)
 142±0.04μs      14.4±0.01μs     0.10  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int64'>)
 154±0.05μs      4.55±0.01μs     0.03  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int8'>)
15.2±0.03ms        343±0.2μs     0.02  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int16'>)
13.9±0.02ms        580±0.4μs     0.04  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int32'>)
14.0±0.01ms      1.27±0.01ms     0.09  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int64'>)
 15.3±0.1ms        177±0.1μs     0.01  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int8'>)
4.38±0.01μs      2.39±0.01μs     0.54  bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int16'>)
4.18±0.02μs      2.46±0.02μs     0.59  bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int32'>)
4.21±0.01μs      2.57±0.01μs     0.61  bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int64'>)
   4.37±0μs         2.37±0μs     0.54  bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int8'>)
  230±0.2μs      7.99±0.01μs     0.03  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int16'>)
 210±0.08μs       11.6±0.1μs     0.06  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int32'>)
 212±0.02μs      20.3±0.05μs     0.10  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int64'>)
  230±0.3μs      5.45±0.01μs     0.02  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int8'>)
 22.9±0.5ms        513±0.4μs     0.02  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int16'>)
20.8±0.01ms          909±4μs     0.04  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int32'>)
21.0±0.01ms      2.09±0.03ms     0.10  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int64'>)
22.8±0.03ms        263±0.1μs     0.01  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int8'>)

i7-8550U[low-power]/GCC 8.4.0(baseline AVX2)

python runtests.py -j8 --cpu-baseline="avx2" --bench-compare master CountNonzero -- --sort name --cpu-affinity 1,5

   before           after        ratio                                            (numaxes   size  dtype)
 [7a18e4ac]       [85e2ce98]
 <master>         <count_nonzero>
 31.7±0.01μs      4.90±0.04μs     0.15  bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int16'>)
 31.8±0.06μs       6.04±0.1μs     0.19  bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int32'>)
  32.0±0.2μs      8.26±0.09μs     0.26  bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int64'>)
  31.7±0.1μs      4.41±0.03μs     0.14  bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int8'>)
 2.86±0.01ms          121±1μs     0.04  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int16'>)
 2.89±0.01ms          231±2μs     0.08  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int32'>)
 3.04±0.02ms         565±10μs     0.19  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int64'>)
 2.82±0.01ms      81.3±0.08μs     0.03  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int8'>)
 3.55±0.01μs      3.22±0.01μs     0.91  bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int16'>)
 3.58±0.05μs      3.22±0.01μs     0.90  bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int32'>)
 3.55±0.01μs      3.29±0.01μs     0.93  bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int64'>)
 3.58±0.03μs      3.19±0.02μs     0.89  bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int8'>)
 59.9±0.04μs      6.01±0.02μs     0.10  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int16'>)
 60.0±0.07μs       8.29±0.2μs     0.14  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int32'>)
 60.8±0.03μs      13.0±0.06μs     0.21  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int64'>)
 59.6±0.03μs      5.26±0.02μs     0.09  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int8'>)
 5.73±0.01ms          236±5μs     0.04  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int16'>)
 5.83±0.02ms          555±6μs     0.10  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int32'>)
 6.13±0.01ms      1.39±0.02ms     0.23  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int64'>)
 5.68±0.01ms          156±3μs     0.03  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int8'>)
 3.83±0.03μs      3.23±0.02μs     0.84  bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int16'>)
 3.82±0.01μs      3.25±0.01μs     0.85  bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int32'>)
    3.84±0μs      3.32±0.01μs     0.86  bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int64'>)
 3.82±0.02μs      3.23±0.02μs     0.85  bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int8'>)
 87.8±0.04μs      7.29±0.03μs     0.08  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int16'>)
 88.3±0.04μs      10.6±0.08μs     0.12  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int32'>)
  89.5±0.3μs       18.0±0.1μs     0.20  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int64'>)
 87.7±0.03μs      6.08±0.02μs     0.07  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int8'>)
 8.59±0.02ms         374±10μs     0.04  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int16'>)
 8.80±0.02ms      1.02±0.03ms     0.12  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int32'>)
 9.19±0.03ms      2.05±0.01ms     0.22  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int64'>)
 8.49±0.03ms          229±2μs     0.03  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int8'>)

i7-8550U[low-power]/GCC 8.4.0(baseline SSE3)

python runtests.py -j8 --bench-compare master CountNonzero -- --sort name --cpu-affinity 1,5

   before           after         ratio                                             (numaxes   size  dtype)
 [7a18e4ac]       [85e2ce98]
 <master>         <count_nonzero>
 7.04±0.05μs      6.54±0.01μs     0.93  bench_core.CountNonzero.time_count_nonzero(1, 100, <class 'object'>)
 37.2±0.01μs       5.63±0.1μs     0.15  bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int16'>)
 37.5±0.06μs       8.50±0.2μs     0.23  bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int32'>)
  37.8±0.1μs      12.9±0.03μs     0.34  bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int64'>)
 37.1±0.02μs       5.00±0.2μs     0.13  bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int8'>)
   408±0.9μs          357±3μs     0.88  bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'object'>)
 3.41±0.01ms          197±8μs     0.06  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int16'>)
 3.47±0.03ms          393±8μs     0.11  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int32'>)
 3.62±0.01ms      1.07±0.04ms     0.29  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int64'>)
    3.37±0ms        128±0.9μs     0.04  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int8'>)
 40.7±0.08ms      35.3±0.05ms     0.87  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'object'>)
 3.66±0.02μs      3.32±0.04μs     0.91  bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int16'>)
 3.68±0.05μs      3.44±0.03μs     0.94  bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int64'>)
 3.66±0.03μs      3.26±0.02μs     0.89  bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int8'>)
 11.1±0.02μs      10.1±0.06μs     0.91  bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'object'>)
 71.1±0.04μs       8.01±0.2μs     0.11  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int16'>)
 71.2±0.05μs       11.6±0.6μs     0.16  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int32'>)
  71.9±0.1μs       22.4±0.1μs     0.31  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int64'>)
  70.9±0.3μs      6.30±0.09μs     0.09  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int8'>)
     813±3μs          710±8μs     0.87  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'object'>)
 6.83±0.01ms          376±1μs     0.06  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int16'>)
 6.97±0.01ms         889±40μs     0.13  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int32'>)
 7.23±0.01ms      2.32±0.01ms     0.32  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int64'>)
 6.80±0.01ms         259±20μs     0.04  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int8'>)
  81.4±0.2ms      70.7±0.09ms     0.87  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'object'>)
 4.02±0.07μs      3.39±0.02μs     0.84  bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int16'>)
 4.02±0.02μs      3.31±0.02μs     0.82  bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int32'>)
 4.00±0.02μs      3.51±0.01μs     0.88  bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int64'>)
 3.99±0.01μs      3.27±0.02μs     0.82  bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int8'>)
  15.2±0.1μs      13.6±0.05μs     0.90  bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'object'>)
   105±0.2μs      9.56±0.02μs     0.09  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int16'>)
  105±0.02μs       15.6±0.6μs     0.15  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int32'>)
   106±0.1μs       32.0±0.2μs     0.30  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int64'>)
   104±0.5μs      7.68±0.07μs     0.07  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int8'>)
    1.22±0ms      1.06±0.01ms     0.87  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'object'>)
 10.3±0.03ms          595±7μs     0.06  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int16'>)
  10.4±0.1ms      1.37±0.02ms     0.13  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int32'>)
 10.8±0.02ms      3.43±0.08ms     0.32  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int64'>)
 10.1±0.02ms         390±10μs     0.04  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int8'>)
   122±0.3ms        106±0.4ms     0.87  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'object'>)

Cortex-A53/GCC 9.3.0(baseline ASIMD)

python runtests.py -j8 --bench-compare master CountNonzero -- --sort name

   before           after         ratio                                             (numaxes   size  dtype)
 [7a18e4ac]       [85e2ce98]
 <master>         <count_nonzero>
 2.55±0.1μs      2.35±0.01μs     0.92  bench_core.CountNonzero.time_count_nonzero(1, 100, <class 'numpy.int32'>)
   10.5±9μs      4.35±0.02μs     0.41  bench_core.CountNonzero.time_count_nonzero(1, 100, <class 'str'>)
   49.5±1μs      5.58±0.04μs     0.11  bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int16'>)
 33.4±0.4μs      8.12±0.04μs     0.24  bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int32'>)
 33.9±0.3μs      12.5±0.07μs     0.37  bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int64'>)
   48.5±2μs      4.28±0.02μs     0.09  bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int8'>)
4.69±0.02ms          258±2μs     0.06  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int16'>)
3.06±0.01ms          501±4μs     0.16  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int32'>)
3.08±0.01ms          970±9μs     0.32  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int64'>)
4.67±0.02ms          135±3μs     0.03  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int8'>)
   5.15±3μs      2.38±0.02μs     0.46  bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'bool'>)
   6.44±4μs      2.41±0.01μs     0.37  bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int32'>)
2.85±0.07μs      2.54±0.02μs     0.89  bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int64'>)
3.12±0.03μs      2.39±0.03μs     0.76  bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int8'>)
96.1±0.03μs      8.15±0.02μs     0.08  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int16'>)
 63.9±0.4μs      12.5±0.03μs     0.20  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int32'>)
64.1±0.09μs      21.0±0.04μs     0.33  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int64'>)
95.7±0.05μs      5.85±0.04μs     0.06  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int8'>)
9.34±0.05ms          499±2μs     0.05  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int16'>)
6.10±0.05ms         972±10μs     0.16  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int32'>)
6.12±0.02ms      1.89±0.01ms     0.31  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int64'>)
9.32±0.03ms          281±2μs     0.03  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int8'>)
3.68±0.07μs         2.38±0μs     0.65  bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int16'>)
3.18±0.06μs      2.48±0.01μs     0.78  bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int32'>)
3.23±0.07μs      2.68±0.02μs     0.83  bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int64'>)
3.60±0.06μs      2.40±0.03μs     0.67  bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int8'>)
  143±0.2μs       10.4±0.1μs     0.07  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int16'>)
   94.2±2μs       16.8±0.1μs     0.18  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int32'>)
 94.4±0.4μs      29.6±0.08μs     0.31  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int64'>)
  142±0.2μs      7.33±0.04μs     0.05  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int8'>)
   14.0±2ms          735±5μs     0.05  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int16'>)
9.10±0.02ms      1.42±0.01ms     0.16  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int32'>)
9.13±0.06ms      2.81±0.02ms     0.31  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int64'>)
14.0±0.02ms          408±6μs     0.03  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int8'>)

mattip · 2021-02-15T09:17:31Z

Thanks @touqir14

This reverts commit 17e3ef9, reversing changes made to 7a18e4a.

touqir14 added 2 commits January 18, 2021 17:59

Added support for SIMD operations for int types in numpy.count_nonzer…

d2e7768

…o function

Added tests for i1,i2,i4,i8 types for numpy.count_nonzero function

c716a12

touqir14 changed the title ~~Optimizing numpy.count_nonzero for int types using SIMD operations~~ SIMD : Optimizing numpy.count_nonzero for int types using SIMD operations Jan 18, 2021

touqir14 changed the title ~~SIMD : Optimizing numpy.count_nonzero for int types using SIMD operations~~ Optimizing numpy.count_nonzero for int types using SIMD operations Jan 18, 2021

Illviljan mentioned this pull request Jan 18, 2021

ENH: Improve performance of tril_indices and triu_indices #18176

Merged

Qiyu8 requested changes Jan 19, 2021

View reviewed changes

touqir14 added 2 commits January 19, 2021 16:19

Merged count_nonzero_int16/int32/int64 into count_nonzero_int and add…

15cf37d

…ed benchmarks

Removed commented out code from PyArray_CountNonzero

2b41cbf

touqir14 requested a review from Qiyu8 January 19, 2021 10:38

seiko2plus self-requested a review January 20, 2021 22:55

Qiyu8 mentioned this pull request Jan 21, 2021

ENH: Add new intrinsics sum_u8/u16/u64. #18200

Merged

Qiyu8 reviewed Jan 21, 2021

View reviewed changes

charris changed the title ~~Optimizing numpy.count_nonzero for int types using SIMD operations~~ MAINT: Optimize numpy.count_nonzero for int types using SIMD operations Jan 22, 2021

github-actions bot added the 03 - Maintenance label Jan 22, 2021

charris added component: numpy._core component: benchmarks labels Jan 22, 2021

touqir14 added 2 commits February 5, 2021 16:49

Merge remote-tracking branch 'upstream/master'

ed3d080

Replaced manual sums with horizontal simd sums for count_nonzero_16/64

87c5d51

touqir14 requested a review from Qiyu8 February 5, 2021 10:59

seiko2plus requested changes Feb 7, 2021

View reviewed changes

fixed CI errors and optimized further simd_16 and simd_32

65892ef

touqir14 added 3 commits February 7, 2021 16:43

some fixes for the build problems

022cc66

another attempt to fix build issues

6895bab

removed the target variable and changed the loop as suggested by Saye…

89d6e55

…d Adel

touqir14 added 2 commits February 8, 2021 04:13

Modified PyArray_CountNonzero to discriminate between types based on …

534132e

…elsize

Ensured overflow does not happen for 16 and 32 bit ints

1eb91a3

touqir14 mentioned this pull request Feb 8, 2021

MAINT: Speed up numpy.nonzero. #18368

Closed

seiko2plus self-assigned this Feb 13, 2021

seiko2plus added 2 commits February 13, 2021 05:46

cleanup

d208702

fix up

85e2ce9

seiko2plus approved these changes Feb 13, 2021

View reviewed changes

mattip merged commit 17e3ef9 into numpy:master Feb 15, 2021

charris mentioned this pull request Aug 2, 2021

numpy/core/tests/test_numeric.py::TestNonzero::test_nonzero_onedim regression on sparc (misaligned access / bus error) #19592

Closed

B4dM4n added a commit to B4dM4n/numpy that referenced this pull request Oct 1, 2021

Revert "Merge pull request numpy#18183 from touqir14/master"

e3955a6

This reverts commit 17e3ef9, reversing changes made to 7a18e4a.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAINT: Optimize numpy.count_nonzero for int types using SIMD operations #18183

MAINT: Optimize numpy.count_nonzero for int types using SIMD operations #18183

touqir14 commented Jan 18, 2021

touqir14 commented Jan 19, 2021

Qiyu8 Jan 21, 2021

touqir14 Feb 5, 2021

Qiyu8 Feb 7, 2021

gnool commented Jan 22, 2021

touqir14 commented Jan 22, 2021

tylerjereddy commented Jan 29, 2021

touqir14 commented Feb 5, 2021

touqir14 commented Feb 5, 2021

tylerjereddy commented Feb 6, 2021

touqir14 commented Feb 6, 2021

touqir14 commented Feb 7, 2021

seiko2plus commented Feb 7, 2021

touqir14 commented Feb 8, 2021 •

edited

touqir14 commented Feb 13, 2021

seiko2plus commented Feb 13, 2021 •

edited

mattip commented Feb 14, 2021

seiko2plus commented Feb 15, 2021

mattip commented Feb 15, 2021

MAINT: Optimize numpy.count_nonzero for int types using SIMD operations #18183

MAINT: Optimize numpy.count_nonzero for int types using SIMD operations #18183

Conversation

touqir14 commented Jan 18, 2021

touqir14 commented Jan 19, 2021

Qiyu8 Jan 21, 2021

Choose a reason for hiding this comment

touqir14 Feb 5, 2021

Choose a reason for hiding this comment

Qiyu8 Feb 7, 2021

Choose a reason for hiding this comment

gnool commented Jan 22, 2021

touqir14 commented Jan 22, 2021

tylerjereddy commented Jan 29, 2021

touqir14 commented Feb 5, 2021

touqir14 commented Feb 5, 2021

tylerjereddy commented Feb 6, 2021

touqir14 commented Feb 6, 2021

touqir14 commented Feb 7, 2021

seiko2plus commented Feb 7, 2021

touqir14 commented Feb 8, 2021 • edited

touqir14 commented Feb 13, 2021

seiko2plus commented Feb 13, 2021 • edited

mattip commented Feb 14, 2021

seiko2plus commented Feb 15, 2021

mattip commented Feb 15, 2021

touqir14 commented Feb 8, 2021 •

edited

seiko2plus commented Feb 13, 2021 •

edited