Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAINT: Optimize numpy.count_nonzero for int types using SIMD operations #18183

Merged
merged 14 commits into from
Feb 15, 2021

Conversation

touqir14
Copy link
Contributor

As pointed out in this issue, numpy.count_nonzero, numpy.nonzero, numpy.flatnonzero are rather slow which could use some optimization. This PR optimizes numpy.count_nonzero for signed and unsigned 8 bit, 16 bit, 32 bit and 64 bit integers using SIMD operations. This in turn speeds up numpy.flatnonzero, numpy.nonzero, and several other functions that depend on numpy.count_nonzero.

Below, I have given benchmarks to showcase the speed improvements for the integer types with AVX2.

import numpy as np

np.random.seed(0)
x = np.random.randint(0,2, size=10**6, dtype=np.int8)
%timeit -n 1000 -r 30 np.count_nonzero(x)
# With SIMD optimization: 34.9 µs ± 1.11 µs per loop (mean ± std. dev. of 30 runs, 1000 loops each)
# Without SIMD optimization: 1.96 ms ± 86 µs per loop (mean ± std. dev. of 30 runs, 1000 loops each)

import numpy as np
np.random.seed(0)
x = np.random.randint(0,2, size=10**6, dtype=np.int16)
%timeit -n 1000 -r 30 np.count_nonzero(x)
# With SIMD optimization: 55.9 µs ± 1.81 µs per loop (mean ± std. dev. of 30 runs, 1000 loops each)
# Without SIMD optimization: 1.94 ms ± 64.7 µs per loop (mean ± std. dev. of 30 runs, 1000 loops each)

import numpy as np
np.random.seed(0)
x = np.random.randint(0,2, size=10**6, dtype=np.int32)
%timeit -n 1000 -r 30 np.count_nonzero(x)
# With SIMD optimization: 138 µs ± 4.75 µs per loop (mean ± std. dev. of 30 runs, 1000 loops each)
# Without SIMD optimization: 1.93 ms ± 31.3 µs per loop (mean ± std. dev. of 30 runs, 1000 loops each)

import numpy as np
np.random.seed(0)
x = np.random.randint(0,2, size=10**6, dtype=np.int64)
%timeit -n 1000 -r 30 np.count_nonzero(x)
# With SIMD optimization: 387 µs ± 5.76 µs per loop (mean ± std. dev. of 30 runs, 1000 loops each)
# Without SIMD optimization: 2.02 ms ± 18.8 µs per loop (mean ± std. dev. of 30 runs, 1000 loops each)

I have added few test cases for each of the integer types. Let me know if more is required.

@touqir14 touqir14 changed the title Optimizing numpy.count_nonzero for int types using SIMD operations SIMD : Optimizing numpy.count_nonzero for int types using SIMD operations Jan 18, 2021
@touqir14 touqir14 changed the title SIMD : Optimizing numpy.count_nonzero for int types using SIMD operations Optimizing numpy.count_nonzero for int types using SIMD operations Jan 18, 2021
numpy/core/src/multiarray/item_selection.c Outdated Show resolved Hide resolved
numpy/core/src/multiarray/item_selection.c Outdated Show resolved Hide resolved
numpy/core/src/multiarray/item_selection.c Outdated Show resolved Hide resolved
numpy/core/src/multiarray/item_selection.c Outdated Show resolved Hide resolved
numpy/core/tests/test_numeric.py Show resolved Hide resolved
@touqir14
Copy link
Contributor Author

@Qiyu8 , I have replaced the MIN/MAX Macros, placed the NPY_SIMD checking guard at the proper place, merged the count_nonzero_int16/32/64 functions into a single function and added benchmarks for the 4 int types.

vsum64 = npyv_add_u64(vsum64, vt);
}

npy_uint64 sums[npyv_nlanes_u64];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use new acceleration intrinsics after #18200 merged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Qiyu8 , I have replaced the manual sums with horizontal SIMD sums.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done, The replaced part looks good to me, Now you need to focus on fix the CI failures and provide ASV benchmark result.

@charris charris changed the title Optimizing numpy.count_nonzero for int types using SIMD operations MAINT: Optimize numpy.count_nonzero for int types using SIMD operations Jan 22, 2021
@gnool
Copy link

gnool commented Jan 22, 2021

Sorry to hijack this thread but on a related topic on nonzero(), is there a reason why calling nonzero on a 1D array is orders of magnitude faster than a multi-dimensional array?

For example, calling it on a Boolean array of shape (1000000,) is taking ~40 µs, while it takes 1400 µs for an array of shape (1000,1000). Both arrays are identical in values and only differ in shape.

Any idea what's the significant overhead cost here?

@touqir14
Copy link
Contributor Author

@gnool , without further investigation, I am speculating the overhead is coming from the use of an iterator and calls to get_multi_index function in https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/item_selection.c#L2555-L2600 whenever a multidimensional array is passed. In contrast, when a single dimensional array is passed, without an iterator the indices are computed using the npy_memchr function which is quite fast : https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/item_selection.c#L2485-L2520

@tylerjereddy
Copy link
Contributor

I make heavy use of numpy.nonzero() on computational geometry work--will be great to see some potential speedups there.

@touqir14 touqir14 requested a review from Qiyu8 February 5, 2021 10:59
@touqir14
Copy link
Contributor Author

touqir14 commented Feb 5, 2021

@seiko2plus , @Qiyu8 I have pushed updates.

@touqir14
Copy link
Contributor Author

touqir14 commented Feb 5, 2021

@tylerjereddy , can you point me to some of your use cases of nonzero ? I think I might be able to speed it up for certain cases.

@tylerjereddy
Copy link
Contributor

@touqir14 Finding the indices where an extremely large array of bools is True for example.

Maybe something like:

# test_array is a huge 1D array of np.float64
relevant_indices = np.nonzero(test_array > 0.5)

@touqir14
Copy link
Contributor Author

touqir14 commented Feb 6, 2021

Yes, a speedup in this case is possible. I will push a commit implementing the optimization for special cases later today or tomorrow. @tylerjereddy

numpy/core/src/multiarray/item_selection.c Outdated Show resolved Hide resolved
numpy/core/src/multiarray/item_selection.c Outdated Show resolved Hide resolved
numpy/core/src/multiarray/item_selection.c Outdated Show resolved Hide resolved
numpy/core/src/multiarray/item_selection.c Outdated Show resolved Hide resolved
numpy/core/src/multiarray/item_selection.c Outdated Show resolved Hide resolved
numpy/core/src/multiarray/item_selection.c Outdated Show resolved Hide resolved
numpy/core/src/multiarray/item_selection.c Outdated Show resolved Hide resolved
numpy/core/src/multiarray/item_selection.c Outdated Show resolved Hide resolved
numpy/core/src/multiarray/item_selection.c Outdated Show resolved Hide resolved
@touqir14
Copy link
Contributor Author

touqir14 commented Feb 7, 2021

@Qiyu8 , @seiko2plus , I also want to add optimizations to PyArray_Nonzero for some special input cases(Ndarray is aligned and C contiguous). Should I just keep sending commits in this PR?

@seiko2plus
Copy link
Member

@touqir14,

Should I just keep sending commits in this PR?

let's keep this pull-request only for count_nonzero

@touqir14
Copy link
Contributor Author

touqir14 commented Feb 8, 2021

@seiko2plus , the overflow possibilities have been taken care of. Please see my last commit to verify. Looks like distinguishing dtypes using kind and elsize has taken away the build (testing) issues on win32/64 and linux 32bit. There is a failed test issue regarding numpy versioning which however has nothing to do with my commits.

@seiko2plus seiko2plus self-assigned this Feb 13, 2021
@touqir14
Copy link
Contributor Author

@seiko2plus , are we all good now? If so, please merge this PR.

@seiko2plus
Copy link
Member

seiko2plus commented Feb 13, 2021

@touqir14, I made some changes in order to increase readability and reduce the amount of code, it wouldn't affect performance.

are we all good now? If so, please merge this PR.

yes, I think it's good. I prefer to wait one day more to give a chance to the others to look at the code.

@mattip
Copy link
Member

mattip commented Feb 14, 2021

It would be nice to see a report of the benchmark changes before/after this PR to make sure we have not by mistake slowed any cases (non-contiguous?, F-order?) down.

@seiko2plus
Copy link
Member

@mattip,

It would be nice to see a report of the benchmark changes before/after this PR

Performance has increased for all supported arches, check the following benchmarks:

Power9/GCC 9.2.1(baseline VSX2)
python runtests.py -j8 --bench-compare master CountNonzero -- --sort name --cpu-affinity 1,5
   before           after        ratio                                            (numaxes   size  dtype)
 [7a18e4ac]       [85e2ce98]
 <master>         <count_nonzero>
   2.85±0μs         2.35±0μs     0.82  bench_core.CountNonzero.time_count_nonzero(1, 100, <class 'numpy.int16'>)
2.76±0.01μs         2.38±0μs     0.86  bench_core.CountNonzero.time_count_nonzero(1, 100, <class 'numpy.int32'>)
   2.77±0μs      2.40±0.01μs     0.87  bench_core.CountNonzero.time_count_nonzero(1, 100, <class 'numpy.int64'>)
2.84±0.01μs      2.36±0.02μs     0.83  bench_core.CountNonzero.time_count_nonzero(1, 100, <class 'numpy.int8'>)
 78.4±0.2μs         4.53±0μs     0.06  bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int16'>)
71.8±0.02μs      5.79±0.01μs     0.08  bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int32'>)
 72.7±0.1μs      8.63±0.01μs     0.12  bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int64'>)
78.3±0.02μs      3.67±0.03μs     0.05  bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int8'>)
   7.58±0ms        174±0.2μs     0.02  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int16'>)
6.93±0.01ms        292±0.3μs     0.04  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int32'>)
   7.00±0ms        585±0.3μs     0.08  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int64'>)
 7.95±0.2ms      90.6±0.05μs     0.01  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int8'>)
   3.60±0μs         2.37±0μs     0.66  bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int16'>)
3.47±0.01μs         2.41±0μs     0.69  bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int32'>)
   3.50±0μs         2.48±0μs     0.71  bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int64'>)
3.60±0.01μs      2.38±0.01μs     0.66  bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int8'>)
  154±0.1μs         6.26±0μs     0.04  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int16'>)
 141±0.02μs      8.66±0.03μs     0.06  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int32'>)
 142±0.04μs      14.4±0.01μs     0.10  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int64'>)
 154±0.05μs      4.55±0.01μs     0.03  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int8'>)
15.2±0.03ms        343±0.2μs     0.02  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int16'>)
13.9±0.02ms        580±0.4μs     0.04  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int32'>)
14.0±0.01ms      1.27±0.01ms     0.09  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int64'>)
 15.3±0.1ms        177±0.1μs     0.01  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int8'>)
4.38±0.01μs      2.39±0.01μs     0.54  bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int16'>)
4.18±0.02μs      2.46±0.02μs     0.59  bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int32'>)
4.21±0.01μs      2.57±0.01μs     0.61  bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int64'>)
   4.37±0μs         2.37±0μs     0.54  bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int8'>)
  230±0.2μs      7.99±0.01μs     0.03  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int16'>)
 210±0.08μs       11.6±0.1μs     0.06  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int32'>)
 212±0.02μs      20.3±0.05μs     0.10  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int64'>)
  230±0.3μs      5.45±0.01μs     0.02  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int8'>)
 22.9±0.5ms        513±0.4μs     0.02  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int16'>)
20.8±0.01ms          909±4μs     0.04  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int32'>)
21.0±0.01ms      2.09±0.03ms     0.10  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int64'>)
22.8±0.03ms        263±0.1μs     0.01  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int8'>)
i7-8550U[low-power]/GCC 8.4.0(baseline AVX2)
python runtests.py -j8 --cpu-baseline="avx2" --bench-compare master CountNonzero -- --sort name --cpu-affinity 1,5
   before           after        ratio                                            (numaxes   size  dtype)
 [7a18e4ac]       [85e2ce98]
 <master>         <count_nonzero>
 31.7±0.01μs      4.90±0.04μs     0.15  bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int16'>)
 31.8±0.06μs       6.04±0.1μs     0.19  bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int32'>)
  32.0±0.2μs      8.26±0.09μs     0.26  bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int64'>)
  31.7±0.1μs      4.41±0.03μs     0.14  bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int8'>)
 2.86±0.01ms          121±1μs     0.04  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int16'>)
 2.89±0.01ms          231±2μs     0.08  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int32'>)
 3.04±0.02ms         565±10μs     0.19  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int64'>)
 2.82±0.01ms      81.3±0.08μs     0.03  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int8'>)
 3.55±0.01μs      3.22±0.01μs     0.91  bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int16'>)
 3.58±0.05μs      3.22±0.01μs     0.90  bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int32'>)
 3.55±0.01μs      3.29±0.01μs     0.93  bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int64'>)
 3.58±0.03μs      3.19±0.02μs     0.89  bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int8'>)
 59.9±0.04μs      6.01±0.02μs     0.10  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int16'>)
 60.0±0.07μs       8.29±0.2μs     0.14  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int32'>)
 60.8±0.03μs      13.0±0.06μs     0.21  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int64'>)
 59.6±0.03μs      5.26±0.02μs     0.09  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int8'>)
 5.73±0.01ms          236±5μs     0.04  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int16'>)
 5.83±0.02ms          555±6μs     0.10  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int32'>)
 6.13±0.01ms      1.39±0.02ms     0.23  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int64'>)
 5.68±0.01ms          156±3μs     0.03  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int8'>)
 3.83±0.03μs      3.23±0.02μs     0.84  bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int16'>)
 3.82±0.01μs      3.25±0.01μs     0.85  bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int32'>)
    3.84±0μs      3.32±0.01μs     0.86  bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int64'>)
 3.82±0.02μs      3.23±0.02μs     0.85  bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int8'>)
 87.8±0.04μs      7.29±0.03μs     0.08  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int16'>)
 88.3±0.04μs      10.6±0.08μs     0.12  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int32'>)
  89.5±0.3μs       18.0±0.1μs     0.20  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int64'>)
 87.7±0.03μs      6.08±0.02μs     0.07  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int8'>)
 8.59±0.02ms         374±10μs     0.04  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int16'>)
 8.80±0.02ms      1.02±0.03ms     0.12  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int32'>)
 9.19±0.03ms      2.05±0.01ms     0.22  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int64'>)
 8.49±0.03ms          229±2μs     0.03  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int8'>)
i7-8550U[low-power]/GCC 8.4.0(baseline SSE3)
python runtests.py -j8 --bench-compare master CountNonzero -- --sort name --cpu-affinity 1,5
   before           after         ratio                                             (numaxes   size  dtype)
 [7a18e4ac]       [85e2ce98]
 <master>         <count_nonzero>
 7.04±0.05μs      6.54±0.01μs     0.93  bench_core.CountNonzero.time_count_nonzero(1, 100, <class 'object'>)
 37.2±0.01μs       5.63±0.1μs     0.15  bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int16'>)
 37.5±0.06μs       8.50±0.2μs     0.23  bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int32'>)
  37.8±0.1μs      12.9±0.03μs     0.34  bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int64'>)
 37.1±0.02μs       5.00±0.2μs     0.13  bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int8'>)
   408±0.9μs          357±3μs     0.88  bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'object'>)
 3.41±0.01ms          197±8μs     0.06  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int16'>)
 3.47±0.03ms          393±8μs     0.11  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int32'>)
 3.62±0.01ms      1.07±0.04ms     0.29  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int64'>)
    3.37±0ms        128±0.9μs     0.04  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int8'>)
 40.7±0.08ms      35.3±0.05ms     0.87  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'object'>)
 3.66±0.02μs      3.32±0.04μs     0.91  bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int16'>)
 3.68±0.05μs      3.44±0.03μs     0.94  bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int64'>)
 3.66±0.03μs      3.26±0.02μs     0.89  bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int8'>)
 11.1±0.02μs      10.1±0.06μs     0.91  bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'object'>)
 71.1±0.04μs       8.01±0.2μs     0.11  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int16'>)
 71.2±0.05μs       11.6±0.6μs     0.16  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int32'>)
  71.9±0.1μs       22.4±0.1μs     0.31  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int64'>)
  70.9±0.3μs      6.30±0.09μs     0.09  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int8'>)
     813±3μs          710±8μs     0.87  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'object'>)
 6.83±0.01ms          376±1μs     0.06  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int16'>)
 6.97±0.01ms         889±40μs     0.13  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int32'>)
 7.23±0.01ms      2.32±0.01ms     0.32  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int64'>)
 6.80±0.01ms         259±20μs     0.04  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int8'>)
  81.4±0.2ms      70.7±0.09ms     0.87  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'object'>)
 4.02±0.07μs      3.39±0.02μs     0.84  bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int16'>)
 4.02±0.02μs      3.31±0.02μs     0.82  bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int32'>)
 4.00±0.02μs      3.51±0.01μs     0.88  bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int64'>)
 3.99±0.01μs      3.27±0.02μs     0.82  bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int8'>)
  15.2±0.1μs      13.6±0.05μs     0.90  bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'object'>)
   105±0.2μs      9.56±0.02μs     0.09  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int16'>)
  105±0.02μs       15.6±0.6μs     0.15  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int32'>)
   106±0.1μs       32.0±0.2μs     0.30  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int64'>)
   104±0.5μs      7.68±0.07μs     0.07  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int8'>)
    1.22±0ms      1.06±0.01ms     0.87  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'object'>)
 10.3±0.03ms          595±7μs     0.06  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int16'>)
  10.4±0.1ms      1.37±0.02ms     0.13  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int32'>)
 10.8±0.02ms      3.43±0.08ms     0.32  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int64'>)
 10.1±0.02ms         390±10μs     0.04  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int8'>)
   122±0.3ms        106±0.4ms     0.87  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'object'>)
Cortex-A53/GCC 9.3.0(baseline ASIMD)
python runtests.py -j8 --bench-compare master CountNonzero -- --sort name
   before           after         ratio                                             (numaxes   size  dtype)
 [7a18e4ac]       [85e2ce98]
 <master>         <count_nonzero>
 2.55±0.1μs      2.35±0.01μs     0.92  bench_core.CountNonzero.time_count_nonzero(1, 100, <class 'numpy.int32'>)
   10.5±9μs      4.35±0.02μs     0.41  bench_core.CountNonzero.time_count_nonzero(1, 100, <class 'str'>)
   49.5±1μs      5.58±0.04μs     0.11  bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int16'>)
 33.4±0.4μs      8.12±0.04μs     0.24  bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int32'>)
 33.9±0.3μs      12.5±0.07μs     0.37  bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int64'>)
   48.5±2μs      4.28±0.02μs     0.09  bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int8'>)
4.69±0.02ms          258±2μs     0.06  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int16'>)
3.06±0.01ms          501±4μs     0.16  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int32'>)
3.08±0.01ms          970±9μs     0.32  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int64'>)
4.67±0.02ms          135±3μs     0.03  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int8'>)
   5.15±3μs      2.38±0.02μs     0.46  bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'bool'>)
   6.44±4μs      2.41±0.01μs     0.37  bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int32'>)
2.85±0.07μs      2.54±0.02μs     0.89  bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int64'>)
3.12±0.03μs      2.39±0.03μs     0.76  bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int8'>)
96.1±0.03μs      8.15±0.02μs     0.08  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int16'>)
 63.9±0.4μs      12.5±0.03μs     0.20  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int32'>)
64.1±0.09μs      21.0±0.04μs     0.33  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int64'>)
95.7±0.05μs      5.85±0.04μs     0.06  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int8'>)
9.34±0.05ms          499±2μs     0.05  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int16'>)
6.10±0.05ms         972±10μs     0.16  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int32'>)
6.12±0.02ms      1.89±0.01ms     0.31  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int64'>)
9.32±0.03ms          281±2μs     0.03  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int8'>)
3.68±0.07μs         2.38±0μs     0.65  bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int16'>)
3.18±0.06μs      2.48±0.01μs     0.78  bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int32'>)
3.23±0.07μs      2.68±0.02μs     0.83  bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int64'>)
3.60±0.06μs      2.40±0.03μs     0.67  bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int8'>)
  143±0.2μs       10.4±0.1μs     0.07  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int16'>)
   94.2±2μs       16.8±0.1μs     0.18  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int32'>)
 94.4±0.4μs      29.6±0.08μs     0.31  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int64'>)
  142±0.2μs      7.33±0.04μs     0.05  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int8'>)
   14.0±2ms          735±5μs     0.05  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int16'>)
9.10±0.02ms      1.42±0.01ms     0.16  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int32'>)
9.13±0.06ms      2.81±0.02ms     0.31  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int64'>)
14.0±0.02ms          408±6μs     0.03  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int8'>)

@mattip mattip merged commit 17e3ef9 into numpy:master Feb 15, 2021
@mattip
Copy link
Member

mattip commented Feb 15, 2021

Thanks @touqir14

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants