-
-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAINT: Optimize numpy.count_nonzero for int types using SIMD operations #18183
Conversation
@Qiyu8 , I have replaced the MIN/MAX Macros, placed the NPY_SIMD checking guard at the proper place, merged the count_nonzero_int16/32/64 functions into a single function and added benchmarks for the 4 int types. |
vsum64 = npyv_add_u64(vsum64, vt); | ||
} | ||
|
||
npy_uint64 sums[npyv_nlanes_u64]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use new acceleration intrinsics after #18200 merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Qiyu8 , I have replaced the manual sums with horizontal SIMD sums.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well done, The replaced part looks good to me, Now you need to focus on fix the CI failures and provide ASV benchmark result.
Sorry to hijack this thread but on a related topic on nonzero(), is there a reason why calling nonzero on a 1D array is orders of magnitude faster than a multi-dimensional array? For example, calling it on a Boolean array of shape (1000000,) is taking ~40 µs, while it takes 1400 µs for an array of shape (1000,1000). Both arrays are identical in values and only differ in shape. Any idea what's the significant overhead cost here? |
@gnool , without further investigation, I am speculating the overhead is coming from the use of an iterator and calls to |
I make heavy use of |
@seiko2plus , @Qiyu8 I have pushed updates. |
@tylerjereddy , can you point me to some of your use cases of |
@touqir14 Finding the indices where an extremely large array of bools is Maybe something like: # test_array is a huge 1D array of np.float64
relevant_indices = np.nonzero(test_array > 0.5) |
Yes, a speedup in this case is possible. I will push a commit implementing the optimization for special cases later today or tomorrow. @tylerjereddy |
@Qiyu8 , @seiko2plus , I also want to add optimizations to |
let's keep this pull-request only for |
@seiko2plus , the overflow possibilities have been taken care of. Please see my last commit to verify. Looks like distinguishing dtypes using |
@seiko2plus , are we all good now? If so, please merge this PR. |
@touqir14, I made some changes in order to increase readability and reduce the amount of code, it wouldn't affect performance.
yes, I think it's good. I prefer to wait one day more to give a chance to the others to look at the code. |
It would be nice to see a report of the benchmark changes before/after this PR to make sure we have not by mistake slowed any cases (non-contiguous?, F-order?) down. |
Performance has increased for all supported arches, check the following benchmarks: Power9/GCC 9.2.1(baseline VSX2)python runtests.py -j8 --bench-compare master CountNonzero -- --sort name --cpu-affinity 1,5 before after ratio (numaxes size dtype)
[7a18e4ac] [85e2ce98]
<master> <count_nonzero>
2.85±0μs 2.35±0μs 0.82 bench_core.CountNonzero.time_count_nonzero(1, 100, <class 'numpy.int16'>)
2.76±0.01μs 2.38±0μs 0.86 bench_core.CountNonzero.time_count_nonzero(1, 100, <class 'numpy.int32'>)
2.77±0μs 2.40±0.01μs 0.87 bench_core.CountNonzero.time_count_nonzero(1, 100, <class 'numpy.int64'>)
2.84±0.01μs 2.36±0.02μs 0.83 bench_core.CountNonzero.time_count_nonzero(1, 100, <class 'numpy.int8'>)
78.4±0.2μs 4.53±0μs 0.06 bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int16'>)
71.8±0.02μs 5.79±0.01μs 0.08 bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int32'>)
72.7±0.1μs 8.63±0.01μs 0.12 bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int64'>)
78.3±0.02μs 3.67±0.03μs 0.05 bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int8'>)
7.58±0ms 174±0.2μs 0.02 bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int16'>)
6.93±0.01ms 292±0.3μs 0.04 bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int32'>)
7.00±0ms 585±0.3μs 0.08 bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int64'>)
7.95±0.2ms 90.6±0.05μs 0.01 bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int8'>)
3.60±0μs 2.37±0μs 0.66 bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int16'>)
3.47±0.01μs 2.41±0μs 0.69 bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int32'>)
3.50±0μs 2.48±0μs 0.71 bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int64'>)
3.60±0.01μs 2.38±0.01μs 0.66 bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int8'>)
154±0.1μs 6.26±0μs 0.04 bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int16'>)
141±0.02μs 8.66±0.03μs 0.06 bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int32'>)
142±0.04μs 14.4±0.01μs 0.10 bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int64'>)
154±0.05μs 4.55±0.01μs 0.03 bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int8'>)
15.2±0.03ms 343±0.2μs 0.02 bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int16'>)
13.9±0.02ms 580±0.4μs 0.04 bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int32'>)
14.0±0.01ms 1.27±0.01ms 0.09 bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int64'>)
15.3±0.1ms 177±0.1μs 0.01 bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int8'>)
4.38±0.01μs 2.39±0.01μs 0.54 bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int16'>)
4.18±0.02μs 2.46±0.02μs 0.59 bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int32'>)
4.21±0.01μs 2.57±0.01μs 0.61 bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int64'>)
4.37±0μs 2.37±0μs 0.54 bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int8'>)
230±0.2μs 7.99±0.01μs 0.03 bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int16'>)
210±0.08μs 11.6±0.1μs 0.06 bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int32'>)
212±0.02μs 20.3±0.05μs 0.10 bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int64'>)
230±0.3μs 5.45±0.01μs 0.02 bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int8'>)
22.9±0.5ms 513±0.4μs 0.02 bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int16'>)
20.8±0.01ms 909±4μs 0.04 bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int32'>)
21.0±0.01ms 2.09±0.03ms 0.10 bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int64'>)
22.8±0.03ms 263±0.1μs 0.01 bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int8'>) i7-8550U[low-power]/GCC 8.4.0(baseline AVX2)python runtests.py -j8 --cpu-baseline="avx2" --bench-compare master CountNonzero -- --sort name --cpu-affinity 1,5 before after ratio (numaxes size dtype)
[7a18e4ac] [85e2ce98]
<master> <count_nonzero>
31.7±0.01μs 4.90±0.04μs 0.15 bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int16'>)
31.8±0.06μs 6.04±0.1μs 0.19 bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int32'>)
32.0±0.2μs 8.26±0.09μs 0.26 bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int64'>)
31.7±0.1μs 4.41±0.03μs 0.14 bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int8'>)
2.86±0.01ms 121±1μs 0.04 bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int16'>)
2.89±0.01ms 231±2μs 0.08 bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int32'>)
3.04±0.02ms 565±10μs 0.19 bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int64'>)
2.82±0.01ms 81.3±0.08μs 0.03 bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int8'>)
3.55±0.01μs 3.22±0.01μs 0.91 bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int16'>)
3.58±0.05μs 3.22±0.01μs 0.90 bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int32'>)
3.55±0.01μs 3.29±0.01μs 0.93 bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int64'>)
3.58±0.03μs 3.19±0.02μs 0.89 bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int8'>)
59.9±0.04μs 6.01±0.02μs 0.10 bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int16'>)
60.0±0.07μs 8.29±0.2μs 0.14 bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int32'>)
60.8±0.03μs 13.0±0.06μs 0.21 bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int64'>)
59.6±0.03μs 5.26±0.02μs 0.09 bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int8'>)
5.73±0.01ms 236±5μs 0.04 bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int16'>)
5.83±0.02ms 555±6μs 0.10 bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int32'>)
6.13±0.01ms 1.39±0.02ms 0.23 bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int64'>)
5.68±0.01ms 156±3μs 0.03 bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int8'>)
3.83±0.03μs 3.23±0.02μs 0.84 bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int16'>)
3.82±0.01μs 3.25±0.01μs 0.85 bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int32'>)
3.84±0μs 3.32±0.01μs 0.86 bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int64'>)
3.82±0.02μs 3.23±0.02μs 0.85 bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int8'>)
87.8±0.04μs 7.29±0.03μs 0.08 bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int16'>)
88.3±0.04μs 10.6±0.08μs 0.12 bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int32'>)
89.5±0.3μs 18.0±0.1μs 0.20 bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int64'>)
87.7±0.03μs 6.08±0.02μs 0.07 bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int8'>)
8.59±0.02ms 374±10μs 0.04 bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int16'>)
8.80±0.02ms 1.02±0.03ms 0.12 bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int32'>)
9.19±0.03ms 2.05±0.01ms 0.22 bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int64'>)
8.49±0.03ms 229±2μs 0.03 bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int8'>)
i7-8550U[low-power]/GCC 8.4.0(baseline SSE3)python runtests.py -j8 --bench-compare master CountNonzero -- --sort name --cpu-affinity 1,5 before after ratio (numaxes size dtype)
[7a18e4ac] [85e2ce98]
<master> <count_nonzero>
7.04±0.05μs 6.54±0.01μs 0.93 bench_core.CountNonzero.time_count_nonzero(1, 100, <class 'object'>)
37.2±0.01μs 5.63±0.1μs 0.15 bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int16'>)
37.5±0.06μs 8.50±0.2μs 0.23 bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int32'>)
37.8±0.1μs 12.9±0.03μs 0.34 bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int64'>)
37.1±0.02μs 5.00±0.2μs 0.13 bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int8'>)
408±0.9μs 357±3μs 0.88 bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'object'>)
3.41±0.01ms 197±8μs 0.06 bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int16'>)
3.47±0.03ms 393±8μs 0.11 bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int32'>)
3.62±0.01ms 1.07±0.04ms 0.29 bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int64'>)
3.37±0ms 128±0.9μs 0.04 bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int8'>)
40.7±0.08ms 35.3±0.05ms 0.87 bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'object'>)
3.66±0.02μs 3.32±0.04μs 0.91 bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int16'>)
3.68±0.05μs 3.44±0.03μs 0.94 bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int64'>)
3.66±0.03μs 3.26±0.02μs 0.89 bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int8'>)
11.1±0.02μs 10.1±0.06μs 0.91 bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'object'>)
71.1±0.04μs 8.01±0.2μs 0.11 bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int16'>)
71.2±0.05μs 11.6±0.6μs 0.16 bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int32'>)
71.9±0.1μs 22.4±0.1μs 0.31 bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int64'>)
70.9±0.3μs 6.30±0.09μs 0.09 bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int8'>)
813±3μs 710±8μs 0.87 bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'object'>)
6.83±0.01ms 376±1μs 0.06 bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int16'>)
6.97±0.01ms 889±40μs 0.13 bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int32'>)
7.23±0.01ms 2.32±0.01ms 0.32 bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int64'>)
6.80±0.01ms 259±20μs 0.04 bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int8'>)
81.4±0.2ms 70.7±0.09ms 0.87 bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'object'>)
4.02±0.07μs 3.39±0.02μs 0.84 bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int16'>)
4.02±0.02μs 3.31±0.02μs 0.82 bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int32'>)
4.00±0.02μs 3.51±0.01μs 0.88 bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int64'>)
3.99±0.01μs 3.27±0.02μs 0.82 bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int8'>)
15.2±0.1μs 13.6±0.05μs 0.90 bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'object'>)
105±0.2μs 9.56±0.02μs 0.09 bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int16'>)
105±0.02μs 15.6±0.6μs 0.15 bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int32'>)
106±0.1μs 32.0±0.2μs 0.30 bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int64'>)
104±0.5μs 7.68±0.07μs 0.07 bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int8'>)
1.22±0ms 1.06±0.01ms 0.87 bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'object'>)
10.3±0.03ms 595±7μs 0.06 bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int16'>)
10.4±0.1ms 1.37±0.02ms 0.13 bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int32'>)
10.8±0.02ms 3.43±0.08ms 0.32 bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int64'>)
10.1±0.02ms 390±10μs 0.04 bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int8'>)
122±0.3ms 106±0.4ms 0.87 bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'object'>) Cortex-A53/GCC 9.3.0(baseline ASIMD)python runtests.py -j8 --bench-compare master CountNonzero -- --sort name before after ratio (numaxes size dtype)
[7a18e4ac] [85e2ce98]
<master> <count_nonzero>
2.55±0.1μs 2.35±0.01μs 0.92 bench_core.CountNonzero.time_count_nonzero(1, 100, <class 'numpy.int32'>)
10.5±9μs 4.35±0.02μs 0.41 bench_core.CountNonzero.time_count_nonzero(1, 100, <class 'str'>)
49.5±1μs 5.58±0.04μs 0.11 bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int16'>)
33.4±0.4μs 8.12±0.04μs 0.24 bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int32'>)
33.9±0.3μs 12.5±0.07μs 0.37 bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int64'>)
48.5±2μs 4.28±0.02μs 0.09 bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int8'>)
4.69±0.02ms 258±2μs 0.06 bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int16'>)
3.06±0.01ms 501±4μs 0.16 bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int32'>)
3.08±0.01ms 970±9μs 0.32 bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int64'>)
4.67±0.02ms 135±3μs 0.03 bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int8'>)
5.15±3μs 2.38±0.02μs 0.46 bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'bool'>)
6.44±4μs 2.41±0.01μs 0.37 bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int32'>)
2.85±0.07μs 2.54±0.02μs 0.89 bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int64'>)
3.12±0.03μs 2.39±0.03μs 0.76 bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int8'>)
96.1±0.03μs 8.15±0.02μs 0.08 bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int16'>)
63.9±0.4μs 12.5±0.03μs 0.20 bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int32'>)
64.1±0.09μs 21.0±0.04μs 0.33 bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int64'>)
95.7±0.05μs 5.85±0.04μs 0.06 bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int8'>)
9.34±0.05ms 499±2μs 0.05 bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int16'>)
6.10±0.05ms 972±10μs 0.16 bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int32'>)
6.12±0.02ms 1.89±0.01ms 0.31 bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int64'>)
9.32±0.03ms 281±2μs 0.03 bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int8'>)
3.68±0.07μs 2.38±0μs 0.65 bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int16'>)
3.18±0.06μs 2.48±0.01μs 0.78 bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int32'>)
3.23±0.07μs 2.68±0.02μs 0.83 bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int64'>)
3.60±0.06μs 2.40±0.03μs 0.67 bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int8'>)
143±0.2μs 10.4±0.1μs 0.07 bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int16'>)
94.2±2μs 16.8±0.1μs 0.18 bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int32'>)
94.4±0.4μs 29.6±0.08μs 0.31 bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int64'>)
142±0.2μs 7.33±0.04μs 0.05 bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int8'>)
14.0±2ms 735±5μs 0.05 bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int16'>)
9.10±0.02ms 1.42±0.01ms 0.16 bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int32'>)
9.13±0.06ms 2.81±0.02ms 0.31 bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int64'>)
14.0±0.02ms 408±6μs 0.03 bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int8'>)
|
Thanks @touqir14 |
As pointed out in this issue,
numpy.count_nonzero
,numpy.nonzero
,numpy.flatnonzero
are rather slow which could use some optimization. This PR optimizesnumpy.count_nonzero
for signed and unsigned 8 bit, 16 bit, 32 bit and 64 bit integers using SIMD operations. This in turn speeds upnumpy.flatnonzero
,numpy.nonzero
, and several other functions that depend onnumpy.count_nonzero
.Below, I have given benchmarks to showcase the speed improvements for the integer types with AVX2.
I have added few test cases for each of the integer types. Let me know if more is required.