Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAINT: Optimize the performance of count_nonzero by using universal intrinsics #17958

Merged
merged 29 commits into from
Dec 23, 2020

Conversation

Qiyu8
Copy link
Member

@Qiyu8 Qiyu8 commented Dec 8, 2020

Introduction

np.count_nonzero is a common operation in database, information-retrieval, cryptographic and machine-learning applications. It's reported that the same OpenCV function, which uses universal intrinsics technique, is nearly 25x faster than the Numpy's function, The algorithm there is easy to migrate into current USIMD framework after some investigation. The performance increased 35% with avx2 instrument.

Benchmark

Here is the ASV benchmark result.

AVX2 enabled

· Creating environments
· Discovering benchmarks
·· Uninstalling from virtualenv-py3.7-Cython
·· Building 4337b254  for virtualenv-py3.7-Cython
·· Installing 4337b254  into virtualenv-py3.7-Cython
· Running 6 total benchmarks (2 commits * 1 environments * 3 benchmarks)
[  0.00%] · For numpy commit 7a505741  (round 1/2):
[  0.00%] ·· Building for virtualenv-py3.7-Cython
[  0.00%] ·· Benchmarking virtualenv-py3.7-Cython
[  8.33%] ··· Running (bench_core.CountNonzero.time_count_nonzero--).
[ 16.67%] ··· Running (bench_core.CountNonzero.time_count_nonzero_axis--).
[ 25.00%] ··· Running (bench_core.CountNonzero.time_count_nonzero_multi_axis--).
[ 25.00%] · For numpy commit 4337b254  (round 1/2):
[ 25.00%] ·· Building for virtualenv-py3.7-Cython
[ 25.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 33.33%] ··· Running (bench_core.CountNonzero.time_count_nonzero--).
[ 41.67%] ··· Running (bench_core.CountNonzero.time_count_nonzero_axis--).
[ 50.00%] ··· Running (bench_core.CountNonzero.time_count_nonzero_multi_axis--).
[ 50.00%] · For numpy commit 4337b254  (round 2/2):
[ 50.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 58.33%] ··· bench_core.CountNonzero.time_count_nonzero                                                                            ok
[ 58.33%] ··· ========= ========= ============= ============= ============= =============
              --                                           dtype
              ------------------- -------------------------------------------------------
               numaxes     size        bool          int           str          object
              ========= ========= ============= ============= ============= =============
                  1        100      1.36±0.1µs   1.50±0.01µs   2.12±0.05µs   2.02±0.05µs
                  1       10000     1.99±0.3µs    32.1±0.6µs     85.9±4µs     77.2±0.6µs
                  1      1000000     41.2±2µs    2.99±0.01ms    9.55±0.3ms    8.15±0.1ms
                  2        100     1.28±0.01µs   1.74±0.03µs   2.93±0.06µs    2.75±0.1µs
                  2       10000    2.34±0.09µs    61.7±0.5µs     174±9µs       164±10µs
                  2      1000000     94.9±3µs    6.01±0.07ms     19.2±1ms     15.8±0.2ms
                  3        100     1.29±0.06µs    2.22±0.1µs   3.93±0.08µs   3.65±0.06µs
                  3       10000    2.81±0.07µs     91.6±1µs      257±8µs       259±8µs
                  3      1000000     141±5µs      9.14±0.1ms    28.5±0.4ms    23.6±0.6ms
              ========= ========= ============= ============= ============= =============

[ 66.67%] ··· bench_core.CountNonzero.time_count_nonzero_axis ok
[ 66.67%] ··· ========= ========= ============= ============= ============ ============
-- dtype
------------------- -----------------------------------------------------
numaxes size bool int str object
========= ========= ============= ============= ============ ============
1 100 8.57±0.3µs 11.5±1µs 11.0±0.1µs 10.7±0.5µs
1 10000 23.9±1µs 30.7±1µs 118±3µs 151±3µs
1 1000000 1.37±0.02ms 2.61±0.09ms 12.7±1ms 14.3±0.7ms
2 100 8.59±0.4µs 9.98±1µs 12.1±0.7µs 14.0±2µs
2 10000 39.0±2µs 49.5±0.7µs 246±5µs 290±10µs
2 1000000 2.76±0.04ms 4.87±0.09ms 24.2±0.7ms 28.0±1ms
3 100 8.58±0.5µs 10.2±1µs 13.6±0.5µs 13.5±0.4µs
3 10000 52.7±4µs 69.1±2µs 365±9µs 428±10µs
3 1000000 4.17±0.07ms 7.52±0.4ms 36.2±0.5ms 41.8±0.4ms
========= ========= ============= ============= ============ ============

[ 75.00%] ··· bench_core.CountNonzero.time_count_nonzero_multi_axis ok
[ 75.00%] ··· ========= ========= ============= ============= ============ ============
-- dtype
------------------- -----------------------------------------------------
numaxes size bool int str object
========= ========= ============= ============= ============ ============
1 100 8.79±0.7µs 9.58±0.3µs 11.9±0.9µs 11.0±0.3µs
1 10000 23.4±0.8µs 30.6±0.3µs 125±4µs 150±4µs
1 1000000 1.38±0.01ms 2.42±0.04ms 12.2±0.2ms 14.1±0.7ms
2 100 8.61±0.5µs 9.81±0.5µs 12.6±0.4µs 12.6±2µs
2 10000 38.1±1µs 57.0±4µs 245±20µs 294±10µs
2 1000000 2.80±0.06ms 4.84±0.1ms 25.1±0.3ms 27.7±0.2ms
3 100 8.75±0.3µs 9.95±0.3µs 13.5±0.6µs 13.8±0.4µs
3 10000 49.6±0.7µs 70.8±2µs 357±4µs 424±20µs
3 1000000 4.13±0.06ms 7.79±0.7ms 35.5±0.8ms 42.4±0.5ms
========= ========= ============= ============= ============ ============

[ 75.00%] · For numpy commit 7a505741 (round 2/2):
[ 75.00%] ·· Building for virtualenv-py3.7-Cython
[ 75.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 83.33%] ··· bench_core.CountNonzero.time_count_nonzero ok
[ 83.33%] ··· ========= ========= ============= ============= ============= =============
-- dtype
------------------- -------------------------------------------------------
numaxes size bool int str object
========= ========= ============= ============= ============= =============
1 100 1.27±0.03µs 1.51±0.05µs 2.36±0.2µs 1.94±0.01µs
1 10000 2.13±0.04µs 32.5±0.7µs 85.3±2µs 79.1±1µs
1 1000000 63.2±0.8µs 2.99±0.03ms 9.28±0.2ms 7.93±0.3ms
2 100 1.27±0.02µs 1.79±0.06µs 2.94±0.07µs 2.84±0.1µs
2 10000 2.82±0.05µs 62.7±0.9µs 182±9µs 158±5µs
2 1000000 128±2µs 6.16±0.07ms 19.7±0.8ms 15.9±0.3ms
3 100 1.28±0.04µs 2.17±0.07µs 3.90±0.06µs 3.63±0.02µs
3 10000 3.42±0.06µs 91.2±2µs 263±10µs 236±3µs
3 1000000 197±8µs 9.05±0.06ms 28.0±0.2ms 23.7±0.2ms
========= ========= ============= ============= ============= =============

[ 91.67%] ··· bench_core.CountNonzero.time_count_nonzero_axis ok
[ 91.67%] ··· ========= ========= ============= ============= ============= ============
-- dtype
------------------- ------------------------------------------------------
numaxes size bool int str object
========= ========= ============= ============= ============= ============
1 100 8.26±0.2µs 9.51±0.8µs 10.6±0.07µs 11.1±0.6µs
1 10000 22.4±0.5µs 30.6±1µs 120±3µs 156±10µs
1 1000000 1.39±0.02ms 2.37±0.04ms 12.6±0.3ms 14.3±0.3ms
2 100 8.50±0.4µs 9.58±0.3µs 12.0±0.4µs 11.7±0.2µs
2 10000 38.3±0.8µs 50.9±3µs 234±4µs 295±20µs
2 1000000 2.75±0.02ms 4.97±0.2ms 24.7±0.6ms 28.2±0.4ms
3 100 8.66±0.4µs 9.55±0.1µs 13.6±0.5µs 13.5±0.3µs
3 10000 53.7±7µs 70.1±2µs 358±40µs 416±7µs
3 1000000 4.11±0.1ms 7.27±0.2ms 36.2±1ms 43.2±1ms
========= ========= ============= ============= ============= ============

[100.00%] ··· bench_core.CountNonzero.time_count_nonzero_multi_axis ok
[100.00%] ··· ========= ========= ============= ============= ============ ============
-- dtype
------------------- -----------------------------------------------------
numaxes size bool int str object
========= ========= ============= ============= ============ ============
1 100 8.64±0.4µs 9.57±0.3µs 11.4±0.3µs 10.8±0.3µs
1 10000 24.8±1µs 31.4±1µs 125±6µs 152±2µs
1 1000000 1.40±0.2ms 2.44±0.05ms 12.4±0.2ms 14.1±0.2ms
2 100 8.79±0.2µs 10.2±0.5µs 12.2±0.4µs 12.4±1µs
2 10000 36.8±1µs 52.8±4µs 247±10µs 292±20µs
2 1000000 2.80±0.04ms 5.00±0.07ms 24.7±0.7ms 28.0±0.7ms
3 100 9.24±0.3µs 10.1±0.3µs 13.9±0.4µs 14.0±0.4µs
3 10000 50.7±0.8µs 74.7±4µs 345±20µs 419±9µs
3 1000000 4.15±0.2ms 7.89±0.4ms 36.6±1ms 43.0±2ms
========= ========= ============= ============= ============ ============

   before           after         ratio
 [7a505741]       [4337b254]
 <master>         <countnz>
  • 2.82±0.05µs      2.34±0.09µs     0.83  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'bool'>)
    
  • 3.42±0.06µs      2.81±0.07µs     0.82  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'bool'>)
    
  •     128±2µs         94.9±3µs     0.74  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'bool'>)
    
  •     197±8µs          141±5µs     0.71  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'bool'>)
    
  •  63.2±0.8µs         41.2±2µs     0.65  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'bool'>)
    

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

SSE2 enabled

· Creating environments
· Discovering benchmarks
·· Uninstalling from virtualenv-py3.7-Cython
·· Building d2f477c3  for virtualenv-py3.7-Cython
·· Installing d2f477c3  into virtualenv-py3.7-Cython
· Running 6 total benchmarks (2 commits * 1 environments * 3 benchmarks)
[  0.00%] · For numpy commit 7a505741  (round 1/2):
[  0.00%] ·· Building for virtualenv-py3.7-Cython
[  0.00%] ·· Benchmarking virtualenv-py3.7-Cython
[  8.33%] ··· Running (bench_core.CountNonzero.time_count_nonzero--).
[ 16.67%] ··· Running (bench_core.CountNonzero.time_count_nonzero_axis--).
[ 25.00%] ··· Running (bench_core.CountNonzero.time_count_nonzero_multi_axis--).
[ 25.00%] · For numpy commit d2f477c3  (round 1/2):
[ 25.00%] ·· Building for virtualenv-py3.7-Cython
[ 25.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 33.33%] ··· Running (bench_core.CountNonzero.time_count_nonzero--).
[ 41.67%] ··· Running (bench_core.CountNonzero.time_count_nonzero_axis--).
[ 50.00%] ··· Running (bench_core.CountNonzero.time_count_nonzero_multi_axis--).
[ 50.00%] · For numpy commit d2f477c3  (round 2/2):
[ 50.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 58.33%] ··· bench_core.CountNonzero.time_count_nonzero                  ok
[ 58.33%] ··· ========= ========= ======== =============
               numaxes     size    dtype
              --------- --------- -------- -------------
                  1        100      bool    1.35±0.07μs
                  1        100      int     1.57±0.04μs
                  1        100      str     2.20±0.08μs
                  1        100     object    2.06±0.2μs
                  1       10000     bool     2.28±0.2μs
                  1       10000     int       32.2±1μs
                  1       10000     str      86.9±0.9μs
                  1       10000    object     77.7±1μs
                  1      1000000    bool      72.7±1μs
                  1      1000000    int     3.05±0.04ms
                  1      1000000    str      9.61±0.2ms
                  1      1000000   object    7.97±0.2ms
                  2        100      bool    1.36±0.03μs
                  2        100      int     1.82±0.03μs
                  2        100      str     2.97±0.09μs
                  2        100     object   2.83±0.08μs
                  2       10000     bool     3.12±0.2μs
                  2       10000     int       62.0±1μs
                  2       10000     str       173±10μs
                  2       10000    object     156±2μs
                  2      1000000    bool      146±5μs
                  2      1000000    int     6.10±0.07ms
                  2      1000000    str      19.0±0.6ms
                  2      1000000   object    16.0±0.4ms
                  3        100      bool    1.30±0.02μs
                  3        100      int      2.17±0.2μs
                  3        100      str     3.92±0.09μs
                  3        100     object    3.61±0.1μs
                  3       10000     bool     4.03±0.3μs
                  3       10000     int       95.5±2μs
                  3       10000     str       260±9μs
                  3       10000    object     234±5μs
                  3      1000000    bool      224±8μs
                  3      1000000    int      9.27±0.2ms
                  3      1000000    str      29.2±0.8ms
                  3      1000000   object    24.2±0.2ms
              ========= ========= ======== =============

[ 66.67%] ··· ...core.CountNonzero.time_count_nonzero_axis ok
[ 66.67%] ··· ========= ========= ======== =============
numaxes size dtype
--------- --------- -------- -------------
1 100 bool 8.77±0.6μs
1 100 int 9.40±0.3μs
1 100 str 10.8±0.3μs
1 100 object 10.9±0.6μs
1 10000 bool 22.8±0.7μs
1 10000 int 31.7±0.9μs
1 10000 str 127±3μs
1 10000 object 148±3μs
1 1000000 bool 1.45±0.04ms
1 1000000 int 2.46±0.1ms
1 1000000 str 12.4±0.3ms
1 1000000 object 14.2±0.3ms
2 100 bool 8.67±0.2μs
2 100 int 10.0±0.5μs
2 100 str 13.5±2μs
2 100 object 13.2±0.6μs
2 10000 bool 41.4±2μs
2 10000 int 51.9±2μs
2 10000 str 242±10μs
2 10000 object 293±7μs
2 1000000 bool 2.83±0.1ms
2 1000000 int 5.09±0.1ms
2 1000000 str 25.5±1ms
2 1000000 object 27.9±0.4ms
3 100 bool 8.76±0.3μs
3 100 int 9.92±0.7μs
3 100 str 13.4±0.6μs
3 100 object 13.9±2μs
3 10000 bool 52.5±2μs
3 10000 int 71.3±1μs
3 10000 str 380±10μs
3 10000 object 432±20μs
3 1000000 bool 4.30±0.1ms
3 1000000 int 7.49±0.1ms
3 1000000 str 37.9±1ms
3 1000000 object 42.3±0.2ms
========= ========= ======== =============

[ 75.00%] ··· ...ountNonzero.time_count_nonzero_multi_axis ok
[ 75.00%] ··· ========= ========= ======== =============
numaxes size dtype
--------- --------- -------- -------------
1 100 bool 8.93±0.3μs
1 100 int 9.64±0.3μs
1 100 str 11.3±0.4μs
1 100 object 11.1±0.3μs
1 10000 bool 23.6±1μs
1 10000 int 31.9±1μs
1 10000 str 130±10μs
1 10000 object 158±10μs
1 1000000 bool 1.50±0.07ms
1 1000000 int 2.50±0.1ms
1 1000000 str 12.3±0.3ms
1 1000000 object 14.4±0.7ms
2 100 bool 9.35±0.6μs
2 100 int 11.4±0.6μs
2 100 str 12.5±1μs
2 100 object 12.8±0.4μs
2 10000 bool 37.4±2μs
2 10000 int 53.3±2μs
2 10000 str 249±6μs
2 10000 object 298±8μs
2 1000000 bool 3.22±0.2ms
2 1000000 int 4.93±0.2ms
2 1000000 str 24.7±0.8ms
2 1000000 object 28.1±0.6ms
3 100 bool 9.28±0.2μs
3 100 int 10.4±0.4μs
3 100 str 13.6±0.7μs
3 100 object 15.6±0.7μs
3 10000 bool 51.9±1μs
3 10000 int 76.9±3μs
3 10000 str 386±30μs
3 10000 object 433±10μs
3 1000000 bool 4.85±0.3ms
3 1000000 int 7.55±0.3ms
3 1000000 str 36.9±0.8ms
3 1000000 object 43.5±1ms
========= ========= ======== =============

[ 75.00%] · For numpy commit 7a505741 (round 2/2):
[ 75.00%] ·· Building for virtualenv-py3.7-Cython
[ 75.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 83.33%] ··· bench_core.CountNonzero.time_count_nonzero ok
[ 83.33%] ··· ========= ========= ======== =============
numaxes size dtype
--------- --------- -------- -------------
1 100 bool 1.34±0.04μs
1 100 int 1.58±0.09μs
1 100 str 2.10±0.04μs
1 100 object 2.08±0.04μs
1 10000 bool 2.19±0.1μs
1 10000 int 31.8±0.2μs
1 10000 str 85.3±4μs
1 10000 object 78.7±1μs
1 1000000 bool 65.3±4μs
1 1000000 int 2.99±0.07ms
1 1000000 str 9.47±0.3ms
1 1000000 object 8.19±0.8ms
2 100 bool 1.28±0.03μs
2 100 int 1.84±0.04μs
2 100 str 3.00±0.1μs
2 100 object 3.01±0.1μs
2 10000 bool 2.87±0.1μs
2 10000 int 64.4±2μs
2 10000 str 175±7μs
2 10000 object 160±5μs
2 1000000 bool 129±4μs
2 1000000 int 6.31±0.2ms
2 1000000 str 18.6±0.4ms
2 1000000 object 16.7±2ms
3 100 bool 1.30±0.06μs
3 100 int 2.17±0.2μs
3 100 str 3.91±0.1μs
3 100 object 3.82±0.2μs
3 10000 bool 3.41±0.1μs
3 10000 int 92.7±3μs
3 10000 str 263±20μs
3 10000 object 232±4μs
3 1000000 bool 207±7μs
3 1000000 int 9.09±0.2ms
3 1000000 str 29.5±1ms
3 1000000 object 24.0±0.7ms
========= ========= ======== =============

[ 91.67%] ··· ...core.CountNonzero.time_count_nonzero_axis ok
[ 91.67%] ··· ========= ========= ======== =============
numaxes size dtype
--------- --------- -------- -------------
1 100 bool 8.28±0.3μs
1 100 int 9.83±0.5μs
1 100 str 11.1±0.2μs
1 100 object 11.1±0.2μs
1 10000 bool 24.8±0.8μs
1 10000 int 31.9±1μs
1 10000 str 128±3μs
1 10000 object 159±3μs
1 1000000 bool 1.48±0.06ms
1 1000000 int 2.58±0.08ms
1 1000000 str 12.3±0.1ms
1 1000000 object 14.4±0.5ms
2 100 bool 8.87±0.5μs
2 100 int 9.80±0.4μs
2 100 str 12.2±0.3μs
2 100 object 12.7±0.5μs
2 10000 bool 38.1±1μs
2 10000 int 54.1±2μs
2 10000 str 237±8μs
2 10000 object 295±10μs
2 1000000 bool 2.87±0.02ms
2 1000000 int 5.08±0.1ms
2 1000000 str 25.9±0.9ms
2 1000000 object 28.7±0.8ms
3 100 bool 8.94±0.6μs
3 100 int 10.3±0.9μs
3 100 str 13.8±0.4μs
3 100 object 13.8±0.4μs
3 10000 bool 51.4±0.4μs
3 10000 int 72.3±3μs
3 10000 str 355±20μs
3 10000 object 451±50μs
3 1000000 bool 4.19±0.1ms
3 1000000 int 7.49±0.2ms
3 1000000 str 37.0±1ms
3 1000000 object 43.0±2ms
========= ========= ======== =============

[100.00%] ··· ...ountNonzero.time_count_nonzero_multi_axis ok
[100.00%] ··· ========= ========= ======== =============
numaxes size dtype
--------- --------- -------- -------------
1 100 bool 9.10±0.4μs
1 100 int 9.54±0.3μs
1 100 str 11.5±0.5μs
1 100 object 10.9±0.3μs
1 10000 bool 24.5±1μs
1 10000 int 31.4±1μs
1 10000 str 130±8μs
1 10000 object 151±3μs
1 1000000 bool 1.41±0.02ms
1 1000000 int 2.56±0.1ms
1 1000000 str 12.0±0.2ms
1 1000000 object 14.4±0.5ms
2 100 bool 8.76±0.1μs
2 100 int 10.2±0.3μs
2 100 str 12.6±0.4μs
2 100 object 13.1±0.4μs
2 10000 bool 39.0±1μs
2 10000 int 52.2±0.8μs
2 10000 str 242±6μs
2 10000 object 332±20μs
2 1000000 bool 2.98±0.2ms
2 1000000 int 5.04±0.2ms
2 1000000 str 24.2±0.4ms
2 1000000 object 28.9±0.9ms
3 100 bool 8.77±0.6μs
3 100 int 10.4±0.3μs
3 100 str 13.5±0.2μs
3 100 object 14.3±0.6μs
3 10000 bool 54.8±2μs
3 10000 int 72.8±2μs
3 10000 str 365±10μs
3 10000 object 463±9μs
3 1000000 bool 4.23±0.09ms
3 1000000 int 7.66±0.3ms
3 1000000 str 36.9±1ms
3 1000000 object 42.7±0.8ms
========= ========= ======== =============

BENCHMARKS NOT SIGNIFICANTLY CHANGED.

NEON enabled

· Creating environments
· Discovering benchmarks
·· Uninstalling from virtualenv-py3.7-Cython.
·· Building 5da4a8e1  for virtualenv-py3.7-Cython.....................................
·· Installing 5da4a8e1  into virtualenv-py3.7-Cython.
· Running 6 total benchmarks (2 commits * 1 environments * 3 benchmarks)
[  0.00%] · For numpy commit 5da4a8e1  (round 1/2):
[  0.00%] ·· Benchmarking virtualenv-py3.7-Cython
[  8.33%] ··· Running (bench_core.CountNonzero.time_count_nonzero--).
[ 16.67%] ··· Running (bench_core.CountNonzero.time_count_nonzero_axis--).
[ 25.00%] ··· Running (bench_core.CountNonzero.time_count_nonzero_multi_axis--).
[ 25.00%] · For numpy commit 5da4a8e1  (round 1/2):
[ 25.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 33.33%] ··· Running (bench_core.CountNonzero.time_count_nonzero--).
[ 41.67%] ··· Running (bench_core.CountNonzero.time_count_nonzero_axis--).
[ 50.00%] ··· Running (bench_core.CountNonzero.time_count_nonzero_multi_axis--).
[ 50.00%] · For numpy commit 5da4a8e1  (round 2/2):
[ 50.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 58.33%] ··· bench_core.CountNonzero.time_count_nonzero                                                                               ok
[ 58.33%] ··· ========= ========= ============= ============= ============= =============
              --                                           dtype                         
              ------------------- -------------------------------------------------------
               numaxes     size        bool          int           str          object   
              ========= ========= ============= ============= ============= =============
                  1        100     2.43±0.02μs   2.62±0.01μs   3.69±0.01μs   3.67±0.01μs 
                  1       10000    3.27±0.02μs   33.6±0.02μs    137±0.1μs     138±0.07μs 
                  1      1000000    80.2±0.3μs   3.19±0.01ms   17.9±0.02ms   14.7±0.04ms 
                  2        100     2.42±0.01μs   2.92±0.01μs   5.03±0.01μs   5.04±0.01μs 
                  2       10000    3.84±0.02μs   64.7±0.09μs     278±3μs      274±0.2μs  
                  2      1000000     166±1μs      6.78±0.1ms    35.5±0.3ms     31.9±1ms  
                  3        100     2.45±0.01μs     3.24±0μs    6.45±0.01μs   6.40±0.01μs 
                  3       10000    4.38±0.01μs   95.9±0.08μs     438±30μs     411±0.7μs  
                  3      1000000     239±10μs     10.3±0.2ms    53.5±0.2ms     50.9±2ms  
              ========= ========= ============= ============= ============= =============

[ 66.67%] ··· bench_core.CountNonzero.time_count_nonzero_axis ok
[ 66.67%] ··· ========= ========= ============= ============= ============= =============
-- dtype
------------------- -------------------------------------------------------
numaxes size bool int str object
========= ========= ============= ============= ============= =============
1 100 17.4±0.02μs 19.4±0.05μs 21.7±0.1μs 21.7±0.2μs
1 10000 31.0±0.2μs 40.8±0.08μs 156±6μs 240±0.7μs
1 1000000 1.23±0ms 2.16±0.01ms 14.4±0.06ms 22.8±0.7ms
2 100 17.9±0.1μs 20.0±0.07μs 23.3±0.1μs 24.3±0.08μs
2 10000 42.9±0.3μs 63.3±0.07μs 271±2μs 464±3μs
2 1000000 2.43±0.01ms 4.53±0.04ms 28.7±0.05ms 45.7±2ms
3 100 18.1±0.06μs 20.2±0.02μs 24.6±0.1μs 26.6±0.03μs
3 10000 55.4±0.3μs 85.9±0.3μs 398±10μs 688±4μs
3 1000000 3.64±0.04ms 7.08±0.1ms 41.9±0.4ms 69.8±5ms
========= ========= ============= ============= ============= =============

[ 75.00%] ··· bench_core.CountNonzero.time_count_nonzero_multi_axis ok
[ 75.00%] ··· ========= ========= ============= ============= ============= =============
-- dtype
------------------- -------------------------------------------------------
numaxes size bool int str object
========= ========= ============= ============= ============= =============
1 100 18.1±0.02μs 20.0±0.05μs 22.5±0.06μs 22.3±0.03μs
1 10000 31.2±0.1μs 42.1±0.1μs 151±1μs 242±2μs
1 1000000 1.22±0ms 2.16±0.03ms 14.3±0.05ms 23.1±1ms
2 100 18.5±0.09μs 20.4±0.02μs 23.8±0.2μs 24.8±0.07μs
2 10000 43.3±0.2μs 63.8±0.4μs 287±10μs 461±2μs
2 1000000 2.43±0ms 4.58±0.1ms 28.7±0.08ms 45.9±2ms
3 100 18.6±0.04μs 20.6±0.08μs 24.8±0.1μs 27.0±0.05μs
3 10000 55.2±0.2μs 84.7±0.2μs 393±10μs 690±4μs
3 1000000 3.62±0.03ms 7.03±0.08ms 42.1±0.2ms 69.3±2ms
========= ========= ============= ============= ============= =============

[ 75.00%] · For numpy commit 5da4a8e1 (round 2/2):
[ 75.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 83.33%] ··· bench_core.CountNonzero.time_count_nonzero ok
[ 83.33%] ··· ========= ========= ============= ============= ============= =============
-- dtype
------------------- -------------------------------------------------------
numaxes size bool int str object
========= ========= ============= ============= ============= =============
1 100 2.43±0μs 2.62±0.01μs 3.72±0.01μs 3.68±0μs
1 10000 3.25±0μs 33.6±0μs 141±0.5μs 138±0.1μs
1 1000000 79.8±0.9μs 3.27±0.01ms 17.8±0.2ms 13.6±0.03ms
2 100 2.44±0μs 2.92±0μs 5.05±0.01μs 5.02±0.01μs
2 10000 3.80±0.01μs 64.8±0.06μs 272±0.5μs 273±0.2μs
2 1000000 168±1μs 6.71±0.02ms 35.6±0.09ms 27.2±0.01ms
3 100 2.44±0μs 3.23±0μs 6.41±0.01μs 6.38±0.01μs
3 10000 4.31±0μs 95.7±0.08μs 414±2μs 409±0.2μs
3 1000000 225±0.3μs 10.4±0.1ms 53.5±0.1ms 41.0±0.05ms
========= ========= ============= ============= ============= =============

[ 91.67%] ··· bench_core.CountNonzero.time_count_nonzero_axis ok
[ 91.67%] ··· ========= ========= ============= ============= ============= =============
-- dtype
------------------- -------------------------------------------------------
numaxes size bool int str object
========= ========= ============= ============= ============= =============
1 100 17.2±0.01μs 19.2±0.01μs 21.7±0.01μs 21.5±0.01μs
1 10000 30.9±0.03μs 42.2±0.1μs 151±0.7μs 242±0.2μs
1 1000000 1.22±0ms 2.15±0.01ms 14.3±0.04ms 22.3±0.02ms
2 100 17.9±0.02μs 20.0±0.01μs 22.8±0.03μs 24.5±0.01μs
2 10000 43.4±0.08μs 65.5±0.1μs 288±4μs 470±0.9μs
2 1000000 2.43±0ms 4.52±0.02ms 28.5±0.1ms 43.7±0.1ms
3 100 18.1±0.02μs 20.3±0.01μs 24.4±0.03μs 26.4±0.04μs
3 10000 56.1±0.03μs 88.9±0.07μs 413±9μs 693±1μs
3 1000000 3.62±0ms 7.01±0.07ms 42.8±0.1ms 66.8±0.3ms
========= ========= ============= ============= ============= =============

[100.00%] ··· bench_core.CountNonzero.time_count_nonzero_multi_axis ok
[100.00%] ··· ========= ========= ============= ============= ============= =============
-- dtype
------------------- -------------------------------------------------------
numaxes size bool int str object
========= ========= ============= ============= ============= =============
1 100 17.9±0.01μs 19.9±0.02μs 22.5±0.07μs 22.4±0.01μs
1 10000 31.7±0.06μs 42.7±0.1μs 152±0.3μs 245±0.6μs
1 1000000 1.22±0ms 2.06±0.02ms 14.3±0.04ms 22.2±0.06ms
2 100 18.3±0.05μs 20.4±0.04μs 23.3±0.04μs 24.6±0.05μs
2 10000 42.9±0.03μs 62.5±0.06μs 274±1μs 468±1μs
2 1000000 2.42±0ms 4.54±0.01ms 28.5±0.02ms 44.6±0.2ms
3 100 18.6±0.08μs 20.6±0.07μs 24.7±0.04μs 26.9±0.07μs
3 10000 55.3±0.02μs 90.3±0.3μs 406±4μs 693±0.7μs
3 1000000 3.62±0.01ms 7.13±0.09ms 42.6±0.04ms 66.5±0.07ms
========= ========= ============= ============= ============= =============

BENCHMARKS NOT SIGNIFICANTLY CHANGED.

System Info

  Arm x86
Hardware KunPeng  
Processor ARMv8 2.6GMHZ 8 processors Intel(R) Xeon(R) Gold 6161 CPU @ 2.20GHz
OS Linux ecs-9d50 4.19.36-vhulk1905.1.0.h276.eulerosv2r8.aarch64 Windows Server 2008 R2 Enterprise
Compiler gcc (GCC) 7.3.0 MSVC14.06

@Qiyu8
Copy link
Member Author

Qiyu8 commented Dec 8, 2020

@seiko2plus when I do the npyv_u8 vt = npyv_cmpeq_u8(a, b); , The power machine has an error incompatible types when assigning to type ‘npyv_u8 {aka __vector(16) unsigned char}’ from type ‘__vector __bool char {aka __vector(16) __bool char}’, Do you have any idea why this happened?

@seiko2plus
Copy link
Member

@Qiyu8, because all comparison operations returns boolean vector data type, NPYV count on SIMD extensions
for type safety, so it should be npyv_b8 bt = npyv_cmpeq_u8(a, b); otherwise it will break the build on VSX and AVX512

@eric-wieser
Copy link
Member

eric-wieser commented Dec 8, 2020

I think something like this would avoid overflow issues:

/* Count the zero bytes between `*d` and `end`, updating `*d` to point to where to keep counting from. */
static NPY_INLINE npyv_u8
count_zero_bytes_u8(const npy_uint8 **d, const npy_uint8 *end)
{
    const npyv_u8 vone = npyv_setall_u8(1);
    const npyv_u8 vzero = npyv_setall_u8(0);

    npy_intp n = 0;
    npyv_u8 vsum8 = npyv_zero_u8();
    while (*d < end && n <= 0xFE) {
        npyv_b8 vt = npyv_cmpeq_u8(npyv_load_u8(d), vzero);
        vt = npyv_and_u8(vt, vone);
        vsum8 = npyv_add_u8(vsum8, vt);
        d += npyv_nlanes_u8;
        n++;
    }
    return vsum8;
}

static NPY_INLINE npyv_u16
count_zero_bytes_u16(const npy_uint8 **d, const npy_uint8 *end)
{
    npyv_u16 vsum16 = npyv_zero_u16();
    npy_intp n = 0;
    while (*d < end && n <= 0xFF00) {
        npyv_u8 vsum8 = count_zero_bytes_u8(d, end);
        npyv_u16 part1, part2;
        npyv_expand_u8_u16(vsum8, &part1, &part2);
        vsum16 = npyv_add_u16(vsum16, npyv_add_u16(part1, part2));
        n += 0xFF;
    }
    return vsum16;
}

static NPY_INLINE npyv_u32
count_zero_bytes_u32(const npy_uint8 **d, const npy_uint8 *end)
{
    npyv_u32 vsum32 = npyv_zero_u32();
    npy_intp n = 0;
    while (*d < end && n <= 0xFFFF0000) {
        npyv_u8 vsum16 = count_zero_bytes_u16(d, end);
        npyv_u32 part1, part2;
        npyv_expand_u16_u32(vsum16, &part1, &part2);
        vsum32 = npyv_add_u32(vsum32, npyv_add_u32(part1, part2));
        n += 0xFFFF;
    }
    return vsum32;
}


static NPY_INLINE npy_intp
count_nonzero_bytes(const npy_uint8 *d, npy_uintp unrollx)
{
    npy_intp zero_count = 0;
    const npy_uint8 *end = d + unrollx;
    while (*d < end) {
        npyv_u32 vsum32 = count_zero_bytes_u32(d, end);
        zero_count += npyv_sum_u32(vsum32);
    }
    return unrollx - zero_count;
}

numpy/core/src/common/simd/avx2/conversion.h Outdated Show resolved Hide resolved
numpy/core/src/common/simd/avx512/conversion.h Outdated Show resolved Hide resolved
numpy/core/src/common/simd/neon/arithmetic.h Outdated Show resolved Hide resolved
numpy/core/src/multiarray/item_selection.c Outdated Show resolved Hide resolved
@charris charris changed the title Optimize the performance of count_nonzero by using universal intrinsics MAINT: Optimize the performance of count_nonzero by using universal intrinsics Dec 8, 2020
@Qiyu8
Copy link
Member Author

Qiyu8 commented Dec 9, 2020

@eric-wieser The over-flow preventing code you presented looks more elegant, I will try and give a benchmark result. thanks.

@Qiyu8
Copy link
Member Author

Qiyu8 commented Dec 9, 2020

@eric-wieser Here is the benchmark result of the modular code.

       before           after         ratio
     [7a505741]       [dfbb3f76]
     <master>         <countnz>
-         133±3μs         89.9±5μs     0.67  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'bool'>)
-         201±8μs          135±3μs     0.67  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'bool'>)
-        64.5±2μs         43.1±1μs     0.67  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'bool'>)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

The performance remains good, so I will take that code.

@Qiyu8
Copy link
Member Author

Qiyu8 commented Dec 9, 2020

_mm512_cvtepu8_epi16 is only available in AVX512BW, @seiko2plus Any suggestions in implement the expansion in other platform like AVX512F?

@Qiyu8
Copy link
Member Author

Qiyu8 commented Dec 14, 2020

The CI failure(same with #17102 ) seems to be "You are in 'detached HEAD' state.", maybe sync with master solve this problem.

numpy/core/src/common/simd/avx2/arithmetic.h Outdated Show resolved Hide resolved
numpy/core/tests/test_simd.py Outdated Show resolved Hide resolved
@Qiyu8
Copy link
Member Author

Qiyu8 commented Dec 15, 2020

The test case of new intrinsics has added, so the codecov/patch result is inaccurate here.

@seiko2plus
Copy link
Member

@Qiyu8, codecov builders sometimes run on x86 machines with no avx512 support

Copy link
Member

@eric-wieser eric-wieser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the loop bounds become clearer if you write the full subtraction now.


npy_intp lane_max = 0;
npyv_u8 vsum8 = npyv_zero_u8();
while (*d < end && lane_max <= 0xFE) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
while (*d < end && lane_max <= 0xFE) {
while (*d < end && lane_max <= 0xFF - 1) {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not < 0xFF? Is <= faster?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the intuition is that the counter should reach at most 0xFF, and is incremented by one each loop. Writing it this way makes it generalize well to the u16 and u32 cases. The compiler shouldn't care, it's for the reader.

numpy/core/src/multiarray/item_selection.c Outdated Show resolved Hide resolved
numpy/core/src/multiarray/item_selection.c Outdated Show resolved Hide resolved
Comment on lines 2153 to 2165
static NPY_INLINE NPY_GCC_OPT_3 npyv_u16
count_zero_bytes_u16(const npy_uint8 **d, const npy_uint8 *end, npy_uint16 max_count)
{
npyv_u16 vsum16 = npyv_zero_u16();
npy_intp lane_max = 0;
while (*d < end && lane_max <= max_count - 2*NPY_MAX_UINT8) {
npyv_u8 vsum8 = count_zero_bytes_u8(d, end, NPY_MAX_UINT8);
npyv_u16x2 part = npyv_expand_u16_u8(vsum8);
vsum16 = npyv_add_u16(vsum16, npyv_add_u16(part.val[0], part.val[1]));
lane_max += 2*NPY_MAX_UINT8;
}
return vsum16;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static NPY_INLINE NPY_GCC_OPT_3 npyv_u16
count_zero_bytes_u16(const npy_uint8 **d, const npy_uint8 *end, npy_uint16 max_count)
{
npyv_u16 vsum16 = npyv_zero_u16();
npy_intp lane_max = 0;
while (*d < end && lane_max <= max_count - 2*NPY_MAX_UINT8) {
npyv_u8 vsum8 = count_zero_bytes_u8(d, end, NPY_MAX_UINT8);
npyv_u16x2 part = npyv_expand_u16_u8(vsum8);
vsum16 = npyv_add_u16(vsum16, npyv_add_u16(part.val[0], part.val[1]));
lane_max += 2*NPY_MAX_UINT8;
}
return vsum16;
}
static NPY_INLINE NPY_GCC_OPT_3 npyv_u16x2
count_zero_bytes_u16(const npy_uint8 **d, const npy_uint8 *end, npy_uint16 max_count)
{
npyv_u16x2 vsum16;
vsum16.val[0] = vsum16.val[1] = npyv_zero_u16();
npy_intp lane_max = 0;
while (*d < end && lane_max <= max_count - NPY_MAX_UINT8) {
npyv_u8 vsum8 = count_zero_bytes_u8(d, end, NPY_MAX_UINT8);
npyv_u16x2 part = npyv_expand_u16_u8(vsum8);
vsum16.val[0] = npyv_add_u16(vsum16.val[0], part.val[0]);
vsum16.val[1] = npyv_add_u16(vsum16.val[1], part.val[1]);
lane_max += NPY_MAX_UINT8;
}
return vsum16;
}

increase the iterations to x2

Comment on lines 2167 to 2179
static NPY_INLINE NPY_GCC_OPT_3 npyv_u32
count_zero_bytes_u32(const npy_uint8 **d, const npy_uint8 *end, npy_uint32 max_count)
{
npyv_u32 vsum32 = npyv_zero_u32();
npy_intp lane_max = 0;
while (*d < end && lane_max <= max_count - 2*NPY_MAX_UINT16) {
npyv_u16 vsum16 = count_zero_bytes_u16(d, end, NPY_MAX_UINT16);
npyv_u32x2 part = npyv_expand_u32_u16(vsum16);
vsum32 = npyv_add_u32(vsum32, npyv_add_u32(part.val[0], part.val[1]));
lane_max += 2*NPY_MAX_UINT16;
}
return vsum32;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static NPY_INLINE NPY_GCC_OPT_3 npyv_u32
count_zero_bytes_u32(const npy_uint8 **d, const npy_uint8 *end, npy_uint32 max_count)
{
npyv_u32 vsum32 = npyv_zero_u32();
npy_intp lane_max = 0;
while (*d < end && lane_max <= max_count - 2*NPY_MAX_UINT16) {
npyv_u16 vsum16 = count_zero_bytes_u16(d, end, NPY_MAX_UINT16);
npyv_u32x2 part = npyv_expand_u32_u16(vsum16);
vsum32 = npyv_add_u32(vsum32, npyv_add_u32(part.val[0], part.val[1]));
lane_max += 2*NPY_MAX_UINT16;
}
return vsum32;
}

I think there's no need for an extra block-level, two is enough plus the prev suggestion increased the iterations for u16 level

Comment on lines 2201 to 2202
npyv_u32 vsum32 = count_zero_bytes_u32(&d, end, NPY_MAX_UINT32 / npyv_nlanes_u32);
zero_count += npyv_sum_u32(vsum32);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
npyv_u32 vsum32 = count_zero_bytes_u32(&d, end, NPY_MAX_UINT32 / npyv_nlanes_u32);
zero_count += npyv_sum_u32(vsum32);
npyv_u16x2 vsum16 = count_zero_bytes_u16(&d, end, NPY_MAX_UINT16);
npyv_u32x2 sum_32_0 = npyv_expand_u32_u16(vsum16.val[0]);
npyv_u32x2 sum_32_1 = npyv_expand_u32_u16(vsum16.val[1]);
zero_count += npyv_sum_u32(npyv_add_u32(
npyv_add_u32(sum_32_0.val[0], sum_32_0.val[1]),
npyv_add_u32(sum_32_1.val[0], sum_32_1.val[1])
));

EDIT: only one sum is needed

Copy link
Member Author

@Qiyu8 Qiyu8 Dec 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image
Please correct me if i am wrong, the new solution only handles (2**16-1)*16 elements in one loop, while the previous one handles 2**32-1 elements, the iterations x2 is good, but I think that count_zero_bytes_u32 should not be removed from the perspective of efficiency. Further more, the npyv_expand_u32_u16 operation is used just to prevent summation overflow, but in previous solution expansion is used to count more elements, which is the key task here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I think that count_zero_bytes_u32 should not be removed from the perspective of efficiency.

wouldn't affect performance, the idea behind level u16 is to reduce the operations of summing the vector.
if you want to improve the performance more, try to unroll u8 level by x4.

the npyv_expand_u32_u16 operation is used just to prevent summation overflow

sorry, I wrote the suggestion in a hurry, I edited it. only one npyv_sum_u32() is used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but in previous solution expansion is used to count more elements, which is the key task here.

Which one do you refer to? I'm getting confused.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to reserve count_zero_bytes_u32 as shown below:
image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a misunderstanding here, what I can understand from the code:

  • count_nonzero_bytes_u8() to reduce calling npyv_expand_u16_u8() and npyv_sum_u32()
  • count_nonzero_bytes_u16() to reduce calling npyv_expand_u32_u16(), npyv_sum_u32()
  • count_nonzero_bytes_u32() another level to reduce calling npyv_sum_u32() not because "needs to avoid overflow"

here an example for the simd loop without any block-level so you can get what I understand from your code:

static NPY_INLINE NPY_GCC_OPT_3 npy_intp
count_nonzero_bytes(const npy_uint8 *d, npy_uintp unrollx)
{
    npy_intp zero_count = 0;
    for (; unrollx > 0; unrollx -= npyv_nlanes_u8, d += npyv_nlanes_u8) {
        npyv_u8 cmp = npyv_cvt_u8_b8(npyv_cmpeq_u8(npyv_load_u8(d), npyv_zero_u8()));
        npyv_u8 one = npyv_and_u8(cmp, npyv_setall_u8(1));

        npyv_u16x2 one_u16 = npyv_expand_u16_u8(vsum8);
        npyv_u32x2 one_u32 = npyv_expand_u32_u16(npyv_add_u16(one_u16.val[0], one_u16.val[1]));
        zero_count += npyv_sum_u32(npyv_add_u32(one_u32.val[0], one_u32.val[1]));
    }
    return zero_count;
}

Now, Am I missing something?

Copy link
Member Author

@Qiyu8 Qiyu8 Dec 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The non-block-level code is fine, basically there has four loops:

  1. count_zero_bytes_u8: get the maximum count that u8 type can hold.
  2. count_zero_bytes_u16: get the maximum count that u16 type can hold.
  3. count_zero_bytes_u32: get the maximum count of a vector whose sum does not exceed a u32.
  4. count_zero_bytes: get the maximum count that u64 type(AKA npy_uintp) can hold.

what I don't fully understand is that your suggested code removes step 3.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Qiyu8, the step 3 you describe there isn't quite the (correct) code you've implemented - the u32 version fills a vector with values whose sum does not exceed a u32

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The non-block-level code is fine, basically there has four loops:

In the previous example, I was trying to explain the motive behind creating nested loops.
The original OpenCV code was trying to reduce the calls of expensive intrinsics
as much as possible and increase the cheaper ones.

cheap intrinsics:

  • comparison. NOTE: "equal" is used here since almost all archs don't have native support "not equal".
  • integer addition.
  • bitwise

Most archs can execute multiple instructions for the
the above operations per one clock cycle but on the other hand
expand and reduce sum(the worst) takes more latency and throughput.

The negative side of nested loops is the "integer overflow" which leads
to putting an iterations limit to the inner loops but wait loops involve jmps
and jmps may lead to flushing the pipeline so you
should be aware that the inner loop should save more cycles than what the flushing can spend.

what I don't fully understand is that your suggested code removes step 3.

because I think there's no performance gain from it, count_zero_bytes_u16 and count_zero_bytes_u8 already reduces enough calls of "reduce sum".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The performance looks fine after remove inner loop.

       before           after         ratio
     [d7a75e8e]       [c5daaf06]
     <master>         <countnz>
-      6.47±0.1ms       5.48±0.2ms     0.85  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'int'>)
-     3.27±0.08ms      2.69±0.08ms     0.82  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'int'>)
-         101±3μs         82.6±2μs     0.82  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'int'>)
-      10.1±0.3ms       8.07±0.1ms     0.80  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'int'>)
-         127±3μs         92.8±4μs     0.73  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'bool'>)
-        215±10μs          142±3μs     0.66  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'bool'>)
-        64.6±1μs         41.7±2μs     0.65  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'bool'>)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

npy_intp lane_max = 0;
npyv_u8 vsum8 = npyv_zero_u8();
while (*d < end && lane_max <= max_count - 1) {
npyv_u8 vt = npyv_cvt_u8_b8(npyv_cmpeq_u8(npyv_load_u8(*d), vzero));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
npyv_u8 vt = npyv_cvt_u8_b8(npyv_cmpeq_u8(npyv_load_u8(*d), vzero));
// we count zeros because `cmpeq` cheaper than `cmpneq` for most archs
npyv_u8 vt = npyv_cvt_u8_b8(npyv_cmpeq_u8(npyv_load_u8(*d), vzero));

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment added.

Copy link
Member

@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thank you

Copy link
Member

@mattip mattip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few nits about docstrings to tie the tests back to the macros, to help future reviewers. Can be done in a follow-up PR if needed.

@@ -663,6 +663,21 @@ def test_conversion_boolean(self):
true_vsfx = from_boolean(true_vb)
assert false_vsfx != true_vsfx

def test_conversion_expand(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def test_conversion_expand(self):
def test_conversion_expand(self):
"""Test npyv_expand_u16_u8, npyv_expand_u32_u16"""

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docstring added.

@@ -707,7 +722,7 @@ def test_arithmetic_div(self):
assert div == data_div

def test_arithmetic_reduce_sum(self):
if not self._is_fp():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add an appropriate docsting to indicate which npyv_* macros this tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docstring added.

@mattip mattip merged commit 85df388 into numpy:master Dec 23, 2020
@mattip
Copy link
Member

mattip commented Dec 23, 2020

Thanks @Qiyu8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants