MAINT: Optimize the performance of count_nonzero by using universal intrinsics #17958

Qiyu8 · 2020-12-08T03:23:11Z

Introduction

np.count_nonzero is a common operation in database, information-retrieval, cryptographic and machine-learning applications. It's reported that the same OpenCV function, which uses universal intrinsics technique, is nearly 25x faster than the Numpy's function, The algorithm there is easy to migrate into current USIMD framework after some investigation. The performance increased 35% with avx2 instrument.

Benchmark

Here is the ASV benchmark result.

AVX2 enabled


· Creating environments
· Discovering benchmarks
·· Uninstalling from virtualenv-py3.7-Cython
·· Building 4337b254  for virtualenv-py3.7-Cython
·· Installing 4337b254  into virtualenv-py3.7-Cython
· Running 6 total benchmarks (2 commits * 1 environments * 3 benchmarks)
[  0.00%] · For numpy commit 7a505741  (round 1/2):
[  0.00%] ·· Building for virtualenv-py3.7-Cython
[  0.00%] ·· Benchmarking virtualenv-py3.7-Cython
[  8.33%] ··· Running (bench_core.CountNonzero.time_count_nonzero--).
[ 16.67%] ··· Running (bench_core.CountNonzero.time_count_nonzero_axis--).
[ 25.00%] ··· Running (bench_core.CountNonzero.time_count_nonzero_multi_axis--).
[ 25.00%] · For numpy commit 4337b254  (round 1/2):
[ 25.00%] ·· Building for virtualenv-py3.7-Cython
[ 25.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 33.33%] ··· Running (bench_core.CountNonzero.time_count_nonzero--).
[ 41.67%] ··· Running (bench_core.CountNonzero.time_count_nonzero_axis--).
[ 50.00%] ··· Running (bench_core.CountNonzero.time_count_nonzero_multi_axis--).
[ 50.00%] · For numpy commit 4337b254  (round 2/2):
[ 50.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 58.33%] ··· bench_core.CountNonzero.time_count_nonzero                                                                            ok
[ 58.33%] ··· ========= ========= ============= ============= ============= =============
              --                                           dtype
              ------------------- -------------------------------------------------------
               numaxes     size        bool          int           str          object
              ========= ========= ============= ============= ============= =============
                  1        100      1.36±0.1µs   1.50±0.01µs   2.12±0.05µs   2.02±0.05µs
                  1       10000     1.99±0.3µs    32.1±0.6µs     85.9±4µs     77.2±0.6µs
                  1      1000000     41.2±2µs    2.99±0.01ms    9.55±0.3ms    8.15±0.1ms
                  2        100     1.28±0.01µs   1.74±0.03µs   2.93±0.06µs    2.75±0.1µs
                  2       10000    2.34±0.09µs    61.7±0.5µs     174±9µs       164±10µs
                  2      1000000     94.9±3µs    6.01±0.07ms     19.2±1ms     15.8±0.2ms
                  3        100     1.29±0.06µs    2.22±0.1µs   3.93±0.08µs   3.65±0.06µs
                  3       10000    2.81±0.07µs     91.6±1µs      257±8µs       259±8µs
                  3      1000000     141±5µs      9.14±0.1ms    28.5±0.4ms    23.6±0.6ms
              ========= ========= ============= ============= ============= =============
[ 66.67%] ··· bench_core.CountNonzero.time_count_nonzero_axis                                                                       ok

[ 66.67%] ··· ========= ========= ============= ============= ============ ============

--                                          dtype

------------------- -----------------------------------------------------

numaxes     size        bool          int          str         object

========= ========= ============= ============= ============ ============

1        100      8.57±0.3µs     11.5±1µs    11.0±0.1µs   10.7±0.5µs

1       10000      23.9±1µs      30.7±1µs     118±3µs      151±3µs

1      1000000   1.37±0.02ms   2.61±0.09ms    12.7±1ms    14.3±0.7ms

2        100      8.59±0.4µs     9.98±1µs    12.1±0.7µs    14.0±2µs

2       10000      39.0±2µs     49.5±0.7µs    246±5µs      290±10µs

2      1000000   2.76±0.04ms   4.87±0.09ms   24.2±0.7ms    28.0±1ms

3        100      8.58±0.5µs     10.2±1µs    13.6±0.5µs   13.5±0.4µs

3       10000      52.7±4µs      69.1±2µs     365±9µs      428±10µs

3      1000000   4.17±0.07ms    7.52±0.4ms   36.2±0.5ms   41.8±0.4ms

========= ========= ============= ============= ============ ============
[ 75.00%] ··· bench_core.CountNonzero.time_count_nonzero_multi_axis                                                                 ok

[ 75.00%] ··· ========= ========= ============= ============= ============ ============

--                                          dtype

------------------- -----------------------------------------------------

numaxes     size        bool          int          str         object

========= ========= ============= ============= ============ ============

1        100      8.79±0.7µs    9.58±0.3µs   11.9±0.9µs   11.0±0.3µs

1       10000     23.4±0.8µs    30.6±0.3µs    125±4µs      150±4µs

1      1000000   1.38±0.01ms   2.42±0.04ms   12.2±0.2ms   14.1±0.7ms

2        100      8.61±0.5µs    9.81±0.5µs   12.6±0.4µs    12.6±2µs

2       10000      38.1±1µs      57.0±4µs     245±20µs     294±10µs

2      1000000   2.80±0.06ms    4.84±0.1ms   25.1±0.3ms   27.7±0.2ms

3        100      8.75±0.3µs    9.95±0.3µs   13.5±0.6µs   13.8±0.4µs

3       10000     49.6±0.7µs     70.8±2µs     357±4µs      424±20µs

3      1000000   4.13±0.06ms    7.79±0.7ms   35.5±0.8ms   42.4±0.5ms

========= ========= ============= ============= ============ ============
[ 75.00%] · For numpy commit 7a505741  (round 2/2):

[ 75.00%] ·· Building for virtualenv-py3.7-Cython

[ 75.00%] ·· Benchmarking virtualenv-py3.7-Cython

[ 83.33%] ··· bench_core.CountNonzero.time_count_nonzero                                                                            ok

[ 83.33%] ··· ========= ========= ============= ============= ============= =============

--                                           dtype

------------------- -------------------------------------------------------

numaxes     size        bool          int           str          object

========= ========= ============= ============= ============= =============

1        100     1.27±0.03µs   1.51±0.05µs    2.36±0.2µs   1.94±0.01µs

1       10000    2.13±0.04µs    32.5±0.7µs     85.3±2µs      79.1±1µs

1      1000000    63.2±0.8µs   2.99±0.03ms    9.28±0.2ms    7.93±0.3ms

2        100     1.27±0.02µs   1.79±0.06µs   2.94±0.07µs    2.84±0.1µs

2       10000    2.82±0.05µs    62.7±0.9µs     182±9µs       158±5µs

2      1000000     128±2µs     6.16±0.07ms    19.7±0.8ms    15.9±0.3ms

3        100     1.28±0.04µs   2.17±0.07µs   3.90±0.06µs   3.63±0.02µs

3       10000    3.42±0.06µs     91.2±2µs      263±10µs      236±3µs

3      1000000     197±8µs     9.05±0.06ms    28.0±0.2ms    23.7±0.2ms

========= ========= ============= ============= ============= =============
[ 91.67%] ··· bench_core.CountNonzero.time_count_nonzero_axis                                                                       ok

[ 91.67%] ··· ========= ========= ============= ============= ============= ============

--                                          dtype

------------------- ------------------------------------------------------

numaxes     size        bool          int           str         object

========= ========= ============= ============= ============= ============

1        100      8.26±0.2µs    9.51±0.8µs   10.6±0.07µs   11.1±0.6µs

1       10000     22.4±0.5µs     30.6±1µs      120±3µs      156±10µs

1      1000000   1.39±0.02ms   2.37±0.04ms    12.6±0.3ms   14.3±0.3ms

2        100      8.50±0.4µs    9.58±0.3µs    12.0±0.4µs   11.7±0.2µs

2       10000     38.3±0.8µs     50.9±3µs      234±4µs      295±20µs

2      1000000   2.75±0.02ms    4.97±0.2ms    24.7±0.6ms   28.2±0.4ms

3        100      8.66±0.4µs    9.55±0.1µs    13.6±0.5µs   13.5±0.3µs

3       10000      53.7±7µs      70.1±2µs      358±40µs     416±7µs

3      1000000    4.11±0.1ms    7.27±0.2ms     36.2±1ms     43.2±1ms

========= ========= ============= ============= ============= ============
[100.00%] ··· bench_core.CountNonzero.time_count_nonzero_multi_axis                                                                 ok

[100.00%] ··· ========= ========= ============= ============= ============ ============

--                                          dtype

------------------- -----------------------------------------------------

numaxes     size        bool          int          str         object

========= ========= ============= ============= ============ ============

1        100      8.64±0.4µs    9.57±0.3µs   11.4±0.3µs   10.8±0.3µs

1       10000      24.8±1µs      31.4±1µs     125±6µs      152±2µs

1      1000000    1.40±0.2ms   2.44±0.05ms   12.4±0.2ms   14.1±0.2ms

2        100      8.79±0.2µs    10.2±0.5µs   12.2±0.4µs    12.4±1µs

2       10000      36.8±1µs      52.8±4µs     247±10µs     292±20µs

2      1000000   2.80±0.04ms   5.00±0.07ms   24.7±0.7ms   28.0±0.7ms

3        100      9.24±0.3µs    10.1±0.3µs   13.9±0.4µs   14.0±0.4µs

3       10000     50.7±0.8µs     74.7±4µs     345±20µs     419±9µs

3      1000000    4.15±0.2ms    7.89±0.4ms    36.6±1ms     43.0±2ms

========= ========= ============= ============= ============ ============
   before           after         ratio
 [7a505741]       [4337b254]
 <master>         <countnz>



2.82±0.05µs      2.34±0.09µs     0.83  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'bool'>)



3.42±0.06µs      2.81±0.07µs     0.82  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'bool'>)



    128±2µs         94.9±3µs     0.74  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'bool'>)



    197±8µs          141±5µs     0.71  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'bool'>)



 63.2±0.8µs         41.2±2µs     0.65  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'bool'>)



SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

PERFORMANCE INCREASED.

SSE2 enabled


· Creating environments
· Discovering benchmarks
·· Uninstalling from virtualenv-py3.7-Cython
·· Building d2f477c3  for virtualenv-py3.7-Cython
·· Installing d2f477c3  into virtualenv-py3.7-Cython
· Running 6 total benchmarks (2 commits * 1 environments * 3 benchmarks)
[  0.00%] · For numpy commit 7a505741  (round 1/2):
[  0.00%] ·· Building for virtualenv-py3.7-Cython
[  0.00%] ·· Benchmarking virtualenv-py3.7-Cython
[  8.33%] ··· Running (bench_core.CountNonzero.time_count_nonzero--).
[ 16.67%] ··· Running (bench_core.CountNonzero.time_count_nonzero_axis--).
[ 25.00%] ··· Running (bench_core.CountNonzero.time_count_nonzero_multi_axis--).
[ 25.00%] · For numpy commit d2f477c3  (round 1/2):
[ 25.00%] ·· Building for virtualenv-py3.7-Cython
[ 25.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 33.33%] ··· Running (bench_core.CountNonzero.time_count_nonzero--).
[ 41.67%] ··· Running (bench_core.CountNonzero.time_count_nonzero_axis--).
[ 50.00%] ··· Running (bench_core.CountNonzero.time_count_nonzero_multi_axis--).
[ 50.00%] · For numpy commit d2f477c3  (round 2/2):
[ 50.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 58.33%] ··· bench_core.CountNonzero.time_count_nonzero                  ok
[ 58.33%] ··· ========= ========= ======== =============
               numaxes     size    dtype
              --------- --------- -------- -------------
                  1        100      bool    1.35±0.07μs
                  1        100      int     1.57±0.04μs
                  1        100      str     2.20±0.08μs
                  1        100     object    2.06±0.2μs
                  1       10000     bool     2.28±0.2μs
                  1       10000     int       32.2±1μs
                  1       10000     str      86.9±0.9μs
                  1       10000    object     77.7±1μs
                  1      1000000    bool      72.7±1μs
                  1      1000000    int     3.05±0.04ms
                  1      1000000    str      9.61±0.2ms
                  1      1000000   object    7.97±0.2ms
                  2        100      bool    1.36±0.03μs
                  2        100      int     1.82±0.03μs
                  2        100      str     2.97±0.09μs
                  2        100     object   2.83±0.08μs
                  2       10000     bool     3.12±0.2μs
                  2       10000     int       62.0±1μs
                  2       10000     str       173±10μs
                  2       10000    object     156±2μs
                  2      1000000    bool      146±5μs
                  2      1000000    int     6.10±0.07ms
                  2      1000000    str      19.0±0.6ms
                  2      1000000   object    16.0±0.4ms
                  3        100      bool    1.30±0.02μs
                  3        100      int      2.17±0.2μs
                  3        100      str     3.92±0.09μs
                  3        100     object    3.61±0.1μs
                  3       10000     bool     4.03±0.3μs
                  3       10000     int       95.5±2μs
                  3       10000     str       260±9μs
                  3       10000    object     234±5μs
                  3      1000000    bool      224±8μs
                  3      1000000    int      9.27±0.2ms
                  3      1000000    str      29.2±0.8ms
                  3      1000000   object    24.2±0.2ms
              ========= ========= ======== =============
[ 66.67%] ··· ...core.CountNonzero.time_count_nonzero_axis                ok

[ 66.67%] ··· ========= ========= ======== =============

numaxes     size    dtype

--------- --------- -------- -------------

1        100      bool     8.77±0.6μs

1        100      int      9.40±0.3μs

1        100      str      10.8±0.3μs

1        100     object    10.9±0.6μs

1       10000     bool     22.8±0.7μs

1       10000     int      31.7±0.9μs

1       10000     str       127±3μs

1       10000    object     148±3μs

1      1000000    bool    1.45±0.04ms

1      1000000    int      2.46±0.1ms

1      1000000    str      12.4±0.3ms

1      1000000   object    14.2±0.3ms

2        100      bool     8.67±0.2μs

2        100      int      10.0±0.5μs

2        100      str       13.5±2μs

2        100     object    13.2±0.6μs

2       10000     bool      41.4±2μs

2       10000     int       51.9±2μs

2       10000     str       242±10μs

2       10000    object     293±7μs

2      1000000    bool     2.83±0.1ms

2      1000000    int      5.09±0.1ms

2      1000000    str       25.5±1ms

2      1000000   object    27.9±0.4ms

3        100      bool     8.76±0.3μs

3        100      int      9.92±0.7μs

3        100      str      13.4±0.6μs

3        100     object     13.9±2μs

3       10000     bool      52.5±2μs

3       10000     int       71.3±1μs

3       10000     str       380±10μs

3       10000    object     432±20μs

3      1000000    bool     4.30±0.1ms

3      1000000    int      7.49±0.1ms

3      1000000    str       37.9±1ms

3      1000000   object    42.3±0.2ms

========= ========= ======== =============
[ 75.00%] ··· ...ountNonzero.time_count_nonzero_multi_axis                ok

[ 75.00%] ··· ========= ========= ======== =============

numaxes     size    dtype

--------- --------- -------- -------------

1        100      bool     8.93±0.3μs

1        100      int      9.64±0.3μs

1        100      str      11.3±0.4μs

1        100     object    11.1±0.3μs

1       10000     bool      23.6±1μs

1       10000     int       31.9±1μs

1       10000     str       130±10μs

1       10000    object     158±10μs

1      1000000    bool    1.50±0.07ms

1      1000000    int      2.50±0.1ms

1      1000000    str      12.3±0.3ms

1      1000000   object    14.4±0.7ms

2        100      bool     9.35±0.6μs

2        100      int      11.4±0.6μs

2        100      str       12.5±1μs

2        100     object    12.8±0.4μs

2       10000     bool      37.4±2μs

2       10000     int       53.3±2μs

2       10000     str       249±6μs

2       10000    object     298±8μs

2      1000000    bool     3.22±0.2ms

2      1000000    int      4.93±0.2ms

2      1000000    str      24.7±0.8ms

2      1000000   object    28.1±0.6ms

3        100      bool     9.28±0.2μs

3        100      int      10.4±0.4μs

3        100      str      13.6±0.7μs

3        100     object    15.6±0.7μs

3       10000     bool      51.9±1μs

3       10000     int       76.9±3μs

3       10000     str       386±30μs

3       10000    object     433±10μs

3      1000000    bool     4.85±0.3ms

3      1000000    int      7.55±0.3ms

3      1000000    str      36.9±0.8ms

3      1000000   object     43.5±1ms

========= ========= ======== =============
[ 75.00%] · For numpy commit 7a505741  (round 2/2):

[ 75.00%] ·· Building for virtualenv-py3.7-Cython

[ 75.00%] ·· Benchmarking virtualenv-py3.7-Cython

[ 83.33%] ··· bench_core.CountNonzero.time_count_nonzero                  ok

[ 83.33%] ··· ========= ========= ======== =============

numaxes     size    dtype

--------- --------- -------- -------------

1        100      bool    1.34±0.04μs

1        100      int     1.58±0.09μs

1        100      str     2.10±0.04μs

1        100     object   2.08±0.04μs

1       10000     bool     2.19±0.1μs

1       10000     int      31.8±0.2μs

1       10000     str       85.3±4μs

1       10000    object     78.7±1μs

1      1000000    bool      65.3±4μs

1      1000000    int     2.99±0.07ms

1      1000000    str      9.47±0.3ms

1      1000000   object    8.19±0.8ms

2        100      bool    1.28±0.03μs

2        100      int     1.84±0.04μs

2        100      str      3.00±0.1μs

2        100     object    3.01±0.1μs

2       10000     bool     2.87±0.1μs

2       10000     int       64.4±2μs

2       10000     str       175±7μs

2       10000    object     160±5μs

2      1000000    bool      129±4μs

2      1000000    int      6.31±0.2ms

2      1000000    str      18.6±0.4ms

2      1000000   object     16.7±2ms

3        100      bool    1.30±0.06μs

3        100      int      2.17±0.2μs

3        100      str      3.91±0.1μs

3        100     object    3.82±0.2μs

3       10000     bool     3.41±0.1μs

3       10000     int       92.7±3μs

3       10000     str       263±20μs

3       10000    object     232±4μs

3      1000000    bool      207±7μs

3      1000000    int      9.09±0.2ms

3      1000000    str       29.5±1ms

3      1000000   object    24.0±0.7ms

========= ========= ======== =============
[ 91.67%] ··· ...core.CountNonzero.time_count_nonzero_axis                ok

[ 91.67%] ··· ========= ========= ======== =============

numaxes     size    dtype

--------- --------- -------- -------------

1        100      bool     8.28±0.3μs

1        100      int      9.83±0.5μs

1        100      str      11.1±0.2μs

1        100     object    11.1±0.2μs

1       10000     bool     24.8±0.8μs

1       10000     int       31.9±1μs

1       10000     str       128±3μs

1       10000    object     159±3μs

1      1000000    bool    1.48±0.06ms

1      1000000    int     2.58±0.08ms

1      1000000    str      12.3±0.1ms

1      1000000   object    14.4±0.5ms

2        100      bool     8.87±0.5μs

2        100      int      9.80±0.4μs

2        100      str      12.2±0.3μs

2        100     object    12.7±0.5μs

2       10000     bool      38.1±1μs

2       10000     int       54.1±2μs

2       10000     str       237±8μs

2       10000    object     295±10μs

2      1000000    bool    2.87±0.02ms

2      1000000    int      5.08±0.1ms

2      1000000    str      25.9±0.9ms

2      1000000   object    28.7±0.8ms

3        100      bool     8.94±0.6μs

3        100      int      10.3±0.9μs

3        100      str      13.8±0.4μs

3        100     object    13.8±0.4μs

3       10000     bool     51.4±0.4μs

3       10000     int       72.3±3μs

3       10000     str       355±20μs

3       10000    object     451±50μs

3      1000000    bool     4.19±0.1ms

3      1000000    int      7.49±0.2ms

3      1000000    str       37.0±1ms

3      1000000   object     43.0±2ms

========= ========= ======== =============
[100.00%] ··· ...ountNonzero.time_count_nonzero_multi_axis                ok

[100.00%] ··· ========= ========= ======== =============

numaxes     size    dtype

--------- --------- -------- -------------

1        100      bool     9.10±0.4μs

1        100      int      9.54±0.3μs

1        100      str      11.5±0.5μs

1        100     object    10.9±0.3μs

1       10000     bool      24.5±1μs

1       10000     int       31.4±1μs

1       10000     str       130±8μs

1       10000    object     151±3μs

1      1000000    bool    1.41±0.02ms

1      1000000    int      2.56±0.1ms

1      1000000    str      12.0±0.2ms

1      1000000   object    14.4±0.5ms

2        100      bool     8.76±0.1μs

2        100      int      10.2±0.3μs

2        100      str      12.6±0.4μs

2        100     object    13.1±0.4μs

2       10000     bool      39.0±1μs

2       10000     int      52.2±0.8μs

2       10000     str       242±6μs

2       10000    object     332±20μs

2      1000000    bool     2.98±0.2ms

2      1000000    int      5.04±0.2ms

2      1000000    str      24.2±0.4ms

2      1000000   object    28.9±0.9ms

3        100      bool     8.77±0.6μs

3        100      int      10.4±0.3μs

3        100      str      13.5±0.2μs

3        100     object    14.3±0.6μs

3       10000     bool      54.8±2μs

3       10000     int       72.8±2μs

3       10000     str       365±10μs

3       10000    object     463±9μs

3      1000000    bool    4.23±0.09ms

3      1000000    int      7.66±0.3ms

3      1000000    str       36.9±1ms

3      1000000   object    42.7±0.8ms

========= ========= ======== =============
BENCHMARKS NOT SIGNIFICANTLY CHANGED.

NEON enabled


· Creating environments
· Discovering benchmarks
·· Uninstalling from virtualenv-py3.7-Cython.
·· Building 5da4a8e1  for virtualenv-py3.7-Cython.....................................
·· Installing 5da4a8e1  into virtualenv-py3.7-Cython.
· Running 6 total benchmarks (2 commits * 1 environments * 3 benchmarks)
[  0.00%] · For numpy commit 5da4a8e1  (round 1/2):
[  0.00%] ·· Benchmarking virtualenv-py3.7-Cython
[  8.33%] ··· Running (bench_core.CountNonzero.time_count_nonzero--).
[ 16.67%] ··· Running (bench_core.CountNonzero.time_count_nonzero_axis--).
[ 25.00%] ··· Running (bench_core.CountNonzero.time_count_nonzero_multi_axis--).
[ 25.00%] · For numpy commit 5da4a8e1  (round 1/2):
[ 25.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 33.33%] ··· Running (bench_core.CountNonzero.time_count_nonzero--).
[ 41.67%] ··· Running (bench_core.CountNonzero.time_count_nonzero_axis--).
[ 50.00%] ··· Running (bench_core.CountNonzero.time_count_nonzero_multi_axis--).
[ 50.00%] · For numpy commit 5da4a8e1  (round 2/2):
[ 50.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 58.33%] ··· bench_core.CountNonzero.time_count_nonzero                                                                               ok
[ 58.33%] ··· ========= ========= ============= ============= ============= =============
              --                                           dtype                         
              ------------------- -------------------------------------------------------
               numaxes     size        bool          int           str          object   
              ========= ========= ============= ============= ============= =============
                  1        100     2.43±0.02μs   2.62±0.01μs   3.69±0.01μs   3.67±0.01μs 
                  1       10000    3.27±0.02μs   33.6±0.02μs    137±0.1μs     138±0.07μs 
                  1      1000000    80.2±0.3μs   3.19±0.01ms   17.9±0.02ms   14.7±0.04ms 
                  2        100     2.42±0.01μs   2.92±0.01μs   5.03±0.01μs   5.04±0.01μs 
                  2       10000    3.84±0.02μs   64.7±0.09μs     278±3μs      274±0.2μs  
                  2      1000000     166±1μs      6.78±0.1ms    35.5±0.3ms     31.9±1ms  
                  3        100     2.45±0.01μs     3.24±0μs    6.45±0.01μs   6.40±0.01μs 
                  3       10000    4.38±0.01μs   95.9±0.08μs     438±30μs     411±0.7μs  
                  3      1000000     239±10μs     10.3±0.2ms    53.5±0.2ms     50.9±2ms  
              ========= ========= ============= ============= ============= =============
[ 66.67%] ··· bench_core.CountNonzero.time_count_nonzero_axis                                                                          ok

[ 66.67%] ··· ========= ========= ============= ============= ============= =============

--                                           dtype

------------------- -------------------------------------------------------

numaxes     size        bool          int           str          object

========= ========= ============= ============= ============= =============

1        100     17.4±0.02μs   19.4±0.05μs    21.7±0.1μs    21.7±0.2μs

1       10000     31.0±0.2μs   40.8±0.08μs     156±6μs      240±0.7μs

1      1000000     1.23±0ms    2.16±0.01ms   14.4±0.06ms    22.8±0.7ms

2        100      17.9±0.1μs   20.0±0.07μs    23.3±0.1μs   24.3±0.08μs

2       10000     42.9±0.3μs   63.3±0.07μs     271±2μs       464±3μs

2      1000000   2.43±0.01ms   4.53±0.04ms   28.7±0.05ms     45.7±2ms

3        100     18.1±0.06μs   20.2±0.02μs    24.6±0.1μs   26.6±0.03μs

3       10000     55.4±0.3μs    85.9±0.3μs     398±10μs      688±4μs

3      1000000   3.64±0.04ms    7.08±0.1ms    41.9±0.4ms     69.8±5ms

========= ========= ============= ============= ============= =============
[ 75.00%] ··· bench_core.CountNonzero.time_count_nonzero_multi_axis                                                                    ok

[ 75.00%] ··· ========= ========= ============= ============= ============= =============

--                                           dtype

------------------- -------------------------------------------------------

numaxes     size        bool          int           str          object

========= ========= ============= ============= ============= =============

1        100     18.1±0.02μs   20.0±0.05μs   22.5±0.06μs   22.3±0.03μs

1       10000     31.2±0.1μs    42.1±0.1μs     151±1μs       242±2μs

1      1000000     1.22±0ms    2.16±0.03ms   14.3±0.05ms     23.1±1ms

2        100     18.5±0.09μs   20.4±0.02μs    23.8±0.2μs   24.8±0.07μs

2       10000     43.3±0.2μs    63.8±0.4μs     287±10μs      461±2μs

2      1000000     2.43±0ms     4.58±0.1ms   28.7±0.08ms     45.9±2ms

3        100     18.6±0.04μs   20.6±0.08μs    24.8±0.1μs   27.0±0.05μs

3       10000     55.2±0.2μs    84.7±0.2μs     393±10μs      690±4μs

3      1000000   3.62±0.03ms   7.03±0.08ms    42.1±0.2ms     69.3±2ms

========= ========= ============= ============= ============= =============
[ 75.00%] · For numpy commit 5da4a8e1  (round 2/2):

[ 75.00%] ·· Benchmarking virtualenv-py3.7-Cython

[ 83.33%] ··· bench_core.CountNonzero.time_count_nonzero                                                                               ok

[ 83.33%] ··· ========= ========= ============= ============= ============= =============

--                                           dtype

------------------- -------------------------------------------------------

numaxes     size        bool          int           str          object

========= ========= ============= ============= ============= =============

1        100       2.43±0μs    2.62±0.01μs   3.72±0.01μs     3.68±0μs

1       10000      3.25±0μs      33.6±0μs     141±0.5μs     138±0.1μs

1      1000000    79.8±0.9μs   3.27±0.01ms    17.8±0.2ms   13.6±0.03ms

2        100       2.44±0μs      2.92±0μs    5.05±0.01μs   5.02±0.01μs

2       10000    3.80±0.01μs   64.8±0.06μs    272±0.5μs     273±0.2μs

2      1000000     168±1μs     6.71±0.02ms   35.6±0.09ms   27.2±0.01ms

3        100       2.44±0μs      3.23±0μs    6.41±0.01μs   6.38±0.01μs

3       10000      4.31±0μs    95.7±0.08μs     414±2μs      409±0.2μs

3      1000000    225±0.3μs     10.4±0.1ms    53.5±0.1ms   41.0±0.05ms

========= ========= ============= ============= ============= =============
[ 91.67%] ··· bench_core.CountNonzero.time_count_nonzero_axis                                                                          ok

[ 91.67%] ··· ========= ========= ============= ============= ============= =============

--                                           dtype

------------------- -------------------------------------------------------

numaxes     size        bool          int           str          object

========= ========= ============= ============= ============= =============

1        100     17.2±0.01μs   19.2±0.01μs   21.7±0.01μs   21.5±0.01μs

1       10000    30.9±0.03μs    42.2±0.1μs    151±0.7μs     242±0.2μs

1      1000000     1.22±0ms    2.15±0.01ms   14.3±0.04ms   22.3±0.02ms

2        100     17.9±0.02μs   20.0±0.01μs   22.8±0.03μs   24.5±0.01μs

2       10000    43.4±0.08μs    65.5±0.1μs     288±4μs      470±0.9μs

2      1000000     2.43±0ms    4.52±0.02ms    28.5±0.1ms    43.7±0.1ms

3        100     18.1±0.02μs   20.3±0.01μs   24.4±0.03μs   26.4±0.04μs

3       10000    56.1±0.03μs   88.9±0.07μs     413±9μs       693±1μs

3      1000000     3.62±0ms    7.01±0.07ms    42.8±0.1ms    66.8±0.3ms

========= ========= ============= ============= ============= =============
[100.00%] ··· bench_core.CountNonzero.time_count_nonzero_multi_axis                                                                    ok

[100.00%] ··· ========= ========= ============= ============= ============= =============

--                                           dtype

------------------- -------------------------------------------------------

numaxes     size        bool          int           str          object

========= ========= ============= ============= ============= =============

1        100     17.9±0.01μs   19.9±0.02μs   22.5±0.07μs   22.4±0.01μs

1       10000    31.7±0.06μs    42.7±0.1μs    152±0.3μs     245±0.6μs

1      1000000     1.22±0ms    2.06±0.02ms   14.3±0.04ms   22.2±0.06ms

2        100     18.3±0.05μs   20.4±0.04μs   23.3±0.04μs   24.6±0.05μs

2       10000    42.9±0.03μs   62.5±0.06μs     274±1μs       468±1μs

2      1000000     2.42±0ms    4.54±0.01ms   28.5±0.02ms    44.6±0.2ms

3        100     18.6±0.08μs   20.6±0.07μs   24.7±0.04μs   26.9±0.07μs

3       10000    55.3±0.02μs    90.3±0.3μs     406±4μs      693±0.7μs

3      1000000   3.62±0.01ms   7.13±0.09ms   42.6±0.04ms   66.5±0.07ms

========= ========= ============= ============= ============= =============
BENCHMARKS NOT SIGNIFICANTLY CHANGED.

System Info

	Arm	x86
Hardware	KunPeng
Processor	ARMv8 2.6GMHZ 8 processors	Intel(R) Xeon(R) Gold 6161 CPU @ 2.20GHz
OS	Linux ecs-9d50 4.19.36-vhulk1905.1.0.h276.eulerosv2r8.aarch64	Windows Server 2008 R2 Enterprise
Compiler	gcc (GCC) 7.3.0	MSVC14.06

Qiyu8 · 2020-12-08T08:30:42Z

@seiko2plus when I do the npyv_u8 vt = npyv_cmpeq_u8(a, b); , The power machine has an error incompatible types when assigning to type ‘npyv_u8 {aka __vector(16) unsigned char}’ from type ‘__vector __bool char {aka __vector(16) __bool char}’, Do you have any idea why this happened?

seiko2plus · 2020-12-08T10:37:08Z

@Qiyu8, because all comparison operations returns boolean vector data type, NPYV count on SIMD extensions
for type safety, so it should be npyv_b8 bt = npyv_cmpeq_u8(a, b); otherwise it will break the build on VSX and AVX512

numpy/core/src/multiarray/item_selection.c

eric-wieser · 2020-12-08T13:15:20Z

I think something like this would avoid overflow issues:

/* Count the zero bytes between `*d` and `end`, updating `*d` to point to where to keep counting from. */
static NPY_INLINE npyv_u8
count_zero_bytes_u8(const npy_uint8 **d, const npy_uint8 *end)
{
    const npyv_u8 vone = npyv_setall_u8(1);
    const npyv_u8 vzero = npyv_setall_u8(0);

    npy_intp n = 0;
    npyv_u8 vsum8 = npyv_zero_u8();
    while (*d < end && n <= 0xFE) {
        npyv_b8 vt = npyv_cmpeq_u8(npyv_load_u8(d), vzero);
        vt = npyv_and_u8(vt, vone);
        vsum8 = npyv_add_u8(vsum8, vt);
        d += npyv_nlanes_u8;
        n++;
    }
    return vsum8;
}

static NPY_INLINE npyv_u16
count_zero_bytes_u16(const npy_uint8 **d, const npy_uint8 *end)
{
    npyv_u16 vsum16 = npyv_zero_u16();
    npy_intp n = 0;
    while (*d < end && n <= 0xFF00) {
        npyv_u8 vsum8 = count_zero_bytes_u8(d, end);
        npyv_u16 part1, part2;
        npyv_expand_u8_u16(vsum8, &part1, &part2);
        vsum16 = npyv_add_u16(vsum16, npyv_add_u16(part1, part2));
        n += 0xFF;
    }
    return vsum16;
}

static NPY_INLINE npyv_u32
count_zero_bytes_u32(const npy_uint8 **d, const npy_uint8 *end)
{
    npyv_u32 vsum32 = npyv_zero_u32();
    npy_intp n = 0;
    while (*d < end && n <= 0xFFFF0000) {
        npyv_u8 vsum16 = count_zero_bytes_u16(d, end);
        npyv_u32 part1, part2;
        npyv_expand_u16_u32(vsum16, &part1, &part2);
        vsum32 = npyv_add_u32(vsum32, npyv_add_u32(part1, part2));
        n += 0xFFFF;
    }
    return vsum32;
}


static NPY_INLINE npy_intp
count_nonzero_bytes(const npy_uint8 *d, npy_uintp unrollx)
{
    npy_intp zero_count = 0;
    const npy_uint8 *end = d + unrollx;
    while (*d < end) {
        npyv_u32 vsum32 = count_zero_bytes_u32(d, end);
        zero_count += npyv_sum_u32(vsum32);
    }
    return unrollx - zero_count;
}

numpy/core/src/common/simd/avx2/conversion.h

numpy/core/src/common/simd/avx512/conversion.h

numpy/core/src/common/simd/neon/arithmetic.h

numpy/core/src/multiarray/item_selection.c

Qiyu8 · 2020-12-09T03:03:32Z

@eric-wieser The over-flow preventing code you presented looks more elegant, I will try and give a benchmark result. thanks.

Qiyu8 · 2020-12-09T08:23:29Z

@eric-wieser Here is the benchmark result of the modular code.

       before           after         ratio
     [7a505741]       [dfbb3f76]
     <master>         <countnz>
-         133±3μs         89.9±5μs     0.67  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'bool'>)
-         201±8μs          135±3μs     0.67  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'bool'>)
-        64.5±2μs         43.1±1μs     0.67  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'bool'>)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

The performance remains good, so I will take that code.

Qiyu8 · 2020-12-09T09:42:53Z

_mm512_cvtepu8_epi16 is only available in AVX512BW, @seiko2plus Any suggestions in implement the expansion in other platform like AVX512F?

numpy/core/src/multiarray/item_selection.c

Qiyu8 · 2020-12-14T03:40:41Z

The CI failure(same with #17102 ) seems to be "You are in 'detached HEAD' state.", maybe sync with master solve this problem.

numpy/core/src/common/simd/avx2/arithmetic.h

numpy/core/tests/test_simd.py

Qiyu8 · 2020-12-15T02:08:30Z

The test case of new intrinsics has added, so the codecov/patch result is inaccurate here.

numpy/core/src/multiarray/item_selection.c

seiko2plus · 2020-12-15T06:49:24Z

@Qiyu8, codecov builders sometimes run on x86 machines with no avx512 support

numpy/core/src/multiarray/item_selection.c

…ustment to avoid overflow.

eric-wieser

I think the loop bounds become clearer if you write the full subtraction now.

eric-wieser · 2020-12-17T10:36:11Z

numpy/core/src/multiarray/item_selection.c

+
+    npy_intp lane_max = 0;
+    npyv_u8 vsum8 = npyv_zero_u8();
+    while (*d < end && lane_max <= 0xFE) {


Suggested change

while (*d < end && lane_max <= 0xFE) {

while (*d < end && lane_max <= 0xFF - 1) {

Why not < 0xFF? Is <= faster?

Because the intuition is that the counter should reach at most 0xFF, and is incremented by one each loop. Writing it this way makes it generalize well to the u16 and u32 cases. The compiler shouldn't care, it's for the reader.

numpy/core/src/multiarray/item_selection.c

seiko2plus · 2020-12-19T08:17:47Z

numpy/core/src/multiarray/item_selection.c

+static NPY_INLINE NPY_GCC_OPT_3 npyv_u16
+count_zero_bytes_u16(const npy_uint8 **d, const npy_uint8 *end, npy_uint16 max_count)
+{
+    npyv_u16 vsum16 = npyv_zero_u16();
+    npy_intp lane_max = 0;
+    while (*d < end && lane_max <= max_count - 2*NPY_MAX_UINT8) {
+        npyv_u8 vsum8 = count_zero_bytes_u8(d, end, NPY_MAX_UINT8);
+        npyv_u16x2 part = npyv_expand_u16_u8(vsum8);
+        vsum16 = npyv_add_u16(vsum16, npyv_add_u16(part.val[0], part.val[1]));
+        lane_max += 2*NPY_MAX_UINT8;
+    }
+    return vsum16;
+}


Suggested change

static NPY_INLINE NPY_GCC_OPT_3 npyv_u16

count_zero_bytes_u16(const npy_uint8 **d, const npy_uint8 *end, npy_uint16 max_count)

{

npyv_u16 vsum16 = npyv_zero_u16();

npy_intp lane_max = 0;

while (*d < end && lane_max <= max_count - 2*NPY_MAX_UINT8) {

npyv_u8 vsum8 = count_zero_bytes_u8(d, end, NPY_MAX_UINT8);

npyv_u16x2 part = npyv_expand_u16_u8(vsum8);

vsum16 = npyv_add_u16(vsum16, npyv_add_u16(part.val[0], part.val[1]));

lane_max += 2*NPY_MAX_UINT8;

}

return vsum16;

}

static NPY_INLINE NPY_GCC_OPT_3 npyv_u16x2

count_zero_bytes_u16(const npy_uint8 **d, const npy_uint8 *end, npy_uint16 max_count)

{

npyv_u16x2 vsum16;

vsum16.val[0] = vsum16.val[1] = npyv_zero_u16();

npy_intp lane_max = 0;

while (*d < end && lane_max <= max_count - NPY_MAX_UINT8) {

npyv_u8 vsum8 = count_zero_bytes_u8(d, end, NPY_MAX_UINT8);

npyv_u16x2 part = npyv_expand_u16_u8(vsum8);

vsum16.val[0] = npyv_add_u16(vsum16.val[0], part.val[0]);

vsum16.val[1] = npyv_add_u16(vsum16.val[1], part.val[1]);

lane_max += NPY_MAX_UINT8;

}

return vsum16;

}

increase the iterations to x2

seiko2plus · 2020-12-19T08:19:33Z

numpy/core/src/multiarray/item_selection.c

+static NPY_INLINE NPY_GCC_OPT_3 npyv_u32
+count_zero_bytes_u32(const npy_uint8 **d, const npy_uint8 *end, npy_uint32 max_count)
+{
+    npyv_u32 vsum32 = npyv_zero_u32();
+    npy_intp lane_max = 0;
+    while (*d < end && lane_max <= max_count - 2*NPY_MAX_UINT16) {
+        npyv_u16 vsum16 = count_zero_bytes_u16(d, end, NPY_MAX_UINT16);
+        npyv_u32x2 part = npyv_expand_u32_u16(vsum16);
+        vsum32 = npyv_add_u32(vsum32, npyv_add_u32(part.val[0], part.val[1]));
+        lane_max += 2*NPY_MAX_UINT16;
+    }
+    return vsum32;
+}


Suggested change

static NPY_INLINE NPY_GCC_OPT_3 npyv_u32

count_zero_bytes_u32(const npy_uint8 **d, const npy_uint8 *end, npy_uint32 max_count)

{

npyv_u32 vsum32 = npyv_zero_u32();

npy_intp lane_max = 0;

while (*d < end && lane_max <= max_count - 2*NPY_MAX_UINT16) {

npyv_u16 vsum16 = count_zero_bytes_u16(d, end, NPY_MAX_UINT16);

npyv_u32x2 part = npyv_expand_u32_u16(vsum16);

vsum32 = npyv_add_u32(vsum32, npyv_add_u32(part.val[0], part.val[1]));

lane_max += 2*NPY_MAX_UINT16;

}

return vsum32;

}

I think there's no need for an extra block-level, two is enough plus the prev suggestion increased the iterations for u16 level

seiko2plus · 2020-12-19T08:28:29Z

numpy/core/src/multiarray/item_selection.c

+        npyv_u32 vsum32 = count_zero_bytes_u32(&d, end, NPY_MAX_UINT32 / npyv_nlanes_u32);
+        zero_count += npyv_sum_u32(vsum32);


Suggested change

npyv_u32 vsum32 = count_zero_bytes_u32(&d, end, NPY_MAX_UINT32 / npyv_nlanes_u32);

zero_count += npyv_sum_u32(vsum32);

npyv_u16x2 vsum16 = count_zero_bytes_u16(&d, end, NPY_MAX_UINT16);

npyv_u32x2 sum_32_0 = npyv_expand_u32_u16(vsum16.val[0]);

npyv_u32x2 sum_32_1 = npyv_expand_u32_u16(vsum16.val[1]);

zero_count += npyv_sum_u32(npyv_add_u32(

npyv_add_u32(sum_32_0.val[0], sum_32_0.val[1]),

npyv_add_u32(sum_32_1.val[0], sum_32_1.val[1])

));

EDIT: only one sum is needed

Please correct me if i am wrong, the new solution only handles (2**16-1)*16 elements in one loop, while the previous one handles 2**32-1 elements, the iterations x2 is good, but I think that count_zero_bytes_u32 should not be removed from the perspective of efficiency. Further more, the npyv_expand_u32_u16 operation is used just to prevent summation overflow, but in previous solution expansion is used to count more elements, which is the key task here.

but I think that count_zero_bytes_u32 should not be removed from the perspective of efficiency.

wouldn't affect performance, the idea behind level u16 is to reduce the operations of summing the vector.
if you want to improve the performance more, try to unroll u8 level by x4.

the npyv_expand_u32_u16 operation is used just to prevent summation overflow

sorry, I wrote the suggestion in a hurry, I edited it. only one npyv_sum_u32() is used.

but in previous solution expansion is used to count more elements, which is the key task here.

Which one do you refer to? I'm getting confused.

I prefer to reserve count_zero_bytes_u32 as shown below:

I think there's a misunderstanding here, what I can understand from the code:

count_nonzero_bytes_u8() to reduce calling npyv_expand_u16_u8() and npyv_sum_u32()

count_nonzero_bytes_u16() to reduce calling npyv_expand_u32_u16(), npyv_sum_u32()

count_nonzero_bytes_u32() another level to reduce calling npyv_sum_u32() not because "needs to avoid overflow"

here an example for the simd loop without any block-level so you can get what I understand from your code:

static NPY_INLINE NPY_GCC_OPT_3 npy_intp count_nonzero_bytes(const npy_uint8 *d, npy_uintp unrollx) { npy_intp zero_count = 0; for (; unrollx > 0; unrollx -= npyv_nlanes_u8, d += npyv_nlanes_u8) { npyv_u8 cmp = npyv_cvt_u8_b8(npyv_cmpeq_u8(npyv_load_u8(d), npyv_zero_u8())); npyv_u8 one = npyv_and_u8(cmp, npyv_setall_u8(1)); npyv_u16x2 one_u16 = npyv_expand_u16_u8(vsum8); npyv_u32x2 one_u32 = npyv_expand_u32_u16(npyv_add_u16(one_u16.val[0], one_u16.val[1])); zero_count += npyv_sum_u32(npyv_add_u32(one_u32.val[0], one_u32.val[1])); } return zero_count; }

Now, Am I missing something?

The non-block-level code is fine, basically there has four loops:

count_zero_bytes_u8: get the maximum count that u8 type can hold.

count_zero_bytes_u16: get the maximum count that u16 type can hold.

count_zero_bytes_u32: get the maximum count of a vector whose sum does not exceed a u32.

count_zero_bytes: get the maximum count that u64 type(AKA npy_uintp) can hold.

what I don't fully understand is that your suggested code removes step 3.

@Qiyu8, the step 3 you describe there isn't quite the (correct) code you've implemented - the u32 version fills a vector with values whose sum does not exceed a u32

The non-block-level code is fine, basically there has four loops:

In the previous example, I was trying to explain the motive behind creating nested loops.
The original OpenCV code was trying to reduce the calls of expensive intrinsics
as much as possible and increase the cheaper ones.

cheap intrinsics:

comparison. NOTE: "equal" is used here since almost all archs don't have native support "not equal".

integer addition.

bitwise

Most archs can execute multiple instructions for the
the above operations per one clock cycle but on the other hand
expand and reduce sum(the worst) takes more latency and throughput.

The negative side of nested loops is the "integer overflow" which leads
to putting an iterations limit to the inner loops but wait loops involve jmps
and jmps may lead to flushing the pipeline so you
should be aware that the inner loop should save more cycles than what the flushing can spend.

what I don't fully understand is that your suggested code removes step 3.

because I think there's no performance gain from it, count_zero_bytes_u16 and count_zero_bytes_u8 already reduces enough calls of "reduce sum".

The performance looks fine after remove inner loop.

before after ratio [d7a75e8e] [c5daaf06] <master> <countnz> - 6.47±0.1ms 5.48±0.2ms 0.85 bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'int'>) - 3.27±0.08ms 2.69±0.08ms 0.82 bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'int'>) - 101±3μs 82.6±2μs 0.82 bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'int'>) - 10.1±0.3ms 8.07±0.1ms 0.80 bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'int'>) - 127±3μs 92.8±4μs 0.73 bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'bool'>) - 215±10μs 142±3μs 0.66 bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'bool'>) - 64.6±1μs 41.7±2μs 0.65 bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'bool'>) SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY. PERFORMANCE INCREASED.

seiko2plus · 2020-12-19T08:37:15Z

numpy/core/src/multiarray/item_selection.c

+    npy_intp lane_max = 0;
+    npyv_u8 vsum8 = npyv_zero_u8();
+    while (*d < end && lane_max <= max_count - 1) {
+        npyv_u8 vt = npyv_cvt_u8_b8(npyv_cmpeq_u8(npyv_load_u8(*d), vzero));


Suggested change

npyv_u8 vt = npyv_cvt_u8_b8(npyv_cmpeq_u8(npyv_load_u8(*d), vzero));

// we count zeros because `cmpeq` cheaper than `cmpneq` for most archs

npyv_u8 vt = npyv_cvt_u8_b8(npyv_cmpeq_u8(npyv_load_u8(*d), vzero));

comment added.

seiko2plus

LGTM, Thank you

mattip

Just a few nits about docstrings to tie the tests back to the macros, to help future reviewers. Can be done in a follow-up PR if needed.

mattip · 2020-12-23T08:41:34Z

numpy/core/tests/test_simd.py

@@ -663,6 +663,21 @@ def test_conversion_boolean(self):
        true_vsfx = from_boolean(true_vb)
        assert false_vsfx != true_vsfx

+    def test_conversion_expand(self):


Suggested change

def test_conversion_expand(self):

def test_conversion_expand(self):

"""Test npyv_expand_u16_u8, npyv_expand_u32_u16"""

docstring added.

mattip · 2020-12-23T08:42:26Z

numpy/core/tests/test_simd.py

@@ -707,7 +722,7 @@ def test_arithmetic_div(self):
        assert div == data_div

    def test_arithmetic_reduce_sum(self):
-        if not self._is_fp():


Could you add an appropriate docsting to indicate which npyv_* macros this tests?

docstring added.

mattip · 2020-12-23T10:04:27Z

Thanks @Qiyu8

Qiyu8 added 6 commits December 7, 2020 15:35

Optimize the performance of countnonzero by using universal intrinsics

1c92910

test avx256 result

4337b25

add AVX512/Neon/VSX intrinsics

d2f477c

fix pointer differ in signedness warning

1ec7500

fix CI failures

b62059f

The mid loop should subtract 255 to prevent overriding.

941a82b

eric-wieser reviewed Dec 8, 2020

View reviewed changes

numpy/core/src/multiarray/item_selection.c Outdated Show resolved Hide resolved

eric-wieser reviewed Dec 8, 2020

View reviewed changes

numpy/core/src/multiarray/item_selection.c Outdated Show resolved Hide resolved

change to npyv_b8 and make loop more clear

7829237

eric-wieser reviewed Dec 8, 2020

View reviewed changes

numpy/core/src/multiarray/item_selection.c Outdated Show resolved Hide resolved

eric-wieser reviewed Dec 8, 2020

View reviewed changes

numpy/core/src/multiarray/item_selection.c Outdated Show resolved Hide resolved

eric-wieser reviewed Dec 8, 2020

View reviewed changes

numpy/core/src/multiarray/item_selection.c Outdated Show resolved Hide resolved

seiko2plus requested changes Dec 8, 2020

View reviewed changes

seiko2plus mentioned this pull request Dec 8, 2020

TST, BUILD: Add a native x86 baseline build running on ubuntu-20.04 #17960

Merged

charris changed the title ~~Optimize the performance of count_nonzero by using universal intrinsics~~ MAINT: Optimize the performance of count_nonzero by using universal intrinsics Dec 8, 2020

charris added 03 - Maintenance component: numpy._core labels Dec 8, 2020

Qiyu8 added 3 commits December 9, 2020 16:04

use splited loops and add unit test for new intrinsics

4fd3d90

force avx2 in order to test the performance.

dfbb3f7

remove force avx2

ea92ca9

eric-wieser reviewed Dec 9, 2020

View reviewed changes

numpy/core/src/multiarray/item_selection.c Show resolved Hide resolved

Qiyu8 added 3 commits December 10, 2020 09:46

add non-AVX512BW instruments

32465aa

Merge branch 'master' of github.com:numpy/numpy into countnz

551f8c3

Add non-AVX512BW expansion

ede6df1

Qiyu8 added 4 commits December 10, 2020 16:14

use current intrinsics instead of create new one.

0b553c7

remove extra input

d328bf3

fix slow test bug

f7fde77

Thanks for the unit test, expose the VSX expand error.

d38a749

seiko2plus reviewed Dec 14, 2020

View reviewed changes

numpy/core/src/common/simd/avx2/arithmetic.h Outdated Show resolved Hide resolved

numpy/core/tests/test_simd.py Outdated Show resolved Hide resolved

Qiyu8 added 2 commits December 14, 2020 16:20

use max value to detect deviation

ae96df4

Merge branch 'master' of github.com:numpy/numpy into countnz

a104074

seiko2plus reviewed Dec 15, 2020

View reviewed changes

numpy/core/src/multiarray/item_selection.c Outdated Show resolved Hide resolved

numpy/core/src/multiarray/item_selection.c Outdated Show resolved Hide resolved

Unified code style.

a209ef9

eric-wieser reviewed Dec 15, 2020

View reviewed changes

numpy/core/src/multiarray/item_selection.c Outdated Show resolved Hide resolved

add NPY_GCC_OPT_3 to all count_zero functions.

e826c1f

eric-wieser reviewed Dec 16, 2020

View reviewed changes

numpy/core/src/multiarray/item_selection.c Outdated Show resolved Hide resolved

Qiyu8 added 2 commits December 16, 2020 17:30

only the last accumulation into the 2**32-bit counters needs this adj…

4a09b2f

…ustment to avoid overflow.

Merge branch 'master' of github.com:numpy/numpy into countnz

a9afb06

eric-wieser reviewed Dec 17, 2020

View reviewed changes

numpy/core/src/multiarray/item_selection.c Outdated Show resolved Hide resolved

eric-wieser reviewed Dec 17, 2020

View reviewed changes

numpy/core/src/multiarray/item_selection.c Outdated Show resolved Hide resolved

add ascii art to demonstrate the process and fix the lane_max ranges

c3ffcb3

seiko2plus requested changes Dec 19, 2020

View reviewed changes

Qiyu8 added 2 commits December 21, 2020 09:39

Merge branch 'master' of github.com:numpy/numpy into countnz

b102e76

remove one inner loops in order to save more cycles.

c5daaf0

seiko2plus approved these changes Dec 22, 2020

View reviewed changes

Merge branch 'master' of github.com:numpy/numpy into countnz

6443951

mattip approved these changes Dec 23, 2020

View reviewed changes

Add docstrings to indicate which intrinsics are tested.

b3681b6

mattip merged commit 85df388 into numpy:master Dec 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAINT: Optimize the performance of count_nonzero by using universal intrinsics #17958

MAINT: Optimize the performance of count_nonzero by using universal intrinsics #17958

Qiyu8 commented Dec 8, 2020

Qiyu8 commented Dec 8, 2020

seiko2plus commented Dec 8, 2020

eric-wieser commented Dec 8, 2020 •

edited

Qiyu8 commented Dec 9, 2020

Qiyu8 commented Dec 9, 2020

Qiyu8 commented Dec 9, 2020 •

edited

Qiyu8 commented Dec 14, 2020

Qiyu8 commented Dec 15, 2020

seiko2plus commented Dec 15, 2020

eric-wieser left a comment

eric-wieser Dec 17, 2020

mattip Dec 17, 2020

eric-wieser Dec 17, 2020 •

edited

seiko2plus Dec 19, 2020

seiko2plus Dec 19, 2020

seiko2plus Dec 19, 2020 •

edited

Qiyu8 Dec 21, 2020 •

edited

seiko2plus Dec 21, 2020 •

edited

seiko2plus Dec 21, 2020

Qiyu8 Dec 21, 2020

seiko2plus Dec 21, 2020

Qiyu8 Dec 21, 2020 •

edited

eric-wieser Dec 21, 2020 •

edited

seiko2plus Dec 21, 2020 •

edited

Qiyu8 Dec 22, 2020

seiko2plus Dec 19, 2020

Qiyu8 Dec 22, 2020

seiko2plus left a comment

mattip left a comment

mattip Dec 23, 2020

Qiyu8 Dec 23, 2020

mattip Dec 23, 2020

Qiyu8 Dec 23, 2020

mattip commented Dec 23, 2020

	while (*d < end && lane_max <= 0xFE) {
	while (*d < end && lane_max <= 0xFF - 1) {

		npyv_u32 vsum32 = count_zero_bytes_u32(&d, end, NPY_MAX_UINT32 / npyv_nlanes_u32);
		zero_count += npyv_sum_u32(vsum32);

-        npyv_u32 vsum32 = count_zero_bytes_u32(&d, end, NPY_MAX_UINT32 / npyv_nlanes_u32);
-        zero_count += npyv_sum_u32(vsum32);
+    npyv_u16x2 vsum16 = count_zero_bytes_u16(&d, end, NPY_MAX_UINT16);
+    npyv_u32x2 sum_32_0 = npyv_expand_u32_u16(vsum16.val[0]);
+    npyv_u32x2 sum_32_1 = npyv_expand_u32_u16(vsum16.val[1]);
+    zero_count += npyv_sum_u32(npyv_add_u32(
+            npyv_add_u32(sum_32_0.val[0], sum_32_0.val[1]),
+            npyv_add_u32(sum_32_1.val[0], sum_32_1.val[1])
+    ));

	npyv_u8 vt = npyv_cvt_u8_b8(npyv_cmpeq_u8(npyv_load_u8(*d), vzero));
	// we count zeros because `cmpeq` cheaper than `cmpneq` for most archs
	npyv_u8 vt = npyv_cvt_u8_b8(npyv_cmpeq_u8(npyv_load_u8(*d), vzero));

	def test_conversion_expand(self):
	def test_conversion_expand(self):
	"""Test npyv_expand_u16_u8, npyv_expand_u32_u16"""

MAINT: Optimize the performance of count_nonzero by using universal intrinsics #17958

MAINT: Optimize the performance of count_nonzero by using universal intrinsics #17958

Conversation

Qiyu8 commented Dec 8, 2020

Introduction

Benchmark

System Info

Qiyu8 commented Dec 8, 2020

seiko2plus commented Dec 8, 2020

eric-wieser commented Dec 8, 2020 • edited

Qiyu8 commented Dec 9, 2020

Qiyu8 commented Dec 9, 2020

Qiyu8 commented Dec 9, 2020 • edited

Qiyu8 commented Dec 14, 2020

Qiyu8 commented Dec 15, 2020

seiko2plus commented Dec 15, 2020

eric-wieser left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-wieser Dec 17, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seiko2plus Dec 19, 2020 • edited

Choose a reason for hiding this comment

Qiyu8 Dec 21, 2020 • edited

Choose a reason for hiding this comment

seiko2plus Dec 21, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Qiyu8 Dec 21, 2020 • edited

Choose a reason for hiding this comment

eric-wieser Dec 21, 2020 • edited

Choose a reason for hiding this comment

seiko2plus Dec 21, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seiko2plus left a comment

Choose a reason for hiding this comment

mattip left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattip commented Dec 23, 2020

eric-wieser commented Dec 8, 2020 •

edited

Qiyu8 commented Dec 9, 2020 •

edited

eric-wieser Dec 17, 2020 •

edited

seiko2plus Dec 19, 2020 •

edited

Qiyu8 Dec 21, 2020 •

edited

seiko2plus Dec 21, 2020 •

edited

Qiyu8 Dec 21, 2020 •

edited

eric-wieser Dec 21, 2020 •

edited

seiko2plus Dec 21, 2020 •

edited