SIMD: Optimize the performance of einsum's submodule multiply by using universal intrinsics #17782

Qiyu8 · 2020-11-16T03:16:13Z

This is the third part of #17049 , There has no impact on X86 platform and about 9%~16% increased performance in ARM. Here is the benchmark results:

X86(SSE2 enabled) BENCHMARKS NOT SIGNIFICANTLY CHANGED.


· Creating environments
· Discovering benchmarks
·· Uninstalling from virtualenv-py3.7-Cython
·· Building 97ba579b  for virtualenv-py3.7-Cython
·· Installing 97ba579b  into virtualenv-py3.7-Cython
· Running 28 total benchmarks (2 commits * 1 environments * 14 benchmarks)
[  0.00%] · For numpy commit 360ba057  (round 1/2):
[  0.00%] ·· Building for virtualenv-py3.7-Cython
[  0.00%] ·· Benchmarking virtualenv-py3.7-Cython
[  1.79%] ··· Running (bench_linalg.Einsum.time_einsum_contig_contig--)..............
[ 25.00%] · For numpy commit 97ba579b  (round 1/2):
[ 25.00%] ·· Building for virtualenv-py3.7-Cython
[ 25.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 26.79%] ··· Running (bench_linalg.Einsum.time_einsum_contig_contig--)..............
[ 50.00%] · For numpy commit 97ba579b  (round 2/2):
[ 50.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 51.79%] ··· bench_linalg.Einsum.time_einsum_contig_contig                                                                                                                                                     ok
[ 51.79%] ··· =============== =========
                   dtype
              --------------- ---------
               numpy.float32   151±9μs
               numpy.float64   249±6μs
              =============== =========
[ 53.57%] ··· bench_linalg.Einsum.time_einsum_contig_outstride0                                                                                                                                                 ok

[ 53.57%] ··· =============== =========

dtype

--------------- ---------

numpy.float32   248±5μs

numpy.float64   455±4μs

=============== =========
[ 55.36%] ··· bench_linalg.Einsum.time_einsum_mul                                                                                                                                                               ok

[ 55.36%] ··· =============== =============

dtype

--------------- -------------

numpy.float32   1.38±0.03ms

numpy.float64    2.85±0.1ms

=============== =============
[ 57.14%] ··· bench_linalg.Einsum.time_einsum_multiply                                                                                                                                                          ok

[ 57.14%] ··· =============== ==========

dtype

--------------- ----------

numpy.float32   75.0±4μs

numpy.float64   84.3±3μs

=============== ==========
[ 58.93%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_contig                                                                                                                                              ok

[ 58.93%] ··· =============== ==========

dtype

--------------- ----------

numpy.float32   44.1±1μs

numpy.float64   49.1±2μs

=============== ==========
[ 60.71%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_outstride0                                                                                                                                          ok

[ 60.71%] ··· =============== ============

dtype

--------------- ------------

numpy.float32    35.0±1μs

numpy.float64   35.4±0.7μs

=============== ============
[ 62.50%] ··· bench_linalg.Einsum.time_einsum_noncon_mul                                                                                                                                                        ok

[ 62.50%] ··· =============== ============

dtype

--------------- ------------

numpy.float32   42.3±0.8μs

numpy.float64    43.6±1μs

=============== ============
[ 64.29%] ··· bench_linalg.Einsum.time_einsum_noncon_multiply                                                                                                                                                   ok

[ 64.29%] ··· =============== ==========

dtype

--------------- ----------

numpy.float32   70.2±2μs

numpy.float64   95.9±2μs

=============== ==========
[ 66.07%] ··· bench_linalg.Einsum.time_einsum_noncon_outer                                                                                                                                                      ok

[ 66.07%] ··· =============== ============

dtype

--------------- ------------

numpy.float32   11.3±0.2ms

numpy.float64   22.5±0.3ms

=============== ============
[ 67.86%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul                                                                                                                                                    ok

[ 67.86%] ··· =============== ==========

dtype

--------------- ----------

numpy.float32   70.7±2μs

numpy.float64   81.3±3μs

=============== ==========
[ 69.64%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul2                                                                                                                                                   ok

[ 69.64%] ··· =============== ==========

dtype

--------------- ----------

numpy.float32   67.7±2μs

numpy.float64   86.3±4μs

=============== ==========
[ 71.43%] ··· bench_linalg.Einsum.time_einsum_outer                                                                                                                                                             ok

[ 71.43%] ··· =============== ============

dtype

--------------- ------------

numpy.float32   25.7±0.5ms

numpy.float64   50.4±0.2ms

=============== ============
[ 73.21%] ··· bench_linalg.Einsum.time_einsum_sum_mul                                                                                                                                                           ok

[ 73.21%] ··· =============== ==========

dtype

--------------- ----------

numpy.float32   60.2±1μs

numpy.float64   66.3±3μs

=============== ==========
[ 75.00%] ··· bench_linalg.Einsum.time_einsum_sum_mul2                                                                                                                                                          ok

[ 75.00%] ··· =============== ==========

dtype

--------------- ----------

numpy.float32   61.7±6μs

numpy.float64   68.0±3μs

=============== ==========
[ 75.00%] · For numpy commit 360ba057  (round 2/2):

[ 75.00%] ·· Building for virtualenv-py3.7-Cython

[ 75.00%] ·· Benchmarking virtualenv-py3.7-Cython

[ 76.79%] ··· bench_linalg.Einsum.time_einsum_contig_contig                                                                                                                                                     ok

[ 76.79%] ··· =============== ==========

dtype

--------------- ----------

numpy.float32   150±20μs

numpy.float64   249±4μs

=============== ==========
[ 78.57%] ··· bench_linalg.Einsum.time_einsum_contig_outstride0                                                                                                                                                 ok

[ 78.57%] ··· =============== ==========

dtype

--------------- ----------

numpy.float32   262±8μs

numpy.float64   461±20μs

=============== ==========
[ 80.36%] ··· bench_linalg.Einsum.time_einsum_mul                                                                                                                                                               ok

[ 80.36%] ··· =============== =============

dtype

--------------- -------------

numpy.float32   1.33±0.01ms

numpy.float64   2.72±0.07ms

=============== =============
[ 82.14%] ··· bench_linalg.Einsum.time_einsum_multiply                                                                                                                                                          ok

[ 82.14%] ··· =============== ==========

dtype

--------------- ----------

numpy.float32   73.0±2μs

numpy.float64   83.1±4μs

=============== ==========
[ 83.93%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_contig                                                                                                                                              ok

[ 83.93%] ··· =============== ============

dtype

--------------- ------------

numpy.float32    45.4±4μs

numpy.float64   47.5±0.8μs

=============== ============
[ 85.71%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_outstride0                                                                                                                                          ok

[ 85.71%] ··· =============== ==========

dtype

--------------- ----------

numpy.float32   35.2±2μs

numpy.float64   38.2±2μs

=============== ==========
[ 87.50%] ··· bench_linalg.Einsum.time_einsum_noncon_mul                                                                                                                                                        ok

[ 87.50%] ··· =============== ============

dtype

--------------- ------------

numpy.float32   41.8±0.9μs

numpy.float64    45.0±3μs

=============== ============
[ 89.29%] ··· bench_linalg.Einsum.time_einsum_noncon_multiply                                                                                                                                                   ok

[ 89.29%] ··· =============== ==========

dtype

--------------- ----------

numpy.float32   72.8±2μs

numpy.float64   79.0±2μs

=============== ==========
[ 91.07%] ··· bench_linalg.Einsum.time_einsum_noncon_outer                                                                                                                                                      ok

[ 91.07%] ··· =============== ============

dtype

--------------- ------------

numpy.float32   10.9±0.2ms

numpy.float64   22.1±0.3ms

=============== ============
[ 92.86%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul                                                                                                                                                    ok

[ 92.86%] ··· =============== ==========

dtype

--------------- ----------

numpy.float32   67.4±1μs

numpy.float64   78.8±2μs

=============== ==========
[ 94.64%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul2                                                                                                                                                   ok

[ 94.64%] ··· =============== ==========

dtype

--------------- ----------

numpy.float32   66.8±3μs

numpy.float64   83.5±3μs

=============== ==========
[ 96.43%] ··· bench_linalg.Einsum.time_einsum_outer                                                                                                                                                             ok

[ 96.43%] ··· =============== ============

dtype

--------------- ------------

numpy.float32   25.7±0.8ms

numpy.float64   50.1±0.4ms

=============== ============
[ 98.21%] ··· bench_linalg.Einsum.time_einsum_sum_mul                                                                                                                                                           ok

[ 98.21%] ··· =============== ==========

dtype

--------------- ----------

numpy.float32   60.3±2μs

numpy.float64   66.7±2μs

=============== ==========
[100.00%] ··· bench_linalg.Einsum.time_einsum_sum_mul2                                                                                                                                                          ok

[100.00%] ··· =============== ==========

dtype

--------------- ----------

numpy.float32   71.5±5μs

numpy.float64   72.8±4μs

=============== ==========
BENCHMARKS NOT SIGNIFICANTLY CHANGED.

ARM(NEON enabled) PERFORMANCE INCREASED


· Creating environments
· Discovering benchmarks
·· Uninstalling from virtualenv-py3.7-Cython.
·· Building 94681fbf  for virtualenv-py3.7-Cython.....................................
·· Installing 94681fbf  into virtualenv-py3.7-Cython.
· Running 28 total benchmarks (2 commits * 1 environments * 14 benchmarks)
[  0.00%] · For numpy commit 360ba057  (round 1/2):
[  0.00%] ·· Building for virtualenv-py3.7-Cython.......................................
[  0.00%] ·· Benchmarking virtualenv-py3.7-Cython
[  1.79%] ··· Running (bench_linalg.Einsum.time_einsum_contig_contig--)..............
[ 25.00%] · For numpy commit 94681fbf  (round 1/2):
[ 25.00%] ·· Building for virtualenv-py3.7-Cython..
[ 25.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 26.79%] ··· Running (bench_linalg.Einsum.time_einsum_contig_contig--)..............
[ 50.00%] · For numpy commit 94681fbf  (round 2/2):
[ 50.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 51.79%] ··· bench_linalg.Einsum.time_einsum_contig_contig                                                                                ok
[ 51.79%] ··· =============== =========
                   dtype               
              --------------- ---------
               numpy.float32   554±1μs 
               numpy.float64   509±2μs 
              =============== =========
[ 53.57%] ··· bench_linalg.Einsum.time_einsum_contig_outstride0                                                                            ok

[ 53.57%] ··· =============== ==========

dtype

--------------- ----------

numpy.float32   1.02±0ms

numpy.float64   865±7μs

=============== ==========
[ 55.36%] ··· bench_linalg.Einsum.time_einsum_mul                                                                                          ok

[ 55.36%] ··· =============== ==========

dtype

--------------- ----------

numpy.float32   780±10μs

numpy.float64   976±20μs

=============== ==========
[ 57.14%] ··· bench_linalg.Einsum.time_einsum_multiply                                                                                     ok

[ 57.14%] ··· =============== ===========

dtype

--------------- -----------

numpy.float32   116±0.6μs

numpy.float64    143±2μs

=============== ===========
[ 58.93%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_contig                                                                         ok

[ 58.93%] ··· =============== ============

dtype

--------------- ------------

numpy.float32   74.9±0.9μs

numpy.float64    75.6±1μs

=============== ============
[ 60.71%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_outstride0                                                                     ok

[ 60.71%] ··· =============== ============

dtype

--------------- ------------

numpy.float32   63.2±0.3μs

numpy.float64   63.1±0.5μs

=============== ============
[ 62.50%] ··· bench_linalg.Einsum.time_einsum_noncon_mul                                                                                   ok

[ 62.50%] ··· =============== ============

dtype

--------------- ------------

numpy.float32   72.3±0.2μs

numpy.float64   73.2±0.4μs

=============== ============
[ 64.29%] ··· bench_linalg.Einsum.time_einsum_noncon_multiply                                                                              ok

[ 64.29%] ··· =============== ===========

dtype

--------------- -----------

numpy.float32   115±0.6μs

numpy.float64    140±1μs

=============== ===========
[ 66.07%] ··· bench_linalg.Einsum.time_einsum_noncon_outer                                                                                 ok

[ 66.07%] ··· =============== ============

dtype

--------------- ------------

numpy.float32   5.36±0.3ms

numpy.float64   7.02±0.3ms

=============== ============
[ 67.86%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul                                                                               ok

[ 67.86%] ··· =============== ===========

dtype

--------------- -----------

numpy.float32   141±0.5μs

numpy.float64   134±0.9μs

=============== ===========
[ 69.64%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul2                                                                              ok

[ 69.64%] ··· =============== ===========

dtype

--------------- -----------

numpy.float32   140±0.5μs

numpy.float64   135±0.5μs

=============== ===========
[ 71.43%] ··· bench_linalg.Einsum.time_einsum_outer                                                                                        ok

[ 71.43%] ··· =============== ============

dtype

--------------- ------------

numpy.float32   17.6±0.9ms

numpy.float64    23.1±2ms

=============== ============
[ 73.21%] ··· bench_linalg.Einsum.time_einsum_sum_mul                                                                                      ok

[ 73.21%] ··· =============== ===========

dtype

--------------- -----------

numpy.float32   112±0.4μs

numpy.float64   111±0.6μs

=============== ===========
[ 75.00%] ··· bench_linalg.Einsum.time_einsum_sum_mul2                                                                                     ok

[ 75.00%] ··· =============== ===========

dtype

--------------- -----------

numpy.float32   112±0.5μs

numpy.float64   111±0.8μs

=============== ===========
[ 75.00%] · For numpy commit 360ba057  (round 2/2):

[ 75.00%] ·· Building for virtualenv-py3.7-Cython..

[ 75.00%] ·· Benchmarking virtualenv-py3.7-Cython

[ 76.79%] ··· bench_linalg.Einsum.time_einsum_contig_contig                                                                                ok

[ 76.79%] ··· =============== =========

dtype

--------------- ---------

numpy.float32   552±2μs

numpy.float64   502±4μs

=============== =========
[ 78.57%] ··· bench_linalg.Einsum.time_einsum_contig_outstride0                                                                            ok

[ 78.57%] ··· =============== ==========

dtype

--------------- ----------

numpy.float32   1.02±0ms

numpy.float64   874±4μs

=============== ==========
[ 80.36%] ··· bench_linalg.Einsum.time_einsum_mul                                                                                          ok

[ 80.36%] ··· =============== =============

dtype

--------------- -------------

numpy.float32     796±9μs

numpy.float64   1.00±0.03ms

=============== =============
[ 82.14%] ··· bench_linalg.Einsum.time_einsum_multiply                                                                                     ok

[ 82.14%] ··· =============== ===========

dtype

--------------- -----------

numpy.float32   137±0.7μs

numpy.float64    158±2μs

=============== ===========
[ 83.93%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_contig                                                                         ok

[ 83.93%] ··· =============== ============

dtype

--------------- ------------

numpy.float32   76.1±0.3μs

numpy.float64   75.3±0.2μs

=============== ============
[ 85.71%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_outstride0                                                                     ok

[ 85.71%] ··· =============== ============

dtype

--------------- ------------

numpy.float32   63.9±0.2μs

numpy.float64   63.1±0.5μs

=============== ============
[ 87.50%] ··· bench_linalg.Einsum.time_einsum_noncon_mul                                                                                   ok

[ 87.50%] ··· =============== ============

dtype

--------------- ------------

numpy.float32   72.5±0.3μs

numpy.float64   74.0±0.2μs

=============== ============
[ 89.29%] ··· bench_linalg.Einsum.time_einsum_noncon_multiply                                                                              ok

[ 89.29%] ··· =============== ===========

dtype

--------------- -----------

numpy.float32   136±0.3μs

numpy.float64   155±0.6μs

=============== ===========
[ 91.07%] ··· bench_linalg.Einsum.time_einsum_noncon_outer                                                                                 ok

[ 91.07%] ··· =============== ============

dtype

--------------- ------------

numpy.float32   5.20±0.1ms

numpy.float64   7.15±0.2ms

=============== ============
[ 92.86%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul                                                                               ok

[ 92.86%] ··· =============== ===========

dtype

--------------- -----------

numpy.float32   141±0.8μs

numpy.float64   135±0.6μs

=============== ===========
[ 94.64%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul2                                                                              ok

[ 94.64%] ··· =============== ===========

dtype

--------------- -----------

numpy.float32   142±0.5μs

numpy.float64   135±0.5μs

=============== ===========
[ 96.43%] ··· bench_linalg.Einsum.time_einsum_outer                                                                                        ok

[ 96.43%] ··· =============== ============

dtype

--------------- ------------

numpy.float32   17.8±0.7ms

numpy.float64   23.7±0.9ms

=============== ============
[ 98.21%] ··· bench_linalg.Einsum.time_einsum_sum_mul                                                                                      ok

[ 98.21%] ··· =============== ===========

dtype

--------------- -----------

numpy.float32   113±0.5μs

numpy.float64   111±0.8μs

=============== ===========
[100.00%] ··· bench_linalg.Einsum.time_einsum_sum_mul2                                                                                     ok

[100.00%] ··· =============== ===========

dtype

--------------- -----------

numpy.float32   113±0.7μs

numpy.float64   110±0.5μs

=============== ===========
   before           after         ratio
 [360ba057]       [94681fbf]
 <master>         <einsum-muladd>



  155±0.6μs          140±1μs     0.91  bench_linalg.Einsum.time_einsum_noncon_multiply(<class 'numpy.float64'>)



    158±2μs          143±2μs     0.91  bench_linalg.Einsum.time_einsum_multiply(<class 'numpy.float64'>)



  136±0.3μs        115±0.6μs     0.85  bench_linalg.Einsum.time_einsum_noncon_multiply(<class 'numpy.float32'>)



  137±0.7μs        116±0.6μs     0.84  bench_linalg.Einsum.time_einsum_multiply(<class 'numpy.float32'>)



SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

PERFORMANCE INCREASED.

System Info

	Arm	x86
Hardware	KunPeng
Processor	ARMv8 2.6GMHZ 8 processors	Intel(R) Xeon(R) Gold 6161 CPU @ 2.20GHz
OS	Linux ecs-9d50 4.19.36-vhulk1905.1.0.h276.eulerosv2r8.aarch64	Windows Server 2008 R2 Enterprise
Compiler	gcc (GCC) 7.3.0	MSVC14.06

numpy/core/src/multiarray/einsum_sumprod.c.src

seiko2plus

since #17340 is merged, could you use NPYV partial load and store for handling the remained scalars?

it should perform better on AVX2/AVX512F without losing performance in other
architectures since we already unroll by four-vectors.
guarantee of having equal accuracy for all array elements when fused intrinsic instructions enabled.

numpy/core/src/multiarray/einsum_sumprod.c.src

seiko2plus

LGTM. Thank you, Chunlin!

Qiyu8 · 2020-12-11T02:09:12Z

@mattip @eric-wieser The development process of version 1.21 is started, right time to accept simd PRs?

mattip · 2020-12-11T06:35:00Z

Thanks @Qiyu8

Optimize the performance of multiply

97ba579

eric-wieser reviewed Nov 16, 2020

View reviewed changes

numpy/core/src/multiarray/einsum_sumprod.c.src Outdated Show resolved Hide resolved

fix misleading comment

594dd5d

seiko2plus requested changes Nov 18, 2020

View reviewed changes

optimize the remaining elements using npyv_load_tillz

95d6052

seiko2plus reviewed Nov 19, 2020

View reviewed changes

numpy/core/src/multiarray/einsum_sumprod.c.src Outdated Show resolved Hide resolved

add guard #ifndef NPY_DISABLE_OPTIMIZATION

f921f0d

seiko2plus approved these changes Nov 24, 2020

View reviewed changes

Qiyu8 mentioned this pull request Dec 7, 2020

SIMD: [Doubt] Doubts on dispatch and usage of npy functions #17925

Closed

mattip merged commit 9e26d1d into numpy:master Dec 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIMD: Optimize the performance of einsum's submodule multiply by using universal intrinsics #17782

SIMD: Optimize the performance of einsum's submodule multiply by using universal intrinsics #17782

Qiyu8 commented Nov 16, 2020

seiko2plus left a comment

seiko2plus left a comment

Qiyu8 commented Dec 11, 2020

mattip commented Dec 11, 2020

SIMD: Optimize the performance of einsum's submodule multiply by using universal intrinsics #17782

SIMD: Optimize the performance of einsum's submodule multiply by using universal intrinsics #17782

Conversation

Qiyu8 commented Nov 16, 2020

seiko2plus left a comment

Choose a reason for hiding this comment

seiko2plus left a comment

Choose a reason for hiding this comment

Qiyu8 commented Dec 11, 2020

mattip commented Dec 11, 2020