Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIMD: Optimize the performance of einsum's submodule multiply by using universal intrinsics #17782

Merged
merged 4 commits into from
Dec 11, 2020

Conversation

Qiyu8
Copy link
Member

@Qiyu8 Qiyu8 commented Nov 16, 2020

This is the third part of #17049 , There has no impact on X86 platform and about 9%~16% increased performance in ARM. Here is the benchmark results:

X86(SSE2 enabled) BENCHMARKS NOT SIGNIFICANTLY CHANGED.

· Creating environments
· Discovering benchmarks
·· Uninstalling from virtualenv-py3.7-Cython
·· Building 97ba579b  for virtualenv-py3.7-Cython
·· Installing 97ba579b  into virtualenv-py3.7-Cython
· Running 28 total benchmarks (2 commits * 1 environments * 14 benchmarks)
[  0.00%] · For numpy commit 360ba057  (round 1/2):
[  0.00%] ·· Building for virtualenv-py3.7-Cython
[  0.00%] ·· Benchmarking virtualenv-py3.7-Cython
[  1.79%] ··· Running (bench_linalg.Einsum.time_einsum_contig_contig--)..............
[ 25.00%] · For numpy commit 97ba579b  (round 1/2):
[ 25.00%] ·· Building for virtualenv-py3.7-Cython
[ 25.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 26.79%] ··· Running (bench_linalg.Einsum.time_einsum_contig_contig--)..............
[ 50.00%] · For numpy commit 97ba579b  (round 2/2):
[ 50.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 51.79%] ··· bench_linalg.Einsum.time_einsum_contig_contig                                                                                                                                                     ok
[ 51.79%] ··· =============== =========
                   dtype
              --------------- ---------
               numpy.float32   151±9μs
               numpy.float64   249±6μs
              =============== =========

[ 53.57%] ··· bench_linalg.Einsum.time_einsum_contig_outstride0 ok
[ 53.57%] ··· =============== =========
dtype
--------------- ---------
numpy.float32 248±5μs
numpy.float64 455±4μs
=============== =========

[ 55.36%] ··· bench_linalg.Einsum.time_einsum_mul ok
[ 55.36%] ··· =============== =============
dtype
--------------- -------------
numpy.float32 1.38±0.03ms
numpy.float64 2.85±0.1ms
=============== =============

[ 57.14%] ··· bench_linalg.Einsum.time_einsum_multiply ok
[ 57.14%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 75.0±4μs
numpy.float64 84.3±3μs
=============== ==========

[ 58.93%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_contig ok
[ 58.93%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 44.1±1μs
numpy.float64 49.1±2μs
=============== ==========

[ 60.71%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_outstride0 ok
[ 60.71%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 35.0±1μs
numpy.float64 35.4±0.7μs
=============== ============

[ 62.50%] ··· bench_linalg.Einsum.time_einsum_noncon_mul ok
[ 62.50%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 42.3±0.8μs
numpy.float64 43.6±1μs
=============== ============

[ 64.29%] ··· bench_linalg.Einsum.time_einsum_noncon_multiply ok
[ 64.29%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 70.2±2μs
numpy.float64 95.9±2μs
=============== ==========

[ 66.07%] ··· bench_linalg.Einsum.time_einsum_noncon_outer ok
[ 66.07%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 11.3±0.2ms
numpy.float64 22.5±0.3ms
=============== ============

[ 67.86%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul ok
[ 67.86%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 70.7±2μs
numpy.float64 81.3±3μs
=============== ==========

[ 69.64%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul2 ok
[ 69.64%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 67.7±2μs
numpy.float64 86.3±4μs
=============== ==========

[ 71.43%] ··· bench_linalg.Einsum.time_einsum_outer ok
[ 71.43%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 25.7±0.5ms
numpy.float64 50.4±0.2ms
=============== ============

[ 73.21%] ··· bench_linalg.Einsum.time_einsum_sum_mul ok
[ 73.21%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 60.2±1μs
numpy.float64 66.3±3μs
=============== ==========

[ 75.00%] ··· bench_linalg.Einsum.time_einsum_sum_mul2 ok
[ 75.00%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 61.7±6μs
numpy.float64 68.0±3μs
=============== ==========

[ 75.00%] · For numpy commit 360ba057 (round 2/2):
[ 75.00%] ·· Building for virtualenv-py3.7-Cython
[ 75.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 76.79%] ··· bench_linalg.Einsum.time_einsum_contig_contig ok
[ 76.79%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 150±20μs
numpy.float64 249±4μs
=============== ==========

[ 78.57%] ··· bench_linalg.Einsum.time_einsum_contig_outstride0 ok
[ 78.57%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 262±8μs
numpy.float64 461±20μs
=============== ==========

[ 80.36%] ··· bench_linalg.Einsum.time_einsum_mul ok
[ 80.36%] ··· =============== =============
dtype
--------------- -------------
numpy.float32 1.33±0.01ms
numpy.float64 2.72±0.07ms
=============== =============

[ 82.14%] ··· bench_linalg.Einsum.time_einsum_multiply ok
[ 82.14%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 73.0±2μs
numpy.float64 83.1±4μs
=============== ==========

[ 83.93%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_contig ok
[ 83.93%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 45.4±4μs
numpy.float64 47.5±0.8μs
=============== ============

[ 85.71%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_outstride0 ok
[ 85.71%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 35.2±2μs
numpy.float64 38.2±2μs
=============== ==========

[ 87.50%] ··· bench_linalg.Einsum.time_einsum_noncon_mul ok
[ 87.50%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 41.8±0.9μs
numpy.float64 45.0±3μs
=============== ============

[ 89.29%] ··· bench_linalg.Einsum.time_einsum_noncon_multiply ok
[ 89.29%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 72.8±2μs
numpy.float64 79.0±2μs
=============== ==========

[ 91.07%] ··· bench_linalg.Einsum.time_einsum_noncon_outer ok
[ 91.07%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 10.9±0.2ms
numpy.float64 22.1±0.3ms
=============== ============

[ 92.86%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul ok
[ 92.86%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 67.4±1μs
numpy.float64 78.8±2μs
=============== ==========

[ 94.64%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul2 ok
[ 94.64%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 66.8±3μs
numpy.float64 83.5±3μs
=============== ==========

[ 96.43%] ··· bench_linalg.Einsum.time_einsum_outer ok
[ 96.43%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 25.7±0.8ms
numpy.float64 50.1±0.4ms
=============== ============

[ 98.21%] ··· bench_linalg.Einsum.time_einsum_sum_mul ok
[ 98.21%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 60.3±2μs
numpy.float64 66.7±2μs
=============== ==========

[100.00%] ··· bench_linalg.Einsum.time_einsum_sum_mul2 ok
[100.00%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 71.5±5μs
numpy.float64 72.8±4μs
=============== ==========

BENCHMARKS NOT SIGNIFICANTLY CHANGED.

ARM(NEON enabled) PERFORMANCE INCREASED

· Creating environments
· Discovering benchmarks
·· Uninstalling from virtualenv-py3.7-Cython.
·· Building 94681fbf  for virtualenv-py3.7-Cython.....................................
·· Installing 94681fbf  into virtualenv-py3.7-Cython.
· Running 28 total benchmarks (2 commits * 1 environments * 14 benchmarks)
[  0.00%] · For numpy commit 360ba057  (round 1/2):
[  0.00%] ·· Building for virtualenv-py3.7-Cython.......................................
[  0.00%] ·· Benchmarking virtualenv-py3.7-Cython
[  1.79%] ··· Running (bench_linalg.Einsum.time_einsum_contig_contig--)..............
[ 25.00%] · For numpy commit 94681fbf  (round 1/2):
[ 25.00%] ·· Building for virtualenv-py3.7-Cython..
[ 25.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 26.79%] ··· Running (bench_linalg.Einsum.time_einsum_contig_contig--)..............
[ 50.00%] · For numpy commit 94681fbf  (round 2/2):
[ 50.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 51.79%] ··· bench_linalg.Einsum.time_einsum_contig_contig                                                                                ok
[ 51.79%] ··· =============== =========
                   dtype               
              --------------- ---------
               numpy.float32   554±1μs 
               numpy.float64   509±2μs 
              =============== =========

[ 53.57%] ··· bench_linalg.Einsum.time_einsum_contig_outstride0 ok
[ 53.57%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 1.02±0ms
numpy.float64 865±7μs
=============== ==========

[ 55.36%] ··· bench_linalg.Einsum.time_einsum_mul ok
[ 55.36%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 780±10μs
numpy.float64 976±20μs
=============== ==========

[ 57.14%] ··· bench_linalg.Einsum.time_einsum_multiply ok
[ 57.14%] ··· =============== ===========
dtype
--------------- -----------
numpy.float32 116±0.6μs
numpy.float64 143±2μs
=============== ===========

[ 58.93%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_contig ok
[ 58.93%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 74.9±0.9μs
numpy.float64 75.6±1μs
=============== ============

[ 60.71%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_outstride0 ok
[ 60.71%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 63.2±0.3μs
numpy.float64 63.1±0.5μs
=============== ============

[ 62.50%] ··· bench_linalg.Einsum.time_einsum_noncon_mul ok
[ 62.50%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 72.3±0.2μs
numpy.float64 73.2±0.4μs
=============== ============

[ 64.29%] ··· bench_linalg.Einsum.time_einsum_noncon_multiply ok
[ 64.29%] ··· =============== ===========
dtype
--------------- -----------
numpy.float32 115±0.6μs
numpy.float64 140±1μs
=============== ===========

[ 66.07%] ··· bench_linalg.Einsum.time_einsum_noncon_outer ok
[ 66.07%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 5.36±0.3ms
numpy.float64 7.02±0.3ms
=============== ============

[ 67.86%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul ok
[ 67.86%] ··· =============== ===========
dtype
--------------- -----------
numpy.float32 141±0.5μs
numpy.float64 134±0.9μs
=============== ===========

[ 69.64%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul2 ok
[ 69.64%] ··· =============== ===========
dtype
--------------- -----------
numpy.float32 140±0.5μs
numpy.float64 135±0.5μs
=============== ===========

[ 71.43%] ··· bench_linalg.Einsum.time_einsum_outer ok
[ 71.43%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 17.6±0.9ms
numpy.float64 23.1±2ms
=============== ============

[ 73.21%] ··· bench_linalg.Einsum.time_einsum_sum_mul ok
[ 73.21%] ··· =============== ===========
dtype
--------------- -----------
numpy.float32 112±0.4μs
numpy.float64 111±0.6μs
=============== ===========

[ 75.00%] ··· bench_linalg.Einsum.time_einsum_sum_mul2 ok
[ 75.00%] ··· =============== ===========
dtype
--------------- -----------
numpy.float32 112±0.5μs
numpy.float64 111±0.8μs
=============== ===========

[ 75.00%] · For numpy commit 360ba057 (round 2/2):
[ 75.00%] ·· Building for virtualenv-py3.7-Cython..
[ 75.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 76.79%] ··· bench_linalg.Einsum.time_einsum_contig_contig ok
[ 76.79%] ··· =============== =========
dtype
--------------- ---------
numpy.float32 552±2μs
numpy.float64 502±4μs
=============== =========

[ 78.57%] ··· bench_linalg.Einsum.time_einsum_contig_outstride0 ok
[ 78.57%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 1.02±0ms
numpy.float64 874±4μs
=============== ==========

[ 80.36%] ··· bench_linalg.Einsum.time_einsum_mul ok
[ 80.36%] ··· =============== =============
dtype
--------------- -------------
numpy.float32 796±9μs
numpy.float64 1.00±0.03ms
=============== =============

[ 82.14%] ··· bench_linalg.Einsum.time_einsum_multiply ok
[ 82.14%] ··· =============== ===========
dtype
--------------- -----------
numpy.float32 137±0.7μs
numpy.float64 158±2μs
=============== ===========

[ 83.93%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_contig ok
[ 83.93%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 76.1±0.3μs
numpy.float64 75.3±0.2μs
=============== ============

[ 85.71%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_outstride0 ok
[ 85.71%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 63.9±0.2μs
numpy.float64 63.1±0.5μs
=============== ============

[ 87.50%] ··· bench_linalg.Einsum.time_einsum_noncon_mul ok
[ 87.50%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 72.5±0.3μs
numpy.float64 74.0±0.2μs
=============== ============

[ 89.29%] ··· bench_linalg.Einsum.time_einsum_noncon_multiply ok
[ 89.29%] ··· =============== ===========
dtype
--------------- -----------
numpy.float32 136±0.3μs
numpy.float64 155±0.6μs
=============== ===========

[ 91.07%] ··· bench_linalg.Einsum.time_einsum_noncon_outer ok
[ 91.07%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 5.20±0.1ms
numpy.float64 7.15±0.2ms
=============== ============

[ 92.86%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul ok
[ 92.86%] ··· =============== ===========
dtype
--------------- -----------
numpy.float32 141±0.8μs
numpy.float64 135±0.6μs
=============== ===========

[ 94.64%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul2 ok
[ 94.64%] ··· =============== ===========
dtype
--------------- -----------
numpy.float32 142±0.5μs
numpy.float64 135±0.5μs
=============== ===========

[ 96.43%] ··· bench_linalg.Einsum.time_einsum_outer ok
[ 96.43%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 17.8±0.7ms
numpy.float64 23.7±0.9ms
=============== ============

[ 98.21%] ··· bench_linalg.Einsum.time_einsum_sum_mul ok
[ 98.21%] ··· =============== ===========
dtype
--------------- -----------
numpy.float32 113±0.5μs
numpy.float64 111±0.8μs
=============== ===========

[100.00%] ··· bench_linalg.Einsum.time_einsum_sum_mul2 ok
[100.00%] ··· =============== ===========
dtype
--------------- -----------
numpy.float32 113±0.7μs
numpy.float64 110±0.5μs
=============== ===========

   before           after         ratio
 [360ba057]       [94681fbf]
 <master>         <einsum-muladd>
  •   155±0.6μs          140±1μs     0.91  bench_linalg.Einsum.time_einsum_noncon_multiply(<class 'numpy.float64'>)
    
  •     158±2μs          143±2μs     0.91  bench_linalg.Einsum.time_einsum_multiply(<class 'numpy.float64'>)
    
  •   136±0.3μs        115±0.6μs     0.85  bench_linalg.Einsum.time_einsum_noncon_multiply(<class 'numpy.float32'>)
    
  •   137±0.7μs        116±0.6μs     0.84  bench_linalg.Einsum.time_einsum_multiply(<class 'numpy.float32'>)
    

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

System Info

  Arm x86
Hardware KunPeng  
Processor ARMv8 2.6GMHZ 8 processors Intel(R) Xeon(R) Gold 6161 CPU @ 2.20GHz
OS Linux ecs-9d50 4.19.36-vhulk1905.1.0.h276.eulerosv2r8.aarch64 Windows Server 2008 R2 Enterprise
Compiler gcc (GCC) 7.3.0 MSVC14.06

Copy link
Member

@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since #17340 is merged, could you use NPYV partial load and store for handling the remained scalars?

  • it should perform better on AVX2/AVX512F without losing performance in other
    architectures since we already unroll by four-vectors.
  • guarantee of having equal accuracy for all array elements when fused intrinsic instructions enabled.

numpy/core/src/multiarray/einsum_sumprod.c.src Outdated Show resolved Hide resolved
numpy/core/src/multiarray/einsum_sumprod.c.src Outdated Show resolved Hide resolved
numpy/core/src/multiarray/einsum_sumprod.c.src Outdated Show resolved Hide resolved
Copy link
Member

@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you, Chunlin!

@Qiyu8
Copy link
Member Author

Qiyu8 commented Dec 11, 2020

@mattip @eric-wieser The development process of version 1.21 is started, right time to accept simd PRs?

@mattip mattip merged commit 9e26d1d into numpy:master Dec 11, 2020
@mattip
Copy link
Member

mattip commented Dec 11, 2020

Thanks @Qiyu8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants