Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIMD: Optimize the performance of einsum's submodule sum. #18012

Merged
merged 5 commits into from
Dec 23, 2020

Conversation

Qiyu8
Copy link
Member

@Qiyu8 Qiyu8 commented Dec 17, 2020

Introduction

This is the sixth part of #17049 , The sum operation is extracted because three sub functions depends on it. The optimized code reduced the amount of code by 85%, the performance increased 45%~50% on X86 and about 14%~77% on ARM.

Benchmark

Here is the ASV benchmark result.

SSE2 enabled

· Creating environments
· Discovering benchmarks
·· Uninstalling from virtualenv-py3.7-Cython
·· Building b80411d2  for virtualenv-py3.7-Cython
·· Installing b80411d2  into virtualenv-py3.7-Cython
· Running 28 total benchmarks (2 commits * 1 environments * 14 benchmarks)
[  0.00%] · For numpy commit d7a75e8e  (round 1/2):
[  0.00%] ·· Building for virtualenv-py3.7-Cython
[  0.00%] ·· Benchmarking virtualenv-py3.7-Cython
[  1.79%] ··· Running (bench_linalg.Einsum.time_einsum_contig_contig--)..............
[ 25.00%] · For numpy commit b80411d2  (round 1/2):
[ 25.00%] ·· Building for virtualenv-py3.7-Cython
[ 25.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 26.79%] ··· Running (bench_linalg.Einsum.time_einsum_contig_contig--)..............
[ 50.00%] · For numpy commit b80411d2  (round 2/2):
[ 50.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 51.79%] ··· bench_linalg.Einsum.time_einsum_contig_contig                                                                                                                                                            ok
[ 51.79%] ··· =============== ==========
                   dtype
              --------------- ----------
               numpy.float32   142±10μs
               numpy.float64   269±30μs
              =============== ==========

[ 53.57%] ··· bench_linalg.Einsum.time_einsum_contig_outstride0 ok
[ 53.57%] ··· =============== =========
dtype
--------------- ---------
numpy.float32 147±2μs
numpy.float64 248±5μs
=============== =========

[ 55.36%] ··· bench_linalg.Einsum.time_einsum_mul ok
[ 55.36%] ··· =============== =============
dtype
--------------- -------------
numpy.float32 1.38±0.04ms
numpy.float64 2.74±0.05ms
=============== =============

[ 57.14%] ··· bench_linalg.Einsum.time_einsum_multiply ok
[ 57.14%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 74.1±5μs
numpy.float64 76.7±1μs
=============== ==========

[ 58.93%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_contig ok
[ 58.93%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 43.5±1μs
numpy.float64 45.4±2μs
=============== ==========

[ 60.71%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_outstride0 ok
[ 60.71%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 33.9±0.6μs
numpy.float64 35.2±0.9μs
=============== ============

[ 62.50%] ··· bench_linalg.Einsum.time_einsum_noncon_mul ok
[ 62.50%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 43.5±1μs
numpy.float64 43.0±3μs
=============== ==========

[ 64.29%] ··· bench_linalg.Einsum.time_einsum_noncon_multiply ok
[ 64.29%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 69.9±2μs
numpy.float64 83.6±8μs
=============== ==========

[ 66.07%] ··· bench_linalg.Einsum.time_einsum_noncon_outer ok
[ 66.07%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 11.1±0.3ms
numpy.float64 22.6±0.1ms
=============== ============

[ 67.86%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul ok
[ 67.86%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 64.9±6μs
numpy.float64 66.4±3μs
=============== ==========

[ 69.64%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul2 ok
[ 69.64%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 63.6±3μs
numpy.float64 71.5±3μs
=============== ==========

[ 71.43%] ··· bench_linalg.Einsum.time_einsum_outer ok
[ 71.43%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 25.5±0.4ms
numpy.float64 50.5±0.6ms
=============== ============

[ 73.21%] ··· bench_linalg.Einsum.time_einsum_sum_mul ok
[ 73.21%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 60.6±2μs
numpy.float64 58.1±2μs
=============== ==========

[ 75.00%] ··· bench_linalg.Einsum.time_einsum_sum_mul2 ok
[ 75.00%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 59.9±4μs
numpy.float64 64.5±2μs
=============== ==========

[ 75.00%] · For numpy commit d7a75e8e (round 2/2):
[ 75.00%] ·· Building for virtualenv-py3.7-Cython
[ 75.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 76.79%] ··· bench_linalg.Einsum.time_einsum_contig_contig ok
[ 76.79%] ··· =============== =========
dtype
--------------- ---------
numpy.float32 142±6μs
numpy.float64 258±8μs
=============== =========

[ 78.57%] ··· bench_linalg.Einsum.time_einsum_contig_outstride0 ok
[ 78.57%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 271±20μs
numpy.float64 452±8μs
=============== ==========

[ 80.36%] ··· bench_linalg.Einsum.time_einsum_mul ok
[ 80.36%] ··· =============== =============
dtype
--------------- -------------
numpy.float32 1.39±0.03ms
numpy.float64 2.73±0.1ms
=============== =============

[ 82.14%] ··· bench_linalg.Einsum.time_einsum_multiply ok
[ 82.14%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 68.7±2μs
numpy.float64 92.8±9μs
=============== ==========

[ 83.93%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_contig ok
[ 83.93%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 46.6±3μs
numpy.float64 46.9±2μs
=============== ==========

[ 85.71%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_outstride0 ok
[ 85.71%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 36.9±3μs
numpy.float64 35.4±1μs
=============== ==========

[ 87.50%] ··· bench_linalg.Einsum.time_einsum_noncon_mul ok
[ 87.50%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 41.3±2μs
numpy.float64 44.4±3μs
=============== ==========

[ 89.29%] ··· bench_linalg.Einsum.time_einsum_noncon_multiply ok
[ 89.29%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 70.3±4μs
numpy.float64 78.8±3μs
=============== ==========

[ 91.07%] ··· bench_linalg.Einsum.time_einsum_noncon_outer ok
[ 91.07%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 11.1±0.3ms
numpy.float64 22.7±0.2ms
=============== ============

[ 92.86%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul ok
[ 92.86%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 66.9±1μs
numpy.float64 76.1±1μs
=============== ==========

[ 94.64%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul2 ok
[ 94.64%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 68.2±4μs
numpy.float64 79.6±4μs
=============== ==========

[ 96.43%] ··· bench_linalg.Einsum.time_einsum_outer ok
[ 96.43%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 26.0±0.4ms
numpy.float64 51.2±0.6ms
=============== ============

[ 98.21%] ··· bench_linalg.Einsum.time_einsum_sum_mul ok
[ 98.21%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 68.7±6μs
numpy.float64 65.1±2μs
=============== ==========

[100.00%] ··· bench_linalg.Einsum.time_einsum_sum_mul2 ok
[100.00%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 67.6±1μs
numpy.float64 64.4±1μs
=============== ==========

   before           after         ratio
 [d7a75e8e]       [b80411d2]
 <master>         <einsum-sum>
  •     452±8μs          248±5μs     0.55  bench_linalg.Einsum.time_einsum_contig_outstride0(<class 'numpy.float64'>)
    
  •    271±20μs          147±2μs     0.54  bench_linalg.Einsum.time_einsum_contig_outstride0(<class 'numpy.float32'>)
    

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

AV2 enabled

· Creating environments
· Discovering benchmarks
·· Uninstalling from virtualenv-py3.7-Cython
·· Building f3608c32  for virtualenv-py3.7-Cython
·· Installing f3608c32  into virtualenv-py3.7-Cython
· Running 28 total benchmarks (2 commits * 1 environments * 14 benchmarks)
[  0.00%] · For numpy commit d7a75e8e  (round 1/2):
[  0.00%] ·· Building for virtualenv-py3.7-Cython
[  0.00%] ·· Benchmarking virtualenv-py3.7-Cython
[  1.79%] ··· Running (bench_linalg.Einsum.time_einsum_contig_contig--)..............
[ 25.00%] · For numpy commit f3608c32  (round 1/2):
[ 25.00%] ·· Building for virtualenv-py3.7-Cython
[ 25.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 26.79%] ··· Running (bench_linalg.Einsum.time_einsum_contig_contig--)..............
[ 50.00%] · For numpy commit f3608c32  (round 2/2):
[ 50.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 51.79%] ··· bench_linalg.Einsum.time_einsum_contig_contig                                                                                                                                                            ok
[ 51.79%] ··· =============== =========
                   dtype
              --------------- ---------
               numpy.float32   107±3μs
               numpy.float64   158±5μs
              =============== =========

[ 53.57%] ··· bench_linalg.Einsum.time_einsum_contig_outstride0 ok
[ 53.57%] ··· =============== =========
dtype
--------------- ---------
numpy.float32 148±2μs
numpy.float64 245±7μs
=============== =========

[ 55.36%] ··· bench_linalg.Einsum.time_einsum_mul ok
[ 55.36%] ··· =============== =============
dtype
--------------- -------------
numpy.float32 1.35±0.02ms
numpy.float64 2.73±0.1ms
=============== =============

[ 57.14%] ··· bench_linalg.Einsum.time_einsum_multiply ok
[ 57.14%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 83.4±2μs
numpy.float64 96.4±7μs
=============== ==========

[ 58.93%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_contig ok
[ 58.93%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 44.3±2μs
numpy.float64 46.6±4μs
=============== ==========

[ 60.71%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_outstride0 ok
[ 60.71%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 45.6±1μs
numpy.float64 41.6±2μs
=============== ==========

[ 62.50%] ··· bench_linalg.Einsum.time_einsum_noncon_mul ok
[ 62.50%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 47.7±3μs
numpy.float64 43.8±1μs
=============== ==========

[ 64.29%] ··· bench_linalg.Einsum.time_einsum_noncon_multiply ok
[ 64.29%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 85.8±2μs
numpy.float64 94.3±5μs
=============== ==========

[ 66.07%] ··· bench_linalg.Einsum.time_einsum_noncon_outer ok
[ 66.07%] ··· =============== =============
dtype
--------------- -------------
numpy.float32 11.1±0.07ms
numpy.float64 22.8±0.4ms
=============== =============

[ 67.86%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul ok
[ 67.86%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 74.5±2μs
numpy.float64 76.5±3μs
=============== ==========

[ 69.64%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul2 ok
[ 69.64%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 76.4±3μs
numpy.float64 79.9±5μs
=============== ==========

[ 71.43%] ··· bench_linalg.Einsum.time_einsum_outer ok
[ 71.43%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 25.4±0.1ms
numpy.float64 52.0±0.7ms
=============== ============

[ 73.21%] ··· bench_linalg.Einsum.time_einsum_sum_mul ok
[ 73.21%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 83.8±5μs
numpy.float64 70.1±2μs
=============== ==========

[ 75.00%] ··· bench_linalg.Einsum.time_einsum_sum_mul2 ok
[ 75.00%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 71.8±4μs
numpy.float64 72.1±3μs
=============== ==========

[ 75.00%] · For numpy commit d7a75e8e (round 2/2):
[ 75.00%] ·· Building for virtualenv-py3.7-Cython
[ 75.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 76.79%] ··· bench_linalg.Einsum.time_einsum_contig_contig ok
[ 76.79%] ··· =============== =========
dtype
--------------- ---------
numpy.float32 105±8μs
numpy.float64 160±6μs
=============== =========

[ 78.57%] ··· bench_linalg.Einsum.time_einsum_contig_outstride0 ok
[ 78.57%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 300±20μs
numpy.float64 442±10μs
=============== ==========

[ 80.36%] ··· bench_linalg.Einsum.time_einsum_mul ok
[ 80.36%] ··· =============== =============
dtype
--------------- -------------
numpy.float32 1.36±0.03ms
numpy.float64 2.95±0.2ms
=============== =============

[ 82.14%] ··· bench_linalg.Einsum.time_einsum_multiply ok
[ 82.14%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 87.1±4μs
numpy.float64 94.2±4μs
=============== ==========

[ 83.93%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_contig ok
[ 83.93%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 46.0±2μs
numpy.float64 45.6±4μs
=============== ==========

[ 85.71%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_outstride0 ok
[ 85.71%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 35.7±1μs
numpy.float64 35.1±1μs
=============== ==========

[ 87.50%] ··· bench_linalg.Einsum.time_einsum_noncon_mul ok
[ 87.50%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 45.6±3μs
numpy.float64 42.4±2μs
=============== ==========

[ 89.29%] ··· bench_linalg.Einsum.time_einsum_noncon_multiply ok
[ 89.29%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 86.4±1μs
numpy.float64 95.2±5μs
=============== ==========

[ 91.07%] ··· bench_linalg.Einsum.time_einsum_noncon_outer ok
[ 91.07%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 11.1±0.2ms
numpy.float64 22.9±0.2ms
=============== ============

[ 92.86%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul ok
[ 92.86%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 70.9±3μs
numpy.float64 81.7±5μs
=============== ==========

[ 94.64%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul2 ok
[ 94.64%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 70.1±4μs
numpy.float64 79.7±5μs
=============== ==========

[ 96.43%] ··· bench_linalg.Einsum.time_einsum_outer ok
[ 96.43%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 25.4±0.2ms
numpy.float64 51.6±0.9ms
=============== ============

[ 98.21%] ··· bench_linalg.Einsum.time_einsum_sum_mul ok
[ 98.21%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 60.8±5μs
numpy.float64 65.5±2μs
=============== ==========

[100.00%] ··· bench_linalg.Einsum.time_einsum_sum_mul2 ok
[100.00%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 61.3±2μs
numpy.float64 65.9±2μs
=============== ==========

   before           after         ratio
 [d7a75e8e]       [f3608c32]
 <master>         <einsum-sum>
  •    442±10μs          245±7μs     0.55  bench_linalg.Einsum.time_einsum_contig_outstride0(<class 'numpy.float64'>)
    
  •    300±20μs          148±2μs     0.50  bench_linalg.Einsum.time_einsum_contig_outstride0(<class 'numpy.float32'>)
    

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

NEON enabled

· Creating environments
· Discovering benchmarks
·· Uninstalling from virtualenv-py3.7-Cython.
·· Installing f3608c32  into virtualenv-py3.7-Cython.
· Running 28 total benchmarks (2 commits * 1 environments * 14 benchmarks)
[  0.00%] · For numpy commit d7a75e8e  (round 1/2):
[  0.00%] ·· Building for virtualenv-py3.7-Cython..
[  0.00%] ·· Benchmarking virtualenv-py3.7-Cython
[  1.79%] ··· Running (bench_linalg.Einsum.time_einsum_contig_contig--)..............
[ 25.00%] · For numpy commit f3608c32  (round 1/2):
[ 25.00%] ·· Building for virtualenv-py3.7-Cython..
[ 25.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 26.79%] ··· Running (bench_linalg.Einsum.time_einsum_contig_contig--)..............
[ 50.00%] · For numpy commit f3608c32  (round 2/2):
[ 50.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 51.79%] ··· bench_linalg.Einsum.time_einsum_contig_contig                                                                                 ok
[ 51.79%] ··· =============== =========
                   dtype               
              --------------- ---------
               numpy.float32   205±1μs 
               numpy.float64   365±4μs 
              =============== =========

[ 53.57%] ··· bench_linalg.Einsum.time_einsum_contig_outstride0 ok
[ 53.57%] ··· =============== =========
dtype
--------------- ---------
numpy.float32 231±3μs
numpy.float64 357±8μs
=============== =========

[ 55.36%] ··· bench_linalg.Einsum.time_einsum_mul ok
[ 55.36%] ··· =============== =============
dtype
--------------- -------------
numpy.float32 783±20μs
numpy.float64 1.04±0.01ms
=============== =============

[ 57.14%] ··· bench_linalg.Einsum.time_einsum_multiply ok
[ 57.14%] ··· =============== ===========
dtype
--------------- -----------
numpy.float32 114±0.5μs
numpy.float64 141±0.7μs
=============== ===========

[ 58.93%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_contig ok
[ 58.93%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 73.1±0.6μs
numpy.float64 74.2±0.9μs
=============== ============

[ 60.71%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_outstride0 ok
[ 60.71%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 59.4±0.4μs
numpy.float64 59.9±0.2μs
=============== ============

[ 62.50%] ··· bench_linalg.Einsum.time_einsum_noncon_mul ok
[ 62.50%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 71.4±0.3μs
numpy.float64 73.3±0.3μs
=============== ============

[ 64.29%] ··· bench_linalg.Einsum.time_einsum_noncon_multiply ok
[ 64.29%] ··· =============== ===========
dtype
--------------- -----------
numpy.float32 115±0.4μs
numpy.float64 141±0.9μs
=============== ===========

[ 66.07%] ··· bench_linalg.Einsum.time_einsum_noncon_outer ok
[ 66.07%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 5.27±0.1ms
numpy.float64 7.16±0.3ms
=============== ============

[ 67.86%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul ok
[ 67.86%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 99.8±0.6μs
numpy.float64 109±0.3μs
=============== ============

[ 69.64%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul2 ok
[ 69.64%] ··· =============== ===========
dtype
--------------- -----------
numpy.float32 101±0.7μs
numpy.float64 109±0.4μs
=============== ===========

[ 71.43%] ··· bench_linalg.Einsum.time_einsum_outer ok
[ 71.43%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 16.9±1ms
numpy.float64 22.4±0.4ms
=============== ============

[ 73.21%] ··· bench_linalg.Einsum.time_einsum_sum_mul ok
[ 73.21%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 95.3±0.2μs
numpy.float64 99.7±0.4μs
=============== ============

[ 75.00%] ··· bench_linalg.Einsum.time_einsum_sum_mul2 ok
[ 75.00%] ··· =============== =============
dtype
--------------- -------------
numpy.float32 96.3±0.3μs
numpy.float64 100.0±0.5μs
=============== =============

[ 75.00%] · For numpy commit d7a75e8e (round 2/2):
[ 75.00%] ·· Building for virtualenv-py3.7-Cython..
[ 75.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 76.79%] ··· bench_linalg.Einsum.time_einsum_contig_contig ok
[ 76.79%] ··· =============== =========
dtype
--------------- ---------
numpy.float32 202±5μs
numpy.float64 367±4μs
=============== =========

[ 78.57%] ··· bench_linalg.Einsum.time_einsum_contig_outstride0 ok
[ 78.57%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 1.02±0ms
numpy.float64 860±4μs
=============== ==========

[ 80.36%] ··· bench_linalg.Einsum.time_einsum_mul ok
[ 80.36%] ··· =============== =============
dtype
--------------- -------------
numpy.float32 780±20μs
numpy.float64 1.02±0.03ms
=============== =============

[ 82.14%] ··· bench_linalg.Einsum.time_einsum_multiply ok
[ 82.14%] ··· =============== ===========
dtype
--------------- -----------
numpy.float32 115±0.4μs
numpy.float64 143±1μs
=============== ===========

[ 83.93%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_contig ok
[ 83.93%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 74.0±0.4μs
numpy.float64 75.0±0.5μs
=============== ============

[ 85.71%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_outstride0 ok
[ 85.71%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 62.7±0.4μs
numpy.float64 62.2±0.4μs
=============== ============

[ 87.50%] ··· bench_linalg.Einsum.time_einsum_noncon_mul ok
[ 87.50%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 71.1±0.2μs
numpy.float64 73.1±0.3μs
=============== ============

[ 89.29%] ··· bench_linalg.Einsum.time_einsum_noncon_multiply ok
[ 89.29%] ··· =============== =========
dtype
--------------- ---------
numpy.float32 116±2μs
numpy.float64 140±2μs
=============== =========

[ 91.07%] ··· bench_linalg.Einsum.time_einsum_noncon_outer ok
[ 91.07%] ··· =============== =============
dtype
--------------- -------------
numpy.float32 5.13±0.08ms
numpy.float64 6.94±0.3ms
=============== =============

[ 92.86%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul ok
[ 92.86%] ··· =============== ===========
dtype
--------------- -----------
numpy.float32 141±0.5μs
numpy.float64 133±1μs
=============== ===========

[ 94.64%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul2 ok
[ 94.64%] ··· =============== ===========
dtype
--------------- -----------
numpy.float32 140±0.3μs
numpy.float64 133±0.8μs
=============== ===========
[ 96.43%] ··· bench_linalg.Einsum.time_einsum_outer ok
[ 96.43%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 16.6±0.9ms
numpy.float64 25.6±1ms
=============== ============

[ 98.21%] ··· bench_linalg.Einsum.time_einsum_sum_mul ok
[ 98.21%] ··· =============== ===========
dtype
--------------- -----------
numpy.float32 112±0.4μs
numpy.float64 110±0.4μs
=============== ===========

[100.00%] ··· bench_linalg.Einsum.time_einsum_sum_mul2 ok
[100.00%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 112±0.08μs
numpy.float64 109±0.7μs
=============== ============

   before           after         ratio
 [d7a75e8e]       [f3608c32]
 <master>         <einsum-sum>
  •  62.7±0.4μs       59.4±0.4μs     0.95  bench_linalg.Einsum.time_einsum_noncon_contig_outstride0(<class 'numpy.float32'>)
    
  •   109±0.7μs      100.0±0.5μs     0.91  bench_linalg.Einsum.time_einsum_sum_mul2(<class 'numpy.float64'>)
    
  •   110±0.4μs       99.7±0.4μs     0.91  bench_linalg.Einsum.time_einsum_sum_mul(<class 'numpy.float64'>)
    
  •  112±0.08μs       96.3±0.3μs     0.86  bench_linalg.Einsum.time_einsum_sum_mul2(<class 'numpy.float32'>)
    
  •   112±0.4μs       95.3±0.2μs     0.85  bench_linalg.Einsum.time_einsum_sum_mul(<class 'numpy.float32'>)
    
  •   133±0.8μs        109±0.4μs     0.82  bench_linalg.Einsum.time_einsum_noncon_sum_mul2(<class 'numpy.float64'>)
    
  •     133±1μs        109±0.3μs     0.81  bench_linalg.Einsum.time_einsum_noncon_sum_mul(<class 'numpy.float64'>)
    
  •   140±0.3μs        101±0.7μs     0.72  bench_linalg.Einsum.time_einsum_noncon_sum_mul2(<class 'numpy.float32'>)
    
  •   141±0.5μs       99.8±0.6μs     0.71  bench_linalg.Einsum.time_einsum_noncon_sum_mul(<class 'numpy.float32'>)
    
  •     860±4μs          357±8μs     0.42  bench_linalg.Einsum.time_einsum_contig_outstride0(<class 'numpy.float64'>)
    
  •    1.02±0ms          231±3μs     0.23  bench_linalg.Einsum.time_einsum_contig_outstride0(<class 'numpy.float32'>)
    

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

System Info

  Arm x86
Hardware KunPeng  
Processor ARMv8 2.6GMHZ 8 processors Intel(R) Xeon(R) Gold 6161 CPU @ 2.20GHz
OS Linux ecs-9d50 4.19.36-vhulk1905.1.0.h276.eulerosv2r8.aarch64 Windows Server 2008 R2 Enterprise
Compiler gcc (GCC) 7.3.0 MSVC14.06

Comment on lines 136 to 137
const @temptype@ a01 = @from@(*data) + @from@(*(data + 1));
const @temptype@ a23 = @from@(*(data + 2)) + @from@(*(data + 3));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const @temptype@ a01 = @from@(*data) + @from@(*(data + 1));
const @temptype@ a23 = @from@(*(data + 2)) + @from@(*(data + 3));
const @temptype@ a01 = @from@(*data) + @from@(data[1]);
const @temptype@ a23 = @from@(data[2]) + @from@(data[3]);

make it simpler?

accum += a01 + a23;
}
#endif // !NPY_DISABLE_OPTIMIZATION
for (; count > 0; --count, data += 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (; count > 0; --count, data += 1) {
for (; count > 0; --count, ++data) {

Copy link
Member

@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done, Thank you!

@mattip mattip merged commit 073b9b9 into numpy:master Dec 23, 2020
@mattip
Copy link
Member

mattip commented Dec 23, 2020

Thanks @Qiyu8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants