Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIMD: Optimize the performance of einsum's submodule dot . #17994

Merged
merged 4 commits into from
Dec 14, 2020

Conversation

Qiyu8
Copy link
Member

@Qiyu8 Qiyu8 commented Dec 14, 2020

Introduction

This is the fourth part of #17049, There has no impact on SSE2 platform and about 10%~64% increased performance in ARM, the performance increased by 27% with avx2 instrument . Here is the benchmark results:

Benchmark

Here is the ASV benchmark result.

SSE2 enabled

· Creating environments
· Discovering benchmarks
·· Uninstalling from virtualenv-py3.7-Cython
·· Building 350670fd  for virtualenv-py3.7-Cython
·· Installing 350670fd  into virtualenv-py3.7-Cython
· Running 28 total benchmarks (2 commits * 1 environments * 14 benchmarks)
[  0.00%] · For numpy commit 9e26d1d2  (round 1/2):
[  0.00%] ·· Building for virtualenv-py3.7-Cython
[  0.00%] ·· Benchmarking virtualenv-py3.7-Cython
[  1.79%] ··· Running (bench_linalg.Einsum.time_einsum_contig_contig--)..............
[ 25.00%] · For numpy commit 350670fd  (round 1/2):
[ 25.00%] ·· Building for virtualenv-py3.7-Cython
[ 25.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 26.79%] ··· Running (bench_linalg.Einsum.time_einsum_contig_contig--)..............
[ 50.00%] · For numpy commit 350670fd  (round 2/2):
[ 50.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 51.79%] ··· bench_linalg.Einsum.time_einsum_contig_contig                                                                                                                       ok
[ 51.79%] ··· =============== ==========
                   dtype
              --------------- ----------
               numpy.float32   146±8μs
               numpy.float64   247±10μs
              =============== ==========

[ 53.57%] ··· bench_linalg.Einsum.time_einsum_contig_outstride0 ok
[ 53.57%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 256±30μs
numpy.float64 469±20μs
=============== ==========

[ 55.36%] ··· bench_linalg.Einsum.time_einsum_mul ok
[ 55.36%] ··· =============== =============
dtype
--------------- -------------
numpy.float32 1.41±0.04ms
numpy.float64 2.79±0.1ms
=============== =============

[ 57.14%] ··· bench_linalg.Einsum.time_einsum_multiply ok
[ 57.14%] ··· =============== ===========
dtype
--------------- -----------
numpy.float32 67.4±10μs
numpy.float64 81.8±1μs
=============== ===========

[ 58.93%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_contig ok
[ 58.93%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 46.7±2μs
numpy.float64 46.4±1μs
=============== ==========

[ 60.71%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_outstride0 ok
[ 60.71%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 35.8±1μs
numpy.float64 35.9±2μs
=============== ==========

[ 62.50%] ··· bench_linalg.Einsum.time_einsum_noncon_mul ok
[ 62.50%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 41.6±2μs
numpy.float64 42.8±0.9μs
=============== ============

[ 64.29%] ··· bench_linalg.Einsum.time_einsum_noncon_multiply ok
[ 64.29%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 70.4±3μs
numpy.float64 78.6±2μs
=============== ==========

[ 66.07%] ··· bench_linalg.Einsum.time_einsum_noncon_outer ok
[ 66.07%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 11.2±0.5ms
numpy.float64 22.6±0.2ms
=============== ============

[ 67.86%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul ok
[ 67.86%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 67.8±2μs
numpy.float64 76.6±2μs
=============== ==========

[ 69.64%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul2 ok
[ 69.64%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 71.3±5μs
numpy.float64 77.3±1μs
=============== ==========

[ 71.43%] ··· bench_linalg.Einsum.time_einsum_outer ok
[ 71.43%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 25.9±1ms
numpy.float64 51.2±1ms
=============== ==========

[ 73.21%] ··· bench_linalg.Einsum.time_einsum_sum_mul ok
[ 73.21%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 62.5±1μs
numpy.float64 68.7±3μs
=============== ==========

[ 75.00%] ··· bench_linalg.Einsum.time_einsum_sum_mul2 ok
[ 75.00%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 64.8±2μs
numpy.float64 67.2±2μs
=============== ==========

[ 75.00%] · For numpy commit 9e26d1d2 (round 2/2):
[ 75.00%] ·· Building for virtualenv-py3.7-Cython
[ 75.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 76.79%] ··· bench_linalg.Einsum.time_einsum_contig_contig ok
[ 76.79%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 160±4μs
numpy.float64 266±20μs
=============== ==========

[ 78.57%] ··· bench_linalg.Einsum.time_einsum_contig_outstride0 ok
[ 78.57%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 266±9μs
numpy.float64 466±20μs
=============== ==========

[ 80.36%] ··· bench_linalg.Einsum.time_einsum_mul ok
[ 80.36%] ··· =============== =============
dtype
--------------- -------------
numpy.float32 1.35±0.04ms
numpy.float64 2.80±0.07ms
=============== =============

[ 82.14%] ··· bench_linalg.Einsum.time_einsum_multiply ok
[ 82.14%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 73.0±5μs
numpy.float64 83.1±1μs
=============== ==========

[ 83.93%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_contig ok
[ 83.93%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 48.2±3μs
numpy.float64 47.4±5μs
=============== ==========

[ 85.71%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_outstride0 ok
[ 85.71%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 35.1±0.7μs
numpy.float64 37.6±1μs
=============== ============

[ 87.50%] ··· bench_linalg.Einsum.time_einsum_noncon_mul ok
[ 87.50%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 41.2±1μs
numpy.float64 42.4±2μs
=============== ==========

[ 89.29%] ··· bench_linalg.Einsum.time_einsum_noncon_multiply ok
[ 89.29%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 69.3±1μs
numpy.float64 84.7±8μs
=============== ==========

[ 91.07%] ··· bench_linalg.Einsum.time_einsum_noncon_outer ok
[ 91.07%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 11.4±0.1ms
numpy.float64 22.9±0.4ms
=============== ============

[ 92.86%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul ok
[ 92.86%] ··· =============== ===========
dtype
--------------- -----------
numpy.float32 70.5±3μs
numpy.float64 78.7±10μs
=============== ===========

[ 94.64%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul2 ok
[ 94.64%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 70.5±4μs
numpy.float64 81.8±5μs
=============== ==========

[ 96.43%] ··· bench_linalg.Einsum.time_einsum_outer ok
[ 96.43%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 26.5±0.9ms
numpy.float64 50.9±1ms
=============== ============

[ 98.21%] ··· bench_linalg.Einsum.time_einsum_sum_mul ok
[ 98.21%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 60.9±2μs
numpy.float64 68.1±3μs
=============== ==========

[100.00%] ··· bench_linalg.Einsum.time_einsum_sum_mul2 ok
[100.00%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 63.5±3μs
numpy.float64 65.5±2μs
=============== ==========

BENCHMARKS NOT SIGNIFICANTLY CHANGED.

AV2 enabled

· Creating environments
· Discovering benchmarks
·· Uninstalling from virtualenv-py3.7-Cython
·· Installing c32f60e3  into virtualenv-py3.7-Cython
· Running 28 total benchmarks (2 commits * 1 environments * 14 benchmarks)
[  0.00%] · For numpy commit 9e26d1d2  (round 1/2):
[  0.00%] ·· Building for virtualenv-py3.7-Cython
[  0.00%] ·· Benchmarking virtualenv-py3.7-Cython
[  1.79%] ··· Running (bench_linalg.Einsum.time_einsum_contig_contig--)..............
[ 25.00%] · For numpy commit c32f60e3  (round 1/2):
[ 25.00%] ·· Building for virtualenv-py3.7-Cython
[ 25.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 26.79%] ··· Running (bench_linalg.Einsum.time_einsum_contig_contig--)..............
[ 50.00%] · For numpy commit c32f60e3  (round 2/2):
[ 50.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 51.79%] ··· bench_linalg.Einsum.time_einsum_contig_contig                                                                                                                       ok
[ 51.79%] ··· =============== =========
                   dtype
              --------------- ---------
               numpy.float32   136±6μs
               numpy.float64   191±6μs
              =============== =========

[ 53.57%] ··· bench_linalg.Einsum.time_einsum_contig_outstride0 ok
[ 53.57%] ··· =============== =========
dtype
--------------- ---------
numpy.float32 252±7μs
numpy.float64 446±8μs
=============== =========

[ 55.36%] ··· bench_linalg.Einsum.time_einsum_mul ok
[ 55.36%] ··· =============== =============
dtype
--------------- -------------
numpy.float32 1.41±0.05ms
numpy.float64 2.95±0.04ms
=============== =============

[ 57.14%] ··· bench_linalg.Einsum.time_einsum_multiply ok
[ 57.14%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 86.9±4μs
numpy.float64 96.2±2μs
=============== ==========

[ 58.93%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_contig ok
[ 58.93%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 49.8±2μs
numpy.float64 58.6±4μs
=============== ==========

[ 60.71%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_outstride0 ok
[ 60.71%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 37.4±2μs
numpy.float64 36.8±0.8μs
=============== ============

[ 62.50%] ··· bench_linalg.Einsum.time_einsum_noncon_mul ok
[ 62.50%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 43.2±2μs
numpy.float64 43.6±6μs
=============== ==========

[ 64.29%] ··· bench_linalg.Einsum.time_einsum_noncon_multiply ok
[ 64.29%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 86.5±5μs
numpy.float64 101±5μs
=============== ==========

[ 66.07%] ··· bench_linalg.Einsum.time_einsum_noncon_outer ok
[ 66.07%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 11.4±0.3ms
numpy.float64 23.0±0.2ms
=============== ============

[ 67.86%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul ok
[ 67.86%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 68.0±2μs
numpy.float64 75.6±3μs
=============== ==========

[ 69.64%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul2 ok
[ 69.64%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 69.9±2μs
numpy.float64 77.5±2μs
=============== ==========

[ 71.43%] ··· bench_linalg.Einsum.time_einsum_outer ok
[ 71.43%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 25.8±0.7ms
numpy.float64 53.1±0.4ms
=============== ============

[ 73.21%] ··· bench_linalg.Einsum.time_einsum_sum_mul ok
[ 73.21%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 62.5±1μs
numpy.float64 67.0±2μs
=============== ==========

[ 75.00%] ··· bench_linalg.Einsum.time_einsum_sum_mul2 ok
[ 75.00%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 62.4±2μs
numpy.float64 68.0±3μs
=============== ==========

[ 75.00%] · For numpy commit 9e26d1d2 (round 2/2):
[ 75.00%] ·· Building for virtualenv-py3.7-Cython
[ 75.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 76.79%] ··· bench_linalg.Einsum.time_einsum_contig_contig ok
[ 76.79%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 147±3μs
numpy.float64 262±10μs
=============== ==========

[ 78.57%] ··· bench_linalg.Einsum.time_einsum_contig_outstride0 ok
[ 78.57%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 267±20μs
numpy.float64 458±8μs
=============== ==========

[ 80.36%] ··· bench_linalg.Einsum.time_einsum_mul ok
[ 80.36%] ··· =============== =============
dtype
--------------- -------------
numpy.float32 1.43±0.04ms
numpy.float64 2.92±0.07ms
=============== =============

[ 82.14%] ··· bench_linalg.Einsum.time_einsum_multiply ok
[ 82.14%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 70.3±2μs
numpy.float64 85.9±5μs
=============== ==========

[ 83.93%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_contig ok
[ 83.93%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 46.7±3μs
numpy.float64 48.7±5μs
=============== ==========

[ 85.71%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_outstride0 ok
[ 85.71%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 33.8±1μs
numpy.float64 37.0±2μs
=============== ==========

[ 87.50%] ··· bench_linalg.Einsum.time_einsum_noncon_mul ok
[ 87.50%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 45.3±0.9μs
numpy.float64 48.2±4μs
=============== ============

[ 89.29%] ··· bench_linalg.Einsum.time_einsum_noncon_multiply ok
[ 89.29%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 74.1±5μs
numpy.float64 80.5±2μs
=============== ==========

[ 91.07%] ··· bench_linalg.Einsum.time_einsum_noncon_outer ok
[ 91.07%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 11.7±0.5ms
numpy.float64 23.0±0.4ms
=============== ============

[ 92.86%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul ok
[ 92.86%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 67.9±1μs
numpy.float64 80.0±0.7μs
=============== ============

[ 94.64%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul2 ok
[ 94.64%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 69.3±3μs
numpy.float64 77.2±2μs
=============== ==========

[ 96.43%] ··· bench_linalg.Einsum.time_einsum_outer ok
[ 96.43%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 25.4±0.7ms
numpy.float64 52.4±1ms
=============== ============

[ 98.21%] ··· bench_linalg.Einsum.time_einsum_sum_mul ok
[ 98.21%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 68.0±6μs
numpy.float64 67.1±2μs
=============== ==========

[100.00%] ··· bench_linalg.Einsum.time_einsum_sum_mul2 ok
[100.00%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 64.1±4μs
numpy.float64 67.4±5μs
=============== ==========

   before           after         ratio
 [9e26d1d2]       [c32f60e3]
 <master>         <einsum-dot>
  •    262±10μs          191±6μs     0.73  bench_linalg.Einsum.time_einsum_contig_contig(<class 'numpy.float64'>)
    

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

NEON enabled

· Creating environments
· Discovering benchmarks
·· Uninstalling from virtualenv-py3.7-Cython.
·· Building 350670fd  for virtualenv-py3.7-Cython...........................................
·· Installing 350670fd  into virtualenv-py3.7-Cython.
· Running 28 total benchmarks (2 commits * 1 environments * 14 benchmarks)
[  0.00%] · For numpy commit 5da4a8e1  (round 1/2):
[  0.00%] ·· Building for virtualenv-py3.7-Cython..
[  0.00%] ·· Benchmarking virtualenv-py3.7-Cython
[  1.79%] ··· Running (bench_linalg.Einsum.time_einsum_contig_contig--)..............
[ 25.00%] · For numpy commit 350670fd  (round 1/2):
[ 25.00%] ·· Building for virtualenv-py3.7-Cython..
[ 25.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 26.79%] ··· Running (bench_linalg.Einsum.time_einsum_contig_contig--)..............
[ 50.00%] · For numpy commit 350670fd  (round 2/2):
[ 50.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 51.79%] ··· bench_linalg.Einsum.time_einsum_contig_contig                                             ok
[ 51.79%] ··· =============== =========
                   dtype               
              --------------- ---------
               numpy.float32   200±2μs 
               numpy.float64   358±4μs 
              =============== =========

[ 53.57%] ··· bench_linalg.Einsum.time_einsum_contig_outstride0 ok
[ 53.57%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 1.02±0ms
numpy.float64 865±5μs
=============== ==========

[ 55.36%] ··· bench_linalg.Einsum.time_einsum_mul ok
[ 55.36%] ··· =============== =============
dtype
--------------- -------------
numpy.float32 776±30μs
numpy.float64 1.00±0.05ms
=============== =============

[ 57.14%] ··· bench_linalg.Einsum.time_einsum_multiply ok
[ 57.14%] ··· =============== ===========
dtype
--------------- -----------
numpy.float32 115±0.8μs
numpy.float64 141±2μs
=============== ===========

[ 58.93%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_contig ok
[ 58.93%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 72.3±0.3μs
numpy.float64 73.2±0.5μs
=============== ============

[ 60.71%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_outstride0 ok
[ 60.71%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 63.2±0.5μs
numpy.float64 62.0±0.2μs
=============== ============

[ 62.50%] ··· bench_linalg.Einsum.time_einsum_noncon_mul ok
[ 62.50%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 71.0±0.2μs
numpy.float64 73.4±1μs
=============== ============

[ 64.29%] ··· bench_linalg.Einsum.time_einsum_noncon_multiply ok
[ 64.29%] ··· =============== =========
dtype
--------------- ---------
numpy.float32 116±2μs
numpy.float64 147±1μs
=============== =========

[ 66.07%] ··· bench_linalg.Einsum.time_einsum_noncon_outer ok
[ 66.07%] ··· =============== =============
dtype
--------------- -------------
numpy.float32 5.02±0.1ms
numpy.float64 6.81±0.08ms
=============== =============

[ 67.86%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul ok
[ 67.86%] ··· =============== ===========
dtype
--------------- -----------
numpy.float32 141±0.3μs
numpy.float64 135±1μs
=============== ===========

[ 69.64%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul2 ok
[ 69.64%] ··· =============== ===========
dtype
--------------- -----------
numpy.float32 143±1μs
numpy.float64 133±0.2μs
=============== ===========

[ 71.43%] ··· bench_linalg.Einsum.time_einsum_outer ok
[ 71.43%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 17.0±1ms
numpy.float64 24.3±1ms
=============== ==========

[ 73.21%] ··· bench_linalg.Einsum.time_einsum_sum_mul ok
[ 73.21%] ··· =============== ===========
dtype
--------------- -----------
numpy.float32 112±0.5μs
numpy.float64 110±1μs
=============== ===========

[ 75.00%] ··· bench_linalg.Einsum.time_einsum_sum_mul2 ok
[ 75.00%] ··· =============== =========
dtype
--------------- ---------
numpy.float32 114±1μs
numpy.float64 110±1μs
=============== =========

[ 75.00%] · For numpy commit 5da4a8e1 (round 2/2):
[ 75.00%] ·· Building for virtualenv-py3.7-Cython..
[ 75.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 76.79%] ··· bench_linalg.Einsum.time_einsum_contig_contig ok
[ 76.79%] ··· =============== =========
dtype
--------------- ---------
numpy.float32 559±3μs
numpy.float64 510±2μs
=============== =========

[ 78.57%] ··· bench_linalg.Einsum.time_einsum_contig_outstride0 ok
[ 78.57%] ··· =============== =============
dtype
--------------- -------------
numpy.float32 1.03±0.01ms
numpy.float64 875±10μs
=============== =============

[ 80.36%] ··· bench_linalg.Einsum.time_einsum_mul ok
[ 80.36%] ··· =============== =============
dtype
--------------- -------------
numpy.float32 795±20μs
numpy.float64 1.07±0.02ms
=============== =============

[ 82.14%] ··· bench_linalg.Einsum.time_einsum_multiply ok
[ 82.14%] ··· =============== =========
dtype
--------------- ---------
numpy.float32 137±1μs
numpy.float64 157±2μs
=============== =========

[ 83.93%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_contig ok
[ 83.93%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 74.9±1μs
numpy.float64 75.0±1μs
=============== ==========

[ 85.71%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_outstride0 ok
[ 85.71%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 63.5±0.4μs
numpy.float64 63.3±0.3μs
=============== ============

[ 87.50%] ··· bench_linalg.Einsum.time_einsum_noncon_mul ok
[ 87.50%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 72.3±0.3μs
numpy.float64 73.9±0.2μs
=============== ============

[ 89.29%] ··· bench_linalg.Einsum.time_einsum_noncon_multiply ok
[ 89.29%] ··· =============== ===========
dtype
--------------- -----------
numpy.float32 138±0.3μs
numpy.float64 156±1μs
=============== ===========

[ 91.07%] ··· bench_linalg.Einsum.time_einsum_noncon_outer ok
[ 91.07%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 5.40±0.1ms
numpy.float64 7.49±0.3ms
=============== ============

[ 92.86%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul ok
[ 92.86%] ··· =============== ===========
dtype
--------------- -----------
numpy.float32 141±0.4μs
numpy.float64 134±0.5μs
=============== ===========

[ 94.64%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul2 ok
[ 94.64%] ··· =============== ===========
dtype
--------------- -----------
numpy.float32 141±0.4μs
numpy.float64 134±0.6μs
=============== ===========

[ 96.43%] ··· bench_linalg.Einsum.time_einsum_outer ok
[ 96.43%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 17.6±0.1ms
numpy.float64 24.6±0.8ms
=============== ============

[ 98.21%] ··· bench_linalg.Einsum.time_einsum_sum_mul ok
[ 98.21%] ··· =============== ===========
dtype
--------------- -----------
numpy.float32 112±0.4μs
numpy.float64 110±0.4μs
=============== ===========

[100.00%] ··· bench_linalg.Einsum.time_einsum_sum_mul2 ok
[100.00%] ··· =============== ===========
dtype
--------------- -----------
numpy.float32 112±0.2μs
numpy.float64 109±0.6μs
=============== ===========

   before           after         ratio
 [5da4a8e1]       [350670fd]
 <master>         <einsum-dot>
  •     156±1μs          147±1μs     0.94  bench_linalg.Einsum.time_einsum_noncon_multiply(<class 'numpy.float64'>)
    
  •     157±2μs          141±2μs     0.90  bench_linalg.Einsum.time_einsum_multiply(<class 'numpy.float64'>)
    
  •   138±0.3μs          116±2μs     0.84  bench_linalg.Einsum.time_einsum_noncon_multiply(<class 'numpy.float32'>)
    
  •     137±1μs        115±0.8μs     0.84  bench_linalg.Einsum.time_einsum_multiply(<class 'numpy.float32'>)
    
  •     510±2μs          358±4μs     0.70  bench_linalg.Einsum.time_einsum_contig_contig(<class 'numpy.float64'>)
    
  •     559±3μs          200±2μs     0.36  bench_linalg.Einsum.time_einsum_contig_contig(<class 'numpy.float32'>)
    

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

System Info

  Arm x86
Hardware KunPeng  
Processor ARMv8 2.6GMHZ 8 processors Intel(R) Xeon(R) Gold 6161 CPU @ 2.20GHz
OS Linux ecs-9d50 4.19.36-vhulk1905.1.0.h276.eulerosv2r8.aarch64 Windows Server 2008 R2 Enterprise
Compiler gcc (GCC) 7.3.0 MSVC14.06

@mattip
Copy link
Member

mattip commented Dec 14, 2020

Nice, removed ~100 lines of code and increased performance. @seiko2plus could you take a look?

#elif EINSUM_USE_SSE2 && @float64@
__m128d a, accum_sse = _mm_setzero_pd();
#endif

NPY_EINSUM_DBG_PRINT1("@name@_sum_of_products_contig_contig_outstride0_two (%d)\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you activate opt level 3 through GCC attr NPY_GCC_OPT_3?

Copy link
Member

@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thank you!

@mattip mattip merged commit 12d99b5 into numpy:master Dec 14, 2020
@mattip
Copy link
Member

mattip commented Dec 14, 2020

Thanks @Qiyu8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants