Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8262498: More than 50% performance degradation of pow operator due to call with svml intrinsic after JDK-8261267 #42

Closed
wants to merge 4 commits into from

Conversation

@DamonFool
Copy link
Member

@DamonFool DamonFool commented Feb 27, 2021

Hi all,

Performance of Vector API's pow operator has been decreased by more than 50% for micro benchmarks like:

  Double128Vector.POW
  Double256Vector.POW
  DoubleMaxVector.POW
  DoubleScalar.POW
  Float128Vector.POW
  Float256Vector.POW
  FloatMaxVector.POW

Experiments show that svml's pow intrinsics are slow (except for the 512-bit ones).
So only 512-bit vectors are allowed to be intrinsified with svml and others should be disabled.

Here is the effect of this fix.

                                           Before                    |                                              After
------------------------------------------------------------------------------------------------------------------------------------------
Benchmark            (size)   Mode  Cnt    Score     Error   Units   |   Benchmark            (size)   Mode  Cnt    Score    Error   Units
Double128Vector.POW    1024  thrpt    5   14.895 ?   0.070  ops/ms   |   Double128Vector.POW    1024  thrpt    5   31.897 ?  0.203  ops/ms
Double256Vector.POW    1024  thrpt    5   15.650 ?   1.274  ops/ms   |   Double256Vector.POW    1024  thrpt    5   36.690 ?  2.848  ops/ms
Double512Vector.POW    1024  thrpt    5  263.472 ?   0.062  ops/ms   |   Double512Vector.POW    1024  thrpt    5  261.681 ? 13.817  ops/ms
Double64Vector.POW     1024  thrpt    5   17.881 ?   0.244  ops/ms   |   Double64Vector.POW     1024  thrpt    5   17.734 ?  0.184  ops/ms
DoubleMaxVector.POW    1024  thrpt    5  263.613 ?   0.132  ops/ms   |   DoubleMaxVector.POW    1024  thrpt    5  263.085 ?  0.167  ops/ms
DoubleScalar.POW       1024  thrpt    5   45.268 ?   0.043  ops/ms   |   DoubleScalar.POW       1024  thrpt    5   45.220 ?  0.013  ops/ms
Float128Vector.POW     1024  thrpt    5   13.761 ?   0.092  ops/ms   |   Float128Vector.POW     1024  thrpt    5   28.578 ?  0.213  ops/ms
Float256Vector.POW     1024  thrpt    5   13.131 ?   0.101  ops/ms   |   Float256Vector.POW     1024  thrpt    5   29.414 ?  0.370  ops/ms
Float512Vector.POW     1024  thrpt    5  624.449 ? 267.160  ops/ms   |   Float512Vector.POW     1024  thrpt    5  649.519 ?  2.295  ops/ms
Float64Vector.POW      1024  thrpt    5   10.888 ?   0.069  ops/ms   |   Float64Vector.POW      1024  thrpt    5   26.376 ?  0.601  ops/ms
FloatMaxVector.POW     1024  thrpt    5  658.723 ?   2.445  ops/ms   |   FloatMaxVector.POW     1024  thrpt    5  663.723 ?  2.852  ops/ms
FloatScalar.POW        1024  thrpt    5   30.682 ?   0.095  ops/ms   |   FloatScalar.POW        1024  thrpt    5   30.678 ?  0.074  ops/ms

Thanks.
Best regards,
Jie


Progress

  • Change must not contain extraneous whitespace
  • Change must be properly reviewed

Issue

  • JDK-8262498: More than 50% performance degradation of pow operator due to call with svml intrinsic after JDK-8261267

Reviewers

Download

$ git fetch https://git.openjdk.java.net/panama-vector pull/42/head:pull/42
$ git checkout pull/42

@DamonFool
Copy link
Member Author

@DamonFool DamonFool commented Feb 27, 2021

/test

Loading

@openjdk
Copy link

@openjdk openjdk bot commented Feb 27, 2021

@DamonFool you need to get approval to run the tests in tier1 for commits up until 429f214

Loading

@openjdk openjdk bot added the test-request label Feb 27, 2021
@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Feb 27, 2021

👋 Welcome back jiefu! A progress list of the required criteria for merging this PR into vectorIntrinsics will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

Loading

@openjdk openjdk bot added the rfr label Feb 27, 2021
@mlbridge
Copy link

@mlbridge mlbridge bot commented Feb 27, 2021

Webrevs

Loading

@DamonFool
Copy link
Member Author

@DamonFool DamonFool commented Feb 27, 2021

/test

Loading

@openjdk
Copy link

@openjdk openjdk bot commented Feb 27, 2021

@DamonFool you need to get approval to run the tests in tier1 for commits up until 4ba408c

Loading

@sviswa7
Copy link
Collaborator

@sviswa7 sviswa7 commented Mar 4, 2021

@DamonFool It will be good to fix this in the platform specific part as follows:
diff --git a/src/hotspot/cpu/x86/stubGenerator_x86_64.cpp b/src/hotspot/cpu/x86/stubGenerator_x86_64.cpp
index bd2e58b..cc776bd 100644
--- a/src/hotspot/cpu/x86/stubGenerator_x86_64.cpp
+++ b/src/hotspot/cpu/x86/stubGenerator_x86_64.cpp
@@ -7352,12 +7352,6 @@ address generate_avx_ghash_processBlocks() {
StubRoutines::_vector_atan_double64 = CAST_FROM_FN_PTR(address, __svml_atan1_ha_l9);
StubRoutines::_vector_atan_double128 = CAST_FROM_FN_PTR(address, __svml_atan2_ha_l9);
StubRoutines::_vector_atan_double256 = CAST_FROM_FN_PTR(address, __svml_atan4_ha_l9);

  •  StubRoutines::_vector_pow_float64     = CAST_FROM_FN_PTR(address, __svml_powf4_ha_l9);
    
  •  StubRoutines::_vector_pow_float128    = CAST_FROM_FN_PTR(address, __svml_powf4_ha_l9);
    
  •  StubRoutines::_vector_pow_float256    = CAST_FROM_FN_PTR(address, __svml_powf8_ha_l9);
    
  •  StubRoutines::_vector_pow_double64    = CAST_FROM_FN_PTR(address, __svml_pow1_ha_l9);
    
  •  StubRoutines::_vector_pow_double128   = CAST_FROM_FN_PTR(address, __svml_pow2_ha_l9);
    
  •  StubRoutines::_vector_pow_double256   = CAST_FROM_FN_PTR(address, __svml_pow4_ha_l9);
     StubRoutines::_vector_hypot_float64   = CAST_FROM_FN_PTR(address, __svml_hypotf4_ha_l9);
     StubRoutines::_vector_hypot_float128  = CAST_FROM_FN_PTR(address, __svml_hypotf4_ha_l9);
     StubRoutines::_vector_hypot_float256  = CAST_FROM_FN_PTR(address, __svml_hypotf8_ha_l9);
    

@@ -7461,12 +7455,6 @@ address generate_avx_ghash_processBlocks() {
StubRoutines::_vector_atan_double64 = CAST_FROM_FN_PTR(address, __svml_atan1_ha_e9);
StubRoutines::_vector_atan_double128 = CAST_FROM_FN_PTR(address, __svml_atan2_ha_e9);
StubRoutines::_vector_atan_double256 = CAST_FROM_FN_PTR(address, __svml_atan4_ha_e9);

  •  StubRoutines::_vector_pow_float64     = CAST_FROM_FN_PTR(address, __svml_powf4_ha_e9);
    
  •  StubRoutines::_vector_pow_float128    = CAST_FROM_FN_PTR(address, __svml_powf4_ha_e9);
    
  •  StubRoutines::_vector_pow_float256    = CAST_FROM_FN_PTR(address, __svml_powf8_ha_e9);
    
  •  StubRoutines::_vector_pow_double64    = CAST_FROM_FN_PTR(address, __svml_pow1_ha_e9);
    
  •  StubRoutines::_vector_pow_double128   = CAST_FROM_FN_PTR(address, __svml_pow2_ha_e9);
    
  •  StubRoutines::_vector_pow_double256   = CAST_FROM_FN_PTR(address, __svml_pow4_ha_e9);
     StubRoutines::_vector_hypot_float64   = CAST_FROM_FN_PTR(address, __svml_hypotf4_ha_e9);
     StubRoutines::_vector_hypot_float128  = CAST_FROM_FN_PTR(address, __svml_hypotf4_ha_e9);
     StubRoutines::_vector_hypot_float256  = CAST_FROM_FN_PTR(address, __svml_hypotf8_ha_e9);
    

@@ -7551,10 +7539,6 @@ address generate_avx_ghash_processBlocks() {
StubRoutines::_vector_hypot_float128 = CAST_FROM_FN_PTR(address, __svml_hypotf4_ha_ex);
StubRoutines::_vector_hypot_double64 = CAST_FROM_FN_PTR(address, __svml_hypot1_ha_ex);
StubRoutines::_vector_hypot_double128 = CAST_FROM_FN_PTR(address, __svml_hypot2_ha_ex);

  •  StubRoutines::_vector_pow_float64     = CAST_FROM_FN_PTR(address, __svml_powf4_ha_ex);
    
  •  StubRoutines::_vector_pow_float128    = CAST_FROM_FN_PTR(address, __svml_powf4_ha_ex);
    
  •  StubRoutines::_vector_pow_double64    = CAST_FROM_FN_PTR(address, __svml_pow1_ha_ex);
    
  •  StubRoutines::_vector_pow_double128   = CAST_FROM_FN_PTR(address, __svml_pow2_ha_ex);
     StubRoutines::_vector_cbrt_float64    = CAST_FROM_FN_PTR(address, __svml_cbrtf4_ha_ex);
     StubRoutines::_vector_cbrt_float128   = CAST_FROM_FN_PTR(address, __svml_cbrtf4_ha_ex);
     StubRoutines::_vector_cbrt_double64   = CAST_FROM_FN_PTR(address, __svml_cbrt1_ha_ex);
    

Loading

@DamonFool
Copy link
Member Author

@DamonFool DamonFool commented Mar 4, 2021

@DamonFool It will be good to fix this in the platform specific part as follows:
diff --git a/src/hotspot/cpu/x86/stubGenerator_x86_64.cpp b/src/hotspot/cpu/x86/stubGenerator_x86_64.cpp
index bd2e58b..cc776bd 100644
--- a/src/hotspot/cpu/x86/stubGenerator_x86_64.cpp
+++ b/src/hotspot/cpu/x86/stubGenerator_x86_64.cpp
@@ -7352,12 +7352,6 @@ address generate_avx_ghash_processBlocks() {
StubRoutines::_vector_atan_double64 = CAST_FROM_FN_PTR(address, __svml_atan1_ha_l9);
StubRoutines::_vector_atan_double128 = CAST_FROM_FN_PTR(address, __svml_atan2_ha_l9);
StubRoutines::_vector_atan_double256 = CAST_FROM_FN_PTR(address, __svml_atan4_ha_l9);

  •  StubRoutines::_vector_pow_float64     = CAST_FROM_FN_PTR(address, __svml_powf4_ha_l9);
    
  •  StubRoutines::_vector_pow_float128    = CAST_FROM_FN_PTR(address, __svml_powf4_ha_l9);
    
  •  StubRoutines::_vector_pow_float256    = CAST_FROM_FN_PTR(address, __svml_powf8_ha_l9);
    
  •  StubRoutines::_vector_pow_double64    = CAST_FROM_FN_PTR(address, __svml_pow1_ha_l9);
    
  •  StubRoutines::_vector_pow_double128   = CAST_FROM_FN_PTR(address, __svml_pow2_ha_l9);
    
  •  StubRoutines::_vector_pow_double256   = CAST_FROM_FN_PTR(address, __svml_pow4_ha_l9);
     StubRoutines::_vector_hypot_float64   = CAST_FROM_FN_PTR(address, __svml_hypotf4_ha_l9);
     StubRoutines::_vector_hypot_float128  = CAST_FROM_FN_PTR(address, __svml_hypotf4_ha_l9);
     StubRoutines::_vector_hypot_float256  = CAST_FROM_FN_PTR(address, __svml_hypotf8_ha_l9);
    

@@ -7461,12 +7455,6 @@ address generate_avx_ghash_processBlocks() {
StubRoutines::_vector_atan_double64 = CAST_FROM_FN_PTR(address, __svml_atan1_ha_e9);
StubRoutines::_vector_atan_double128 = CAST_FROM_FN_PTR(address, __svml_atan2_ha_e9);
StubRoutines::_vector_atan_double256 = CAST_FROM_FN_PTR(address, __svml_atan4_ha_e9);

  •  StubRoutines::_vector_pow_float64     = CAST_FROM_FN_PTR(address, __svml_powf4_ha_e9);
    
  •  StubRoutines::_vector_pow_float128    = CAST_FROM_FN_PTR(address, __svml_powf4_ha_e9);
    
  •  StubRoutines::_vector_pow_float256    = CAST_FROM_FN_PTR(address, __svml_powf8_ha_e9);
    
  •  StubRoutines::_vector_pow_double64    = CAST_FROM_FN_PTR(address, __svml_pow1_ha_e9);
    
  •  StubRoutines::_vector_pow_double128   = CAST_FROM_FN_PTR(address, __svml_pow2_ha_e9);
    
  •  StubRoutines::_vector_pow_double256   = CAST_FROM_FN_PTR(address, __svml_pow4_ha_e9);
     StubRoutines::_vector_hypot_float64   = CAST_FROM_FN_PTR(address, __svml_hypotf4_ha_e9);
     StubRoutines::_vector_hypot_float128  = CAST_FROM_FN_PTR(address, __svml_hypotf4_ha_e9);
     StubRoutines::_vector_hypot_float256  = CAST_FROM_FN_PTR(address, __svml_hypotf8_ha_e9);
    

@@ -7551,10 +7539,6 @@ address generate_avx_ghash_processBlocks() {
StubRoutines::_vector_hypot_float128 = CAST_FROM_FN_PTR(address, __svml_hypotf4_ha_ex);
StubRoutines::_vector_hypot_double64 = CAST_FROM_FN_PTR(address, __svml_hypot1_ha_ex);
StubRoutines::_vector_hypot_double128 = CAST_FROM_FN_PTR(address, __svml_hypot2_ha_ex);

  •  StubRoutines::_vector_pow_float64     = CAST_FROM_FN_PTR(address, __svml_powf4_ha_ex);
    
  •  StubRoutines::_vector_pow_float128    = CAST_FROM_FN_PTR(address, __svml_powf4_ha_ex);
    
  •  StubRoutines::_vector_pow_double64    = CAST_FROM_FN_PTR(address, __svml_pow1_ha_ex);
    
  •  StubRoutines::_vector_pow_double128   = CAST_FROM_FN_PTR(address, __svml_pow2_ha_ex);
     StubRoutines::_vector_cbrt_float64    = CAST_FROM_FN_PTR(address, __svml_cbrtf4_ha_ex);
     StubRoutines::_vector_cbrt_float128   = CAST_FROM_FN_PTR(address, __svml_cbrtf4_ha_ex);
     StubRoutines::_vector_cbrt_double64   = CAST_FROM_FN_PTR(address, __svml_cbrt1_ha_ex);
    

Thanks @sviswa7 for your review.

Good suggestion.
I'll update the patch and do more testing.
Thanks.

Loading

@DamonFool
Copy link
Member Author

@DamonFool DamonFool commented Mar 4, 2021

@DamonFool It will be good to fix this in the platform specific part as follows:

Hi @sviswa7 ,

Updated.
And the performance test passed.
Thanks for your help.

Loading

sviswa7
sviswa7 approved these changes Mar 4, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Mar 4, 2021

@DamonFool This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8262498: More than 50% performance degradation of pow operator due to call with svml intrinsic after JDK-8261267

Reviewed-by: sviswanathan

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 2 new commits pushed to the vectorIntrinsics branch:

  • 97fef5c: 8262492: Add cast nodes between float types implementation for Arm SVE
  • 721b5f5: 8261108: Add cast nodes from integer types to float types implementation for Arm SVE

Please see this link for an up-to-date comparison between the source branch of this pull request and the vectorIntrinsics branch.
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@sviswa7) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

Loading

@openjdk openjdk bot added the ready label Mar 4, 2021
@DamonFool
Copy link
Member Author

@DamonFool DamonFool commented Mar 4, 2021

/integrate

Loading

@openjdk openjdk bot added the sponsor label Mar 4, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Mar 4, 2021

@DamonFool
Your change (at version 0d3dbeb) is now ready to be sponsored by a Committer.

Loading

@DamonFool
Copy link
Member Author

@DamonFool DamonFool commented Mar 4, 2021

Hi @sviswa7 , Could you please sponsor it?
Thanks.

Loading

@sviswa7
Copy link
Collaborator

@sviswa7 sviswa7 commented Mar 4, 2021

/sponsor

Loading

@openjdk openjdk bot closed this Mar 4, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Mar 4, 2021

@sviswa7 @DamonFool Since your change was applied there have been 2 commits pushed to the vectorIntrinsics branch:

  • 97fef5c: 8262492: Add cast nodes between float types implementation for Arm SVE
  • 721b5f5: 8261108: Add cast nodes from integer types to float types implementation for Arm SVE

Your commit was automatically rebased without conflicts.

Pushed as commit 0523be6.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Loading

@DamonFool DamonFool deleted the JDK-8262498 branch Mar 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
2 participants