Skip to content

Conversation

@jatin-bhateja
Copy link
Member

@jatin-bhateja jatin-bhateja commented Aug 4, 2022

Hi All,

This patch extends conversion optimizations added with JDK-8287835 to optimize following floating point to integral conversions for X86 AVX2 targets:-

  • D2I , D2S, D2B, F2I , F2S, F2B

In addition, it also optimizes following wide vector (64 bytes) double to integer and sub-type conversions for AVX512 targets which do not support AVX512DQ feature.

  • D2I, D2S, D2B

Following are the JMH micro performance results with and without patch.

System configuration: 40C 2S Icelake server (Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz)

BENCHMARK SIZE BASELINE (ops/ms) WITHOPT (ops/ms) PERF GAIN FACTOR
VectorFPtoIntCastOperations.microDouble128ToByte128 1024 90.603 92.797 1.024215534
VectorFPtoIntCastOperations.microDouble128ToByte256 1024 81.909 82.3 1.00477359
VectorFPtoIntCastOperations.microDouble128ToByte512 1024 26.181 26.244 1.002406325
VectorFPtoIntCastOperations.microDouble128ToInteger128 1024 90.74 2537.958 27.96956138
VectorFPtoIntCastOperations.microDouble128ToInteger256 1024 81.586 2429.599 29.7796068
VectorFPtoIntCastOperations.microDouble128ToInteger512 1024 19.406 19.61 1.010512213
VectorFPtoIntCastOperations.microDouble128ToLong128 1024 91.723 90.754 0.989435583
VectorFPtoIntCastOperations.microDouble128ToShort128 1024 91.766 1984.577 21.62649565
VectorFPtoIntCastOperations.microDouble128ToShort256 1024 81.949 1940.599 23.68056962
VectorFPtoIntCastOperations.microDouble128ToShort512 1024 16.468 16.56 1.005586592
VectorFPtoIntCastOperations.microDouble256ToByte128 1024 163.331 3018.351 18.479964
VectorFPtoIntCastOperations.microDouble256ToByte256 1024 148.878 3082.034 20.70174237
VectorFPtoIntCastOperations.microDouble256ToByte512 1024 50.108 51.629 1.030354434
VectorFPtoIntCastOperations.microDouble256ToInteger128 1024 159.805 4619.421 28.90661118
VectorFPtoIntCastOperations.microDouble256ToInteger256 1024 143.876 4649.642 32.31700909
VectorFPtoIntCastOperations.microDouble256ToInteger512 1024 38.127 38.188 1.001599916
VectorFPtoIntCastOperations.microDouble256ToLong128 1024 160.322 162.442 1.013223388
VectorFPtoIntCastOperations.microDouble256ToLong256 1024 141.252 143.01 1.012445841
VectorFPtoIntCastOperations.microDouble256ToShort128 1024 157.717 3757.471 23.82413437
VectorFPtoIntCastOperations.microDouble256ToShort256 1024 143.876 3830.971 26.62689399
VectorFPtoIntCastOperations.microDouble256ToShort512 1024 32.061 32.911 1.026511962
VectorFPtoIntCastOperations.microFloat128ToByte128 1024 146.599 4002.967 27.30555461
VectorFPtoIntCastOperations.microFloat128ToByte256 1024 136.99 3938.799 28.75245638
VectorFPtoIntCastOperations.microFloat128ToByte512 1024 51.561 50.284 0.975233219
VectorFPtoIntCastOperations.microFloat128ToInteger128 1024 5933.565 5361.472 0.903583596
VectorFPtoIntCastOperations.microFloat128ToInteger256 1024 5079.564 5062.046 0.996551279
VectorFPtoIntCastOperations.microFloat128ToInteger512 1024 37.101 38.419 1.035524649
VectorFPtoIntCastOperations.microFloat128ToLong128 1024 145.863 145.362 0.99656527
VectorFPtoIntCastOperations.microFloat128ToLong256 1024 131.159 133.154 1.015210546
VectorFPtoIntCastOperations.microFloat128ToShort128 1024 145.966 4150.039 28.4315457
VectorFPtoIntCastOperations.microFloat128ToShort256 1024 134.703 4566.589 33.90116775
VectorFPtoIntCastOperations.microFloat128ToShort512 1024 31.878 30.867 0.968285338
VectorFPtoIntCastOperations.microFloat256ToByte128 1024 237.841 6292.051 26.4548627
VectorFPtoIntCastOperations.microFloat256ToByte256 1024 222.041 6292.748 28.34047766
VectorFPtoIntCastOperations.microFloat256ToByte512 1024 92.073 88.981 0.966417951
VectorFPtoIntCastOperations.microFloat256ToInteger128 1024 11471.121 10269.636 0.895260019
VectorFPtoIntCastOperations.microFloat256ToInteger256 1024 10729.816 10105.92 0.941853989
VectorFPtoIntCastOperations.microFloat256ToInteger512 1024 68.328 70.005 1.024543379
VectorFPtoIntCastOperations.microFloat256ToLong128 1024 247.101 248.571 1.005948984
VectorFPtoIntCastOperations.microFloat256ToLong256 1024 225.74 223.987 0.992234429
VectorFPtoIntCastOperations.microFloat256ToLong512 1024 76.39 76.187 0.997342584
VectorFPtoIntCastOperations.microFloat256ToShort128 1024 233.196 8202.179 35.17289748
VectorFPtoIntCastOperations.microFloat256ToShort256 1024 220.75 7781.073 35.24834881
VectorFPtoIntCastOperations.microFloat256ToShort512 1024 58.143 55.633 0.956830573

Kindly review and share your feedback.

Best Regards,
Jatin


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8288043: Optimize FP to word/sub-word integral type conversion on X86 AVX2 platforms

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/9748/head:pull/9748
$ git checkout pull/9748

Update a local copy of the PR:
$ git checkout pull/9748
$ git pull https://git.openjdk.org/jdk pull/9748/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 9748

View PR using the GUI difftool:
$ git pr show -t 9748

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/9748.diff

@jatin-bhateja
Copy link
Member Author

/label add hotspot-compiler-dev

@bridgekeeper
Copy link

bridgekeeper bot commented Aug 4, 2022

👋 Welcome back jbhateja! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Aug 4, 2022
@openjdk
Copy link

openjdk bot commented Aug 4, 2022

@jatin-bhateja
The hotspot-compiler label was successfully added.

@openjdk openjdk bot added the rfr Pull request is ready for review label Aug 5, 2022
@mlbridge
Copy link

mlbridge bot commented Aug 5, 2022

@TobiHartmann
Copy link
Member

I can run some testing in our system once you resolved the merge conflicts.

@TobiHartmann
Copy link
Member

Testing in our system did not show any failures but I see that there are SIGILL failures in the pre-submit testing.

@sviswa7
Copy link

sviswa7 commented Sep 17, 2022

Could you please enable the compiler/vectorapi/VectorFPtoIntCastTest.java for AVX2 platforms?
Currently they are only run for AVX512DQ platforms.

@jatin-bhateja
Copy link
Member Author

Could you please enable the compiler/vectorapi/VectorFPtoIntCastTest.java for AVX2 platforms? Currently they are only run for AVX512DQ platforms.

I have added missing casting cases AVX/AVX2 and AVX512 targets in existing comprehensive test for casting

return 0;
case Op_VectorCastF2X: // fall through
case Op_VectorCastD2X: {
return is_subword_type(ety) ? 75 : 70;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment here and in other cases explaining numbers. Is it size of instructions or elements or something?

Copy link
Member Author

@jatin-bhateja jatin-bhateja Sep 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Value now matches the one for RoundV[FD] IR nodes, currently, its a rudimentary heuristic based on emitted code size for complex IR nodes used by unroll policy. Idea is to constrain unrolling factor and prevent generating bloated loop bodies.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, add this information/clarification to this method's comment at line 186.

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good. I will test it.

You need second review.

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My testing passed.

@openjdk
Copy link

openjdk bot commented Sep 21, 2022

@jatin-bhateja This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8288043: Optimize FP to word/sub-word integral type conversion on X86 AVX2 platforms

Reviewed-by: kvn, sviswanathan

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 19 new commits pushed to the master branch:

  • 3a980b9: 8295168: Remove superfluous period in @throws tag description
  • 9bb932c: 8295154: Documentation for RemoteExecutionControl.invoke(Method) inherits non-existent documentation
  • 945950d: 8295069: [PPC64] Performance regression after JDK-8290025
  • d362e16: 8294689: The SA transported_core.html file needs quite a bit of work
  • 07946aa: 8289552: Make intrinsic conversions between bit representations of half precision values and floats
  • 2586b1a: 8295155: Incorrect javadoc of java.base module
  • e1a77cf: 8295163: Remove old hsdis Makefile
  • 3c7ae12: 8294821: Class load improvement for AES crypto engine
  • 619cd82: 8294702: BufferedInputStream uses undefined value range for markpos
  • 9d0009e: 6777156: GTK L&F: JFileChooser can jump beyond root directory in combobox and selection textarea.
  • ... and 9 more: https://git.openjdk.org/jdk/compare/9d116ec147a3182a9c831ffdce02c98da8c5031d...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Sep 21, 2022
Copy link

@sviswa7 sviswa7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still going through the c2_MacroAssembler_x86.cpp changes. Hopefully early next week will finish the review.

return 0;
case Op_VectorCastF2X: // fall through
case Op_VectorCastD2X: {
return is_subword_type(ety) ? 35 : 30;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be more selective. It is not that in all cases F2X and D2X need lot of instructions e.g. F2D, D2F are single instruction.


void C2_MacroAssembler::vector_castF2L_evex(XMMRegister dst, XMMRegister src, XMMRegister xtmp1, XMMRegister xtmp2,
KRegister ktmp1, KRegister ktmp2, AddressLiteral double_sign_flip,
Register rscratch, int vec_enc) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need an assert here:
assert(rscratch != noreg || always_reachable(double_sign_flip), "missing");

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sviswa7, assertions are part of leaf level macro assembly routine which is vector_cast_float_to_long_special_cases_evex in this case.

@sviswa7
Copy link

sviswa7 commented Oct 7, 2022

@jatin-bhateja Rest of the changes look good to me. Mainly the vector_op_pre_select_sz_estimate() needs to be corrected.

@vnkozlov
Copy link
Contributor

@jatin-bhateja, please merge latest JDK and I will start re-testing.

@jatin-bhateja
Copy link
Member Author

@jatin-bhateja, please merge latest JDK and I will start re-testing.

Hi @kvn, kindly regress the changes.

@vnkozlov
Copy link
Contributor

I started new testing

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My testing passed.

@jatin-bhateja
Copy link
Member Author

Thanks @sviswa7 and @vnkozlov.

@jatin-bhateja
Copy link
Member Author

/integrate

@openjdk
Copy link

openjdk bot commented Oct 12, 2022

Going to push as commit 2ceb80c.
Since your change was applied there have been 21 commits pushed to the master branch:

  • 703a6ef: 8283699: Improve the peephole mechanism of hotspot
  • 94a9b04: 8295013: OopStorage should derive from CHeapObjBase
  • 3a980b9: 8295168: Remove superfluous period in @throws tag description
  • 9bb932c: 8295154: Documentation for RemoteExecutionControl.invoke(Method) inherits non-existent documentation
  • 945950d: 8295069: [PPC64] Performance regression after JDK-8290025
  • d362e16: 8294689: The SA transported_core.html file needs quite a bit of work
  • 07946aa: 8289552: Make intrinsic conversions between bit representations of half precision values and floats
  • 2586b1a: 8295155: Incorrect javadoc of java.base module
  • e1a77cf: 8295163: Remove old hsdis Makefile
  • 3c7ae12: 8294821: Class load improvement for AES crypto engine
  • ... and 11 more: https://git.openjdk.org/jdk/compare/9d116ec147a3182a9c831ffdce02c98da8c5031d...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Oct 12, 2022
@openjdk openjdk bot closed this Oct 12, 2022
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Oct 12, 2022
@openjdk
Copy link

openjdk bot commented Oct 12, 2022

@jatin-bhateja Pushed as commit 2ceb80c.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@jatin-bhateja jatin-bhateja deleted the JDK-8288043 branch January 20, 2023 21:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hotspot-compiler hotspot-compiler-dev@openjdk.org integrated Pull request has been integrated

Development

Successfully merging this pull request may close these issues.

4 participants