8297359: RISC-V: improve performance of floating Max Min intrinsics #11276
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
Please review this change.
It improves performance of Math.min/max intrinsics for Floats and Doubles.
The main issue in these intrinsics is the requirement to return NaN if any of arguments is NaN. In risc-v, fmin/fmax returns NaN only if both of src registers are NaN ( quiet NaN).
That requires additional logic to handle the case where only of of src is NaN.
Here the postcheck with flt (floating less than comparision) and flags analysis replaced with precheck. The precheck is done with fadd-ing srcs into dst and checking the dst for NaN ( with fclass).
The results on the thead c910:
The results, thead c910:
before
Benchmark Mode Cnt Score Error Units
FpMinMaxIntrinsics.dMax avgt 25 54023.827 ± 268.645 ns/op
FpMinMaxIntrinsics.dMin avgt 25 54309.850 ± 323.551 ns/op
FpMinMaxIntrinsics.dMinReduce avgt 25 42192.140 ± 12.114 ns/op
FpMinMaxIntrinsics.fMax avgt 25 53797.657 ± 15.816 ns/op
FpMinMaxIntrinsics.fMin avgt 25 54135.710 ± 313.185 ns/op
FpMinMaxIntrinsics.fMinReduce avgt 25 42196.156 ± 13.424 ns/op
MaxMinOptimizeTest.dAdd avgt 25 650.810 ± 169.998 us/op
MaxMinOptimizeTest.dMax avgt 25 4561.967 ± 40.367 us/op
MaxMinOptimizeTest.dMin avgt 25 4589.100 ± 75.854 us/op
MaxMinOptimizeTest.dMul avgt 25 759.821 ± 240.092 us/op
MaxMinOptimizeTest.fAdd avgt 25 300.137 ± 13.495 us/op
MaxMinOptimizeTest.fMax avgt 25 4348.885 ± 20.061 us/op
MaxMinOptimizeTest.fMin avgt 25 4372.799 ± 27.296 us/op
MaxMinOptimizeTest.fMul avgt 25 304.024 ± 12.120 us/op
after
Benchmark Mode Cnt Score Error Units
FpMinMaxIntrinsics.dMax avgt 25 10545.196 ± 140.137 ns/op
FpMinMaxIntrinsics.dMin avgt 25 10454.525 ± 9.972 ns/op
FpMinMaxIntrinsics.dMinReduce avgt 25 3104.703 ± 0.892 ns/op
FpMinMaxIntrinsics.fMax avgt 25 10449.709 ± 7.284 ns/op
FpMinMaxIntrinsics.fMin avgt 25 10445.261 ± 7.206 ns/op
FpMinMaxIntrinsics.fMinReduce avgt 25 3104.769 ± 0.951 ns/op
MaxMinOptimizeTest.dAdd avgt 25 487.769 ± 170.711 us/op
MaxMinOptimizeTest.dMax avgt 25 929.394 ± 158.697 us/op
MaxMinOptimizeTest.dMin avgt 25 864.230 ± 284.794 us/op
MaxMinOptimizeTest.dMul avgt 25 894.116 ± 342.550 us/op
MaxMinOptimizeTest.fAdd avgt 25 284.664 ± 1.446 us/op
MaxMinOptimizeTest.fMax avgt 25 384.388 ± 15.004 us/op
MaxMinOptimizeTest.fMin avgt 25 371.952 ± 15.295 us/op
MaxMinOptimizeTest.fMul avgt 25 305.226 ± 12.467 us/op
significant improvement
On hifive u74 ( unmatched) the improvements is less significant:
hifive:
before
Benchmark Mode Cnt Score Error Units
FpMinMaxIntrinsics.dMax avgt 25 30219.666 ± 12.878 ns/op
FpMinMaxIntrinsics.dMin avgt 25 30242.249 ± 31.374 ns/op
FpMinMaxIntrinsics.dMinReduce avgt 25 15394.622 ± 2.803 ns/op
FpMinMaxIntrinsics.fMax avgt 25 30150.114 ± 22.421 ns/op
FpMinMaxIntrinsics.fMin avgt 25 30149.752 ± 20.813 ns/op
FpMinMaxIntrinsics.fMinReduce avgt 25 15396.402 ± 4.251 ns/op
MaxMinOptimizeTest.dAdd avgt 25 1143.582 ± 4.444 us/op
MaxMinOptimizeTest.dMax avgt 25 2556.317 ± 3.795 us/op
MaxMinOptimizeTest.dMin avgt 25 2556.569 ± 2.274 us/op
MaxMinOptimizeTest.dMul avgt 25 1142.769 ± 1.593 us/op
MaxMinOptimizeTest.fAdd avgt 25 748.688 ± 7.342 us/op
MaxMinOptimizeTest.fMax avgt 25 2280.381 ± 1.535 us/op
MaxMinOptimizeTest.fMin avgt 25 2280.760 ± 1.532 us/op
MaxMinOptimizeTest.fMul avgt 25 748.991 ± 7.261 us/op
after:
Benchmark Mode Cnt Score Error Units
FpMinMaxIntrinsics.dMax avgt 25 27723.791 ± 22.784 ns/op
FpMinMaxIntrinsics.dMin avgt 25 27760.799 ± 45.411 ns/op
FpMinMaxIntrinsics.dMinReduce avgt 25 12875.949 ± 2.829 ns/op
FpMinMaxIntrinsics.fMax avgt 25 25992.753 ± 23.788 ns/op
FpMinMaxIntrinsics.fMin avgt 25 25994.554 ± 32.060 ns/op
FpMinMaxIntrinsics.fMinReduce avgt 25 11200.737 ± 2.169 ns/op
MaxMinOptimizeTest.dAdd avgt 25 1144.128 ± 4.371 us/op
MaxMinOptimizeTest.dMax avgt 25 1968.145 ± 2.346 us/op
MaxMinOptimizeTest.dMin avgt 25 1970.249 ± 4.712 us/op
MaxMinOptimizeTest.dMul avgt 25 1143.356 ± 2.203 us/op
MaxMinOptimizeTest.fAdd avgt 25 748.634 ± 7.229 us/op
MaxMinOptimizeTest.fMax avgt 25 1523.719 ± 0.570 us/op
MaxMinOptimizeTest.fMin avgt 25 1524.534 ± 1.109 us/op
MaxMinOptimizeTest.fMul avgt 25 748.643 ± 7.291 us/op
fAdd/dAdd and fMul/dMull is unaffected likely due to :
private double dAddBench(double a, double b) {
return Math.max(a, b) + Math.min(a, b);
}
private double dMulBench(double a, double b) {
return Math.max(a, b) * Math.min(a, b);
}
may get reduces to just a + b and a*b respectively without actually using min/max
Testing : tier1/tier2 in progress, will update this as soon as it finishes
Progress
Issue
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/11276/head:pull/11276$ git checkout pull/11276Update a local copy of the PR:
$ git checkout pull/11276$ git pull https://git.openjdk.org/jdk pull/11276/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 11276View PR using the GUI difftool:
$ git pr show -t 11276Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/11276.diff