8376891: [VectorAlgorithms] add more if-conversion benchmarks and tests#29522
8376891: [VectorAlgorithms] add more if-conversion benchmarks and tests#29522eme64 wants to merge 20 commits intoopenjdk:masterfrom
Conversation
|
👋 Welcome back epeter! A progress list of the required criteria for merging this PR into |
|
@eme64 This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 24 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
|
Result on AVX512 laptop: Interesting: the scalar implementation can beat the vectorized one, if the branch probability is extreme enough. And with speculating on one branch all-true, we even get: If the branch probability is high enough, we get the combined benefit of branch prediction and vectorization! |
|
|
||
| int filterI_range = 1000_000; | ||
| aI_filterI = new int[size]; | ||
| Arrays.setAll(aI, i -> random.nextInt(filterI_range)); |
There was a problem hiding this comment.
| Arrays.setAll(aI, i -> random.nextInt(filterI_range)); | |
| Arrays.setAll(aI, i -> random.nextInt(aI_filterI)); |
There was a problem hiding this comment.
Why are you suggesting this?
I think it is correct as is. The goal is to filter "in/out" an element with probability branchProbability. So we need to use the same range filterI_range for picking the elements and for eI_filterI.
|
I'll have to run it again later to get less noisy results. But it looks promising (AVX512 laptop): Thanks @PaulSandoz and @rgiulietti for bringing this one up offline :) |
|
Here some benchmark results, run with a command like this: filterI On my AVX512 laptop, results a bit noisy: And on 2 other x64 servers, results much cleaner: Comment:
Comment:
lowerCaseB On a x64 machine (other two machines looks very similar): Comment:
pieceWise2FunctionF On my 3 x64 machines it looks like this: Comment:
conditionalSumB Comment:
Some More Comments / Observations
|
Webrevs
|
|
@XiaohongGong @PaulSandoz @iwanowww @jatin-bhateja This is a continuation of #28639, would any of you be up to reviewing this here as well? |
| } | ||
| v.intoArray(r, i); | ||
| } | ||
| for (; i < a.length; i++) { |
There was a problem hiding this comment.
Can this piece of scalar processing refactored into a method so that it does not duplicate the one that processes the whole array above?
There was a problem hiding this comment.
We could. But I don't really want to. I'd like to demonstrate that we need two loops here. It is a trade-off :)
There was a problem hiding this comment.
I think the implementation in the test and the micro are pretty similar, is it possible to have a common place that both can call?
| } | ||
|
|
||
| int filterI_range = 1000_000; | ||
| aI_filterI = new int[size]; |
There was a problem hiding this comment.
This is not used, aI is used instead.
There was a problem hiding this comment.
Oh wow, that is a great catch! I might have to rerun the experiments, since this could possibly affect the results :/
| public float[] bF; | ||
|
|
||
| // Input for piece-wise functions. | ||
| // Uniform [0..1[ with probability p and Uniform [1..2[ with probability (1-p) |
There was a problem hiding this comment.
Is this a typo issue [0..1[ ? Should be [0..1] instead?
There was a problem hiding this comment.
No, [0..1[ is the interval that includes 0 but not 1 (a half-closed interval).
| } else if (mask.anyTrue()) { | ||
| int v0 = v.lane(0); | ||
| int v1 = v.lane(1); | ||
| if (v0 >= threshold) { r[j++] = v0; } | ||
| if (v1 >= threshold) { r[j++] = v1; } |
There was a problem hiding this comment.
Can we just use mask.laneIsSet(0) here?
There was a problem hiding this comment.
Good idea, it lets us simplify to a line like this:
if (mask.laneIsSet(0)) { r[j++] = v.lane(0); }
| int i = 0; | ||
| for (; i < SPECIES_I256.loopBound(a.length); i += SPECIES_I256.length()) { | ||
| IntVector v = IntVector.fromArray(SPECIES_I256, a, i); | ||
| var mask = v.compare(VectorOperators.GE, thresholds); |
There was a problem hiding this comment.
We can use the compare with immediate API directly here.
| var mask = v.compare(VectorOperators.GE, thresholds); | |
| var mask = v.compare(VectorOperators.GE, threshold); |
| var vI1 = vB.castShape(SPECIES_I, 1); | ||
| var vI2 = vB.castShape(SPECIES_I, 2); | ||
| var vI3 = vB.castShape(SPECIES_I, 3); | ||
| accI = accI.add(vI0.add(vI1).add(vI2).add(vI3)); |
There was a problem hiding this comment.
Will following change get better parallelization performance?
| accI = accI.add(vI0.add(vI1).add(vI2).add(vI3)); | |
| accI = accI.add(vI0.add(vI1).add(vI2.add(vI3))); |
There was a problem hiding this comment.
It does not matter because the dependency chain is at accI, so as long as we add every else together before adding accI, it will be the same.
There was a problem hiding this comment.
Yes, exactly. The critical dependency chain is accI. But feel free to investigate the performance difference in a follow-up RFE, and propose yet another implementation, if it shows to be better :)
There was a problem hiding this comment.
Not a block to me. My point is that the dependence between vI0.add(vI1) and vI2.add(vI3) can be broken and hence it my get better parallelization. Although the critical dependency chain is accI, the performance might be better if its input can be calculated earlier.
| float s2 = (float)Math.sqrt(ai); | ||
| float s4 = (float)Math.sqrt(s2); | ||
| float s8 = (float)Math.sqrt(s4); |
There was a problem hiding this comment.
| float s2 = (float)Math.sqrt(ai); | |
| float s4 = (float)Math.sqrt(s2); | |
| float s8 = (float)Math.sqrt(s4); | |
| float s2 = (float) Math.sqrt(ai); | |
| float s4 = (float) Math.sqrt(s2); | |
| float s8 = (float) Math.sqrt(s4); |
There was a problem hiding this comment.
I don't see the point in this - is there some style guide that suggests this? Maybe it is just a matter of taste ;)
There was a problem hiding this comment.
Yeah, it's not a block from me. Maybe just the personal style. Sorry for the noise and please ignore.
|
@PaulSandoz @XiaohongGong Thanks for your suggestions around Feel free to keep reviewing in the meantime ;) |
| for (; i < SPECIES_I128.loopBound(a.length); i += SPECIES_I128.length()) { | ||
| IntVector v = IntVector.fromArray(SPECIES_I128, a, i); | ||
| var mask = v.compare(VectorOperators.GE, threshold); | ||
| if (mask.allTrue()) { |
There was a problem hiding this comment.
What you have here is similar to what we have in fallback implementation of Vector.compress
https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/IntVector.java#L528
If the Idea here is to only to make use of VectorAPIs to implement a scalar algorithm without caring about algorithm then its fine, else I think better would be to have a version (version 3 may be in follow-up PR) which uses shuffle lookup table where index of lookup table is computed using mask.toLong(). For I2, I4 and I8 size of lookup table will be less than equal to 16 rows.
So shuffle lookup table contains indexes to rearrange vector lanes corresponding to set mask bit, single rearrange can then pack the lane contiguously into result 'r'.
There was a problem hiding this comment.
@jatin-bhateja Good point. But we may at some point change how we deal with fall-back, so I think I want to keep the example as is.
About the shuffle lookup: I think that would be an interesting addition. We can do that in a later RFE, feel free to implement that if you like :)
The issue with lookup table: it can increase the memory pressure, and that can hurt some programs. I've heard that often lookup tables are fast for micro benchmarks where memory pressure is low, but can hurt real programs where memory pressure is already high. But I have not done that kind of experiment myself yet, would be interesting to do :)
|
@eme64 this pull request can not be integrated into git checkout JDK-8376891-VectorAPI-if-conversion-benchmarks-and-tests
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push |
|
@PaulSandoz @merykitty Ok, I think I fixed all the issues. The benchmarks results are still the same. If anybody else has suggestions, let them come ;) |
|
@merykitty @PaulSandoz @XiaohongGong @jatin-bhateja Thank you very much for all the suggestions, catching bugs and for the approvals :) /integrate |
|
Going to push as commit b2728d0.
Your commit was automatically rebased without conflicts. |










Changes:
BRANCH_PROBABILITY, so we can adjust the branch probability of benchmarks with branches that are sensitive to branch prediction.filterIis sensitive to branch prediction: give it data that depends onBRANCH_PROBABILITY.filterI: add some alternative implementations that speculate on all-true/all-false paths.lowerCaseBadjust percentage of upper/lower case character based onBRANCH_PROBABILITY.pieceWise2FunctionFpiece wise function, shows branching vs vector vs vector with branching.conditionalSumB: shows branching vs vector performance.Builds on #28639
Please: have a look at the results and discussion in a comment further down: #29522 (comment)
The
filterI_VectorAPI_v2_l2benchmark performs poorly on x64, so I filed this RFE:JDK-8378589 C2 VectorAPI x64: implement 2-element vector masks
We also see that some benchmarks are very slow, because we have not yet implemented "graceful degregation".
See also: https://bugs.openjdk.org/browse/JDK-8378373
Credits:
the all-true/all-false path implementations (dynamic uniformity) for
filterIare inspired by this paper:Combining Run-Time Checks and Compile-Time Analysis to Improve Control Flow Auto-Vectorization.
Bangtian Liu, Avery Laird, Wai Hung Tsang, Bardia Mahjour, and Maryam Mehri Dehnavi.
In PACT, 2022 [PDF]
Progress
Issue
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/29522/head:pull/29522$ git checkout pull/29522Update a local copy of the PR:
$ git checkout pull/29522$ git pull https://git.openjdk.org/jdk.git pull/29522/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 29522View PR using the GUI difftool:
$ git pr show -t 29522Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/29522.diff
Using Webrev
Link to Webrev Comment