Skip to content

8376891: [VectorAlgorithms] add more if-conversion benchmarks and tests#29522

Closed
eme64 wants to merge 20 commits intoopenjdk:masterfrom
eme64:JDK-8376891-VectorAPI-if-conversion-benchmarks-and-tests
Closed

8376891: [VectorAlgorithms] add more if-conversion benchmarks and tests#29522
eme64 wants to merge 20 commits intoopenjdk:masterfrom
eme64:JDK-8376891-VectorAPI-if-conversion-benchmarks-and-tests

Conversation

@eme64
Copy link
Contributor

@eme64 eme64 commented Feb 2, 2026

Changes:

  • Introduce BRANCH_PROBABILITY, so we can adjust the branch probability of benchmarks with branches that are sensitive to branch prediction.
  • filterI is sensitive to branch prediction: give it data that depends on BRANCH_PROBABILITY.
  • filterI: add some alternative implementations that speculate on all-true/all-false paths.
  • lowerCaseB adjust percentage of upper/lower case character based on BRANCH_PROBABILITY.
  • pieceWise2FunctionF piece wise function, shows branching vs vector vs vector with branching.
  • conditionalSumB: shows branching vs vector performance.

Builds on #28639

Please: have a look at the results and discussion in a comment further down: #29522 (comment)

The filterI_VectorAPI_v2_l2 benchmark performs poorly on x64, so I filed this RFE:
JDK-8378589 C2 VectorAPI x64: implement 2-element vector masks

We also see that some benchmarks are very slow, because we have not yet implemented "graceful degregation".
See also: https://bugs.openjdk.org/browse/JDK-8378373

Credits:
the all-true/all-false path implementations (dynamic uniformity) for filterI are inspired by this paper:
Combining Run-Time Checks and Compile-Time Analysis to Improve Control Flow Auto-Vectorization.
Bangtian Liu, Avery Laird, Wai Hung Tsang, Bardia Mahjour, and Maryam Mehri Dehnavi.
In PACT, 2022 [PDF]


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8376891: [VectorAlgorithms] add more if-conversion benchmarks and tests (Sub-task - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/29522/head:pull/29522
$ git checkout pull/29522

Update a local copy of the PR:
$ git checkout pull/29522
$ git pull https://git.openjdk.org/jdk.git pull/29522/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 29522

View PR using the GUI difftool:
$ git pr show -t 29522

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/29522.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Feb 2, 2026

👋 Welcome back epeter! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Feb 2, 2026

@eme64 This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8376891: [VectorAlgorithms] add more if-conversion benchmarks and tests

Reviewed-by: qamai, psandoz, xgong, jbhateja

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 24 new commits pushed to the master branch:

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot changed the title JDK-8376891 8376891: [VectorAlgorithms] add more if-conversion benchmarks and tests Feb 2, 2026
@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Feb 2, 2026
@openjdk
Copy link

openjdk bot commented Feb 2, 2026

@eme64 The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@eme64
Copy link
Contributor Author

eme64 commented Feb 20, 2026

Result on AVX512 laptop:

Benchmark                                       (BRANCH_PROBABILITY)  (NUM_X_OBJECTS)  (SEED)  (SIZE)  Mode  Cnt      Score      Error  Units
VectorAlgorithms.pieceWise2FunctionF_VectorAPI                  0.01            10000       0   10000  avgt    3   9054.094 ±  218.989  ns/op
VectorAlgorithms.pieceWise2FunctionF_VectorAPI                  0.05            10000       0   10000  avgt    3   9056.102 ±  226.557  ns/op
VectorAlgorithms.pieceWise2FunctionF_VectorAPI                   0.3            10000       0   10000  avgt    3   9047.736 ±   47.267  ns/op
VectorAlgorithms.pieceWise2FunctionF_VectorAPI                   0.5            10000       0   10000  avgt    3   9052.642 ±  131.057  ns/op
VectorAlgorithms.pieceWise2FunctionF_VectorAPI                   0.7            10000       0   10000  avgt    3   9044.788 ±   97.747  ns/op
VectorAlgorithms.pieceWise2FunctionF_VectorAPI                  0.95            10000       0   10000  avgt    3   9110.475 ± 1893.316  ns/op
VectorAlgorithms.pieceWise2FunctionF_VectorAPI                  0.99            10000       0   10000  avgt    3   9048.477 ±  224.952  ns/op
VectorAlgorithms.pieceWise2FunctionF_loop                       0.01            10000       0   10000  avgt    3  35782.844 ±   40.692  ns/op
VectorAlgorithms.pieceWise2FunctionF_loop                       0.05            10000       0   10000  avgt    3  34319.087 ±  263.681  ns/op
VectorAlgorithms.pieceWise2FunctionF_loop                        0.3            10000       0   10000  avgt    3  25162.299 ±   80.084  ns/op
VectorAlgorithms.pieceWise2FunctionF_loop                        0.5            10000       0   10000  avgt    3  18010.296 ±  423.631  ns/op
VectorAlgorithms.pieceWise2FunctionF_loop                        0.7            10000       0   10000  avgt    3  11045.497 ±  479.314  ns/op
VectorAlgorithms.pieceWise2FunctionF_loop                       0.95            10000       0   10000  avgt    3   8120.907 ± 1622.572  ns/op
VectorAlgorithms.pieceWise2FunctionF_loop                       0.99            10000       0   10000  avgt    3   8128.132 ±  386.989  ns/op

Interesting: the scalar implementation can beat the vectorized one, if the branch probability is extreme enough.

And with speculating on one branch all-true, we even get:

Benchmark                                          (BRANCH_PROBABILITY)  (NUM_X_OBJECTS)  (SEED)  (SIZE)  Mode  Cnt     Score      Error  Units
VectorAlgorithms.pieceWise2FunctionF_VectorAPI_v2                  0.01            10000       0   10000  avgt    3  9039.454 ±   29.497  ns/op
VectorAlgorithms.pieceWise2FunctionF_VectorAPI_v2                  0.05            10000       0   10000  avgt    3  9045.280 ±  159.739  ns/op
VectorAlgorithms.pieceWise2FunctionF_VectorAPI_v2                   0.3            10000       0   10000  avgt    3  9042.699 ±   14.615  ns/op
VectorAlgorithms.pieceWise2FunctionF_VectorAPI_v2                   0.5            10000       0   10000  avgt    3  9040.755 ±   42.132  ns/op
VectorAlgorithms.pieceWise2FunctionF_VectorAPI_v2                   0.7            10000       0   10000  avgt    3  9068.483 ± 1364.741  ns/op
VectorAlgorithms.pieceWise2FunctionF_VectorAPI_v2                  0.95            10000       0   10000  avgt    3  5239.270 ±   18.841  ns/op
VectorAlgorithms.pieceWise2FunctionF_VectorAPI_v2                  0.99            10000       0   10000  avgt    3  1660.073 ±   11.396  ns/op

If the branch probability is high enough, we get the combined benefit of branch prediction and vectorization!


int filterI_range = 1000_000;
aI_filterI = new int[size];
Arrays.setAll(aI, i -> random.nextInt(filterI_range));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Arrays.setAll(aI, i -> random.nextInt(filterI_range));
Arrays.setAll(aI, i -> random.nextInt(aI_filterI));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you suggesting this?
I think it is correct as is. The goal is to filter "in/out" an element with probability branchProbability. So we need to use the same range filterI_range for picking the elements and for eI_filterI.

@eme64
Copy link
Contributor Author

eme64 commented Feb 23, 2026

I'll have to run it again later to get less noisy results. But it looks promising (AVX512 laptop):

Benchmark                                      (BRANCH_PROBABILITY)  (NUM_X_OBJECTS)  (SEED)  (SIZE)  Mode  Cnt      Score       Error  Units
VectorAlgorithms.conditionalSumB_VectorAPI_v1                  0.01            10000       0   10000  avgt    3   1814.407 ±  7117.180  ns/op
VectorAlgorithms.conditionalSumB_VectorAPI_v1                  0.05            10000       0   10000  avgt    3   1310.797 ±   723.511  ns/op
VectorAlgorithms.conditionalSumB_VectorAPI_v1                   0.3            10000       0   10000  avgt    3   1264.866 ±    15.574  ns/op
VectorAlgorithms.conditionalSumB_VectorAPI_v1                   0.5            10000       0   10000  avgt    3   1276.427 ±   387.683  ns/op
VectorAlgorithms.conditionalSumB_VectorAPI_v1                   0.7            10000       0   10000  avgt    3   1262.752 ±    10.911  ns/op
VectorAlgorithms.conditionalSumB_VectorAPI_v1                  0.95            10000       0   10000  avgt    3   1661.397 ±  6421.994  ns/op
VectorAlgorithms.conditionalSumB_VectorAPI_v1                  0.99            10000       0   10000  avgt    3   1257.687 ±     3.951  ns/op
VectorAlgorithms.conditionalSumB_VectorAPI_v2                  0.01            10000       0   10000  avgt    3   1028.745 ±  3909.545  ns/op
VectorAlgorithms.conditionalSumB_VectorAPI_v2                  0.05            10000       0   10000  avgt    3    895.966 ±   101.069  ns/op
VectorAlgorithms.conditionalSumB_VectorAPI_v2                   0.3            10000       0   10000  avgt    3    909.189 ±   202.897  ns/op
VectorAlgorithms.conditionalSumB_VectorAPI_v2                   0.5            10000       0   10000  avgt    3    917.714 ±   138.545  ns/op
VectorAlgorithms.conditionalSumB_VectorAPI_v2                   0.7            10000       0   10000  avgt    3   1000.344 ±  2572.412  ns/op
VectorAlgorithms.conditionalSumB_VectorAPI_v2                  0.95            10000       0   10000  avgt    3   1439.590 ±   138.446  ns/op
VectorAlgorithms.conditionalSumB_VectorAPI_v2                  0.99            10000       0   10000  avgt    3    903.180 ±    14.482  ns/op
VectorAlgorithms.conditionalSumB_loop                          0.01            10000       0   10000  avgt    3   5211.599 ±  1341.394  ns/op
VectorAlgorithms.conditionalSumB_loop                          0.05            10000       0   10000  avgt    3   5368.796 ±  1293.927  ns/op
VectorAlgorithms.conditionalSumB_loop                           0.3            10000       0   10000  avgt    3  11524.422 ±  2455.560  ns/op
VectorAlgorithms.conditionalSumB_loop                           0.5            10000       0   10000  avgt    3  13321.561 ±  9368.497  ns/op
VectorAlgorithms.conditionalSumB_loop                           0.7            10000       0   10000  avgt    3   8116.188 ±    23.995  ns/op
VectorAlgorithms.conditionalSumB_loop                          0.95            10000       0   10000  avgt    3   6188.119 ± 29088.909  ns/op
VectorAlgorithms.conditionalSumB_loop                          0.99            10000       0   10000  avgt    3   8450.095 ±  1771.220  ns/op

Thanks @PaulSandoz and @rgiulietti for bringing this one up offline :)

@eme64
Copy link
Contributor Author

eme64 commented Feb 26, 2026

Here some benchmark results, run with a command like this:

make test TEST="micro:vm.compiler.VectorAlgorithms.filterI" CONF=linux-x64 TEST_VM_OPTS="" MICRO="OPTIONS=-p SIZE=10000 -p BRANCH_PROBABILITY=0.001,0.002,0.004,0.008,0.016,0.031,0.063,0.125,0.250,0.500,0.750,0.875,0.938,0.969,0.984,0.992,0.996,0.998,0.999" | tee filterI.log
make test TEST="micro:vm.compiler.VectorAlgorithms.lowerCa" CONF=linux-x64 TEST_VM_OPTS="" MICRO="OPTIONS=-p SIZE=10000 -p BRANCH_PROBABILITY=0.001,0.002,0.004,0.008,0.016,0.031,0.063,0.125,0.250,0.500,0.750,0.875,0.938,0.969,0.984,0.992,0.996,0.998,0.999" | tee lowerCase.log
make test TEST="micro:vm.compiler.VectorAlgorithms.pieceWi" CONF=linux-x64 TEST_VM_OPTS="" MICRO="OPTIONS=-p SIZE=10000 -p BRANCH_PROBABILITY=0.001,0.002,0.004,0.008,0.016,0.031,0.063,0.125,0.250,0.500,0.750,0.875,0.938,0.969,0.984,0.992,0.996,0.998,0.999" | tee piece.log
make test TEST="micro:vm.compiler.VectorAlgorithms.conditi" CONF=linux-x64 TEST_VM_OPTS="" MICRO="OPTIONS=-p SIZE=10000 -p BRANCH_PROBABILITY=0.001,0.002,0.004,0.008,0.016,0.031,0.063,0.125,0.250,0.500,0.750,0.875,0.938,0.969,0.984,0.992,0.996,0.998,0.999" | tee conditionalSum.log

filterI

On my AVX512 laptop, results a bit noisy:
image

And on 2 other x64 servers, results much cleaner:
image
image

Comment:

  • Clear branch prediction shape of most implementations. High probability does writing, so it takes the same or more time than the low probability cases.
  • v2_l2: 2-element masks not yet supported on x64, so not intrinsification -> horrible performance. Still we see the branch prediction pattern peak through!
  • v1 (compress) gives generally the best performance, except v2_l8 has a slight speedup in extreme probabilities.

And on NEON (N1) aarch64:
image

Comment:

  • v1 (compress) not implemented -> horrible performance
  • v2_v8 (vector too long) -> horrible performance
  • Clear branch prediction shape
  • loop (scalar) performance is better for low and middle probability, but worse for high probability, probably as vectorized store becomes profitable than scalar stores.

lowerCaseB

On a x64 machine (other two machines looks very similar):
image

On the NEON machines:
image

Comment:

  • loop (scalar) has some branch misprediction penalty in the middle probability.
  • v1 and v2 have very similar performance: no difference on x64, but on NEON it is a slight bit better to only have a single comparison (v2) instead of two (v1).

pieceWise2FunctionF

On my 3 x64 machines it looks like this:
image

And on NEON:
image

Comment:

  • loop (scalar):
    • low probability: mostly sqrt, so slower
    • high probability: mostly mul, so a bit faster
    • middle probability: we see branch misprediction penalty clearly for NEON, but I think also a very slight bump for x64.
  • v1 has constant performance, using only masked vectors.
  • v2 has the same performance as v1 for low up to middle probabilities, but around the middle probability it goes even faster, as we can take the uniform path towards only vectorized mul, and can avoid vectorized sqrt.

conditionalSumB

On my 3 x64 machines:
image

And on NEON:
image

Comment:

  • x64:
    • loop (scalar): slowest, and branch misprediction penalty in the middle
    • v2 is a bit faster than v1, because v2 loads full ByteVectors, and casts them to 3 IntVectors. v1 only loads a 4th of a ByteVector and casts it to a single IntVector, which seems to be a little slower (probably a memory instruction bottle-neck).
  • NEON:
    • loop (scalar): apparently no branch misprediction penalty, maybe even a slight boost? Strange!
    • v1 cannot (currently) have a ByteVector that is a 4th of a IntVector: ByteVector must be at least 64bit, and so the IntVector would be 256bit, over the 128bit limit of NEON. No intrinsification -> horrible performance.
    • v2 vectorizes well.

Some More Comments / Observations

  • Auto Vectorization does not yet support control flow, so the loop implementations so far all get us scalar performance, almost always with the classic branch misprediction penalties.
    • Auto Vectorization of control flow (using if-conversion) is not trivial: there are cases where the basic approach can lead to regressions: see pieceWise2FunctionF, where one branch is much cheaper than the other, and so if the branch probability leans even moderately high (0.875), branch prediction might be good enough, and if-converted vector code has to execute both branches and hence suffers from the vectorization of the slow branch (here vectorized sqrt). Thus, we need a good cost-model, and take branch probability into account.
  • The VectorAPI can be used to simulate control flow, and it mostly works quite well. We can even use it to experiment with advanced optimizations using dynamic uniformity (filterI v2, pieceWise2FunctionF v2), that combine vectorization with all-true/all-false checks.
  • The VectorAPI still has some issues:
    • If intrinsification fails, we get horrible performance. The promise of the JEP "Graceful degradation" ("On platforms without vectors, graceful degradation will yield code competitive with manually-unrolled loops"): we are far away from getting scalar performance. There are multiple causes for missing intrinsification: vector length, specific instructions, implementation limitations (mask length). We will need to find a way to lower unsupported vector shapes inside C2 probably, lowering to a reasonable scalar implementation. The difficulty: doing this per-vector-instruction may not be the most efficient, rather one might want to detect a whole connected graph of vectors and find a good translation strategy (example: filterI with compress and masked store), but that might be too much work.
    • Restrictive set of vector shapes: Especially on NEON, casts quickly lead us outside the supported vector length. The hardware would support vectors smaller than 64bits, but they cannot be expressed in the VectorAPI. We are stuck in the 2x range from 64bits to 128bits, which is too narrow. A larger range of vector shapes (below 64bits and above 512bits), together with automatic splitting of too-large vectors could help a lot here.

@eme64 eme64 marked this pull request as ready for review February 26, 2026 09:40
@openjdk openjdk bot added the rfr Pull request is ready for review label Feb 26, 2026
@mlbridge
Copy link

mlbridge bot commented Feb 26, 2026

Webrevs

@eme64
Copy link
Contributor Author

eme64 commented Mar 3, 2026

@XiaohongGong @PaulSandoz @iwanowww @jatin-bhateja This is a continuation of #28639, would any of you be up to reviewing this here as well?

Copy link
Member

@merykitty merykitty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise

}
v.intoArray(r, i);
}
for (; i < a.length; i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this piece of scalar processing refactored into a method so that it does not duplicate the one that processes the whole array above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could. But I don't really want to. I'd like to demonstrate that we need two loops here. It is a trade-off :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the implementation in the test and the micro are pretty similar, is it possible to have a common place that both can call?

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Mar 3, 2026
}

int filterI_range = 1000_000;
aI_filterI = new int[size];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not used, aI is used instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow, that is a great catch! I might have to rerun the experiments, since this could possibly affect the results :/

public float[] bF;

// Input for piece-wise functions.
// Uniform [0..1[ with probability p and Uniform [1..2[ with probability (1-p)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a typo issue [0..1[ ? Should be [0..1] instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, [0..1[ is the interval that includes 0 but not 1 (a half-closed interval).

Comment on lines +732 to +736
} else if (mask.anyTrue()) {
int v0 = v.lane(0);
int v1 = v.lane(1);
if (v0 >= threshold) { r[j++] = v0; }
if (v1 >= threshold) { r[j++] = v1; }

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just use mask.laneIsSet(0) here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, it lets us simplify to a line like this:
if (mask.laneIsSet(0)) { r[j++] = v.lane(0); }

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct.

int i = 0;
for (; i < SPECIES_I256.loopBound(a.length); i += SPECIES_I256.length()) {
IntVector v = IntVector.fromArray(SPECIES_I256, a, i);
var mask = v.compare(VectorOperators.GE, thresholds);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use the compare with immediate API directly here.

Suggested change
var mask = v.compare(VectorOperators.GE, thresholds);
var mask = v.compare(VectorOperators.GE, threshold);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, can do :)

var vI1 = vB.castShape(SPECIES_I, 1);
var vI2 = vB.castShape(SPECIES_I, 2);
var vI3 = vB.castShape(SPECIES_I, 3);
accI = accI.add(vI0.add(vI1).add(vI2).add(vI3));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will following change get better parallelization performance?

Suggested change
accI = accI.add(vI0.add(vI1).add(vI2).add(vI3));
accI = accI.add(vI0.add(vI1).add(vI2.add(vI3)));

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not matter because the dependency chain is at accI, so as long as we add every else together before adding accI, it will be the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly. The critical dependency chain is accI. But feel free to investigate the performance difference in a follow-up RFE, and propose yet another implementation, if it shows to be better :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a block to me. My point is that the dependence between vI0.add(vI1) and vI2.add(vI3) can be broken and hence it my get better parallelization. Although the critical dependency chain is accI, the performance might be better if its input can be calculated earlier.

Comment on lines +1021 to +1023
float s2 = (float)Math.sqrt(ai);
float s4 = (float)Math.sqrt(s2);
float s8 = (float)Math.sqrt(s4);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
float s2 = (float)Math.sqrt(ai);
float s4 = (float)Math.sqrt(s2);
float s8 = (float)Math.sqrt(s4);
float s2 = (float) Math.sqrt(ai);
float s4 = (float) Math.sqrt(s2);
float s8 = (float) Math.sqrt(s4);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the point in this - is there some style guide that suggests this? Maybe it is just a matter of taste ;)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's not a block from me. Maybe just the personal style. Sorry for the noise and please ignore.

@openjdk openjdk bot removed the ready Pull request is ready to be integrated label Mar 4, 2026
@eme64
Copy link
Contributor Author

eme64 commented Mar 4, 2026

@PaulSandoz @XiaohongGong Thanks for your suggestions around filterI, very helpful! I'll re-test and run the benchmarks again for filterI :)

Feel free to keep reviewing in the meantime ;)

for (; i < SPECIES_I128.loopBound(a.length); i += SPECIES_I128.length()) {
IntVector v = IntVector.fromArray(SPECIES_I128, a, i);
var mask = v.compare(VectorOperators.GE, threshold);
if (mask.allTrue()) {
Copy link
Member

@jatin-bhateja jatin-bhateja Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you have here is similar to what we have in fallback implementation of Vector.compress
https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/IntVector.java#L528

If the Idea here is to only to make use of VectorAPIs to implement a scalar algorithm without caring about algorithm then its fine, else I think better would be to have a version (version 3 may be in follow-up PR) which uses shuffle lookup table where index of lookup table is computed using mask.toLong(). For I2, I4 and I8 size of lookup table will be less than equal to 16 rows.

So shuffle lookup table contains indexes to rearrange vector lanes corresponding to set mask bit, single rearrange can then pack the lane contiguously into result 'r'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jatin-bhateja Good point. But we may at some point change how we deal with fall-back, so I think I want to keep the example as is.

About the shuffle lookup: I think that would be an interesting addition. We can do that in a later RFE, feel free to implement that if you like :)
The issue with lookup table: it can increase the memory pressure, and that can hurt some programs. I've heard that often lookup tables are fast for micro benchmarks where memory pressure is low, but can hurt real programs where memory pressure is already high. But I have not done that kind of experiment myself yet, would be interesting to do :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok @eme64.

@openjdk
Copy link

openjdk bot commented Mar 5, 2026

@eme64 this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout JDK-8376891-VectorAPI-if-conversion-benchmarks-and-tests
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

@openjdk openjdk bot added the merge-conflict Pull request has merge conflict with target branch label Mar 5, 2026
@openjdk openjdk bot removed the merge-conflict Pull request has merge conflict with target branch label Mar 5, 2026
@eme64
Copy link
Contributor Author

eme64 commented Mar 5, 2026

@PaulSandoz @merykitty Ok, I think I fixed all the issues. The benchmarks results are still the same.
Can you re-approve?

If anybody else has suggestions, let them come ;)

Copy link
Member

@merykitty merykitty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still good.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Mar 5, 2026
Copy link

@XiaohongGong XiaohongGong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@eme64
Copy link
Contributor Author

eme64 commented Mar 9, 2026

@merykitty @PaulSandoz @XiaohongGong @jatin-bhateja Thank you very much for all the suggestions, catching bugs and for the approvals :)

/integrate

@openjdk
Copy link

openjdk bot commented Mar 9, 2026

Going to push as commit b2728d0.
Since your change was applied there have been 28 commits pushed to the master branch:

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Mar 9, 2026
@openjdk openjdk bot closed this Mar 9, 2026
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Mar 9, 2026
@openjdk
Copy link

openjdk bot commented Mar 9, 2026

@eme64 Pushed as commit b2728d0.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hotspot-compiler hotspot-compiler-dev@openjdk.org integrated Pull request has been integrated

Development

Successfully merging this pull request may close these issues.

6 participants