Skip to content

Conversation

@XiaohongGong
Copy link

@XiaohongGong XiaohongGong commented Jul 10, 2025

This is a follow-up patch of [1], which aims at implementing the subword gather load APIs for AArch64 SVE platform.

Background

Vector gather load APIs load values from memory addresses calculated by adding a base pointer to integer indices. SVE provides native gather load instructions for byte/short types using int vectors for indices. The vector size for a gather-load instruction is determined by the index vector (i.e. int elements). Hence, the total size is 32 * elem_num bits, where elem_num is the number of loaded elements in the vector register.

Implementation

Challenges

Due to size differences between int indices (32-bit) and byte/short data (8/16-bit), operations must be split across multiple vector registers based on the target SVE vector register size constraints.

For a 512-bit SVE machine, loading a byte vector with different vector species require different approaches:

  • SPECIES_64: Single operation with mask (8 elements, 256-bit)
  • SPECIES_128: Single operation, full register (16 elements, 512-bit)
  • SPECIES_256: Two operations + merge (32 elements, 1024-bit)
  • SPECIES_512/MAX: Four operations + merge (64 elements, 2048-bit)

Use ByteVector.SPECIES_512 as an example:

  • It contains 64 elements. So the index vector size should be 64 * 32 bits, which is 4 times of the SVE vector register size.
  • It requires 4 times of vector gather-loads to finish the whole operation.
byte[] arr = [a, a, a, a, ..., a, b, b, b, b, ..., b, c, c, c, c, ..., c, d, d, d, d, ..., d, ...]
int[] idx = [0, 1, 2, 3, ..., 63, ...]

4 gather-load:
idx_v1 = [15 14 13 ... 1 0]    gather_v1 = [... 0000 0000 0000 0000 aaaa aaaa aaaa aaaa]
idx_v2 = [31 30 29 ... 17 16]  gather_v2 = [... 0000 0000 0000 0000 bbbb bbbb bbbb bbbb]
idx_v3 = [47 46 45 ... 33 32]  gather_v3 = [... 0000 0000 0000 0000 cccc cccc cccc cccc]
idx_v4 = [63 62 61 ... 49 48]  gather_v4 = [... 0000 0000 0000 0000 dddd dddd dddd dddd]
merge: v = [dddd dddd dddd dddd cccc cccc cccc cccc bbbb bbbb bbbb bbbb aaaa aaaa aaaa aaaa]

Solution

The implementation simplifies backend complexity by defining each gather load IR to handle one vector gather-load operation, with multiple IRs generated in the compiler mid-end.

Here is the main changes:

  • Enhanced IR generation with architecture-specific patterns based on gather_scatter_needs_vector_index() matcher.
  • Added VectorSliceNode for result merging.
  • Added VectorMaskWidenNode for mask spliting and type conversion for masked gather-load.
  • Implemented SVE match rules for subword gather operations.
  • Added comprehensive IR tests for verification.

Testing:

  • Passed hotspot::tier1/2/3, jdk::tier1/2/3 tests
  • No regressions found

Performance:

The performance of corresponding JMH benchmarks improve 3-11x on an NVIDIA GRACE CPU, which is a 128-bit SVE2 architecture. Following is the performance data:

Benchmark                                                 SIZE Mode   Cnt Unit   Before      After   Gain
GatherOperationsBenchmark.microByteGather128              64   thrpt  30  ops/ms 13500.891 46721.307 3.46
GatherOperationsBenchmark.microByteGather128              256  thrpt  30  ops/ms  3378.186 12321.847 3.64
GatherOperationsBenchmark.microByteGather128              1024 thrpt  30  ops/ms   844.871  3144.217 3.72
GatherOperationsBenchmark.microByteGather128              4096 thrpt  30  ops/ms   211.386   783.337 3.70
GatherOperationsBenchmark.microByteGather128_MASK         64   thrpt  30  ops/ms 10605.664 46124.957 4.34
GatherOperationsBenchmark.microByteGather128_MASK         256  thrpt  30  ops/ms  2668.531 12292.350 4.60
GatherOperationsBenchmark.microByteGather128_MASK         1024 thrpt  30  ops/ms   676.218  3074.224 4.54
GatherOperationsBenchmark.microByteGather128_MASK         4096 thrpt  30  ops/ms   169.402   817.227 4.82
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF  64   thrpt  30  ops/ms 10615.723 46122.380 4.34
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF  256  thrpt  30  ops/ms  2671.931 12222.473 4.57
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF  1024 thrpt  30  ops/ms   678.437  3091.970 4.55
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF  4096 thrpt  30  ops/ms   170.310   813.967 4.77
GatherOperationsBenchmark.microByteGather128_NZ_OFF       64   thrpt  30  ops/ms 13524.671 47223.082 3.49
GatherOperationsBenchmark.microByteGather128_NZ_OFF       256  thrpt  30  ops/ms  3411.813 12343.308 3.61
GatherOperationsBenchmark.microByteGather128_NZ_OFF       1024 thrpt  30  ops/ms   847.919  3129.065 3.69
GatherOperationsBenchmark.microByteGather128_NZ_OFF       4096 thrpt  30  ops/ms   212.790   787.953 3.70
GatherOperationsBenchmark.microByteGather64               64   thrpt  30  ops/ms  8717.294 48176.937 5.52
GatherOperationsBenchmark.microByteGather64               256  thrpt  30  ops/ms  2184.345 12347.113 5.65
GatherOperationsBenchmark.microByteGather64               1024 thrpt  30  ops/ms   546.093  3070.851 5.62
GatherOperationsBenchmark.microByteGather64               4096 thrpt  30  ops/ms   136.724   767.656 5.61
GatherOperationsBenchmark.microByteGather64_MASK          64   thrpt  30  ops/ms  6576.504 48588.806 7.38
GatherOperationsBenchmark.microByteGather64_MASK          256  thrpt  30  ops/ms  1653.073 12341.291 7.46
GatherOperationsBenchmark.microByteGather64_MASK          1024 thrpt  30  ops/ms   416.590  3070.680 7.37
GatherOperationsBenchmark.microByteGather64_MASK          4096 thrpt  30  ops/ms   105.743   767.790 7.26
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF   64   thrpt  30  ops/ms  6628.974 48628.463 7.33
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF   256  thrpt  30  ops/ms  1676.767 12338.116 7.35
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF   1024 thrpt  30  ops/ms   422.612  3070.987 7.26
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF   4096 thrpt  30  ops/ms   105.033   767.563 7.30
GatherOperationsBenchmark.microByteGather64_NZ_OFF        64   thrpt  30  ops/ms  8754.635 48525.395 5.54
GatherOperationsBenchmark.microByteGather64_NZ_OFF        256  thrpt  30  ops/ms  2182.044 12338.096 5.65
GatherOperationsBenchmark.microByteGather64_NZ_OFF        1024 thrpt  30  ops/ms   547.353  3071.666 5.61
GatherOperationsBenchmark.microByteGather64_NZ_OFF        4096 thrpt  30  ops/ms   137.853   767.745 5.56
GatherOperationsBenchmark.microShortGather128             64   thrpt  30  ops/ms  8713.480 37696.121 4.32
GatherOperationsBenchmark.microShortGather128             256  thrpt  30  ops/ms  2189.636  9479.710 4.32
GatherOperationsBenchmark.microShortGather128             1024 thrpt  30  ops/ms   545.435  2378.492 4.36
GatherOperationsBenchmark.microShortGather128             4096 thrpt  30  ops/ms   136.213   595.504 4.37
GatherOperationsBenchmark.microShortGather128_MASK        64   thrpt  30  ops/ms  6665.844 37765.315 5.66
GatherOperationsBenchmark.microShortGather128_MASK        256  thrpt  30  ops/ms  1673.950  9482.207 5.66
GatherOperationsBenchmark.microShortGather128_MASK        1024 thrpt  30  ops/ms   420.628  2378.813 5.65
GatherOperationsBenchmark.microShortGather128_MASK        4096 thrpt  30  ops/ms   105.128   595.412 5.66
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 64   thrpt  30  ops/ms  6699.594 37698.398 5.62
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 256  thrpt  30  ops/ms  1682.128  9480.355 5.63
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 1024 thrpt  30  ops/ms   421.942  2380.449 5.64
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 4096 thrpt  30  ops/ms   106.587   595.560 5.58
GatherOperationsBenchmark.microShortGather128_NZ_OFF      64   thrpt  30  ops/ms  8788.830 37709.493 4.29
GatherOperationsBenchmark.microShortGather128_NZ_OFF      256  thrpt  30  ops/ms  2199.706  9485.769 4.31
GatherOperationsBenchmark.microShortGather128_NZ_OFF      1024 thrpt  30  ops/ms   548.309  2380.494 4.34
GatherOperationsBenchmark.microShortGather128_NZ_OFF      4096 thrpt  30  ops/ms   137.434   595.448 4.33
GatherOperationsBenchmark.microShortGather64              64   thrpt  30  ops/ms  5296.860 37797.813 7.13
GatherOperationsBenchmark.microShortGather64              256  thrpt  30  ops/ms  1321.738  9602.510 7.26
GatherOperationsBenchmark.microShortGather64              1024 thrpt  30  ops/ms   330.520  2404.013 7.27
GatherOperationsBenchmark.microShortGather64              4096 thrpt  30  ops/ms    82.149   602.956 7.33
GatherOperationsBenchmark.microShortGather64_MASK         64   thrpt  30  ops/ms  3458.968 37851.452 10.94
GatherOperationsBenchmark.microShortGather64_MASK         256  thrpt  30  ops/ms   879.143  9616.554 10.93
GatherOperationsBenchmark.microShortGather64_MASK         1024 thrpt  30  ops/ms   220.256  2408.851 10.93
GatherOperationsBenchmark.microShortGather64_MASK         4096 thrpt  30  ops/ms    54.947   603.251 10.97
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF  64   thrpt  30  ops/ms  3521.856 37736.119 10.71
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF  256  thrpt  30  ops/ms   881.456  9602.649 10.89
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF  1024 thrpt  30  ops/ms   220.122  2409.030 10.94
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF  4096 thrpt  30  ops/ms    55.845   603.126 10.79
GatherOperationsBenchmark.microShortGather64_NZ_OFF       64   thrpt  30  ops/ms  5279.815 37698.023 7.14
GatherOperationsBenchmark.microShortGather64_NZ_OFF       256  thrpt  30  ops/ms  1307.935  9601.551 7.34
GatherOperationsBenchmark.microShortGather64_NZ_OFF       1024 thrpt  30  ops/ms   329.707  2409.962 7.30
GatherOperationsBenchmark.microShortGather64_NZ_OFF       4096 thrpt  30  ops/ms    82.092   603.380 7.35

[1] https://bugs.openjdk.org/browse/JDK-8355563
[2] https://developer.arm.com/documentation/ddi0602/2024-12/SVE-Instructions/LD1B--scalar-plus-vector-Gather-load-unsigned-bytes-to-vector--vector-index--?lang=en
[3] https://developer.arm.com/documentation/ddi0602/2024-12/SVE-Instructions/LD1H--scalar-plus-vector---Gather-load-unsigned-halfwords-to-vector--vector-index--?lang=en


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8351623: VectorAPI: Add SVE implementation of subword gather load operation (Enhancement - P4)

Reviewers

Reviewers without OpenJDK IDs

  • @erifan (no known openjdk.org user name / role) 🔄 Re-review required (review applies to 534ed7fd)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/26236/head:pull/26236
$ git checkout pull/26236

Update a local copy of the PR:
$ git checkout pull/26236
$ git pull https://git.openjdk.org/jdk.git pull/26236/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 26236

View PR using the GUI difftool:
$ git pr show -t 26236

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/26236.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Jul 10, 2025

👋 Welcome back xgong! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jul 10, 2025

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk openjdk bot added the rfr Pull request is ready for review label Jul 10, 2025
@openjdk
Copy link

openjdk bot commented Jul 10, 2025

@XiaohongGong The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Jul 10, 2025
@mlbridge
Copy link

mlbridge bot commented Jul 10, 2025

Webrevs

@XiaohongGong
Copy link
Author

Hi @Bhavana-Kilambi, @fg1417, could you please help take a look at this PR? BTW, since the vector register size of my SVE machine is 128-bit, could you please help test the correctness on a SVE machine with larger vector size (e.g. 512-bit vector size)? Thanks a lot in advance!

@Bhavana-Kilambi
Copy link
Contributor

Hi @XiaohongGong , thank you for doing this. As for testing, we can currently only test on 256-bit SVE machines (we no longer have any 512bit machines). We will get back to you with the results soon.

@XiaohongGong
Copy link
Author

Hi @XiaohongGong , thank you for doing this. As for testing, we can currently only test on 256-bit SVE machines (we no longer have any 512bit machines). We will get back to you with the results soon.

Testing on 256-bit SVE machines are fine to me. Thanks so much for your help!

@fg1417
Copy link

fg1417 commented Jul 15, 2025

@XiaohongGong thanks for your work! Tier1 - tier3 passed on 256-bit sve machine without new failures.

@fg1417
Copy link

fg1417 commented Jul 15, 2025

@XiaohongGong Please correct me if I’m missing something or got anything wrong.

Taking short on 512-bit machine as an example, these instructions would be generated:

// vgather
sve_dup vtmp, 0
sve_load_0 =>  [0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a]
sve_uzp1 with vtmp =>  [00 00 00 00 00 00 00 00 aa aa aa aa aa aa aa aa]

// vgather1
sve_dup vtmp, 0
sve_load_1 =>  [0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b]
sve_uzp1 with vtmp =>  [00 00 00 00 00 00 00 00 bb bb bb bb bb bb bb bb]

// Slice vgather1, vgather1
ext =>  [bb bb bb bb bb bb bb bb 00 00 00 00 00 00 00 00]

// Or vgather, vslice
sve_orr =>  [bb bb bb bb bb bb bb bb aa aa aa aa aa aa aa aa]

Actually, we can get the target result directly by uzp1 the output from sve_load_0 and sve_load_1, like

[0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a]
[0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b]
uzp1 => 
[bb bb bb bb bb bb bb bb aa aa aa aa aa aa aa aa]

If so, the current design of LoadVectorGather may not be sufficiently low-level to suit AArch64. WDYT?

@XiaohongGong
Copy link
Author

@XiaohongGong thanks for your work! Tier1 - tier3 passed on 256-bit sve machine without new failures.

Good! Thanks so much for your help!

@XiaohongGong
Copy link
Author

XiaohongGong commented Jul 16, 2025

@XiaohongGong Please correct me if I’m missing something or got anything wrong.

Taking short on 512-bit machine as an example, these instructions would be generated:

// vgather
sve_dup vtmp, 0
sve_load_0 =>  [0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a]
sve_uzp1 with vtmp =>  [00 00 00 00 00 00 00 00 aa aa aa aa aa aa aa aa]

// vgather1
sve_dup vtmp, 0
sve_load_1 =>  [0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b]
sve_uzp1 with vtmp =>  [00 00 00 00 00 00 00 00 bb bb bb bb bb bb bb bb]

// Slice vgather1, vgather1
ext =>  [bb bb bb bb bb bb bb bb 00 00 00 00 00 00 00 00]

// Or vgather, vslice
sve_orr =>  [bb bb bb bb bb bb bb bb aa aa aa aa aa aa aa aa]

Actually, we can get the target result directly by uzp1 the output from sve_load_0 and sve_load_1, like

[0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a]
[0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b]
uzp1 => 
[bb bb bb bb bb bb bb bb aa aa aa aa aa aa aa aa]

If so, the current design of LoadVectorGather may not be sufficiently low-level to suit AArch64. WDYT?

Yes, you are right! This can work for truncating and merging two gather load results. But we have to consider other scenarios together: 1) No merging 2) Need 4 times of gather-loads and merging. Additionally, we have to make LoadVectorGatherNode common sense for all scenarios and different architectures.

To make the IR itself simple and unify the inputs for all types on kinds of architectures, I choose to pass one index to it now, and define that one LoadVectorGatherNode just finish one time of gather-load with the index. The element type of the result should be the subword type. So a followed type truncating is needed anyway. I think this makes sense for a single gather-load operation for subword types, right?

For cases that need more than 1 time of gather, I choose to generate multiple LoadVectorGatherNode and do the merging at last. And, I agree this may make the code less efficient than that of implementing with one LoadVectorGatherNode for all different scenarios. Writing backend assemblers for all scenarios can be more efficient. But this makes the backend implementation more complex. In additional to four normal gather cases, we have to consider the corresponding masked version and partial cases. BTW, the number of index passed to LoadVectorGatherNode will be different (e.g. 1, 2, 4), which makes the IR itself not easy to maintain.

Regarding to the refinement based on your suggestion,

  • case-1: no merging
    • It's not an issue (current version is fine)
  • case-2: 2 times of gather and merge
    • Can be refined. But the LoadVectorGatherNode should be changed to accept 2 index vectors.
  • case-3: 4 times of gather and merge (only for byte)
    • Can be refined. We can implement it just like:
      step-1: v1 = gather1 + gather2 + 2 * uzp1 // merging the first and second gather-loads
      step-2: v2 = gather3 + gather4 + 2 * uzp1 // merging the third and fourth gather-loads
      step-3: v3 = slice (v2, v2), v = or(v1, v3) // do the final merging
      We have to change LoadVectorGatherNode as well. At least making it accept 2 index vectors.

As a summary, LoadVectorGatherNode will be more complex than before. But the good thing is, giving it one more index input is ok. I'm not sure whether this is appliable for other architectures like maybe RVV. But I can try with this change. Do you have better idea? Thanks!

@fg1417
Copy link

fg1417 commented Jul 16, 2025

  • case-2: 2 times of gather and merge

    • Can be refined. But the LoadVectorGatherNode should be changed to accept 2 index vectors.
  • case-3: 4 times of gather and merge (only for byte)

    • Can be refined. We can implement it just like:
      step-1: v1 = gather1 + gather2 + 2 * uzp1 // merging the first and second gather-loads
      step-2: v2 = gather3 + gather4 + 2 * uzp1 // merging the third and fourth gather-loads
      step-3: v3 = slice (v2, v2), v = or(v1, v3) // do the final merging
      We have to change LoadVectorGatherNode as well. At least making it accept 2 index vectors.

As a summary, LoadVectorGatherNode will be more complex than before. But the good thing is, giving it one more index input is ok. I'm not sure whether this is appliable for other architectures like maybe RVV. But I can try with this change. Do you have better idea? Thanks!

@XiaohongGong thanks for your reply.

This idea generally looks good to me.

For case-2, we have

gather1 + gather2 + uzp1:
[0a 0a 0a 0a 0a 0a 0a 0a]
[0b 0b 0b 0b 0b 0b 0b 0b]
uzp1.H  => 
[bb bb bb bb aa aa aa aa]

Can we improve case-3 by following the pattern of case-2?

step-1:  v1 = gather1 + gather2 + uzp1 
[000a 000a 000a 000a 000a 000a 000a 000a]
[000b 000b 000b 000b 000b 000b 000b 000b]
uzp1.H => [0b0b 0b0b 0b0b 0b0b 0a0a 0a0a 0a0a 0a0a]

step-2:  v2 = gather3 + gather4 + uzp1 
[000c 000c 000c 000c 000c 000c 000c 000c]
[000d 000d 000d 000d 000d 000d 000d 000d]
uzp1.H => [0d0d 0d0d 0d0d 0d0d 0c0c 0c0c 0c0c 0c0c]

step-3:  v3 = uzp1 (v1, v2)
[0b0b 0b0b 0b0b 0b0b 0a0a 0a0a 0a0a 0a0a]
[0d0d 0d0d 0d0d 0d0d 0c0c 0c0c 0c0c 0c0c]
uzp1.B => [dddd dddd cccc cccc bbbb bbbb aaaa aaaa]

Then we can also consistently define the semantics of LoadVectorGatherNode as gather1 + gather2 + uzp1.H , which would make backend much cleaner. WDYT?

@XiaohongGong
Copy link
Author

XiaohongGong commented Jul 17, 2025

  • case-2: 2 times of gather and merge

    • Can be refined. But the LoadVectorGatherNode should be changed to accept 2 index vectors.
  • case-3: 4 times of gather and merge (only for byte)

    • Can be refined. We can implement it just like:
      step-1: v1 = gather1 + gather2 + 2 * uzp1 // merging the first and second gather-loads
      step-2: v2 = gather3 + gather4 + 2 * uzp1 // merging the third and fourth gather-loads
      step-3: v3 = slice (v2, v2), v = or(v1, v3) // do the final merging
      We have to change LoadVectorGatherNode as well. At least making it accept 2 index vectors.

As a summary, LoadVectorGatherNode will be more complex than before. But the good thing is, giving it one more index input is ok. I'm not sure whether this is appliable for other architectures like maybe RVV. But I can try with this change. Do you have better idea? Thanks!

@XiaohongGong thanks for your reply.

This idea generally looks good to me.

For case-2, we have

gather1 + gather2 + uzp1:
[0a 0a 0a 0a 0a 0a 0a 0a]
[0b 0b 0b 0b 0b 0b 0b 0b]
uzp1.H  => 
[bb bb bb bb aa aa aa aa]

Can we improve case-3 by following the pattern of case-2?

step-1:  v1 = gather1 + gather2 + uzp1 
[000a 000a 000a 000a 000a 000a 000a 000a]
[000b 000b 000b 000b 000b 000b 000b 000b]
uzp1.H => [0b0b 0b0b 0b0b 0b0b 0a0a 0a0a 0a0a 0a0a]

step-2:  v2 = gather3 + gather4 + uzp1 
[000c 000c 000c 000c 000c 000c 000c 000c]
[000d 000d 000d 000d 000d 000d 000d 000d]
uzp1.H => [0d0d 0d0d 0d0d 0d0d 0c0c 0c0c 0c0c 0c0c]

step-3:  v3 = uzp1 (v1, v2)
[0b0b 0b0b 0b0b 0b0b 0a0a 0a0a 0a0a 0a0a]
[0d0d 0d0d 0d0d 0d0d 0c0c 0c0c 0c0c 0c0c]
uzp1.B => [dddd dddd cccc cccc bbbb bbbb aaaa aaaa]

Then we can also consistently define the semantics of LoadVectorGatherNode as gather1 + gather2 + uzp1.H , which would make backend much cleaner. WDYT?

Thanks! Regarding to the definitation of LoadVectorGatherNode, we'd better keep the vector type as it is for byte and short vectors. The SVE vector load gather instruction needs the type information. Additionally, the vector layout of the result should be matched with the vector type, right? We can handle this easily with pure backend implementation. But it seems not easy in mid-end IR level. BTW, uzp1 is SVE specific instruction, we'd better define a common IR for that, which is also useful for other platforms that want to support subword gather API, right? I'm not sure whether this makes sense. I will take a considering for this suggestion.

@XiaohongGong
Copy link
Author

XiaohongGong commented Jul 17, 2025

Can we improve case-3 by following the pattern of case-2?

step-1:  v1 = gather1 + gather2 + uzp1 
[000a 000a 000a 000a 000a 000a 000a 000a]
[000b 000b 000b 000b 000b 000b 000b 000b]
uzp1.H => [0b0b 0b0b 0b0b 0b0b 0a0a 0a0a 0a0a 0a0a]

step-2:  v2 = gather3 + gather4 + uzp1 
[000c 000c 000c 000c 000c 000c 000c 000c]
[000d 000d 000d 000d 000d 000d 000d 000d]
uzp1.H => [0d0d 0d0d 0d0d 0d0d 0c0c 0c0c 0c0c 0c0c]

step-3:  v3 = uzp1 (v1, v2)
[0b0b 0b0b 0b0b 0b0b 0a0a 0a0a 0a0a 0a0a]
[0d0d 0d0d 0d0d 0d0d 0c0c 0c0c 0c0c 0c0c]
uzp1.B => [dddd dddd cccc cccc bbbb bbbb aaaa aaaa]

Then we can also consistently define the semantics of LoadVectorGatherNode as gather1 + gather2 + uzp1.H , which would make backend much cleaner. WDYT?

Thanks! Regarding to the definitation of LoadVectorGatherNode, we'd better keep the vector type as it is for byte and short vectors. The SVE vector load gather instruction needs the type information. Additionally, the vector layout of the result should be matched with the vector type, right? We can handle this easily with pure backend implementation. But it seems not easy in mid-end IR level. BTW, uzp1 is SVE specific instruction, we'd better define a common IR for that, which is also useful for other platforms that want to support subword gather API, right? I'm not sure whether this makes sense. I will take a considering for this suggestion.

Maybe I can define the vector type of LoadVectorGatherNode as int vector type for subword types. An additional flag is necessary to denote whether it is a byte or short loading. It only finishes the gather operation (without any truncating). And define an IR like VectorConcateNode to merge the gather results. For cases that only one time of gather is needed, we can just return a type cast node like VectorCastI2X. Seems this will make the IR more common and code more clean.

The implementation would like:

  • case-1 one gather:
    • gather (bt: int) + cast (bt: byte|short)
  • case-2 two gathers:
    • step-1: gather1 (bt: int) + gather2 (bt: int) + concate(gather1, gather2) (bt: short)
    • step-2: cast (bt: byte) // just for byte vectors
  • case-3 four gathers:
    • step-1: gather1 (bt: int) + gather2 (bt: int) + concate(gather1, gather2) (bt: short)
    • step-2: gather3 (bt: int) + gather4 (bt: int) + concate(gather3, gather3) (bt: short)
    • step-3: concate (bt: byte)

Or more commonly:

  • case-1 one gather:
    • gather (bt: int) + cast (bt: byte|short)
  • case-2 two gathers:
    • step-1: gather1 (bt: int) + gather2 (bt: int) + concate(gather1, gather2) (bt: byte|short)
  • case-3 four gathers:
    • step-1: gather1 (bt: int) + gather2 (bt: int) + gather3 (bt: int) + gather4 (bt: int)
    • step-2: concate(gather1, gather2, gather3, gather4) (bt: byte|short)

From the IR level, which one do you think is better?

@fg1417
Copy link

fg1417 commented Jul 17, 2025

Thanks! Regarding to the definitation of LoadVectorGatherNode, we'd better keep the vector type as it is for byte and short vectors. The SVE vector load gather instruction needs the type information. Additionally, the vector layout of the result should be matched with the vector type, right? We can handle this easily with pure backend implementation. But it seems not easy in mid-end IR level. BTW, uzp1 is SVE specific instruction, we'd better define a common IR for that, which is also useful for other platforms that want to support subword gather API, right?

That makes sense to me. Thanks for your explanation!

Maybe I can define the vector type of LoadVectorGatherNode as int vector type for subword types. An additional flag is necessary to denote whether it is a byte or short loading. It only finishes the gather operation (without any truncating). And define an IR like VectorConcateNode to merge the gather results. For cases that only one time of gather is needed, we can just return a type cast node like VectorCastI2X. Seems this will make the IR more common and code more clean.

The implementation would like:

  • case-1 one gather:

    • gather (bt: int) + cast (bt: byte|short)
  • case-2 two gathers:

    • step-1: gather1 (bt: int) + gather2 (bt: int) + concate(gather1, gather2) (bt: short)
    • step-2: cast (bt: byte) // just for byte vectors
  • case-3 four gathers:

    • step-1: gather1 (bt: int) + gather2 (bt: int) + concate(gather1, gather2) (bt: short)
    • step-2: gather3 (bt: int) + gather4 (bt: int) + concate(gather3, gather3) (bt: short)
    • step-3: concate (bt: byte)

Or more commonly:

  • case-1 one gather:

    • gather (bt: int) + cast (bt: byte|short)
  • case-2 two gathers:

    • step-1: gather1 (bt: int) + gather2 (bt: int) + concate(gather1, gather2) (bt: byte|short)
  • case-3 four gathers:

    • step-1: gather1 (bt: int) + gather2 (bt: int) + gather3 (bt: int) + gather4 (bt: int)
    • step-2: concate(gather1, gather2, gather3, gather4) (bt: byte|short)

From the IR level, which one do you think is better?

I like this idea! The first one looks better, in which concate would provide lower-level and more fine-grained semantics, allowing us to define fewer IR node types while supporting more scenarios.

@XiaohongGong
Copy link
Author

I like this idea! The first one looks better, in which concate would provide lower-level and more fine-grained semantics, allowing us to define fewer IR node types while supporting more scenarios.

Yes, I agree with you. I'm now working on refactoring the IR based on the first idea. I will update the patch as soon as possible. Thanks for your valuable suggestion!

@fg1417
Copy link

fg1417 commented Jul 17, 2025

Yes, I agree with you. I'm now working on refactoring the IR based on the first idea. I will update the patch as soon as possible. Thanks for your valuable suggestion!

Thanks! I’d suggest also highlighting aarch64 in the JBS title, so others who are interested won’t miss it.

@XiaohongGong
Copy link
Author

Yes, I agree with you. I'm now working on refactoring the IR based on the first idea. I will update the patch as soon as possible. Thanks for your valuable suggestion!

Thanks! I’d suggest also highlighting aarch64 in the JBS title, so others who are interested won’t miss it.

Thanks for your point~
I'm not sure since this is not a pure AArch64 backend patch as I can see. Actually, the backend rules are so simple, and the mid-end IR change is relative more complex. Not sure whether this patch will be also missed by others that are not familiar with AArch64 if it is highlighted.

@XiaohongGong
Copy link
Author

Hi @fg1417 , the latest commit refactored the whole IR patterns and LoadVectorGather[Masked] IR based on above discussions. Could you please help take another look? Thanks~

Main changes

  • Type of LoadVectorGather[Masked] are changed from original subword vector type to int vector type. Additionally, a _mem_bt member is added to denote the load type.
    • backend rules are clean
    • mask generation for partial cases are clean
  • Define VectorConcatenateNode and remove VectorSliceNode.
    • VectorConcatenateNode has the same function with SVE/NEON's uzp1. It is used to narrow the element size of input to half size and concatenate narrowed results from src1 and src2 to dst (src1 is in lower part and src2 is in higher part of dst).
  • The matcher helper function vector_idea_reg_size() is needless and removed. Originally it is used by VectorSlice.
  • More IR tests are added for kinds of different vector species.

IR implementation

  • It needs one gather-load
    • LoadVectorGather (bt: int) + VectorCastI2X (bt: byte|short)
  • It needs two gather-loads and merge
    • step-1: v1 = LoadVectorGather (bt: int), v2 = LoadVectorGather (bt: int)
    • step-2: merge = VectorConcatenate(v1, v2) (bt: short)
    • step-3: (only byte) v = VectorCastS2X(merge) (bt: byte)
  • It needs four gather-loads and merge - (only byte vector)
    • step-1: v1 = LoadVectorGather (bt: int), v2 = LoadVectorGather (bt: int)
    • step-2: merge1 = VectorConcatenate(v1, v2) (bt: short)
    • step-3: v3 = LoadVectorGather (bt: int), v4 = LoadVectorGather (bt: int)
    • step-4: merge2 = VectorConcatenate(v3, v4) (bt: short)
    • step-5: v = VectorConcatenate(merge1, merge2) (bt: byte)

Performance change

It can observe about 4% ~ 9% uplifts on some micro benchmarks. No significant regressions are observed.
Following is the performance change on NVIDIA Grace with latest commit:

Benchmark                        (SIZE)   Mode   Units      Before     After   Gain
microByteGather128                   64  thrpt   ops/ms  48405.283  48668.502  1.005
microByteGather128                  256  thrpt   ops/ms  12821.924  12662.342  0.987
microByteGather128                 1024  thrpt   ops/ms   3253.778   3198.608  0.983
microByteGather128                 4096  thrpt   ops/ms    817.604    801.250  0.979
microByteGather128_MASK              64  thrpt   ops/ms  46124.722  48334.916  1.047
microByteGather128_MASK             256  thrpt   ops/ms  12152.575  12652.821  1.041
microByteGather128_MASK            1024  thrpt   ops/ms   3075.066   3193.787  1.038
microByteGather128_MASK            4096  thrpt   ops/ms    812.738    803.017  0.988
microByteGather128_MASK_NZ_OFF       64  thrpt   ops/ms  46130.244  48384.633  1.048
microByteGather128_MASK_NZ_OFF      256  thrpt   ops/ms  12139.800  12624.298  1.039
microByteGather128_MASK_NZ_OFF     1024  thrpt   ops/ms   3078.040   3203.049  1.040
microByteGather128_MASK_NZ_OFF     4096  thrpt   ops/ms    812.716    802.712  0.987
microByteGather128_NZ_OFF            64  thrpt   ops/ms  48369.524  48643.937  1.005
microByteGather128_NZ_OFF           256  thrpt   ops/ms  12814.552  12672.757  0.988
microByteGather128_NZ_OFF          1024  thrpt   ops/ms   3253.294   3202.016  0.984
microByteGather128_NZ_OFF          4096  thrpt   ops/ms    818.389    805.488  0.984
microByteGather64                    64  thrpt   ops/ms  48491.633  50615.848  1.043
microByteGather64                   256  thrpt   ops/ms  12340.778  13156.762  1.066
microByteGather64                  1024  thrpt   ops/ms   3067.592   3322.777  1.083
microByteGather64                  4096  thrpt   ops/ms    767.111    832.409  1.085
microByteGather64_MASK               64  thrpt   ops/ms  48526.894  50730.468  1.045
microByteGather64_MASK              256  thrpt   ops/ms  12340.398  13159.723  1.066
microByteGather64_MASK             1024  thrpt   ops/ms   3066.227   3327.964  1.085
microByteGather64_MASK             4096  thrpt   ops/ms    767.390    833.327  1.085
microByteGather64_MASK_NZ_OFF        64  thrpt   ops/ms  48472.912  51287.634  1.058
microByteGather64_MASK_NZ_OFF       256  thrpt   ops/ms  12331.578  13258.954  1.075
microByteGather64_MASK_NZ_OFF      1024  thrpt   ops/ms   3070.319   3345.911  1.089
microByteGather64_MASK_NZ_OFF      4096  thrpt   ops/ms    767.097    838.008  1.092
microByteGather64_NZ_OFF             64  thrpt   ops/ms  48492.984  51224.743  1.056
microByteGather64_NZ_OFF            256  thrpt   ops/ms  12334.944  13240.494  1.073
microByteGather64_NZ_OFF           1024  thrpt   ops/ms   3067.754   3343.387  1.089
microByteGather64_NZ_OFF           4096  thrpt   ops/ms    767.123    837.642  1.091
microShortGather128                  64  thrpt   ops/ms  37717.835  37041.162  0.982
microShortGather128                 256  thrpt   ops/ms   9467.160   9890.109  1.044
microShortGather128                1024  thrpt   ops/ms   2376.520   2481.753  1.044
microShortGather128                4096  thrpt   ops/ms    595.030    621.274  1.044
microShortGather128_MASK             64  thrpt   ops/ms  37655.017  37036.887  0.983
microShortGather128_MASK            256  thrpt   ops/ms   9471.324   9859.461  1.040
microShortGather128_MASK           1024  thrpt   ops/ms   2376.811   2477.106  1.042
microShortGather128_MASK           4096  thrpt   ops/ms    595.049    620.082  1.042
microShortGather128_MASK_NZ_OFF      64  thrpt   ops/ms  37636.229  37029.468  0.983
microShortGather128_MASK_NZ_OFF     256  thrpt   ops/ms   9483.674   9867.427  1.040
microShortGather128_MASK_NZ_OFF    1024  thrpt   ops/ms   2379.877   2478.608  1.041
microShortGather128_MASK_NZ_OFF    4096  thrpt   ops/ms    594.710    620.455  1.043
microShortGather128_NZ_OFF           64  thrpt   ops/ms  37706.896  37044.505  0.982
microShortGather128_NZ_OFF          256  thrpt   ops/ms   9487.006   9882.079  1.041
microShortGather128_NZ_OFF         1024  thrpt   ops/ms   2379.571   2482.341  1.043
microShortGather128_NZ_OFF         4096  thrpt   ops/ms    595.099    621.392  1.044
microShortGather64                   64  thrpt   ops/ms  37773.485  37502.698  0.992
microShortGather64                  256  thrpt   ops/ms   9591.046   9640.225  1.005
microShortGather64                 1024  thrpt   ops/ms   2406.013   2420.376  1.005
microShortGather64                 4096  thrpt   ops/ms    603.270    606.541  1.005
microShortGather64_MASK              64  thrpt   ops/ms  37781.860  37479.295  0.991
microShortGather64_MASK             256  thrpt   ops/ms   9608.015   9657.010  1.005
microShortGather64_MASK            1024  thrpt   ops/ms   2406.828   2422.170  1.006
microShortGather64_MASK            4096  thrpt   ops/ms    602.965    606.283  1.005
microShortGather64_MASK_NZ_OFF       64  thrpt   ops/ms  37740.577  37487.740  0.993
microShortGather64_MASK_NZ_OFF      256  thrpt   ops/ms   9593.611   9663.041  1.007
microShortGather64_MASK_NZ_OFF     1024  thrpt   ops/ms   2404.846   2423.493  1.007
microShortGather64_MASK_NZ_OFF     4096  thrpt   ops/ms    602.691    605.911  1.005
microShortGather64_NZ_OFF            64  thrpt   ops/ms  37723.586  37507.899  0.994
microShortGather64_NZ_OFF           256  thrpt   ops/ms   9589.985   9630.033  1.004
microShortGather64_NZ_OFF          1024  thrpt   ops/ms   2405.774   2423.655  1.007
microShortGather64_NZ_OFF          4096  thrpt   ops/ms    602.778    606.151  1.005

Copy link

@fg1417 fg1417 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating it. Looks good on my end.
It might be helpful to have Reviewers take a look.

@XiaohongGong
Copy link
Author

Thanks for updating it. Looks good on my end. It might be helpful to have Reviewers take a look.

Thanks a lot for your review and test!

@XiaohongGong
Copy link
Author

Hi, could anyone please help take a look at this PR? Thanks so much!

Hi @RealFYang , not sure whether there is any plan to support the subword gather-load for RVV, it will be much appreciated if we can get any feedback from other architecture side. Would you mind taking a look at this PR? Thanks a lot in advance!

@XiaohongGong
Copy link
Author

ping~

@XiaohongGong
Copy link
Author

Hi, could anyone please help take a look at this PR? Thanks a lot in advance!

@XiaohongGong
Copy link
Author

Hi @eme64 , could you please help take a look at this PR? Thanks a lot in advance!

Copy link
Contributor

@shqking shqking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@erifan erifan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@eme64 eme64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very interesting. I have a first series of questions / comments :)

There is definitively a tradeoff between complexity in the backend and in the C2 IR. So I'm yet trying to wrap my head around that decision. I'm just afraid that adding more very specific C2 IR nodes makes things more complicated to do optimizations in the C2 IR.

// Unpack elements from the lower or upper half of the source
// predicate and place in elements of twice their size within
// the destination predicate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

unnecessary empty line

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This empty line is auto-generated by the m4 file. I tried some methods to clean it, but all fails. So I have to keep it as it is.

Comment on lines 1762 to 1769
// Concatenate elements from two source vectors by narrowing the elements to half size. Put
// the narrowed elements from the first source vector to the lower half of the destination
// vector, and the narrowed elements from the second source vector to the upper half.
//
// e.g. vec1 = [0d 0c 0b 0a], vec2 = [0h 0g 0f 0e]
// dst = [h g f e d c b a]
//
class VectorConcatenateNode : public VectorNode {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That semantic is not quite what I would expect from Concatenate. Maybe we can call it something else?
VectorConcatenateAndNarrowNode?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered using 2x Cast + Concatenate instead, and just matching that in the backend? I don't remember how to do the mere Concat, but it should be possible via the unslice or some other operation that concatenates two vectors.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That semantic is not quite what I would expect from Concatenate. Maybe we can call it something else? VectorConcatenateAndNarrowNode?

Yeah, VectorConcatenateAndNarrowNode would be much match. I just thought the name would be too long. I will change it in next commit. Thanks for your suggestion!

Copy link
Author

@XiaohongGong XiaohongGong Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered using 2x Cast + Concatenate instead, and just matching that in the backend? I don't remember how to do the mere Concat, but it should be possible via the unslice or some other operation that concatenates two vectors.

Would using 2x Cast + Concatenate make the IRs and match rule more complex? Mere concatenate would be something like vector slice in Vector API. It concatenates two vectors into one with an index denoting the merging position. And it requires the vector types are the same for two input vectors and the dst vector. Hence, if we want to separate this operation with cast and concatenate, the IRs would be (assume original type of v1/v2 is 4-int, the result type should be 8-short):

  1. Narrow two input vectors:
    v1 = VectorCast(v1) (4-short); v2 = VectorCast(v2) (4-short).
    The vector length are not changed while the element size is half size. Hence the vector length in bytes is half size as well.
  2. Resize v1 and v2 to double vector length. The higher bits are cleared:
    v1 = VectorReinterpret(v1) (8-short); v2 = VectorReinterpret(v2) (8-short).
  3. Concatenate v1 and v2 like slice. The position is the middle of the vector length.
    v = VectorSlice(v1, v2, 4) (8-short).

If we want to merging these IRs in backend, would the match rule be more complex? I will take a considering.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not saying I know that this alternative would be better. I'm just worried about having extra IR nodes, and then optimizations are more complex / just don't work because we don't handle all nodes.

Copy link
Contributor

@iwanowww iwanowww Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started looking at the PR and it looks appealing to simplify VM intrinsics and lift more code into Java. In other words, subword gather operation can be coded as a composition of operations on int vectors. Have you considered that?

It doesn't solve the problem how to reliably match complex graph into a single instruction through. Matcher favors tree representation, but there are multiple ways to workaround it. Personally, I'd prefer to address it separately.

For now, a dedicated node to concatenate vectors look appropriate (please, note there's existing PackNode et al).
It can be either exposed through VM intrinsic or substituted for a well-known complex IR shape during IGVN (like the one you depicted). The nice thing is it'll uniformly cover all usages irrespective of whether they come from Vector API implementation itself or from user code.

In the context of Vector API, the plan was to expose generic element rearranges/shuffles through API, but then enable various strength-reductions to optimize well-known/popular shapes. Packing multiple vectors perfectly fits that effort.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started looking at the PR and it looks appealing to simplify VM intrinsics and lift more code into Java. In other words, subword gather operation can be coded as a composition of operations on int vectors. Have you considered that?

Thanks so much for looking at this PR! Yes, personally I think we can move these op generation to Java-level for subword gather operation. And I also considered this when I started working at this task. However, this may break current backend implementation for other architectures like X86. I'm not sure whether moving to Java will be also friendly for non-SVE arches. Per my understanding, subword gather depends much more on the backend solution.

For now, a dedicated node to concatenate vectors look appropriate (please, note there's existing PackNode et al).
It can be either exposed through VM intrinsic or substituted for a well-known complex IR shape during IGVN (like the one you depicted). The nice thing is it'll uniformly cover all usages irrespective of whether they come from Vector API implementation itself or from user code.

In the context of Vector API, the plan was to expose generic element rearranges/shuffles through API, but then enable various strength-reductions to optimize well-known/popular shapes. Packing multiple vectors perfectly fits that effort.

Thanks for your inputs on the IR choice. I agree with you about adding such a vector concatenate node in C2. And if we decide to move the complex implementation to Java-level, we'd better also add such an API for vector concatenate, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we decide to move the complex implementation to Java-level, we'd better also add such an API for vector concatenate, right?

There's already generic shuffle operation present (rearrange). But there're precedents when more specific operations became part of the API for convenience reasons (e.g., slice/unslice). So, a dedicated operation for vector concatenation may be well-justified.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, this may break current backend implementation for other architectures like X86. I'm not sure whether moving to Java will be also friendly for non-SVE arches. Per my understanding, subword gather depends much more on the backend solution.

IMO that's a clear sign that current abstraction is way too ad-hoc and platform-specific. x86 ISA lacks native support, so the operation is emulated with hand-written assembly. If there's a less performant implementation, but which relies on a uniform cross-platform VM interface, it'll be a clear winner.

The PR, as it is now, introduces a new IR representation which complicates things even more. Instead, I'd encourage you to work on a uniform API even if x86 won't be immediately migrated.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see and it makes sense to me. Thanks for your suggestion. I will have a try with moving the complex operations to API level next.

Comment on lines +1840 to +1841
// Unpack the elements to twice size.
class VectorMaskWidenNode : public VectorNode {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a visual example like above for VectorConcatenateNode, please?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider the alternative of Extract + Cast? Not sure if that would be better, you know more about the code complexity. It would just allow us to have one fewer nodes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just has the Extract node to extract an element from vector in C2, right? Extracting the lowest part can be implemented with VectorReinterpret easily. But how about the higher parts? Maybe this can also be implemented with operations like slice ? But, seems this will also make the IR more complex? For Cast, we have VectorCastMask now, but it assumes the vector length should be the same for input and output. So the VectorReinterpret or an VectorExtract is sill needed.

I can have a try with separating the IR. But I guess an additional new node is still necessary.

It would just allow us to have one fewer nodes.

This is also what I expect really.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would just be nice to build on "simple" building blocks and not have too many complex nodes, that have very special semantics (widen + split into two). It just means that the IR optimizations have to take care of more special cases, rather than following simple rules/optimizations because every IR node does a relatively simple thing.

Maybe you find out that we really need a complex node, and can provide good arguments. Looking forward to what you find :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @iwanowww , regarding to the operation of extending the higher half element size for a vector mask, do you have any better idea? To split the gather operation for a subword type, we usually need to split the input mask as well. Especially for SVE, which the vector mask needs the same data type for an element. I need to extract the part of the original vector mask, and extend it to the int type. For Vector API, I think we can either use similar vector slice for a mask, or a vector extract API. WDYT?

Note that on SVE, it has the native PUNPKHI [1] instruction supported.

[1] https://developer.arm.com/documentation/ddi0596/2020-12/SVE-Instructions/PUNPKHI--PUNPKLO--Unpack-and-widen-half-of-predicate-

@XiaohongGong
Copy link
Author

Hi @eme64 , I just push a commit which added more comments and assertion in the code. This is just a simple fixing to part of your comments. Regarding to the IR refinement, I need more time taking a look. So could you please take another look at the changes relative to method_rename/comment/assertion? Thanks a lot in advance!

//
// SVE requires vector indices for gather-load/scatter-store operations on all
// data types.
static bool gather_scatter_requires_index_in_address(BasicType bt) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I agreed to this naming, but I looked at the signature of Gather again:
LoadVectorGatherNode(Node* c, Node* mem, Node* adr, const TypePtr* at, const TypeVect* vt, Node* indices)

I'm a little confused now what is the address that your name references. Is it the adr? I think not, because that is the base address, right? Can you clarify a little more? Maybe add to the documentation of the gather and scatter node as well, if you think that helps?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, you already did add documentation to the gather / scatter nodes now. And based on your explanation there, I suggest you rename the method here to:
gather_scatter_requires_indices_from_array
This would say that the indices come from an array, rather than a vector register.

Your current name we had agreed on confuses me because it suggests that the index maybe already in the address adr, but that does not make much sense.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, gather_scatter_requires_indices_from_array sounds better to me. I will change it soon.

I'm a little confused now what is the address that your name references. Is it the adr? I think not, because that is the base address, right? Can you clarify a little more? Maybe add to the documentation of the gather and scatter node as well, if you think that helps?

It means the input indices is an address that saves the indexes if this method return true, otherwise, indices is a vector register. You are right that it has no relationship with adr input which is the memory base address.

// Load Vector from memory via index map. The index map is usually a vector of indices
// that has the same vector type as the node's bottom type. For non-subword types, it must
// be. However, for subword types, the basic type of index is int. Hence, the index map
// can be either a vector with int elements or an address which saves the int indices.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, that helps!

Copy link
Contributor

@eme64 eme64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@XiaohongGong I'm going to be away on vacation for about 3 weeks now. So I won't be able to continue with the review until I'm back.

Maybe @vnkozlov or @iwanowww can review instead. Maybe @PaulSandoz or @jatin-bhateja would like to look at it too. If they do, I would want them to consider if the approach with the special vector nodes VectorConcatenateAndNarrow and VectorMaskWiden are really desirable. The complexity needs to go somewhere, but I'm not sure if it is better in the C2 IR or in the backend.

In this PR, there are already a thread here and here.

@PaulSandoz
Copy link
Member

I would want them to consider if the approach with the special vector nodes VectorConcatenateAndNarrow and VectorMaskWiden are really desirable. The complexity needs to go somewhere, but I'm not sure if it is better in the C2 IR or in the backend.

It would just be nice to build on "simple" building blocks and not have too many complex nodes, that have very special semantics (widen + split into two)

Intuitively this seems like the right way to think about it, although I don't have a proposed solution, i am really just agreeing with the above sentiment - a compositional solution, if possible, with the right primitive building blocks will likely be superior.

@XiaohongGong
Copy link
Author

I would want them to consider if the approach with the special vector nodes VectorConcatenateAndNarrow and VectorMaskWiden are really desirable. The complexity needs to go somewhere, but I'm not sure if it is better in the C2 IR or in the backend.

It would just be nice to build on "simple" building blocks and not have too many complex nodes, that have very special semantics (widen + split into two)

Intuitively this seems like the right way to think about it, although I don't have a proposed solution, i am really just agreeing with the above sentiment - a compositional solution, if possible, with the right primitive building blocks will likely be superior.

Thanks for your input @PaulSandoz ! And I agree with making the IR simple enough. I'm now working on finding a better way for these two complex operations. Hope I can fix it soon. Thanks!

@XiaohongGong
Copy link
Author

Hi @iwanowww , @PaulSandoz , @eme64 ,

Hope you’re doing well!

I’ve created a prototype that moves the implementation to the Java API level, as suggested (see: XiaohongGong#8). This refactoring has resulted in significantly cleaner and more maintainable code. Thanks for your insightful feedback @iwanowww !

However, it also introduces new issues that we have to consider. The codegen might not be optimal. If we want to generate the optimal instruction sequence, we need more effort.

Following is the details:

  1. We need a new API to cross-lane shift the lanes for a vector mask, which is used to extract different piece of a vector mask if the whole gather operation needs to be split. Consider it has a Vector.slice() API which can implement such a function, I added a similar one for VectorMask.

    There are two new issues that I need to address for this API:

    • SVE lacks a native instruction for such a mask operation. I have to convert it to a vector, call the Vector.slice(), and then convert back to a mask. Please note that the whole progress is not SVE friendly. The performance of such an API will have large gap on SVE compared with other arches.
    • To generate a SVE optimal instruction, I have to do further IR transformation and optimize the pattern with match rule. I'm not sure whether the optimization will be common enough to be accepted in future.

    Do you have a better idea on the new added API? I'd like to avoid adding such a performance not friendly API, and the API might not be frequently used in real world.

  2. To make the interface uniform across-platforms, each API is defined as the same vector type of the target result, although we need to do separation and merging. However, as the SVE gather-load instruction works with int vector type, we need special handling in compiler IR-level.

    I'd like to extend LoadVectorGather{,Masked} with mem_bt to handle subword loads, adjust mask with cast/resize before and append vector cast/reinterpret after. Splitting into simple IRs make it possible for further IR-level optimization. This might make the compiler IRs different across platforms like what it is in current PR. Hence, the compiler change might not be so clean. Does this make sense to you?

  3. Further compiler optimization is necessary to optimize out in-efficient instructions. This needs the combination of IR transformation and match rules. I think this might be more complex, and the result is not guaranteed now. I need further implementation.

As a summary, the implementation itself of this API is clean. But it introduces more overhead especially for SVE. It's not so easy for me to make a conclusion whether the Java change wins or not. Any suggestion on this?

Thanks,
Xiaohong

@PaulSandoz
Copy link
Member

@XiaohongGong would it help if loadWithMap accepted a part number, identifying what part of the mask to use and identifying the part where the returned elements will be located, such that the returned vectors can be easily composed with logical or (as if merging).

@XiaohongGong
Copy link
Author

XiaohongGong commented Oct 15, 2025

@XiaohongGong would it help if loadWithMap accepted a part number, identifying what part of the mask to use and identifying the part where the returned elements will be located, such that the returned vectors can be easily composed with logical or (as if merging).

Thanks for you input @PaulSandoz ! Yes, I think passing a part to hotspot would be helpful. But that would move the cross-lane shift operation for a vector&mask to VM intrinsic part. This is more convenient for compiler optimization. Seems this will be a composition of java and VM intrinsic co-work, which makes sense to me. But will this make the loadWithMap interface more complex? WDYT @iwanowww ?

@PaulSandoz
Copy link
Member

I suspect it's likely more complex overall adding a slice operation to mask, that is really only needed for a specific case. (A more general operation would be compress/expand of the mask bits, but i don't believe there are hardware instructions for such operations on mask registers.)

In my view adding a part parameter is a compromise and seems less complex that requiring N index vectors, and it fits with a general pattern we have around parts of the vector. It moves the specialized operation requirements on the mask into the area where it is needed rather than trying to generalize in a manner that i don't think is appropriate in the mask API.

@XiaohongGong
Copy link
Author

I suspect it's likely more complex overall adding a slice operation to mask, that is really only needed for a specific case. (A more general operation would be compress/expand of the mask bits, but i don't believe there are hardware instructions for such operations on mask registers.)

Yes, I agree with you. Personally, I’d prefer not to introduce such APIs for a vector mask.

In my view adding a part parameter is a compromise and seems less complex that requiring N index vectors, and it fits with a general pattern we have around parts of the vector. It moves the specialized operation requirements on the mask into the area where it is needed rather than trying to generalize in a manner that i don't think is appropriate in the mask API.

Yeah, it can sound reasonable that an API can finish a simple task and then choose to move the results to different part of a vector based on an offset. Consider loadWithMap is used as a VM interface, we have to add checks for the passed origin against the vector length. Besides, we have to support the same cross-lane shift for other vector types like int/long/double.
I will prepare a prototype for this. Thanks for your inputs @PaulSandoz .

@XiaohongGong
Copy link
Author

XiaohongGong commented Oct 31, 2025

Hi @iwanowww , @PaulSandoz , and @eme64 :

I’ve recently completed a prototype that moves the implementation into the Java API level:
Refactor subword gather API in Java.

Do you think it would be a good time to open a draft PR for easier review?

Below is a brief summary of the changes compared with the previous version.

Main idea

  • Invoke VectorSupport.loadWithMap() multiple times in Java when needed, where each call handles a single vector gather load.
  • In the compiler, the gathered result is represented as an int vector and then cast to the original subword vector species. Cross-lane shifting aligns the elements correctly.
  • The partial results are merged in Java using the Vector.or() API.

Advantages

  • No need to pass all vector indices to HotSpot.
  • The design is platform agnostic.

Limitations

  • The Java implementation is less clean to accommodate compiler optimizations.
  • Compiler changes remain nontrivial due to required vector/mask casting, resizing, and slicing.
  • Additional IR ideal and match rules are needed for optimal SVE code generation.
  • The API's performance will degrade significantly (about 30% ~ 50%) on platforms that do not support compiler intrinsification. Since a single previous API call is now split into multiple calls that cannot be intrinsified, the overhead of generating multiple vector objects in pure Java can be substantial. Does this impact matter?

I plan to rebase and update the compiler-change PR using the same node and match rules as well, so we can clearly compare both approaches.

Any thoughts or feedback would be much appreciated. Thanks so much!

Best Regards,
Xiaohong

@iwanowww
Copy link
Contributor

iwanowww commented Nov 8, 2025

Nice work, @XiaohongGong! I haven't closely looked at the patch yet, but I very much like the general direction. I don't consider performance regression in default Java implementation a big deal. In the future, we can rethink how default implementations are handled for operations which lack hardware/VM intrinsic support.

@XiaohongGong
Copy link
Author

Nice work, @XiaohongGong! I haven't closely looked at the patch yet, but I very much like the general direction. I don't consider performance regression in default Java implementation a big deal. In the future, we can rethink how default implementations are handled for operations which lack hardware/VM intrinsic support.

Thank you very much for your input so far—it’s been extremely helpful.

I have an additional concern regarding the slice operation for both masks and vectors. While Vector.slice exists and works well for vector merging, there’s currently no equivalent operation for vector masks when using the masked gather API and splitting is required. Adding such an API or not both come with trade-offs:

  1. Adding a slice API for masks:
    This would likely make the compiler code cleaner. However, it would also increase patch complexity and could present performance issues on certain architectures such as SVE. Optimizing codegen for this path might require significant additional compiler work.

  2. Not adding a slice API for masks:
    As @PaulSandoz suggested, we could move mask slicing to the compiler by passing an origin to the intrinsic, and similarly move vector slice operations to the compiler. This approach, however, introduces another issue: VectorSlice with a constant index is not universally supported across all architectures and vector species—see the X86 limitation (VectorSlice details). While it can be implemented with rearrange/blend as alternatives, they would complicate the compiler code. If left unsupported, gather API intrinsification would fail and fall back to Java implementation, causing significant performance regression I guess.

I am unsure of the best approach for implementing the slice operations cleanly and efficiently, so I would greatly appreciate any additional feedback or suggestions on this topic. Thank you again for your help!

@PaulSandoz
Copy link
Member

PaulSandoz commented Nov 10, 2025

and similarly move vector slice operations to the compiler

Yes, you have to slice the mask, whether it be represented as a mask/predicate register or as a vector. There's no way around that and we have to deal with the current limitations in hardware. As a further compromise we can in Java convert the mask to a vector and rearrange it, then pass the vector representation of the mask to the scatter/gather intrinsic. Then the intrinsic can if it chooses convert it back to a mask/predicate register if that is the best form.

IIUC we have agreed for non-masked subword scatter/gather to compose by parts using the intrinsic. That seems good, and it looks like we can do the same for masked subword scatter/gather, as above, but it may not be the most efficient for the platform.

Do you have any use cases for mask subword scatter/gather? Given the lack of underlying hardware support it seems focusing on getting the non-masked version working well, and the masked version working ok is a pragmatic way forward.

@XiaohongGong
Copy link
Author

and similarly move vector slice operations to the compiler

Yes, you have to slice the mask, whether it be represented as a mask/predicate register or as a vector. There's no way around that and we have to deal with the current limitations in hardware. As a further compromise we can in Java convert the mask to a vector and rearrange it, then pass the vector representation of the mask to the scatter/gather intrinsic. Then the intrinsic can if it chooses convert it back to a mask/predicate register if that is the best form.

Yes, converting mask to vector will be the way to resolve. Do you think it's better that defining a private VectorMask function for the slice operation? The function could be implemented with corresponding vector slice APIs. Although this function is not friendly to SVE performance, it wins on unifying the implementation.

IIUC we have agreed for non-masked subword scatter/gather to compose by parts using the intrinsic. That seems good, and it looks like we can do the same for masked subword scatter/gather, as above, but it may not be the most efficient for the platform.

Do you have any use cases for mask subword scatter/gather? Given the lack of underlying hardware support it seems focusing on getting the non-masked version working well, and the masked version working ok is a pragmatic way forward.

Currently, I do not have specific use cases for masked subword gather or scatter operations. However, I would like to ensure support for these APIs on SVE in case they become relevant for future Java workloads. However, compared to having no intrinsic support at all, using intrinsified APIs—even if not fully optimized—can still significantly improve performance, right?
BTW, I agree that focusing only on the non-masked version would certainly simplify the implementation a lot.

@PaulSandoz
Copy link
Member

Yes, converting mask to vector will be the way to resolve. Do you think it's better that defining a private VectorMask function for the slice operation? The function could be implemented with corresponding vector slice APIs. Although this function is not friendly to SVE performance, it wins on unifying the implementation.

If it helps just add a utility method that does the slice/rearrange mask<->vector conversion, but given your use case i expect it only to be used in one location, so perhaps keep it close to there. It maybe you don't need full slice functionality, since you only care about a part of the mask elements that was rearranged to the start of the vector and therefore don't need to zero out the remaining parts that are not relevant. (The same happens for conversion by parts.) Since we don't yet have any slice intrinsic i think that would be OK and we could revisit later. Ideally we should able to optimize rearrange of vectors using constant shuffles with recognizable patterns.

@XiaohongGong
Copy link
Author

Yes, converting mask to vector will be the way to resolve. Do you think it's better that defining a private VectorMask function for the slice operation? The function could be implemented with corresponding vector slice APIs. Although this function is not friendly to SVE performance, it wins on unifying the implementation.

If it helps just add a utility method that does the slice/rearrange mask<->vector conversion, but given your use case i expect it only to be used in one location, so perhaps keep it close to there. It maybe you don't need full slice functionality, since you only care about a part of the mask elements that was rearranged to the start of the vector and therefore don't need to zero out the remaining parts that are not relevant. (The same happens for conversion by parts.) Since we don't yet have any slice intrinsic i think that would be OK and we could revisit later. Ideally we should able to optimize rearrange of vectors using constant shuffles with recognizable patterns.

Make sense to me. Thanks for all your inputs! I will create a PR for the java-level refactor and X86 modifications first. We can have more discussion then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hotspot-compiler hotspot-compiler-dev@openjdk.org rfr Pull request is ready for review

Development

Successfully merging this pull request may close these issues.

8 participants