Skip to content

Conversation

@eme64
Copy link
Contributor

@eme64 eme64 commented Sep 17, 2024

I'm adding some proper JMH benchmarks for vectorized reductions. There are already some others, but they are not comprehensive or not JMH.

Plus, I wanted to do a performance-investigation, hopefully leading to some improvements. See Future Work below.

How I run my benchmarks

All benchmarks
make test TEST="micro:vm.compiler.VectorReduction2" CONF=linux-x64

Some specific benchmark, with profiler that tells me which code snippet is hottest:
make test TEST="micro:vm.compiler.VectorReduction2.*doubleMinDotProduct" CONF=linux-x64 MICRO="OPTIONS=-prof perfasm"

JMH logs

Run on my AVX512 laptop, with master:
run_avx512_master.txt

Run on remote asimd (aarch64, NEON) machine:
run_asimd_master.txt

Results

I ran it on 2 machines so far. Left on my AVX512 machine, right on a ASIMD/NEON/aarch64 machine.

Here the interesting int / long / float / double results, discussion further below:
image

And there the less spectacular byte / char / short results. There is no vectorization of these cases. But there seems to be some issue with over-unrolling on my AVX512 machine, one case I looked at would only unroll 4x without SuperWord, but 16x with, and that seems to be unfavourable.

image

Here the PDF:
benchmark_results.pdf

Why are all the ...Simple benchmarks not vectorizing, i.e. "not profitable"?

Apparently, there must be sufficient "work" vectors to outweith the "reduction" vectors.
The idea used to be that one should have at least 2 work vectors which tend to be profitable, to outweigh the cost of a single reduction vector.

  // Check if reductions are connected
  if (is_marked_reduction(p0)) {
    Node* second_in = p0->in(2);
    Node_List* second_pk = get_pack(second_in);
    if ((second_pk == nullptr) || (_num_work_vecs == _num_reductions)) {
      // No parent pack or not enough work
      // to cover reduction expansion overhead
      return false;
    } else if (second_pk->size() != p->size()) {
      return false;
    }
  }

But when I disable this code, then I see on the aarch64/ASIMD machine:

VectorReduction2.NoSuperword.intAddSimple      2048       0  avgt    3  751.453 ± 1603.353  ns/op
VectorReduction2.WithSuperword.intAddSimple    2048       0  avgt    3  344.263 ±    2.888  ns/op

Hence, this assumption no longer holds. I think it is because we are actually able to move the reductions out of the loop now, and that was not the case when this code was added.

2-Element Reductions for INT / LONG

Apparently, all 2-element int and long reductions are currently deemed not profitable, see change: a880f3d

This means that the long reductions do not vectorize on the ASIMD / aarch64 machine with MaxVectorSize=16.

      // Length 2 reductions of INT/LONG do not offer performance benefits
      if (((arith_type->basic_type() == T_INT) || (arith_type->basic_type() == T_LONG)) && (size == 2)) {
        retValue = false;
      } else {
        retValue = ReductionNode::implemented(opc, size, arith_type->basic_type());
      }

AARCH64 / NEON / ASIMD

This is why the float / double add / mul reductions (yellow MATCHER) do not vectorize. We might be able to tackle this with an appropriate alternative implementation.

Also all of the long cases fail. For one because they would only be 2-element reductions (because only MaxVectorSize=16). But also because the MulVL is not allowed apparently, see below.

  bool Matcher::match_rule_supported_auto_vectorization(int opcode, int vlen, BasicType bt) {
    if (UseSVE == 0) {
      // These operations are not profitable to be vectorized on NEON, because no direct
      // NEON instructions support them. But the match rule support for them is profitable for
      // Vector API intrinsics.
      if ((opcode == Op_VectorCastD2X && bt == T_INT) ||
          (opcode == Op_VectorCastL2X && bt == T_FLOAT) ||
          (opcode == Op_CountLeadingZerosV && bt == T_LONG) ||
          (opcode == Op_CountTrailingZerosV && bt == T_LONG) ||
          // The implementations of Op_AddReductionVD/F in Neon are for the Vector API only.
          // They are not suitable for auto-vectorization because the result would not conform
          // to the JLS, Section Evaluation Order.
          opcode == Op_AddReductionVD || opcode == Op_AddReductionVF ||
          opcode == Op_MulReductionVD || opcode == Op_MulReductionVF ||
          opcode == Op_MulVL) {
        return false;
      }
    }
    return match_rule_supported_vector(opcode, vlen, bt);
  }

Float / Double with Add and Mul

On aarch64 NEON these cases do not vectorize, see last section.

It turns out that many of these cases actually do vectorize (on x64), but the code is just as fast as the scalar code.
This is because the reduction order is strict, to maintain correct rounding.
Interestingly, the code runs about at the same speed, if vectorized or not. it seems that the latency of the reduction is simply the determining factor, no matter if it is vectorized or scalar.

Running this for example shows that the loop-bodies are quite different:

make test TEST="micro:vm.compiler.VectorReduction2.*floatAddBig" CONF=linux-x64 MICRO="OPTIONS=-prof perfasm"

Scalar loop body, note only scalar xmm registers are used:

            0x00007fcee43ffedb:   vaddss %xmm3,%xmm6,%xmm6
            0x00007fcee43ffedf:   vmulss %xmm2,%xmm4,%xmm4
   6.69%    0x00007fcee43ffee3:   vmulss %xmm2,%xmm1,%xmm1
            0x00007fcee43ffee7:   vaddss %xmm4,%xmm5,%xmm3
            0x00007fcee43ffeeb:   vmulss %xmm18,%xmm13,%xmm2
            0x00007fcee43ffef1:   vaddss %xmm1,%xmm3,%xmm3
            0x00007fcee43ffef5:   vmulss %xmm12,%xmm10,%xmm1
            0x00007fcee43ffefa:   vmulss %xmm11,%xmm13,%xmm5
            0x00007fcee43ffeff:   vaddss %xmm17,%xmm1,%xmm1
            0x00007fcee43fff05:   vaddss %xmm5,%xmm2,%xmm4            ;   {no_reloc}
   5.81%    0x00007fcee43fff09:   vaddss %xmm14,%xmm1,%xmm1
            0x00007fcee43fff0e:   vaddss %xmm15,%xmm4,%xmm4
            0x00007fcee43fff13:   vaddss %xmm9,%xmm4,%xmm2
            0x00007fcee43fff18:   vaddss %xmm2,%xmm3,%xmm3
  24.14%    0x00007fcee43fff1c:   vaddss %xmm3,%xmm6,%xmm2
  24.50%    0x00007fcee43fff20:   vaddss %xmm2,%xmm1,%xmm9

Vector loop uses zmm, ymm and xmm registers, and not just mul and add, but also vpshufd and vextractf128 instructions to shuffle the values around in the reduction.

            0x00007f4578400c36:   vmulps %zmm3,%zmm2,%zmm12
            0x00007f4578400c3c:   vmulps %zmm9,%zmm11,%zmm13
   0.33%    0x00007f4578400c42:   vmulps %zmm10,%zmm11,%zmm11
            0x00007f4578400c48:   vmulps %zmm2,%zmm4,%zmm2
            0x00007f4578400c4e:   vaddps %zmm13,%zmm11,%zmm11
            0x00007f4578400c54:   vmulps %zmm3,%zmm4,%zmm3
            0x00007f4578400c5a:   vmulps %zmm5,%zmm15,%zmm13
            0x00007f4578400c60:   vaddps %zmm2,%zmm3,%zmm2
            0x00007f4578400c66:   vmulps %zmm10,%zmm9,%zmm3
            0x00007f4578400c6c:   vaddps %zmm12,%zmm2,%zmm2
   0.39%    0x00007f4578400c72:   vaddps %zmm3,%zmm11,%zmm3           ;   {no_reloc}
            0x00007f4578400c78:   vmulps %zmm7,%zmm8,%zmm4
            0x00007f4578400c7e:   vmulps %zmm6,%zmm8,%zmm8
   0.03%    0x00007f4578400c84:   vmulps %zmm7,%zmm6,%zmm6
            0x00007f4578400c8a:   vaddps %zmm8,%zmm4,%zmm4
            0x00007f4578400c90:   vmulps %zmm5,%zmm16,%zmm5
            0x00007f4578400c96:   vaddps %zmm6,%zmm4,%zmm4
            0x00007f4578400c9c:   vmulps %zmm15,%zmm16,%zmm6
   0.36%    0x00007f4578400ca2:   vaddps %zmm6,%zmm5,%zmm5
            0x00007f4578400ca8:   vaddps %zmm13,%zmm5,%zmm5
            0x00007f4578400cae:   vaddss %xmm3,%xmm1,%xmm1
            0x00007f4578400cb2:   vpshufd $0x1,%xmm3,%xmm13
   0.23%    0x00007f4578400cb7:   vaddss %xmm13,%xmm1,%xmm1
   0.85%    0x00007f4578400cbc:   vpshufd $0x2,%xmm3,%xmm13
            0x00007f4578400cc1:   vaddss %xmm13,%xmm1,%xmm1
   1.31%    0x00007f4578400cc6:   vpshufd $0x3,%xmm3,%xmm13
            0x00007f4578400ccb:   vaddss %xmm13,%xmm1,%xmm1
   1.61%    0x00007f4578400cd0:   vextractf128 $0x1,%ymm3,%xmm13
            0x00007f4578400cd6:   vaddss %xmm13,%xmm1,%xmm1
   1.08%    0x00007f4578400cdb:   vpshufd $0x1,%xmm13,%xmm12
            0x00007f4578400ce1:   vaddss %xmm12,%xmm1,%xmm1
   1.48%    0x00007f4578400ce6:   vpshufd $0x2,%xmm13,%xmm12
            0x00007f4578400cec:   vaddss %xmm12,%xmm1,%xmm1
   1.48%    0x00007f4578400cf1:   vpshufd $0x3,%xmm13,%xmm12
            0x00007f4578400cf7:   vaddss %xmm12,%xmm1,%xmm1
   1.15%    0x00007f4578400cfc:   vextracti64x4 $0x1,%zmm3,%ymm12
            0x00007f4578400d03:   vaddss %xmm12,%xmm1,%xmm1
   1.64%    0x00007f4578400d08:   vpshufd $0x1,%xmm12,%xmm13
            0x00007f4578400d0e:   vaddss %xmm13,%xmm1,%xmm1
   1.12%    0x00007f4578400d13:   vpshufd $0x2,%xmm12,%xmm13
            0x00007f4578400d19:   vaddss %xmm13,%xmm1,%xmm1
   1.34%    0x00007f4578400d1e:   vpshufd $0x3,%xmm12,%xmm13
            0x00007f4578400d24:   vaddss %xmm13,%xmm1,%xmm1
   1.41%    0x00007f4578400d29:   vextractf128 $0x1,%ymm12,%xmm13
            0x00007f4578400d2f:   vaddss %xmm13,%xmm1,%xmm1
   1.54%    0x00007f4578400d34:   vpshufd $0x1,%xmm13,%xmm12
            0x00007f4578400d3a:   vaddss %xmm12,%xmm1,%xmm1
   1.41%    0x00007f4578400d3f:   vpshufd $0x2,%xmm13,%xmm12
            0x00007f4578400d45:   vaddss %xmm12,%xmm1,%xmm1
   1.28%    0x00007f4578400d4a:   vpshufd $0x3,%xmm13,%xmm12
            0x00007f4578400d50:   vaddss %xmm12,%xmm1,%xmm1
   1.57%    0x00007f4578400d55:   vaddss %xmm4,%xmm1,%xmm1
   1.31%    0x00007f4578400d59:   vpshufd $0x1,%xmm4,%xmm6
            0x00007f4578400d5e:   vaddss %xmm6,%xmm1,%xmm1
   1.28%    0x00007f4578400d62:   vpshufd $0x2,%xmm4,%xmm6
            0x00007f4578400d67:   vaddss %xmm6,%xmm1,%xmm1
   1.34%    0x00007f4578400d6b:   vpshufd $0x3,%xmm4,%xmm6
            0x00007f4578400d70:   vaddss %xmm6,%xmm1,%xmm1
   1.05%    0x00007f4578400d74:   vextractf128 $0x1,%ymm4,%xmm6       ;   {no_reloc}
            0x00007f4578400d7a:   vaddss %xmm6,%xmm1,%xmm1
   1.67%    0x00007f4578400d7e:   vpshufd $0x1,%xmm6,%xmm11
            0x00007f4578400d83:   vaddss %xmm11,%xmm1,%xmm1
   1.48%    0x00007f4578400d88:   vpshufd $0x2,%xmm6,%xmm11
            0x00007f4578400d8d:   vaddss %xmm11,%xmm1,%xmm1
   1.25%    0x00007f4578400d92:   vpshufd $0x3,%xmm6,%xmm11
            0x00007f4578400d97:   vaddss %xmm11,%xmm1,%xmm1
   1.21%    0x00007f4578400d9c:   vextracti64x4 $0x1,%zmm4,%ymm11
            0x00007f4578400da3:   vaddss %xmm11,%xmm1,%xmm1
   1.94%    0x00007f4578400da8:   vpshufd $0x1,%xmm11,%xmm6
            0x00007f4578400dae:   vaddss %xmm6,%xmm1,%xmm1
   1.21%    0x00007f4578400db2:   vpshufd $0x2,%xmm11,%xmm6
            0x00007f4578400db8:   vaddss %xmm6,%xmm1,%xmm1
   1.77%    0x00007f4578400dbc:   vpshufd $0x3,%xmm11,%xmm6
            0x00007f4578400dc2:   vaddss %xmm6,%xmm1,%xmm1
   1.57%    0x00007f4578400dc6:   vextractf128 $0x1,%ymm11,%xmm6
            0x00007f4578400dcc:   vaddss %xmm6,%xmm1,%xmm1
   1.02%    0x00007f4578400dd0:   vpshufd $0x1,%xmm6,%xmm11
            0x00007f4578400dd5:   vaddss %xmm11,%xmm1,%xmm1
   1.48%    0x00007f4578400dda:   vpshufd $0x2,%xmm6,%xmm11
            0x00007f4578400ddf:   vaddss %xmm11,%xmm1,%xmm1
   1.31%    0x00007f4578400de4:   vpshufd $0x3,%xmm6,%xmm11
            0x00007f4578400de9:   vaddss %xmm11,%xmm1,%xmm1
   1.54%    0x00007f4578400dee:   vaddss %xmm5,%xmm1,%xmm1
   1.61%    0x00007f4578400df2:   vpshufd $0x1,%xmm5,%xmm9
            0x00007f4578400df7:   vaddss %xmm9,%xmm1,%xmm1
   1.51%    0x00007f4578400dfc:   vpshufd $0x2,%xmm5,%xmm9
            0x00007f4578400e01:   vaddss %xmm9,%xmm1,%xmm1
   1.61%    0x00007f4578400e06:   vpshufd $0x3,%xmm5,%xmm9
            0x00007f4578400e0b:   vaddss %xmm9,%xmm1,%xmm1
   1.34%    0x00007f4578400e10:   vextractf128 $0x1,%ymm5,%xmm9
            0x00007f4578400e16:   vaddss %xmm9,%xmm1,%xmm1
   1.25%    0x00007f4578400e1b:   vpshufd $0x1,%xmm9,%xmm8
            0x00007f4578400e21:   vaddss %xmm8,%xmm1,%xmm1
   2.16%    0x00007f4578400e26:   vpshufd $0x2,%xmm9,%xmm8
            0x00007f4578400e2c:   vaddss %xmm8,%xmm1,%xmm1
   1.44%    0x00007f4578400e31:   vpshufd $0x3,%xmm9,%xmm8
            0x00007f4578400e37:   vaddss %xmm8,%xmm1,%xmm1
   1.38%    0x00007f4578400e3c:   vextracti64x4 $0x1,%zmm5,%ymm8
            0x00007f4578400e43:   vaddss %xmm8,%xmm1,%xmm1
   1.51%    0x00007f4578400e48:   vpshufd $0x1,%xmm8,%xmm9
            0x00007f4578400e4e:   vaddss %xmm9,%xmm1,%xmm1
   1.74%    0x00007f4578400e53:   vpshufd $0x2,%xmm8,%xmm9
            0x00007f4578400e59:   vaddss %xmm9,%xmm1,%xmm1
   1.87%    0x00007f4578400e5e:   vpshufd $0x3,%xmm8,%xmm9
            0x00007f4578400e64:   vaddss %xmm9,%xmm1,%xmm1
   1.28%    0x00007f4578400e69:   vextractf128 $0x1,%ymm8,%xmm9
            0x00007f4578400e6f:   vaddss %xmm9,%xmm1,%xmm1
   1.67%    0x00007f4578400e74:   vpshufd $0x1,%xmm9,%xmm8            ;   {no_reloc}
            0x00007f4578400e7a:   vaddss %xmm8,%xmm1,%xmm1
   1.57%    0x00007f4578400e7f:   vpshufd $0x2,%xmm9,%xmm8
            0x00007f4578400e85:   vaddss %xmm8,%xmm1,%xmm1
   1.48%    0x00007f4578400e8a:   vpshufd $0x3,%xmm9,%xmm8
            0x00007f4578400e90:   vaddss %xmm8,%xmm1,%xmm1
   1.57%    0x00007f4578400e95:   vaddss %xmm2,%xmm1,%xmm1
   1.57%    0x00007f4578400e99:   vpshufd $0x1,%xmm2,%xmm7
            0x00007f4578400e9e:   vaddss %xmm7,%xmm1,%xmm1
   1.61%    0x00007f4578400ea2:   vpshufd $0x2,%xmm2,%xmm7
            0x00007f4578400ea7:   vaddss %xmm7,%xmm1,%xmm1
   1.48%    0x00007f4578400eab:   vpshufd $0x3,%xmm2,%xmm7
            0x00007f4578400eb0:   vaddss %xmm7,%xmm1,%xmm1
   1.44%    0x00007f4578400eb4:   vextractf128 $0x1,%ymm2,%xmm7
            0x00007f4578400eba:   vaddss %xmm7,%xmm1,%xmm1
   1.02%    0x00007f4578400ebe:   vpshufd $0x1,%xmm7,%xmm10
            0x00007f4578400ec3:   vaddss %xmm10,%xmm1,%xmm1
   1.54%    0x00007f4578400ec8:   vpshufd $0x2,%xmm7,%xmm10
            0x00007f4578400ecd:   vaddss %xmm10,%xmm1,%xmm1
   1.31%    0x00007f4578400ed2:   vpshufd $0x3,%xmm7,%xmm10
            0x00007f4578400ed7:   vaddss %xmm10,%xmm1,%xmm1
   1.28%    0x00007f4578400edc:   vextracti64x4 $0x1,%zmm2,%ymm10
            0x00007f4578400ee3:   vaddss %xmm10,%xmm1,%xmm1
   1.28%    0x00007f4578400ee8:   vpshufd $0x1,%xmm10,%xmm7
            0x00007f4578400eee:   vaddss %xmm7,%xmm1,%xmm1
   2.03%    0x00007f4578400ef2:   vpshufd $0x2,%xmm10,%xmm7
            0x00007f4578400ef8:   vaddss %xmm7,%xmm1,%xmm1
   1.44%    0x00007f4578400efc:   vpshufd $0x3,%xmm10,%xmm7
            0x00007f4578400f02:   vaddss %xmm7,%xmm1,%xmm1
   1.48%    0x00007f4578400f06:   vextractf128 $0x1,%ymm10,%xmm7
            0x00007f4578400f0c:   vaddss %xmm7,%xmm1,%xmm1
   1.31%    0x00007f4578400f10:   vpshufd $0x1,%xmm7,%xmm10
            0x00007f4578400f15:   vaddss %xmm10,%xmm1,%xmm1
   1.08%    0x00007f4578400f1a:   vpshufd $0x2,%xmm7,%xmm10
            0x00007f4578400f1f:   vaddss %xmm10,%xmm1,%xmm1
   1.64%    0x00007f4578400f24:   vpshufd $0x3,%xmm7,%xmm10
            0x00007f4578400f29:   vaddss %xmm10,%xmm1,%xmm1

No vectorization for longMulBig?

I locally can get vectorization in another setting, but somehow not in the JMH benchmark. This is strange, and I'll have to keep investigating.

No speedup with doubleMinDotProduct

Strangely, on my AVX512 machine, that benchmark did vectorize, but it did not experience any speedup. I'm quite confused about that. Especially because the parallel benchmark doubleMaxDotProduct vectorizes just fine. More investigation needed.


Future Work

  • JDK-8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long)
  • The Simple benchmarks should be profitable -> allow vectorization!
  • Investigate the results for byte / char / short: is it all due to over-unrolling?
  • float / double with add / mul:
    • the main issue is with the strict order of operations. Maybe we can add some Float.addAssociative so that we can do non-strict reduction? That could be a cool feature for those who care about performance and are willing to give up some rounding precision.
    • The aarch64 backend simply refuses to vectorize these, because there is no strict order implementation. That could be changed. Though we would need a benchmark where it is worth it...
  • Investigate the strange results with longMulBig and doubleMinDotProduct.

Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8340272: C2 SuperWord: JMH benchmark for Reduction vectorization (Enhancement - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/21032/head:pull/21032
$ git checkout pull/21032

Update a local copy of the PR:
$ git checkout pull/21032
$ git pull https://git.openjdk.org/jdk.git pull/21032/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 21032

View PR using the GUI difftool:
$ git pr show -t 21032

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/21032.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Sep 17, 2024

👋 Welcome back epeter! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Sep 17, 2024

@eme64 This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8340272: C2 SuperWord: JMH benchmark for Reduction vectorization

Reviewed-by: kvn, jkarthikeyan

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 82 new commits pushed to the master branch:

  • 202fd42: 8340213: jcmd VM.events ignores max argument
  • dfc9093: 8340132: Remove internal CpException for reading malformed utf8
  • f0ae90f: 8340210: Add positionTestUI() to PassFailJFrame.Builder
  • eabfc6e: 8337563: NMT: rename MEMFLAGS to MemTag
  • d588182: 8338686: App classpath mismatch if a jar from the Class-Path attribute is on the classpath
  • 5dc9723: 8340323: Test jdk/classfile/OptionsTest.java fails after JDK-8340200
  • 90e92f9: 8339790: Support Intel APX setzucc instruction
  • 28d009c: 8339934: Simplify Math.scalb(double) method
  • 3e14fb9: 8340200: Misspelled constant AttributesProcessingOption.DROP_UNSTABLE_ATRIBUTES
  • 64e3a9e: 8339574: Behavior of File.is{Directory,File,Hidden} is not documented with respect to symlinks
  • ... and 72 more: https://git.openjdk.org/jdk/compare/597788850042e7272a23714c05ba546ee6080856...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot changed the title 8340272 8340272: C2 SuperWord: JMH benchmark for Reduction vectorization Sep 17, 2024
@openjdk
Copy link

openjdk bot commented Sep 17, 2024

@eme64 The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Sep 17, 2024
@eme64 eme64 marked this pull request as ready for review September 17, 2024 11:56
@openjdk openjdk bot added the rfr Pull request is ready for review label Sep 17, 2024
@eme64
Copy link
Contributor Author

eme64 commented Sep 17, 2024

@galderz You can use this JMH benchmark for your work in #20098 if you want.

@mlbridge
Copy link

mlbridge bot commented Sep 17, 2024

Webrevs

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Sep 17, 2024
Copy link
Member

@jaskarth jaskarth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice, the benchmark is very thorough! I was interested to see how it performed on my Zen 3 (AVX2) machine, I've attached the results here in case it's interesting/useful: perf_results.txt

@eme64
Copy link
Contributor Author

eme64 commented Sep 18, 2024

@jaskarth thanks for the benchmark!

I included it in these results now:
benchmark_results.pdf

The results are quite comparable to the AVX512 results. Some comments:

  • byte / char / short: there is also some variation here, but it seems slightly different. We might want to investigate that anyway, especially the regressions that are in the 15-25% range. It is also possible that we could invest more to even vectorize these cases, but the IR is more complicated with all the "cast to byte/char/short", i.e. the right and left shifting required to remove the upper bits. Pattern matching those cases is difficult with the current SuperWord structure, as far as I can see. I'm open to ideas/suggestions here ;)
  • int / float / double performance is as expected, parallel to ASIMD and AVX512. Good.
  • long:
    • MulVL is not implemented for AVX2 (hardware does not have them as far as I know), so those benchmarks results are as expected.
    • Your results around the long min/max are a bit unexpected, especially because the vectorization is not supposed to work as far as I know. Could be interesting to investigate more there.

image

image

@eme64
Copy link
Contributor Author

eme64 commented Sep 18, 2024

@jaskarth @vnkozlov thanks for the review!
/integrate

@openjdk
Copy link

openjdk bot commented Sep 18, 2024

Going to push as commit aeba1ea.
Since your change was applied there have been 89 commits pushed to the master branch:

  • 19b2cee: 8340113: Remove JULONG as a Diagnostic Command argument type (jcmd JFR.view)
  • 45e438f: 8339845: Update color.org and wapforum.org links to use HTTPS instead of HTTP
  • 4ff17c1: 8319873: Add windows implementation for jcmd System.map and System.dump_map
  • 3895b8f: 8340230: Tests crash: assert(is_in_encoding_range || k->is_interface() || k->is_abstract()) failed: sanity
  • 5381f55: 8333258: C2: high memory usage in PhaseCFG::insert_anti_dependences()
  • d23c59e: 8340280: Avoid calling MT.invokerType() when creating LambdaForms
  • 147e300: 8340015: Open source several AWT focus tests - series 7
  • 202fd42: 8340213: jcmd VM.events ignores max argument
  • dfc9093: 8340132: Remove internal CpException for reading malformed utf8
  • f0ae90f: 8340210: Add positionTestUI() to PassFailJFrame.Builder
  • ... and 79 more: https://git.openjdk.org/jdk/compare/597788850042e7272a23714c05ba546ee6080856...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Sep 18, 2024
@openjdk openjdk bot closed this Sep 18, 2024
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Sep 18, 2024
@openjdk
Copy link

openjdk bot commented Sep 18, 2024

@eme64 Pushed as commit aeba1ea.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@jaskarth
Copy link
Member

The subword results seem quite tricky, especially since there are some things that performed well for me (like char) but ended up causing regressions for your machine. The long results are also quite strange, but it may just be random noise. I'll definitely make sure to investigate further. Thanks a lot for the analysis!

@galderz
Copy link
Contributor

galderz commented Oct 17, 2024

@galderz You can use this JMH benchmark for your work in #20098 if you want.

@eme64 Thanks for building this. I ended up creating a min/max specific benchmark in #20098. The main reason I created something different was to be able to control data in the arrays such that the branching factors could be pre-determined. The results can vary depending that. Then I used the opportunity to add both reduction and non-reduction vector benchmarks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hotspot-compiler hotspot-compiler-dev@openjdk.org integrated Pull request has been integrated

Development

Successfully merging this pull request may close these issues.

4 participants