Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[llvm-mca] Unexpected Rthroughput for vfmadd* instructions in znver3 #59325

Closed
bubblepipe opened this issue Dec 3, 2022 · 15 comments
Closed

Comments

@bubblepipe
Copy link

llvm-mca reports for zen3 the Rthroughput of vfmadd* instructions is 1

# CHECK:      [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
# CHECK-NEXT:  1      4     1.00                        vfmadd132pd	%xmm0, %xmm1, %xmm2
# CHECK-NEXT:  1      11    1.00    *                   vfmadd132pd	(%rax), %xmm1, %xmm2

But for zen2 it is 0.5.

# CHECK:      [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
# CHECK-NEXT:  1      5     0.50                        vfmadd132pd	%xmm0, %xmm1, %xmm2
# CHECK-NEXT:  1      12    0.50    *                   vfmadd132pd	(%rax), %xmm1, %xmm2

Should they both be 0.5? Or is there any micro-architecture change that makes zen3 slower than zen2 on vfmadd*?

@LebedevRI
Copy link
Member

LebedevRI commented Dec 3, 2022

LLVM's znver1/znver2 instruction schedules are, well, lot a lot to be desired. They may be rather wrong.
The znver3's i hand crafted, and i'm mostly sure it's result is correct.

---
mode:            inverse_throughput
key:
  instructions:
    - 'VFMADD132PDr XMM0 XMM0 XMM11 XMM2'
    - 'VFMADD132PDr XMM1 XMM1 XMM4 XMM8'
    - 'VFMADD132PDr XMM2 XMM2 XMM5 XMM10'
    - 'VFMADD132PDr XMM3 XMM3 XMM7 XMM14'
    - 'VFMADD132PDr XMM4 XMM4 XMM1 XMM9'
    - 'VFMADD132PDr XMM5 XMM5 XMM4 XMM15'
    - 'VFMADD132PDr XMM6 XMM6 XMM9 XMM12'
    - 'VFMADD132PDr XMM7 XMM7 XMM15 XMM1'
    - 'VFMADD132PDr XMM8 XMM8 XMM9 XMM0'
    - 'VFMADD132PDr XMM9 XMM9 XMM6 XMM5'
    - 'VFMADD132PDr XMM10 XMM10 XMM15 XMM12'
    - 'VFMADD132PDr XMM11 XMM11 XMM2 XMM2'
    - 'VFMADD132PDr XMM12 XMM12 XMM6 XMM1'
    - 'VFMADD132PDr XMM13 XMM13 XMM4 XMM11'
    - 'VFMADD132PDr XMM14 XMM14 XMM0 XMM5'
    - 'VFMADD132PDr XMM15 XMM15 XMM4 XMM1'
  config:          ''
  register_initial_values:
    - 'XMM0=0x0'
    - 'XMM11=0x0'
    - 'XMM2=0x0'
    - 'MXCSR=0x0'
    - 'XMM1=0x0'
    - 'XMM4=0x0'
    - 'XMM8=0x0'
    - 'XMM5=0x0'
    - 'XMM10=0x0'
    - 'XMM3=0x0'
    - 'XMM7=0x0'
    - 'XMM14=0x0'
    - 'XMM9=0x0'
    - 'XMM15=0x0'
    - 'XMM6=0x0'
    - 'XMM12=0x0'
    - 'XMM13=0x0'
cpu_name:        znver3
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
  - { key: inverse_throughput, value: 0.7951, per_snippet_value: 12.7216 }
error:           ''
info:            instruction has tied variables, using static renaming.
assembled_snippet: 4883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F04244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F1C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F14244883C4104883EC04C70424801F0000C5F8AE14244883C4044883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F0C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F24244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F04244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F2C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F14244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F1C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F3C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F34244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F0C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F3C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F34244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F24244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F2C244883C410C4E2A198C2C4C2D998C8C4C2D198D2C4C2C198DEC4C2F198E1C4C2D998EFC4C2B198F4C4E28198F9C462B198C0C462C998CDC4428198D4C462E998DAC462C998E1C442D998EBC462F998F5C462D998F9C4E2A198C2C4C2D998C8C4C2D198D2C4C2C198DEC4C2F198E1C4C2D998EFC4C2B198F4C4E28198F9C462B198C0C462C998CDC4428198D4C462E998DAC462C998E1C442D998EBC462F998F5C462D998F9C4E2A198C2C4C2D998C8C4C2D198D2C4C2C198DEC4C2F198E1C4C2D998EFC4C2B198F4C4E28198F9C462B198C0C462C998CDC4428198D4C462E998DAC462C998E1C442D998EBC462F998F5C462D998F9C4E2A198C2C4C2D998C8C4C2D198D2C4C2C198DEC4C2F198E1C4C2D998EFC4C2B198F4C4E28198F9C462B198C0C462C998CDC4428198D4C462E998DAC462C998E1C442D998EBC462F998F5C462D998F9C34883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F04244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F1C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F14244883C4104883EC04C70424801F0000C5F8AE14244883C4044883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F0C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F24244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F04244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F2C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F14244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F1C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F3C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F34244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F0C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F3C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F34244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F24244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F2C244883C41049B80200000000000000C4E2A198C2C4C2D998C8C4C2D198D2C4C2C198DEC4C2F198E1C4C2D998EFC4C2B198F4C4E28198F9C462B198C0C462C998CDC4428198D4C462E998DAC462C998E1C442D998EBC462F998F5C462D998F9C4E2A198C2C4C2D998C8C4C2D198D2C4C2C198DEC4C2F198E1C4C2D998EFC4C2B198F4C4E28198F9C462B198C0C462C998CDC4428198D4C462E998DAC462C998E1C442D998EBC462F998F5C462D998F94983C0FF0F8556FFFFFFC3
...

So it's not really 0.5 on znver3.

@LebedevRI
Copy link
Member

@RKSimon probably yet another znver2/znver1 bug?

@RKSimon RKSimon self-assigned this Dec 3, 2022
@RKSimon
Copy link
Collaborator

RKSimon commented Dec 3, 2022

I'll take a look

@llvmbot
Copy link
Collaborator

llvmbot commented Dec 4, 2022

@llvm/issue-subscribers-backend-x86

@adibiagio
Copy link
Collaborator

Zen processors can issue up to 2 FMA uOPs per cycle. The FPU implement 2 FMA units. Reciprocal throughput for vfmadd*/vfmsub* is expected to be 0.5 and not 1.

I have run a quick perf test on my zen1, and I've got these numbers:

       358,642,029      cycles:u                                                      ( +-  0.01% )
       717,107,114      instructions:u            #    2.00  insn per cycle           ( +-  0.00% )
       717,126,328      r0C1:u                                                        ( +-  0.00% )

           0.10022 +- 0.00255 seconds time elapsed  ( +-  2.54% )

NOTE: on Zen1 only, wider YMM fmadd* are decoded into 2 uOPs. That is because zen1 only natively supports up to 128-bit data types. So, rthroughput of wide YMM fmadd* on zen1 is expected to be 1.

If my understanding is correct, the code snippet used by exegesis to do throughput analysis is not great because it introduces unwanted data dependencies.

Example:

<snip>
    - 'VFMADD132PDr XMM4 XMM4 XMM1 XMM9'
    - 'VFMADD132PDr XMM5 XMM5 XMM4 XMM15'    <== A RAW on XMM4  ?

I am not an expert of exegesis. However, there may be an issue with the code used by it to test throughput.
The following code snippet should give you an ideal rthroughput of 0.5 (at least, it does on my zen1).

vfmadd132pd %xmm0, %xmm1, %xmm2
vfmadd132pd %xmm0, %xmm1, %xmm3
vfmadd132pd %xmm0, %xmm1, %xmm4
vfmadd132pd %xmm0, %xmm1, %xmm5
vfmadd132pd %xmm0, %xmm1, %xmm6
vfmadd132pd %xmm0, %xmm1, %xmm7
vfmadd132pd %xmm0, %xmm1, %xmm8
vfmadd132pd %xmm0, %xmm1, %xmm9
vfmadd132pd %xmm0, %xmm1, %xmm10
vfmadd132pd %xmm0, %xmm1, %xmm11

@RKSimon
Copy link
Collaborator

RKSimon commented Dec 4, 2022

Agreed - I'm noticing a lot of cases where exegesis is making really poor register dependency decisions.

@adibiagio
Copy link
Collaborator

Rather than simply randomising input regs, exegesis should try to generate a code sequence with no RAW dependencies between instructions (like in my example). RAW dependencies would still arise between instructions from different iterations. However, if regalloc is done properly, those dependencies would happen so late that throughput won’t be affected in practice.

@LebedevRI
Copy link
Member

Hm, i'd like to take a look if no one is already looking into this?

@RKSimon RKSimon removed their assignment Dec 4, 2022
@RKSimon
Copy link
Collaborator

RKSimon commented Dec 4, 2022

Hm, i'd like to take a look if no one is already looking into this?

Sure - please go for it - there's 2 tasks to do afaict:
1 - confirm/fix znver3 model WriteFMA throughputs/resources
2 - investigate better register selection in llvm-exegesis codegen throughput tests (new ticket?)

@LebedevRI
Copy link
Member

Right. The other way around though.

@llvmbot
Copy link
Collaborator

llvmbot commented Dec 4, 2022

@llvm/issue-subscribers-tools-llvm-mca

@llvmbot
Copy link
Collaborator

llvmbot commented Dec 4, 2022

@llvm/issue-subscribers-tools-llvm-exegesis

@LebedevRI
Copy link
Member

exegesis part: https://reviews.llvm.org/D139283

LebedevRI added a commit that referenced this issue Dec 6, 2022
…tfail for instrs w/ tied variables

As it is being discussed in #59325,
at least for the instructions with tied variables,
when trying to parallelize the instructions,
register selection is rather bad, and may either
use a register which we have used for def,
or vice versa.

That introduces serialization, and leads to
overly pessimistic inverse throughput measurement.

The new implementation avoids that,

New result:
```
$ ninja llvm-exegesis && ./bin/llvm-exegesis --mode=inverse_throughput --opcode-name=VFMADD132PDr --max-configs-per-opcode=9182
ninja: no work to do.
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-4af034.o
---
mode:            inverse_throughput
key:
  instructions:
    - 'VFMADD132PDr XMM3 XMM3 XMM4 XMM8'
    - 'VFMADD132PDr XMM5 XMM5 XMM14 XMM7'
    - 'VFMADD132PDr XMM10 XMM10 XMM11 XMM15'
    - 'VFMADD132PDr XMM13 XMM13 XMM15 XMM15'
    - 'VFMADD132PDr XMM12 XMM12 XMM11 XMM1'
    - 'VFMADD132PDr XMM0 XMM0 XMM6 XMM9'
    - 'VFMADD132PDr XMM2 XMM2 XMM15 XMM11'
  config:          ''
  register_initial_values:
    - 'XMM3=0x0'
    - 'XMM4=0x0'
    - 'XMM8=0x0'
    - 'MXCSR=0x0'
    - 'XMM5=0x0'
    - 'XMM14=0x0'
    - 'XMM7=0x0'
    - 'XMM10=0x0'
    - 'XMM11=0x0'
    - 'XMM15=0x0'
    - 'XMM13=0x0'
    - 'XMM12=0x0'
    - 'XMM1=0x0'
    - 'XMM0=0x0'
    - 'XMM6=0x0'
    - 'XMM9=0x0'
    - 'XMM2=0x0'
cpu_name:        znver3
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
  - { key: inverse_throughput, value: 0.6403, per_snippet_value: 4.4821 }
error:           ''
info:            instruction has tied variables, avoiding Read-After-Write issue, picking random def and use registers not aliasing each other, randomizing registers for uses
assembled_snippet
...
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-f05c2f.o
---
mode:            inverse_throughput
key:
  instructions:
    - 'VFMADD132PDr XMM15 XMM15 XMM11 XMM2'
    - 'VFMADD132PDr XMM5 XMM5 XMM11 XMM2'
    - 'VFMADD132PDr XMM14 XMM14 XMM11 XMM2'
    - 'VFMADD132PDr XMM4 XMM4 XMM11 XMM2'
    - 'VFMADD132PDr XMM8 XMM8 XMM11 XMM2'
    - 'VFMADD132PDr XMM3 XMM3 XMM11 XMM2'
    - 'VFMADD132PDr XMM10 XMM10 XMM11 XMM2'
    - 'VFMADD132PDr XMM7 XMM7 XMM11 XMM2'
    - 'VFMADD132PDr XMM13 XMM13 XMM11 XMM2'
    - 'VFMADD132PDr XMM9 XMM9 XMM11 XMM2'
    - 'VFMADD132PDr XMM1 XMM1 XMM11 XMM2'
    - 'VFMADD132PDr XMM6 XMM6 XMM11 XMM2'
    - 'VFMADD132PDr XMM0 XMM0 XMM11 XMM2'
    - 'VFMADD132PDr XMM12 XMM12 XMM11 XMM2'
  config:          ''
  register_initial_values:
    - 'XMM15=0x0'
    - 'XMM11=0x0'
    - 'XMM2=0x0'
    - 'MXCSR=0x0'
    - 'XMM5=0x0'
    - 'XMM14=0x0'
    - 'XMM4=0x0'
    - 'XMM8=0x0'
    - 'XMM3=0x0'
    - 'XMM10=0x0'
    - 'XMM7=0x0'
    - 'XMM13=0x0'
    - 'XMM9=0x0'
    - 'XMM1=0x0'
    - 'XMM6=0x0'
    - 'XMM0=0x0'
    - 'XMM12=0x0'
cpu_name:        znver3
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
  - { key: inverse_throughput, value: 0.5312, per_snippet_value: 7.4368 }
error:           ''
info:            instruction has tied variables, avoiding Read-After-Write issue, picking random def and use registers not aliasing each other, one unique register for each use position
assembled_snippet
...
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-c32060.o
---
mode:            inverse_throughput
key:
  instructions:
    - 'VFMADD132PDr XMM10 XMM10 XMM6 XMM6'
    - 'VFMADD132PDr XMM8 XMM8 XMM6 XMM6'
    - 'VFMADD132PDr XMM12 XMM12 XMM6 XMM6'
    - 'VFMADD132PDr XMM9 XMM9 XMM6 XMM6'
    - 'VFMADD132PDr XMM7 XMM7 XMM6 XMM6'
    - 'VFMADD132PDr XMM1 XMM1 XMM6 XMM6'
    - 'VFMADD132PDr XMM0 XMM0 XMM6 XMM6'
    - 'VFMADD132PDr XMM5 XMM5 XMM6 XMM6'
    - 'VFMADD132PDr XMM11 XMM11 XMM6 XMM6'
    - 'VFMADD132PDr XMM2 XMM2 XMM6 XMM6'
    - 'VFMADD132PDr XMM15 XMM15 XMM6 XMM6'
    - 'VFMADD132PDr XMM3 XMM3 XMM6 XMM6'
    - 'VFMADD132PDr XMM14 XMM14 XMM6 XMM6'
    - 'VFMADD132PDr XMM4 XMM4 XMM6 XMM6'
    - 'VFMADD132PDr XMM13 XMM13 XMM6 XMM6'
  config:          ''
  register_initial_values:
    - 'XMM10=0x0'
    - 'XMM6=0x0'
    - 'MXCSR=0x0'
    - 'XMM8=0x0'
    - 'XMM12=0x0'
    - 'XMM9=0x0'
    - 'XMM7=0x0'
    - 'XMM1=0x0'
    - 'XMM0=0x0'
    - 'XMM5=0x0'
    - 'XMM11=0x0'
    - 'XMM2=0x0'
    - 'XMM15=0x0'
    - 'XMM3=0x0'
    - 'XMM14=0x0'
    - 'XMM4=0x0'
    - 'XMM13=0x0'
cpu_name:        znver3
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
  - { key: inverse_throughput, value: 0.5311, per_snippet_value: 7.9665 }
error:           ''
info:            instruction has tied variables, avoiding Read-After-Write issue, picking random def and use registers not aliasing each other, reusing the same register for all uses
assembled_snippet
...
```

Reviewed By: courbet

Differential Revision: https://reviews.llvm.org/D139283
@LebedevRI LebedevRI self-assigned this Dec 6, 2022
@RKSimon
Copy link
Collaborator

RKSimon commented Dec 11, 2022

@LebedevRI Are we now able confirm that znver3 WriteFMA rthroughput is 0.5?

@LebedevRI
Copy link
Member

@LebedevRI Are we now able confirm that znver3 WriteFMA rthroughput is 0.5?

I'm currently trying to do just that :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants