New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[llvm-mca] Unexpected Rthroughput for vfmadd* instructions in znver3 #59325
Comments
LLVM's znver1/znver2 instruction schedules are, well, lot a lot to be desired. They may be rather wrong. ---
mode: inverse_throughput
key:
instructions:
- 'VFMADD132PDr XMM0 XMM0 XMM11 XMM2'
- 'VFMADD132PDr XMM1 XMM1 XMM4 XMM8'
- 'VFMADD132PDr XMM2 XMM2 XMM5 XMM10'
- 'VFMADD132PDr XMM3 XMM3 XMM7 XMM14'
- 'VFMADD132PDr XMM4 XMM4 XMM1 XMM9'
- 'VFMADD132PDr XMM5 XMM5 XMM4 XMM15'
- 'VFMADD132PDr XMM6 XMM6 XMM9 XMM12'
- 'VFMADD132PDr XMM7 XMM7 XMM15 XMM1'
- 'VFMADD132PDr XMM8 XMM8 XMM9 XMM0'
- 'VFMADD132PDr XMM9 XMM9 XMM6 XMM5'
- 'VFMADD132PDr XMM10 XMM10 XMM15 XMM12'
- 'VFMADD132PDr XMM11 XMM11 XMM2 XMM2'
- 'VFMADD132PDr XMM12 XMM12 XMM6 XMM1'
- 'VFMADD132PDr XMM13 XMM13 XMM4 XMM11'
- 'VFMADD132PDr XMM14 XMM14 XMM0 XMM5'
- 'VFMADD132PDr XMM15 XMM15 XMM4 XMM1'
config: ''
register_initial_values:
- 'XMM0=0x0'
- 'XMM11=0x0'
- 'XMM2=0x0'
- 'MXCSR=0x0'
- 'XMM1=0x0'
- 'XMM4=0x0'
- 'XMM8=0x0'
- 'XMM5=0x0'
- 'XMM10=0x0'
- 'XMM3=0x0'
- 'XMM7=0x0'
- 'XMM14=0x0'
- 'XMM9=0x0'
- 'XMM15=0x0'
- 'XMM6=0x0'
- 'XMM12=0x0'
- 'XMM13=0x0'
cpu_name: znver3
llvm_triple: x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
- { key: inverse_throughput, value: 0.7951, per_snippet_value: 12.7216 }
error: ''
info: instruction has tied variables, using static renaming.
assembled_snippet: 4883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F04244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F1C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F14244883C4104883EC04C70424801F0000C5F8AE14244883C4044883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F0C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F24244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F04244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F2C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F14244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F1C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F3C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F34244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F0C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F3C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F34244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F24244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F2C244883C410C4E2A198C2C4C2D998C8C4C2D198D2C4C2C198DEC4C2F198E1C4C2D998EFC4C2B198F4C4E28198F9C462B198C0C462C998CDC4428198D4C462E998DAC462C998E1C442D998EBC462F998F5C462D998F9C4E2A198C2C4C2D998C8C4C2D198D2C4C2C198DEC4C2F198E1C4C2D998EFC4C2B198F4C4E28198F9C462B198C0C462C998CDC4428198D4C462E998DAC462C998E1C442D998EBC462F998F5C462D998F9C4E2A198C2C4C2D998C8C4C2D198D2C4C2C198DEC4C2F198E1C4C2D998EFC4C2B198F4C4E28198F9C462B198C0C462C998CDC4428198D4C462E998DAC462C998E1C442D998EBC462F998F5C462D998F9C4E2A198C2C4C2D998C8C4C2D198D2C4C2C198DEC4C2F198E1C4C2D998EFC4C2B198F4C4E28198F9C462B198C0C462C998CDC4428198D4C462E998DAC462C998E1C442D998EBC462F998F5C462D998F9C34883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F04244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F1C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F14244883C4104883EC04C70424801F0000C5F8AE14244883C4044883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F0C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F24244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F04244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F2C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F14244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F1C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F3C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F34244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F0C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F3C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F34244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F24244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F2C244883C41049B80200000000000000C4E2A198C2C4C2D998C8C4C2D198D2C4C2C198DEC4C2F198E1C4C2D998EFC4C2B198F4C4E28198F9C462B198C0C462C998CDC4428198D4C462E998DAC462C998E1C442D998EBC462F998F5C462D998F9C4E2A198C2C4C2D998C8C4C2D198D2C4C2C198DEC4C2F198E1C4C2D998EFC4C2B198F4C4E28198F9C462B198C0C462C998CDC4428198D4C462E998DAC462C998E1C442D998EBC462F998F5C462D998F94983C0FF0F8556FFFFFFC3
... So it's not really |
@RKSimon probably yet another znver2/znver1 bug? |
I'll take a look |
@llvm/issue-subscribers-backend-x86 |
Zen processors can issue up to 2 FMA uOPs per cycle. The FPU implement 2 FMA units. Reciprocal throughput for vfmadd*/vfmsub* is expected to be 0.5 and not 1. I have run a quick perf test on my zen1, and I've got these numbers:
NOTE: on Zen1 only, wider YMM fmadd* are decoded into 2 uOPs. That is because zen1 only natively supports up to 128-bit data types. So, rthroughput of wide YMM fmadd* on zen1 is expected to be 1. If my understanding is correct, the code snippet used by exegesis to do throughput analysis is not great because it introduces unwanted data dependencies. Example:
I am not an expert of exegesis. However, there may be an issue with the code used by it to test throughput.
|
Agreed - I'm noticing a lot of cases where exegesis is making really poor register dependency decisions. |
Rather than simply randomising input regs, exegesis should try to generate a code sequence with no RAW dependencies between instructions (like in my example). RAW dependencies would still arise between instructions from different iterations. However, if regalloc is done properly, those dependencies would happen so late that throughput won’t be affected in practice. |
Hm, i'd like to take a look if no one is already looking into this? |
Sure - please go for it - there's 2 tasks to do afaict: |
Right. The other way around though. |
@llvm/issue-subscribers-tools-llvm-mca |
@llvm/issue-subscribers-tools-llvm-exegesis |
exegesis part: https://reviews.llvm.org/D139283 |
…tfail for instrs w/ tied variables As it is being discussed in #59325, at least for the instructions with tied variables, when trying to parallelize the instructions, register selection is rather bad, and may either use a register which we have used for def, or vice versa. That introduces serialization, and leads to overly pessimistic inverse throughput measurement. The new implementation avoids that, New result: ``` $ ninja llvm-exegesis && ./bin/llvm-exegesis --mode=inverse_throughput --opcode-name=VFMADD132PDr --max-configs-per-opcode=9182 ninja: no work to do. Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-4af034.o --- mode: inverse_throughput key: instructions: - 'VFMADD132PDr XMM3 XMM3 XMM4 XMM8' - 'VFMADD132PDr XMM5 XMM5 XMM14 XMM7' - 'VFMADD132PDr XMM10 XMM10 XMM11 XMM15' - 'VFMADD132PDr XMM13 XMM13 XMM15 XMM15' - 'VFMADD132PDr XMM12 XMM12 XMM11 XMM1' - 'VFMADD132PDr XMM0 XMM0 XMM6 XMM9' - 'VFMADD132PDr XMM2 XMM2 XMM15 XMM11' config: '' register_initial_values: - 'XMM3=0x0' - 'XMM4=0x0' - 'XMM8=0x0' - 'MXCSR=0x0' - 'XMM5=0x0' - 'XMM14=0x0' - 'XMM7=0x0' - 'XMM10=0x0' - 'XMM11=0x0' - 'XMM15=0x0' - 'XMM13=0x0' - 'XMM12=0x0' - 'XMM1=0x0' - 'XMM0=0x0' - 'XMM6=0x0' - 'XMM9=0x0' - 'XMM2=0x0' cpu_name: znver3 llvm_triple: x86_64-unknown-linux-gnu num_repetitions: 10000 measurements: - { key: inverse_throughput, value: 0.6403, per_snippet_value: 4.4821 } error: '' info: instruction has tied variables, avoiding Read-After-Write issue, picking random def and use registers not aliasing each other, randomizing registers for uses assembled_snippet: 4883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F1C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F24244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F04244883C4104883EC04C70424801F0000C5F8AE14244883C4044883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F2C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F34244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F3C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F14244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F1C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F3C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F2C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F24244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F0C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F04244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F34244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F0C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F14244883C410C4C2D998D8C4E28998EFC442A198D7C4428198EFC462A198E1C4C2C998C1C4C28198D3C4C2D998D8C4E28998EFC442A198D7C4428198EFC462A198E1C4C2C998C1C4C28198D3C4C2D998D8C4E28998EFC442A198D7C4428198EFC462A198E1C4C2C998C1C4C28198D3C4C2D998D8C4E28998EFC442A198D7C4428198EFC462A198E1C4C2C998C1C4C28198D3C3 ... Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-f05c2f.o --- mode: inverse_throughput key: instructions: - 'VFMADD132PDr XMM15 XMM15 XMM11 XMM2' - 'VFMADD132PDr XMM5 XMM5 XMM11 XMM2' - 'VFMADD132PDr XMM14 XMM14 XMM11 XMM2' - 'VFMADD132PDr XMM4 XMM4 XMM11 XMM2' - 'VFMADD132PDr XMM8 XMM8 XMM11 XMM2' - 'VFMADD132PDr XMM3 XMM3 XMM11 XMM2' - 'VFMADD132PDr XMM10 XMM10 XMM11 XMM2' - 'VFMADD132PDr XMM7 XMM7 XMM11 XMM2' - 'VFMADD132PDr XMM13 XMM13 XMM11 XMM2' - 'VFMADD132PDr XMM9 XMM9 XMM11 XMM2' - 'VFMADD132PDr XMM1 XMM1 XMM11 XMM2' - 'VFMADD132PDr XMM6 XMM6 XMM11 XMM2' - 'VFMADD132PDr XMM0 XMM0 XMM11 XMM2' - 'VFMADD132PDr XMM12 XMM12 XMM11 XMM2' config: '' register_initial_values: - 'XMM15=0x0' - 'XMM11=0x0' - 'XMM2=0x0' - 'MXCSR=0x0' - 'XMM5=0x0' - 'XMM14=0x0' - 'XMM4=0x0' - 'XMM8=0x0' - 'XMM3=0x0' - 'XMM10=0x0' - 'XMM7=0x0' - 'XMM13=0x0' - 'XMM9=0x0' - 'XMM1=0x0' - 'XMM6=0x0' - 'XMM0=0x0' - 'XMM12=0x0' cpu_name: znver3 llvm_triple: x86_64-unknown-linux-gnu num_repetitions: 10000 measurements: - { key: inverse_throughput, value: 0.5312, per_snippet_value: 7.4368 } error: '' info: instruction has tied variables, avoiding Read-After-Write issue, picking random def and use registers not aliasing each other, one unique register for each use position assembled_snippet: 4883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F3C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F1C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F14244883C4104883EC04C70424801F0000C5F8AE14244883C4044883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F2C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F34244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F24244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F04244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F1C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F14244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F3C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F2C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F0C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F0C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F34244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F04244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F24244883C410C462A198FAC4E2A198EAC462A198F2C4E2A198E2C462A198C2C4E2A198DAC462A198D2C4E2A198FAC462A198EAC462A198CAC4E2A198CAC4E2A198F2C4E2A198C2C462A198E2C462A198FAC4E2A198EAC462A198F2C4E2A198E2C462A198C2C4E2A198DAC462A198D2C4E2A198FAC462A198EAC462A198CAC4E2A198CAC4E2A198F2C4E2A198C2C462A198E2C462A198FAC4E2A198EAC462A198F2C4E2A198E2C462A198C2C4E2A198DAC462A198D2C4E2A198FAC462A198EAC462A198CAC4E2A198CAC4E2A198F2C4E2A198C2C462A198E2C462A198FAC4E2A198EAC462A198F2C4E2A198E2C462A198C2C4E2A198DAC462A198D2C4E2A198FAC462A198EAC462A198CAC4E2A198CAC4E2A198F2C4E2A198C2C462A198E2C3 ... Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-c32060.o --- mode: inverse_throughput key: instructions: - 'VFMADD132PDr XMM10 XMM10 XMM6 XMM6' - 'VFMADD132PDr XMM8 XMM8 XMM6 XMM6' - 'VFMADD132PDr XMM12 XMM12 XMM6 XMM6' - 'VFMADD132PDr XMM9 XMM9 XMM6 XMM6' - 'VFMADD132PDr XMM7 XMM7 XMM6 XMM6' - 'VFMADD132PDr XMM1 XMM1 XMM6 XMM6' - 'VFMADD132PDr XMM0 XMM0 XMM6 XMM6' - 'VFMADD132PDr XMM5 XMM5 XMM6 XMM6' - 'VFMADD132PDr XMM11 XMM11 XMM6 XMM6' - 'VFMADD132PDr XMM2 XMM2 XMM6 XMM6' - 'VFMADD132PDr XMM15 XMM15 XMM6 XMM6' - 'VFMADD132PDr XMM3 XMM3 XMM6 XMM6' - 'VFMADD132PDr XMM14 XMM14 XMM6 XMM6' - 'VFMADD132PDr XMM4 XMM4 XMM6 XMM6' - 'VFMADD132PDr XMM13 XMM13 XMM6 XMM6' config: '' register_initial_values: - 'XMM10=0x0' - 'XMM6=0x0' - 'MXCSR=0x0' - 'XMM8=0x0' - 'XMM12=0x0' - 'XMM9=0x0' - 'XMM7=0x0' - 'XMM1=0x0' - 'XMM0=0x0' - 'XMM5=0x0' - 'XMM11=0x0' - 'XMM2=0x0' - 'XMM15=0x0' - 'XMM3=0x0' - 'XMM14=0x0' - 'XMM4=0x0' - 'XMM13=0x0' cpu_name: znver3 llvm_triple: x86_64-unknown-linux-gnu num_repetitions: 10000 measurements: - { key: inverse_throughput, value: 0.5311, per_snippet_value: 7.9665 } error: '' info: instruction has tied variables, avoiding Read-After-Write issue, picking random def and use registers not aliasing each other, reusing the same register for all uses assembled_snippet: 4883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F14244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F34244883C4104883EC04C70424801F0000C5F8AE14244883C4044883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F04244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F24244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F0C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F3C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F0C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F04244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F2C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F1C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F14244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F3C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F1C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F34244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F24244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F2C244883C410C462C998D6C462C998C6C462C998E6C462C998CEC4E2C998FEC4E2C998CEC4E2C998C6C4E2C998EEC462C998DEC4E2C998D6C462C998FEC4E2C998DEC462C998F6C4E2C998E6C462C998EEC462C998D6C462C998C6C462C998E6C462C998CEC4E2C998FEC4E2C998CEC4E2C998C6C4E2C998EEC462C998DEC4E2C998D6C462C998FEC4E2C998DEC462C998F6C4E2C998E6C462C998EEC462C998D6C462C998C6C462C998E6C462C998CEC4E2C998FEC4E2C998CEC4E2C998C6C4E2C998EEC462C998DEC4E2C998D6C462C998FEC4E2C998DEC462C998F6C4E2C998E6C462C998EEC462C998D6C462C998C6C462C998E6C462C998CEC4E2C998FEC4E2C998CEC4E2C998C6C4E2C998EEC462C998DEC4E2C998D6C462C998FEC4E2C998DEC462C998F6C4E2C998E6C462C998EEC3 ... ``` Reviewed By: courbet Differential Revision: https://reviews.llvm.org/D139283
@LebedevRI Are we now able confirm that znver3 WriteFMA rthroughput is 0.5? |
I'm currently trying to do just that :) |
llvm-mca reports for zen3 the Rthroughput of
vfmadd*
instructions is 1But for zen2 it is 0.5.
Should they both be 0.5? Or is there any micro-architecture change that makes zen3 slower than zen2 on
vfmadd*
?The text was updated successfully, but these errors were encountered: