-
Notifications
You must be signed in to change notification settings - Fork 14.8k
Description
uops.info seems to show that for some cpus, popcnt only has a false dependency on the destination register for 16bit reg/mem size (note that on uops.info, popcnt is listed under the sse category, and not the base, other or bmi category)
Where clicking on the latency listing for popcnt r/m32
and popcnt r/m64
shows 0 latency for operand 1 -> operand 1 (where operand 1 is the dest register)
uops.info shows this behavior for these cpus:
- Cannon Lake
- Ice Lake
- Tiger Lake
- Rocket Lake
- Alder Lake-P
- Alder Lake-E
- Goldmont Plus
- Tremont
- Zen+ & Zen2-4 (tho I don't think any of the zen cpu tunings in X86.td have TuningPOPCNTFalseDeps currently, so not sure on the relevance on that)
The main reason for this issue post is that the Alderlake tuning in X86.td inherits the TuningPOPCNTFalseDeps tuning field from SKLTuning on this line:
llvm-project/llvm/lib/Target/X86/X86.td
Line 1294
in
190778a
list<SubtargetFeature> ADLTuning = !listconcat(SKLTuning, ADLAdditionalTuning);
llvm-project/llvm/lib/Target/X86/X86.td
Line 1294 in 190778a
list<SubtargetFeature> ADLTuning = !listconcat(SKLTuning, ADLAdditionalTuning); |
Which results in unneeded xor insns for some popcnt r/m64 & popcnt r/m32 insns
I double checked that only popcnt r/m16 has a false dep, for my Alderlake cpu (12th Gen Intel(R) Core(TM) i7-12700H), with llvm-exegesis:
64bit -- 1 cycle:
printf '%s\n' '#LLVM-EXEGESIS-DEFREG RAX 2' '#LLVM-EXEGESIS-DEFREG RCX 2' 'popcnt %rax, %rcx' | llvm-exegesis --mode=latency --repetition-mode=duplicate --snippets-file=-
---
mode: latency
key:
instructions:
- 'POPCNT64rr RCX RAX'
config: ''
register_initial_values:
- 'RAX=0x2'
- 'RCX=0x2'
cpu_name: alderlake
llvm_triple: x86_64-pc-linux-gnu
min_instructions: 10000
measurements:
- { key: latency, value: 1.0212, per_snippet_value: 1.0212, validation_counters: {} }
error: ''
info: ''
assembled_snippet: 48B8020000000000000048B90200000000000000F3480FB8C8F3480FB8C8F3480FB8C8F3480FB8C8C3
...
64bit -- still 1 cycle when xoring dest register rcx:
printf '%s\n' '#LLVM-EXEGESIS-DEFREG RAX 2' '#LLVM-EXEGESIS-DEFREG RCX 2' 'xor %ecx, %ecx' 'popcnt %rax, %rcx' | llvm-exegesis --mode=latency --repetition-mode=duplicate --snippets-file=-
--
mode: latency
key:
instructions:
- 'XOR32rr ECX ECX ECX'
- 'POPCNT64rr RCX RAX'
config: ''
register_initial_values:
- 'RAX=0x2'
- 'RCX=0x2'
cpu_name: alderlake
llvm_triple: x86_64-pc-linux-gnu
min_instructions: 10000
measurements:
- { key: latency, value: 0.5202, per_snippet_value: 1.0404, validation_counters: {} }
error: ''
info: ''
assembled_snippet: 48B8020000000000000048B9020000000000000031C9F3480FB8C831C9F3480FB8C831C9F3480FB8C831C9F3480FB8C8C3
...
32bit -- 1 cycle:
printf '%s\n' '#LLVM-EXEGESIS-DEFREG RAX 2' '#LLVM-EXEGESIS-DEFREG RCX 2' 'popcnt %eax, %ecx' | llvm-exegesis --mode=latency --repetition-mode=duplicate --snippets-file=-
---
mode: latency
key:
instructions:
- 'POPCNT32rr ECX EAX'
config: ''
register_initial_values:
- 'RAX=0x2'
- 'RCX=0x2'
cpu_name: alderlake
llvm_triple: x86_64-pc-linux-gnu
min_instructions: 10000
measurements:
- { key: latency, value: 1.0208, per_snippet_value: 1.0208, validation_counters: {} }
error: ''
info: ''
assembled_snippet: 48B8020000000000000048B90200000000000000F30FB8C8F30FB8C8F30FB8C8F30FB8C8C3
...
32bit -- still 1 cycle when xoring dest register rcx:
printf '%s\n' '#LLVM-EXEGESIS-DEFREG RAX 2' '#LLVM-EXEGESIS-DEFREG RCX 2' 'xor %ecx, %ecx' 'popcnt %eax, %ecx' | llvm-exegesis --mode=latency --repetition-mode=duplicate --snippets-file=-
---
mode: latency
key:
instructions:
- 'XOR32rr ECX ECX ECX'
- 'POPCNT32rr ECX EAX'
config: ''
register_initial_values:
- 'RAX=0x2'
- 'RCX=0x2'
cpu_name: alderlake
llvm_triple: x86_64-pc-linux-gnu
min_instructions: 10000
measurements:
- { key: latency, value: 0.5227, per_snippet_value: 1.0454, validation_counters: {} }
error: ''
info: ''
assembled_snippet: 48B8020000000000000048B9020000000000000031C9F30FB8C831C9F30FB8C831C9F30FB8C831C9F30FB8C8C3
...
16 bit -- 3 cycles due to false dependency:
printf '%s\n' '#LLVM-EXEGESIS-DEFREG RAX 2' '#LLVM-EXEGESIS-DEFREG RCX 2' 'popcnt %ax, %cx' | llvm-exegesis --mode=latency --repetition-mode=duplicate --snippets-file=-
---
mode: latency
key:
instructions:
- 'POPCNT16rr CX AX'
config: ''
register_initial_values:
- 'RAX=0x2'
- 'RCX=0x2'
cpu_name: alderlake
llvm_triple: x86_64-pc-linux-gnu
min_instructions: 10000
measurements:
- { key: latency, value: 3.017, per_snippet_value: 3.017, validation_counters: {} }
error: ''
info: ''
assembled_snippet: 48B8020000000000000048B9020000000000000066F30FB8C866F30FB8C866F30FB8C866F30FB8C8C3
...
16 bit -- back to 1 cycle when xoring destination register rcx:
printf '%s\n' '#LLVM-EXEGESIS-DEFREG RAX 2' '#LLVM-EXEGESIS-DEFREG RCX 2' 'xor %ecx, %ecx' 'popcnt %ax, %cx' | llvm-exegesis --mode=latency --repetition-mode=duplicate --snippets-file=-
---
mode: latency
key:
instructions:
- 'XOR32rr ECX ECX ECX'
- 'POPCNT16rr CX AX'
config: ''
register_initial_values:
- 'RAX=0x2'
- 'RCX=0x2'
cpu_name: alderlake
llvm_triple: x86_64-pc-linux-gnu
min_instructions: 10000
measurements:
- { key: latency, value: 0.5194, per_snippet_value: 1.0388, validation_counters: {} }
error: ''
info: ''
assembled_snippet: 48B8020000000000000048B9020000000000000031C966F30FB8C831C966F30FB8C831C966F30FB8C831C966F30FB8C8C3
...
I also double checked that the alderlake E core has this same behavior as w/ the p cores, using systemd-run to rerun just the 32&16bit popcnt tests on just my E cores (cpus 12-19):
32bit -- 1 cycle:
sudo systemd-run -p AllowedCPUs=12-19 --send-sighup --pty -- printf '%s\n' '#LLVM-EXEGESIS-DEFREG RAX 2' '#LLVM-EXEGESIS-DEFREG RCX 2' 'popcnt %eax, %ecx' | llvm-exegesis --mode=latency --repetition-mode=duplicate --snippets-file=-
Running as unit: run-p694886-i694887.service
Press ^] three times within 1s to disconnect TTY.
---
mode: latency
key:
instructions:
- 'POPCNT32rr ECX EAX'
config: ''
register_initial_values:
- 'RAX=0x2'
- 'RCX=0x2'
cpu_name: alderlake
llvm_triple: x86_64-pc-linux-gnu
min_instructions: 10000
measurements:
- { key: latency, value: 1.0214, per_snippet_value: 1.0214, validation_counters: {} }
error: ''
info: ''
assembled_snippet: 48B8020000000000000048B90200000000000000F30FB8C8F30FB8C8F30FB8C8F30FB8C8C3
...
32bit -- still 1 cycle when xoring dest register rcx:
sudo systemd-run -p AllowedCPUs=12-19 --send-sighup --pty -- printf '%s\n' '#LLVM-EXEGESIS-DEFREG RAX 2' '#LLVM-EXEGESIS-DEFREG RCX 2' 'xor %ecx, %ecx' 'popcnt %eax, %ecx' | llvm-exegesis --mode=latency --repetition-mode=duplicate --snippets-file=-
Running as unit: run-p695328-i695329.service; invocation ID: 94e816d26d6c4a33a8a5efa0899be086
Press ^] three times within 1s to disconnect TTY.
---
mode: latency
key:
instructions:
- 'XOR32rr ECX ECX ECX'
- 'POPCNT32rr ECX EAX'
config: ''
register_initial_values:
- 'RAX=0x2'
- 'RCX=0x2'
cpu_name: alderlake
llvm_triple: x86_64-pc-linux-gnu
min_instructions: 10000
measurements:
- { key: latency, value: 0.5205, per_snippet_value: 1.041, validation_counters: {} }
error: ''
info: ''
assembled_snippet: 48B8020000000000000048B9020000000000000031C9F30FB8C831C9F30FB8C831C9F30FB8C831C9F30FB8C8C3
...
16 bit -- 3 cycles due to false dependency:
sudo systemd-run -p AllowedCPUs=12-19 --send-sighup --pty -- printf '%s\n' '#LLVM-EXEGESIS-DEFREG RAX 2' '#LLVM-EXEGESIS-DEFREG RCX 2' 'popcnt %ax, %cx' | llvm-exegesis --mode=latency --repetition-mode=duplicate --snippets-file=-
Running as unit: run-p695858-i695859.service
Press ^] three times within 1s to disconnect TTY.
---
mode: latency
key:
instructions:
- 'POPCNT16rr CX AX'
config: ''
register_initial_values:
- 'RAX=0x2'
- 'RCX=0x2'
cpu_name: alderlake
llvm_triple: x86_64-pc-linux-gnu
min_instructions: 10000
measurements:
- { key: latency, value: 3.0179, per_snippet_value: 3.0179, validation_counters: {} }
error: ''
info: ''
assembled_snippet: 48B8020000000000000048B9020000000000000066F30FB8C866F30FB8C866F30FB8C866F30FB8C8C3
...
16 bit -- back to 1 cycle when xoring destination register rcx:
sudo systemd-run -p AllowedCPUs=12-19 --send-sighup --pty -- printf '%s\n' '#LLVM-EXEGESIS-DEFREG RAX 2' '#LLVM-EXEGESIS-DEFREG RCX 2' 'xor %ecx, %ecx' 'popcnt %ax, %cx' | llvm-exegesis --mode=latency --repetition-mode=duplicate --snippets-file=-
Running as unit: run-p696642-i696643.service; invocation ID: 9bf18c1f47a645fcbc340206cb3657f3
Press ^] three times within 1s to disconnect TTY.
---
mode: latency
key:
instructions:
- 'XOR32rr ECX ECX ECX'
- 'POPCNT16rr CX AX'
config: ''
register_initial_values:
- 'RAX=0x2'
- 'RCX=0x2'
cpu_name: alderlake
llvm_triple: x86_64-pc-linux-gnu
min_instructions: 10000
measurements:
- { key: latency, value: 0.521, per_snippet_value: 1.042, validation_counters: {} }
error: ''
info: ''
assembled_snippet: 48B8020000000000000048B9020000000000000031C966F30FB8C831C966F30FB8C831C966F30FB8C831C966F30FB8C8C3
...