Skip to content

Conversation

arsenm
Copy link
Contributor

@arsenm arsenm commented Oct 2, 2025

This appears to be forgotten switch flip from 2015. This
seems to do a nicer job with subregister copies. Most of the
test changes are improvements or neutral, not that many are
light regressions. The worst AMDGPU regressions are for true16
in the atomic tests, but I think that's due to existing true16
issues.

I also had to hack many hexagon tests to disable the rule. I have
no idea how to update these tests. They appear to be testing specific
scheduling and packet formation of later machine passes, so any change
in the incoming mir is likely hiding whatever was originally intended.
I'll open an issue to fixup these tests once this lands.

This appears to be forgotten switch flip from 2015. This
seems to do a nicer job with subregister copies. Most of the
test changes are improvements or neutral, not that many are
light  regressions. The worst AMDGPU regressions are for true16
in the atomic tests, but I think that's due to existing true16
issues.

I also had to hack many hexagon tests to disable the rule. I have
no idea how to update these tests. They appear to be testing specific
scheduling and packet formation of later machine passes, so any change
in the incoming mir is likely hiding whatever was originally intended.
I'll open an issue to fixup these tests once this lands.
Copy link
Contributor Author

arsenm commented Oct 2, 2025

This stack of pull requests is managed by Graphite. Learn more about stacking.

@llvmbot
Copy link
Member

llvmbot commented Oct 2, 2025

@llvm/pr-subscribers-backend-systemz
@llvm/pr-subscribers-backend-risc-v
@llvm/pr-subscribers-backend-x86
@llvm/pr-subscribers-backend-powerpc

@llvm/pr-subscribers-llvm-globalisel

Author: Matt Arsenault (arsenm)

Changes

This appears to be forgotten switch flip from 2015. This
seems to do a nicer job with subregister copies. Most of the
test changes are improvements or neutral, not that many are
light regressions. The worst AMDGPU regressions are for true16
in the atomic tests, but I think that's due to existing true16
issues.

I also had to hack many hexagon tests to disable the rule. I have
no idea how to update these tests. They appear to be testing specific
scheduling and packet formation of later machine passes, so any change
in the incoming mir is likely hiding whatever was originally intended.
I'll open an issue to fixup these tests once this lands.


Patch is 1.65 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/161621.diff

152 Files Affected:

  • (modified) llvm/lib/CodeGen/RegisterCoalescer.cpp (+1-1)
  • (modified) llvm/test/CodeGen/AArch64/build-vector-two-dup.ll (+5-5)
  • (modified) llvm/test/CodeGen/AArch64/machine-licm-sink-instr.ll (+15-24)
  • (modified) llvm/test/CodeGen/AArch64/machine-sink-kill-flags.ll (+2-3)
  • (modified) llvm/test/CodeGen/AArch64/sme-pstate-sm-changing-call-disable-coalescing.ll (+34-51)
  • (modified) llvm/test/CodeGen/AArch64/sme-streaming-compatible-interface.ll (+2-3)
  • (modified) llvm/test/CodeGen/AArch64/sve-extract-fixed-from-scalable-vector.ll (+6-6)
  • (modified) llvm/test/CodeGen/AArch64/sve-extract-fixed-vector.ll (+20-19)
  • (modified) llvm/test/CodeGen/AArch64/sve-fixed-length-reshuffle.ll (+6-6)
  • (modified) llvm/test/CodeGen/AArch64/sve-fixed-length-shuffles.ll (+36-36)
  • (modified) llvm/test/CodeGen/AArch64/sve-ptest-removal-sink.ll (+4-4)
  • (modified) llvm/test/CodeGen/AArch64/zext-to-tbl.ll (+46-46)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.ll (+9-9)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.ll (+19-19)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll (+180-194)
  • (modified) llvm/test/CodeGen/AMDGPU/and.ll (+47-53)
  • (modified) llvm/test/CodeGen/AMDGPU/bfe-patterns.ll (+42-42)
  • (modified) llvm/test/CodeGen/AMDGPU/bfi_nested.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/bfm.ll (+6-7)
  • (modified) llvm/test/CodeGen/AMDGPU/bitreverse.ll (+19-23)
  • (modified) llvm/test/CodeGen/AMDGPU/build_vector.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/combine-cond-add-sub.ll (+16-16)
  • (modified) llvm/test/CodeGen/AMDGPU/divergence-driven-buildvector.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/divergence-driven-sext-inreg.ll (+26-30)
  • (modified) llvm/test/CodeGen/AMDGPU/fabs.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/fdiv.ll (+39-45)
  • (modified) llvm/test/CodeGen/AMDGPU/fmin_legacy.ll (+11-4)
  • (modified) llvm/test/CodeGen/AMDGPU/fnearbyint.ll (+6-7)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg-fabs.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/fp_to_sint.ll (+40-45)
  • (modified) llvm/test/CodeGen/AMDGPU/fp_to_uint.ll (+21-27)
  • (modified) llvm/test/CodeGen/AMDGPU/fshl.ll (+9-10)
  • (modified) llvm/test/CodeGen/AMDGPU/insert_vector_elt.ll (+20-22)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ubfe.ll (+14-16)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.exp.ll (+15-16)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.exp10.ll (+15-16)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.exp2.ll (+15-16)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.log.ll (+18-19)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.log10.ll (+18-19)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.log2.ll (+3-5)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fadd.ll (+1038-1012)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmax.ll (+1070-1042)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmin.ll (+1070-1042)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fsub.ll (+1224-1182)
  • (modified) llvm/test/CodeGen/AMDGPU/lshr.v2i16.ll (+12-13)
  • (modified) llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/max.ll (+40-46)
  • (modified) llvm/test/CodeGen/AMDGPU/memcpy-crash-issue63986.ll (+9-9)
  • (modified) llvm/test/CodeGen/AMDGPU/memmove-var-size.ll (+204-204)
  • (modified) llvm/test/CodeGen/AMDGPU/mul_int24.ll (+60-69)
  • (modified) llvm/test/CodeGen/AMDGPU/mul_uint24-amdgcn.ll (+50-56)
  • (modified) llvm/test/CodeGen/AMDGPU/or.ll (+12-14)
  • (modified) llvm/test/CodeGen/AMDGPU/set-inactive-wwm-overwrite.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/sext-divergence-driven-isel.ll (+7-8)
  • (modified) llvm/test/CodeGen/AMDGPU/shl.v2i16.ll (+30-36)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4f32.v3f32.ll (+9-16)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4i32.v3i32.ll (+9-16)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4p3.v3p3.ll (+9-16)
  • (modified) llvm/test/CodeGen/AMDGPU/sign_extend.ll (+42-48)
  • (modified) llvm/test/CodeGen/AMDGPU/skip-if-dead.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/sminmax.v2i16.ll (+31-32)
  • (modified) llvm/test/CodeGen/AMDGPU/sub.ll (+6-7)
  • (modified) llvm/test/CodeGen/AMDGPU/sub.v2i16.ll (+18-23)
  • (modified) llvm/test/CodeGen/AMDGPU/udiv.ll (+30-32)
  • (modified) llvm/test/CodeGen/AMDGPU/udiv64.ll (+12-16)
  • (modified) llvm/test/CodeGen/AMDGPU/while-break.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/xor.ll (+6-7)
  • (modified) llvm/test/CodeGen/AMDGPU/zext-divergence-driven-isel.ll (+7-8)
  • (modified) llvm/test/CodeGen/BPF/objdump_cond_op_2.ll (+2-2)
  • (modified) llvm/test/CodeGen/Hexagon/late_instr.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-carried-1.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-conv3x3-nested.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-epilog-phi11.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-epilog-phi12.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-epilog-phi7.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-kernel-phi1.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-matmul-bitext.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-order-copies.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-order-deps7.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-reuse-phi-6.ll (+1-1)
  • (modified) llvm/test/CodeGen/NVPTX/atomics-b128.ll (+75-75)
  • (modified) llvm/test/CodeGen/NVPTX/atomics-sm70.ll (+20-20)
  • (modified) llvm/test/CodeGen/NVPTX/atomics-sm90.ll (+20-20)
  • (modified) llvm/test/CodeGen/NVPTX/atomics.ll (+6-6)
  • (modified) llvm/test/CodeGen/PowerPC/ctrloop-fp128.ll (+3-3)
  • (modified) llvm/test/CodeGen/PowerPC/licm-xxsplti.ll (+27-27)
  • (modified) llvm/test/CodeGen/PowerPC/loop-instr-form-prepare.ll (+3-5)
  • (modified) llvm/test/CodeGen/PowerPC/perfect-shuffle.ll (+6-6)
  • (modified) llvm/test/CodeGen/PowerPC/sms-phi-1.ll (+3-2)
  • (modified) llvm/test/CodeGen/PowerPC/sms-phi-2.ll (+21-22)
  • (modified) llvm/test/CodeGen/RISCV/branch-on-zero.ll (+6-10)
  • (modified) llvm/test/CodeGen/RISCV/machine-pipeliner.ll (+23-23)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-shuffle-exact-vlen.ll (+4-6)
  • (modified) llvm/test/CodeGen/RISCV/rvv/pr95865.ll (+21-22)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vandn-sdnode.ll (+33-33)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vcpop-shl-zext-opt.ll (+14-14)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vxrm-insert-out-of-loop.ll (+12-12)
  • (modified) llvm/test/CodeGen/SystemZ/atomicrmw-fadd-01.ll (+6-5)
  • (modified) llvm/test/CodeGen/SystemZ/atomicrmw-fsub-01.ll (+6-5)
  • (modified) llvm/test/CodeGen/Thumb2/LowOverheadLoops/constbound.ll (+9-9)
  • (modified) llvm/test/CodeGen/Thumb2/LowOverheadLoops/varying-outer-2d-reduction.ll (+24-26)
  • (modified) llvm/test/CodeGen/Thumb2/LowOverheadLoops/while-loops.ll (+45-46)
  • (modified) llvm/test/CodeGen/Thumb2/mve-float32regloops.ll (+102-109)
  • (modified) llvm/test/CodeGen/Thumb2/mve-gather-increment.ll (+12-12)
  • (modified) llvm/test/CodeGen/Thumb2/mve-gather-scatter-optimisation.ll (+45-45)
  • (modified) llvm/test/CodeGen/Thumb2/mve-pipelineloops.ll (+26-26)
  • (modified) llvm/test/CodeGen/Thumb2/mve-shuffle.ll (+7-6)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vld4.ll (+7-6)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vmaxnma-commute.ll (+12-12)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vst4.ll (+7-7)
  • (modified) llvm/test/CodeGen/Thumb2/pacbti-m-vla.ll (+1-1)
  • (modified) llvm/test/CodeGen/WebAssembly/simd-shift-in-loop.ll (+6-8)
  • (modified) llvm/test/CodeGen/X86/3addr-16bit.ll (+24-24)
  • (modified) llvm/test/CodeGen/X86/atomic-rm-bit-test.ll (+13-9)
  • (modified) llvm/test/CodeGen/X86/atomicrmw-fadd-fp-vector.ll (+2-1)
  • (modified) llvm/test/CodeGen/X86/bitcast-vector-bool.ll (+16-16)
  • (modified) llvm/test/CodeGen/X86/coalescer-dead-flag-verifier-error.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/dag-update-nodetomatch.ll (+52-48)
  • (modified) llvm/test/CodeGen/X86/fold-loop-of-urem.ll (+38-43)
  • (modified) llvm/test/CodeGen/X86/freeze-binary.ll (+14-12)
  • (modified) llvm/test/CodeGen/X86/i128-mul.ll (+87-91)
  • (modified) llvm/test/CodeGen/X86/icmp-abs-C.ll (+11-11)
  • (modified) llvm/test/CodeGen/X86/masked_gather_scatter.ll (+6-6)
  • (modified) llvm/test/CodeGen/X86/midpoint-int.ll (+14-14)
  • (modified) llvm/test/CodeGen/X86/mmx-arith.ll (+1-2)
  • (modified) llvm/test/CodeGen/X86/mul-constant-i16.ll (+4-4)
  • (modified) llvm/test/CodeGen/X86/mul-constant-i32.ll (+8-8)
  • (modified) llvm/test/CodeGen/X86/mul-constant-i8.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/optimize-max-0.ll (+107-104)
  • (modified) llvm/test/CodeGen/X86/parity.ll (+15-15)
  • (modified) llvm/test/CodeGen/X86/rotate-extract.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/smul_fix.ll (+4-4)
  • (modified) llvm/test/CodeGen/X86/sshl_sat.ll (+20-20)
  • (modified) llvm/test/CodeGen/X86/sshl_sat_vec.ll (+56-57)
  • (modified) llvm/test/CodeGen/X86/stackmap.ll (+6-3)
  • (modified) llvm/test/CodeGen/X86/subvectorwise-store-of-vector-splat.ll (+105-105)
  • (modified) llvm/test/CodeGen/X86/twoaddr-lea.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/umul_fix.ll (+4-4)
  • (modified) llvm/test/CodeGen/X86/ushl_sat.ll (+14-14)
  • (modified) llvm/test/CodeGen/X86/ushl_sat_vec.ll (+55-56)
  • (modified) llvm/test/CodeGen/X86/vector-mulfix-legalize.ll (+17-17)
  • (modified) llvm/test/CodeGen/X86/vector-reduce-xor-bool.ll (+80-80)
  • (modified) llvm/test/CodeGen/X86/wide-scalar-shift-by-byte-multiple-legalization.ll (+3023-3058)
  • (modified) llvm/test/CodeGen/X86/wide-scalar-shift-legalization.ll (+668-676)
  • (modified) llvm/test/CodeGen/X86/widen-load-of-small-alloca-with-zero-upper-half.ll (+165-163)
  • (modified) llvm/test/CodeGen/X86/widen-load-of-small-alloca.ll (+49-46)
  • (modified) llvm/test/CodeGen/X86/x86-shrink-wrapping.ll (+9-9)
  • (modified) llvm/test/CodeGen/X86/xor.ll (+66-66)
  • (modified) llvm/test/Transforms/LoopStrengthReduce/X86/ivchain-X86.ll (+11-10)
diff --git a/llvm/lib/CodeGen/RegisterCoalescer.cpp b/llvm/lib/CodeGen/RegisterCoalescer.cpp
index 7ac1aef83777a..5bd38a916fe4d 100644
--- a/llvm/lib/CodeGen/RegisterCoalescer.cpp
+++ b/llvm/lib/CodeGen/RegisterCoalescer.cpp
@@ -81,7 +81,7 @@ static cl::opt<bool> EnableJoining("join-liveintervals",
 
 static cl::opt<bool> UseTerminalRule("terminal-rule",
                                      cl::desc("Apply the terminal rule"),
-                                     cl::init(false), cl::Hidden);
+                                     cl::init(true), cl::Hidden);
 
 /// Temporary flag to test critical edge unsplitting.
 static cl::opt<bool> EnableJoinSplits(
diff --git a/llvm/test/CodeGen/AArch64/build-vector-two-dup.ll b/llvm/test/CodeGen/AArch64/build-vector-two-dup.ll
index dbbfbea9176f6..f725c19081deb 100644
--- a/llvm/test/CodeGen/AArch64/build-vector-two-dup.ll
+++ b/llvm/test/CodeGen/AArch64/build-vector-two-dup.ll
@@ -188,11 +188,11 @@ entry:
 define <8 x i8> @test11(ptr nocapture noundef readonly %a, ptr nocapture noundef readonly %b) {
 ; CHECK-LABEL: test11:
 ; CHECK:       // %bb.0: // %entry
-; CHECK-NEXT:    ld1r { v1.8b }, [x0]
-; CHECK-NEXT:    ld1r { v2.8b }, [x1]
-; CHECK-NEXT:    mov v0.16b, v1.16b
-; CHECK-NEXT:    mov v0.h[2], v2.h[0]
-; CHECK-NEXT:    mov v0.h[3], v1.h[0]
+; CHECK-NEXT:    ld1r { v0.8b }, [x0]
+; CHECK-NEXT:    ld1r { v1.8b }, [x1]
+; CHECK-NEXT:    fmov d2, d0
+; CHECK-NEXT:    mov v0.h[2], v1.h[0]
+; CHECK-NEXT:    mov v0.h[3], v2.h[0]
 ; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $q0
 ; CHECK-NEXT:    ret
 entry:
diff --git a/llvm/test/CodeGen/AArch64/machine-licm-sink-instr.ll b/llvm/test/CodeGen/AArch64/machine-licm-sink-instr.ll
index 3230c9e946da7..b3a7ec961b736 100644
--- a/llvm/test/CodeGen/AArch64/machine-licm-sink-instr.ll
+++ b/llvm/test/CodeGen/AArch64/machine-licm-sink-instr.ll
@@ -20,20 +20,17 @@ define i32 @sink_load_and_copy(i32 %n) {
 ; CHECK-NEXT:    b.lt .LBB0_3
 ; CHECK-NEXT:  // %bb.1: // %for.body.preheader
 ; CHECK-NEXT:    adrp x8, A
-; CHECK-NEXT:    mov w20, w19
-; CHECK-NEXT:    ldr w21, [x8, :lo12:A]
+; CHECK-NEXT:    mov w21, w19
+; CHECK-NEXT:    ldr w20, [x8, :lo12:A]
 ; CHECK-NEXT:  .LBB0_2: // %for.body
 ; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    mov w0, w21
+; CHECK-NEXT:    mov w0, w20
 ; CHECK-NEXT:    bl _Z3usei
-; CHECK-NEXT:    sdiv w20, w20, w0
-; CHECK-NEXT:    subs w19, w19, #1
+; CHECK-NEXT:    sdiv w19, w19, w0
+; CHECK-NEXT:    subs w21, w21, #1
 ; CHECK-NEXT:    b.ne .LBB0_2
-; CHECK-NEXT:    b .LBB0_4
-; CHECK-NEXT:  .LBB0_3:
-; CHECK-NEXT:    mov w20, w19
-; CHECK-NEXT:  .LBB0_4: // %for.cond.cleanup
-; CHECK-NEXT:    mov w0, w20
+; CHECK-NEXT:  .LBB0_3: // %for.cond.cleanup
+; CHECK-NEXT:    mov w0, w19
 ; CHECK-NEXT:    ldp x20, x19, [sp, #16] // 16-byte Folded Reload
 ; CHECK-NEXT:    ldp x30, x21, [sp], #32 // 16-byte Folded Reload
 ; CHECK-NEXT:    ret
@@ -82,15 +79,12 @@ define i32 @cant_sink_successive_call(i32 %n) {
 ; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    mov w0, w20
 ; CHECK-NEXT:    bl _Z3usei
-; CHECK-NEXT:    sdiv w21, w21, w0
-; CHECK-NEXT:    subs w19, w19, #1
+; CHECK-NEXT:    sdiv w19, w19, w0
+; CHECK-NEXT:    subs w21, w21, #1
 ; CHECK-NEXT:    b.ne .LBB1_2
-; CHECK-NEXT:    b .LBB1_4
-; CHECK-NEXT:  .LBB1_3:
-; CHECK-NEXT:    mov w21, w19
-; CHECK-NEXT:  .LBB1_4: // %for.cond.cleanup
+; CHECK-NEXT:  .LBB1_3: // %for.cond.cleanup
+; CHECK-NEXT:    mov w0, w19
 ; CHECK-NEXT:    ldp x20, x19, [sp, #16] // 16-byte Folded Reload
-; CHECK-NEXT:    mov w0, w21
 ; CHECK-NEXT:    ldp x30, x21, [sp], #32 // 16-byte Folded Reload
 ; CHECK-NEXT:    ret
 entry:
@@ -139,15 +133,12 @@ define i32 @cant_sink_successive_store(ptr nocapture readnone %store, i32 %n) {
 ; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    mov w0, w20
 ; CHECK-NEXT:    bl _Z3usei
-; CHECK-NEXT:    sdiv w21, w21, w0
-; CHECK-NEXT:    subs w19, w19, #1
+; CHECK-NEXT:    sdiv w19, w19, w0
+; CHECK-NEXT:    subs w21, w21, #1
 ; CHECK-NEXT:    b.ne .LBB2_2
-; CHECK-NEXT:    b .LBB2_4
-; CHECK-NEXT:  .LBB2_3:
-; CHECK-NEXT:    mov w21, w19
-; CHECK-NEXT:  .LBB2_4: // %for.cond.cleanup
+; CHECK-NEXT:  .LBB2_3: // %for.cond.cleanup
+; CHECK-NEXT:    mov w0, w19
 ; CHECK-NEXT:    ldp x20, x19, [sp, #16] // 16-byte Folded Reload
-; CHECK-NEXT:    mov w0, w21
 ; CHECK-NEXT:    ldp x30, x21, [sp], #32 // 16-byte Folded Reload
 ; CHECK-NEXT:    ret
 entry:
diff --git a/llvm/test/CodeGen/AArch64/machine-sink-kill-flags.ll b/llvm/test/CodeGen/AArch64/machine-sink-kill-flags.ll
index e7e109170d6a1..338084295fc7f 100644
--- a/llvm/test/CodeGen/AArch64/machine-sink-kill-flags.ll
+++ b/llvm/test/CodeGen/AArch64/machine-sink-kill-flags.ll
@@ -16,13 +16,12 @@ define i32 @test(ptr %ptr) {
 ; CHECK-NEXT:    mov w9, wzr
 ; CHECK-NEXT:  LBB0_1: ; %.thread
 ; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    lsr w11, w9, #1
 ; CHECK-NEXT:    sub w10, w9, #1
-; CHECK-NEXT:    mov w9, w11
+; CHECK-NEXT:    lsr w9, w9, #1
 ; CHECK-NEXT:    tbnz w10, #0, LBB0_1
 ; CHECK-NEXT:  ; %bb.2: ; %bb343
 ; CHECK-NEXT:    and w9, w10, #0x1
-; CHECK-NEXT:    mov w0, #-1
+; CHECK-NEXT:    mov w0, #-1 ; =0xffffffff
 ; CHECK-NEXT:    str w9, [x8]
 ; CHECK-NEXT:    ret
 bb:
diff --git a/llvm/test/CodeGen/AArch64/sme-pstate-sm-changing-call-disable-coalescing.ll b/llvm/test/CodeGen/AArch64/sme-pstate-sm-changing-call-disable-coalescing.ll
index b947c943ba448..72f6646930624 100644
--- a/llvm/test/CodeGen/AArch64/sme-pstate-sm-changing-call-disable-coalescing.ll
+++ b/llvm/test/CodeGen/AArch64/sme-pstate-sm-changing-call-disable-coalescing.ll
@@ -151,12 +151,11 @@ define void @dont_coalesce_arg_f16(half %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $h0 killed $h0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $h0 killed $h0 killed $z0
 ; CHECK-NEXT:    str h0, [sp, #14] // 2-byte Folded Spill
+; CHECK-NEXT:    // kill: def $h0 killed $h0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr h0, [sp, #14] // 2-byte Folded Reload
 ; CHECK-NEXT:    bl use_f16
@@ -190,12 +189,11 @@ define void @dont_coalesce_arg_f32(float %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $s0 killed $s0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $s0 killed $s0 killed $z0
 ; CHECK-NEXT:    str s0, [sp, #12] // 4-byte Folded Spill
+; CHECK-NEXT:    // kill: def $s0 killed $s0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr s0, [sp, #12] // 4-byte Folded Reload
 ; CHECK-NEXT:    bl use_f32
@@ -229,12 +227,11 @@ define void @dont_coalesce_arg_f64(double %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_f64
@@ -273,12 +270,11 @@ define void @dont_coalesce_arg_v1i8(<1 x i8> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_v16i8
@@ -313,12 +309,11 @@ define void @dont_coalesce_arg_v1i16(<1 x i16> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_v8i16
@@ -353,12 +348,11 @@ define void @dont_coalesce_arg_v1i32(<1 x i32> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_v4i32
@@ -393,12 +387,11 @@ define void @dont_coalesce_arg_v1i64(<1 x i64> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_v2i64
@@ -433,12 +426,11 @@ define void @dont_coalesce_arg_v1f16(<1 x half> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $h0 killed $h0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $h0 killed $h0 killed $z0
 ; CHECK-NEXT:    str h0, [sp, #14] // 2-byte Folded Spill
+; CHECK-NEXT:    // kill: def $h0 killed $h0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr h0, [sp, #14] // 2-byte Folded Reload
 ; CHECK-NEXT:    bl use_v8f16
@@ -513,12 +505,11 @@ define void @dont_coalesce_arg_v1f64(<1 x double> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_v2f64
@@ -557,12 +548,11 @@ define void @dont_coalesce_arg_v16i8(<16 x i8> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v16i8
@@ -596,12 +586,11 @@ define void @dont_coalesce_arg_v8i16(<8 x i16> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v8i16
@@ -635,12 +624,11 @@ define void @dont_coalesce_arg_v4i32(<4 x i32> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v4i32
@@ -674,12 +662,11 @@ define void @dont_coalesce_arg_v2i64(<2 x i64> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v2i64
@@ -713,12 +700,11 @@ define void @dont_coalesce_arg_v8f16(<8 x half> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v8f16
@@ -752,12 +738,11 @@ define void @dont_coalesce_arg_v8bf16(<8 x bfloat> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v8bf16
@@ -791,12 +776,11 @@ define void @dont_coalesce_arg_v4f32(<4 x float> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v4f32
@@ -830,12 +814,11 @@ define void @dont_coalesce_arg_v2f64(<2 x double> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v2f64
diff --git a/llvm/test/CodeGen/AArch64/sme-streaming-compatible-interface.ll b/llvm/test/CodeGen/AArch64/sme-streaming-compatible-interface.ll
index f2163ad15bafc..df88f37195ed6 100644
--- a/llvm/test/CodeGen/AArch64/sme-streaming-compatible-interface.ll
+++ b/llvm/test/CodeGen/AArch64/sme-streaming-compatible-interface.ll
@@ -129,12 +129,11 @@ define <2 x double> @streaming_compatible_with_neon_vectors(<2 x double> %arg) "
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
+; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    mrs x19, SVCR
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    mrs x19, SVCR
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
-; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
 ; CHECK-NEXT:    tbz w19, #0, .LBB4_2
 ; CHECK-NEXT:  // %bb.1:
 ; CHECK-NEXT:    smstop sm
diff --git a/llvm/test/CodeGen/AArch64/sve-extract-fixed-from-scalable-vector.ll b/llvm/test/CodeGen/AArch64/sve-extract-fixed-from-scalable-vector.ll
index 6c6a691760af3..52a77cb396909 100644
--- a/llvm/test/CodeGen/AArch64/sve-extract-fixed-from-scalable-vector.ll
+++ b/llvm/test/CodeGen/AArch64/sve-extract-fixed-from-scalable-vector.ll
@@ -147,15 +147,15 @@ define <2 x float> @extract_v2f32_nxv16f32_2(<vscale x 16 x float> %arg) {
 define <4 x i1> @extract_v4i1_nxv32i1_0(<vscale x 32 x i1> %arg) {
 ; CHECK-LABEL: extract_v4i1_nxv32i1_0:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    mov z1.b, p0/z, #1 // =0x1
-; CHECK-NEXT:    umov w8, v1.b[1]
-; CHECK-NEXT:    mov v0.16b, v1.16b
-; CHECK-NEXT:    umov w9, v1.b[2]
+; CHECK-NEXT:    mov z0.b, p0/z, #1 // =0x1
+; CHECK-NEXT:    umov w8, v0.b[1]
+; CHECK-NEXT:    mov v1.16b, v0.16b
 ; CHECK-NEXT:    mov v0.h[1], w8
+; CHECK-NEXT:    umov w8, v1.b[2]
+; CHECK-NEXT:    mov v0.h[2], w8
 ; CHECK-NEXT:    umov w8, v1.b[3]
-; CHECK-NEXT:    mov v0.h[2], w9
 ; CHECK-NEXT:    mov v0.h[3], w8
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $q0
+; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    ret
   %ext = call <4 x i1> @llvm.vector.extract.v4i1.n...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Oct 2, 2025

@llvm/pr-subscribers-backend-aarch64

Author: Matt Arsenault (arsenm)

Changes

This appears to be forgotten switch flip from 2015. This
seems to do a nicer job with subregister copies. Most of the
test changes are improvements or neutral, not that many are
light regressions. The worst AMDGPU regressions are for true16
in the atomic tests, but I think that's due to existing true16
issues.

I also had to hack many hexagon tests to disable the rule. I have
no idea how to update these tests. They appear to be testing specific
scheduling and packet formation of later machine passes, so any change
in the incoming mir is likely hiding whatever was originally intended.
I'll open an issue to fixup these tests once this lands.


Patch is 1.65 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/161621.diff

152 Files Affected:

  • (modified) llvm/lib/CodeGen/RegisterCoalescer.cpp (+1-1)
  • (modified) llvm/test/CodeGen/AArch64/build-vector-two-dup.ll (+5-5)
  • (modified) llvm/test/CodeGen/AArch64/machine-licm-sink-instr.ll (+15-24)
  • (modified) llvm/test/CodeGen/AArch64/machine-sink-kill-flags.ll (+2-3)
  • (modified) llvm/test/CodeGen/AArch64/sme-pstate-sm-changing-call-disable-coalescing.ll (+34-51)
  • (modified) llvm/test/CodeGen/AArch64/sme-streaming-compatible-interface.ll (+2-3)
  • (modified) llvm/test/CodeGen/AArch64/sve-extract-fixed-from-scalable-vector.ll (+6-6)
  • (modified) llvm/test/CodeGen/AArch64/sve-extract-fixed-vector.ll (+20-19)
  • (modified) llvm/test/CodeGen/AArch64/sve-fixed-length-reshuffle.ll (+6-6)
  • (modified) llvm/test/CodeGen/AArch64/sve-fixed-length-shuffles.ll (+36-36)
  • (modified) llvm/test/CodeGen/AArch64/sve-ptest-removal-sink.ll (+4-4)
  • (modified) llvm/test/CodeGen/AArch64/zext-to-tbl.ll (+46-46)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.ll (+9-9)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.ll (+19-19)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll (+180-194)
  • (modified) llvm/test/CodeGen/AMDGPU/and.ll (+47-53)
  • (modified) llvm/test/CodeGen/AMDGPU/bfe-patterns.ll (+42-42)
  • (modified) llvm/test/CodeGen/AMDGPU/bfi_nested.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/bfm.ll (+6-7)
  • (modified) llvm/test/CodeGen/AMDGPU/bitreverse.ll (+19-23)
  • (modified) llvm/test/CodeGen/AMDGPU/build_vector.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/combine-cond-add-sub.ll (+16-16)
  • (modified) llvm/test/CodeGen/AMDGPU/divergence-driven-buildvector.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/divergence-driven-sext-inreg.ll (+26-30)
  • (modified) llvm/test/CodeGen/AMDGPU/fabs.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/fdiv.ll (+39-45)
  • (modified) llvm/test/CodeGen/AMDGPU/fmin_legacy.ll (+11-4)
  • (modified) llvm/test/CodeGen/AMDGPU/fnearbyint.ll (+6-7)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg-fabs.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/fp_to_sint.ll (+40-45)
  • (modified) llvm/test/CodeGen/AMDGPU/fp_to_uint.ll (+21-27)
  • (modified) llvm/test/CodeGen/AMDGPU/fshl.ll (+9-10)
  • (modified) llvm/test/CodeGen/AMDGPU/insert_vector_elt.ll (+20-22)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ubfe.ll (+14-16)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.exp.ll (+15-16)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.exp10.ll (+15-16)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.exp2.ll (+15-16)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.log.ll (+18-19)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.log10.ll (+18-19)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.log2.ll (+3-5)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fadd.ll (+1038-1012)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmax.ll (+1070-1042)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmin.ll (+1070-1042)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fsub.ll (+1224-1182)
  • (modified) llvm/test/CodeGen/AMDGPU/lshr.v2i16.ll (+12-13)
  • (modified) llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/max.ll (+40-46)
  • (modified) llvm/test/CodeGen/AMDGPU/memcpy-crash-issue63986.ll (+9-9)
  • (modified) llvm/test/CodeGen/AMDGPU/memmove-var-size.ll (+204-204)
  • (modified) llvm/test/CodeGen/AMDGPU/mul_int24.ll (+60-69)
  • (modified) llvm/test/CodeGen/AMDGPU/mul_uint24-amdgcn.ll (+50-56)
  • (modified) llvm/test/CodeGen/AMDGPU/or.ll (+12-14)
  • (modified) llvm/test/CodeGen/AMDGPU/set-inactive-wwm-overwrite.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/sext-divergence-driven-isel.ll (+7-8)
  • (modified) llvm/test/CodeGen/AMDGPU/shl.v2i16.ll (+30-36)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4f32.v3f32.ll (+9-16)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4i32.v3i32.ll (+9-16)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4p3.v3p3.ll (+9-16)
  • (modified) llvm/test/CodeGen/AMDGPU/sign_extend.ll (+42-48)
  • (modified) llvm/test/CodeGen/AMDGPU/skip-if-dead.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/sminmax.v2i16.ll (+31-32)
  • (modified) llvm/test/CodeGen/AMDGPU/sub.ll (+6-7)
  • (modified) llvm/test/CodeGen/AMDGPU/sub.v2i16.ll (+18-23)
  • (modified) llvm/test/CodeGen/AMDGPU/udiv.ll (+30-32)
  • (modified) llvm/test/CodeGen/AMDGPU/udiv64.ll (+12-16)
  • (modified) llvm/test/CodeGen/AMDGPU/while-break.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/xor.ll (+6-7)
  • (modified) llvm/test/CodeGen/AMDGPU/zext-divergence-driven-isel.ll (+7-8)
  • (modified) llvm/test/CodeGen/BPF/objdump_cond_op_2.ll (+2-2)
  • (modified) llvm/test/CodeGen/Hexagon/late_instr.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-carried-1.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-conv3x3-nested.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-epilog-phi11.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-epilog-phi12.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-epilog-phi7.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-kernel-phi1.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-matmul-bitext.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-order-copies.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-order-deps7.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-reuse-phi-6.ll (+1-1)
  • (modified) llvm/test/CodeGen/NVPTX/atomics-b128.ll (+75-75)
  • (modified) llvm/test/CodeGen/NVPTX/atomics-sm70.ll (+20-20)
  • (modified) llvm/test/CodeGen/NVPTX/atomics-sm90.ll (+20-20)
  • (modified) llvm/test/CodeGen/NVPTX/atomics.ll (+6-6)
  • (modified) llvm/test/CodeGen/PowerPC/ctrloop-fp128.ll (+3-3)
  • (modified) llvm/test/CodeGen/PowerPC/licm-xxsplti.ll (+27-27)
  • (modified) llvm/test/CodeGen/PowerPC/loop-instr-form-prepare.ll (+3-5)
  • (modified) llvm/test/CodeGen/PowerPC/perfect-shuffle.ll (+6-6)
  • (modified) llvm/test/CodeGen/PowerPC/sms-phi-1.ll (+3-2)
  • (modified) llvm/test/CodeGen/PowerPC/sms-phi-2.ll (+21-22)
  • (modified) llvm/test/CodeGen/RISCV/branch-on-zero.ll (+6-10)
  • (modified) llvm/test/CodeGen/RISCV/machine-pipeliner.ll (+23-23)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-shuffle-exact-vlen.ll (+4-6)
  • (modified) llvm/test/CodeGen/RISCV/rvv/pr95865.ll (+21-22)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vandn-sdnode.ll (+33-33)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vcpop-shl-zext-opt.ll (+14-14)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vxrm-insert-out-of-loop.ll (+12-12)
  • (modified) llvm/test/CodeGen/SystemZ/atomicrmw-fadd-01.ll (+6-5)
  • (modified) llvm/test/CodeGen/SystemZ/atomicrmw-fsub-01.ll (+6-5)
  • (modified) llvm/test/CodeGen/Thumb2/LowOverheadLoops/constbound.ll (+9-9)
  • (modified) llvm/test/CodeGen/Thumb2/LowOverheadLoops/varying-outer-2d-reduction.ll (+24-26)
  • (modified) llvm/test/CodeGen/Thumb2/LowOverheadLoops/while-loops.ll (+45-46)
  • (modified) llvm/test/CodeGen/Thumb2/mve-float32regloops.ll (+102-109)
  • (modified) llvm/test/CodeGen/Thumb2/mve-gather-increment.ll (+12-12)
  • (modified) llvm/test/CodeGen/Thumb2/mve-gather-scatter-optimisation.ll (+45-45)
  • (modified) llvm/test/CodeGen/Thumb2/mve-pipelineloops.ll (+26-26)
  • (modified) llvm/test/CodeGen/Thumb2/mve-shuffle.ll (+7-6)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vld4.ll (+7-6)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vmaxnma-commute.ll (+12-12)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vst4.ll (+7-7)
  • (modified) llvm/test/CodeGen/Thumb2/pacbti-m-vla.ll (+1-1)
  • (modified) llvm/test/CodeGen/WebAssembly/simd-shift-in-loop.ll (+6-8)
  • (modified) llvm/test/CodeGen/X86/3addr-16bit.ll (+24-24)
  • (modified) llvm/test/CodeGen/X86/atomic-rm-bit-test.ll (+13-9)
  • (modified) llvm/test/CodeGen/X86/atomicrmw-fadd-fp-vector.ll (+2-1)
  • (modified) llvm/test/CodeGen/X86/bitcast-vector-bool.ll (+16-16)
  • (modified) llvm/test/CodeGen/X86/coalescer-dead-flag-verifier-error.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/dag-update-nodetomatch.ll (+52-48)
  • (modified) llvm/test/CodeGen/X86/fold-loop-of-urem.ll (+38-43)
  • (modified) llvm/test/CodeGen/X86/freeze-binary.ll (+14-12)
  • (modified) llvm/test/CodeGen/X86/i128-mul.ll (+87-91)
  • (modified) llvm/test/CodeGen/X86/icmp-abs-C.ll (+11-11)
  • (modified) llvm/test/CodeGen/X86/masked_gather_scatter.ll (+6-6)
  • (modified) llvm/test/CodeGen/X86/midpoint-int.ll (+14-14)
  • (modified) llvm/test/CodeGen/X86/mmx-arith.ll (+1-2)
  • (modified) llvm/test/CodeGen/X86/mul-constant-i16.ll (+4-4)
  • (modified) llvm/test/CodeGen/X86/mul-constant-i32.ll (+8-8)
  • (modified) llvm/test/CodeGen/X86/mul-constant-i8.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/optimize-max-0.ll (+107-104)
  • (modified) llvm/test/CodeGen/X86/parity.ll (+15-15)
  • (modified) llvm/test/CodeGen/X86/rotate-extract.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/smul_fix.ll (+4-4)
  • (modified) llvm/test/CodeGen/X86/sshl_sat.ll (+20-20)
  • (modified) llvm/test/CodeGen/X86/sshl_sat_vec.ll (+56-57)
  • (modified) llvm/test/CodeGen/X86/stackmap.ll (+6-3)
  • (modified) llvm/test/CodeGen/X86/subvectorwise-store-of-vector-splat.ll (+105-105)
  • (modified) llvm/test/CodeGen/X86/twoaddr-lea.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/umul_fix.ll (+4-4)
  • (modified) llvm/test/CodeGen/X86/ushl_sat.ll (+14-14)
  • (modified) llvm/test/CodeGen/X86/ushl_sat_vec.ll (+55-56)
  • (modified) llvm/test/CodeGen/X86/vector-mulfix-legalize.ll (+17-17)
  • (modified) llvm/test/CodeGen/X86/vector-reduce-xor-bool.ll (+80-80)
  • (modified) llvm/test/CodeGen/X86/wide-scalar-shift-by-byte-multiple-legalization.ll (+3023-3058)
  • (modified) llvm/test/CodeGen/X86/wide-scalar-shift-legalization.ll (+668-676)
  • (modified) llvm/test/CodeGen/X86/widen-load-of-small-alloca-with-zero-upper-half.ll (+165-163)
  • (modified) llvm/test/CodeGen/X86/widen-load-of-small-alloca.ll (+49-46)
  • (modified) llvm/test/CodeGen/X86/x86-shrink-wrapping.ll (+9-9)
  • (modified) llvm/test/CodeGen/X86/xor.ll (+66-66)
  • (modified) llvm/test/Transforms/LoopStrengthReduce/X86/ivchain-X86.ll (+11-10)
diff --git a/llvm/lib/CodeGen/RegisterCoalescer.cpp b/llvm/lib/CodeGen/RegisterCoalescer.cpp
index 7ac1aef83777a..5bd38a916fe4d 100644
--- a/llvm/lib/CodeGen/RegisterCoalescer.cpp
+++ b/llvm/lib/CodeGen/RegisterCoalescer.cpp
@@ -81,7 +81,7 @@ static cl::opt<bool> EnableJoining("join-liveintervals",
 
 static cl::opt<bool> UseTerminalRule("terminal-rule",
                                      cl::desc("Apply the terminal rule"),
-                                     cl::init(false), cl::Hidden);
+                                     cl::init(true), cl::Hidden);
 
 /// Temporary flag to test critical edge unsplitting.
 static cl::opt<bool> EnableJoinSplits(
diff --git a/llvm/test/CodeGen/AArch64/build-vector-two-dup.ll b/llvm/test/CodeGen/AArch64/build-vector-two-dup.ll
index dbbfbea9176f6..f725c19081deb 100644
--- a/llvm/test/CodeGen/AArch64/build-vector-two-dup.ll
+++ b/llvm/test/CodeGen/AArch64/build-vector-two-dup.ll
@@ -188,11 +188,11 @@ entry:
 define <8 x i8> @test11(ptr nocapture noundef readonly %a, ptr nocapture noundef readonly %b) {
 ; CHECK-LABEL: test11:
 ; CHECK:       // %bb.0: // %entry
-; CHECK-NEXT:    ld1r { v1.8b }, [x0]
-; CHECK-NEXT:    ld1r { v2.8b }, [x1]
-; CHECK-NEXT:    mov v0.16b, v1.16b
-; CHECK-NEXT:    mov v0.h[2], v2.h[0]
-; CHECK-NEXT:    mov v0.h[3], v1.h[0]
+; CHECK-NEXT:    ld1r { v0.8b }, [x0]
+; CHECK-NEXT:    ld1r { v1.8b }, [x1]
+; CHECK-NEXT:    fmov d2, d0
+; CHECK-NEXT:    mov v0.h[2], v1.h[0]
+; CHECK-NEXT:    mov v0.h[3], v2.h[0]
 ; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $q0
 ; CHECK-NEXT:    ret
 entry:
diff --git a/llvm/test/CodeGen/AArch64/machine-licm-sink-instr.ll b/llvm/test/CodeGen/AArch64/machine-licm-sink-instr.ll
index 3230c9e946da7..b3a7ec961b736 100644
--- a/llvm/test/CodeGen/AArch64/machine-licm-sink-instr.ll
+++ b/llvm/test/CodeGen/AArch64/machine-licm-sink-instr.ll
@@ -20,20 +20,17 @@ define i32 @sink_load_and_copy(i32 %n) {
 ; CHECK-NEXT:    b.lt .LBB0_3
 ; CHECK-NEXT:  // %bb.1: // %for.body.preheader
 ; CHECK-NEXT:    adrp x8, A
-; CHECK-NEXT:    mov w20, w19
-; CHECK-NEXT:    ldr w21, [x8, :lo12:A]
+; CHECK-NEXT:    mov w21, w19
+; CHECK-NEXT:    ldr w20, [x8, :lo12:A]
 ; CHECK-NEXT:  .LBB0_2: // %for.body
 ; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    mov w0, w21
+; CHECK-NEXT:    mov w0, w20
 ; CHECK-NEXT:    bl _Z3usei
-; CHECK-NEXT:    sdiv w20, w20, w0
-; CHECK-NEXT:    subs w19, w19, #1
+; CHECK-NEXT:    sdiv w19, w19, w0
+; CHECK-NEXT:    subs w21, w21, #1
 ; CHECK-NEXT:    b.ne .LBB0_2
-; CHECK-NEXT:    b .LBB0_4
-; CHECK-NEXT:  .LBB0_3:
-; CHECK-NEXT:    mov w20, w19
-; CHECK-NEXT:  .LBB0_4: // %for.cond.cleanup
-; CHECK-NEXT:    mov w0, w20
+; CHECK-NEXT:  .LBB0_3: // %for.cond.cleanup
+; CHECK-NEXT:    mov w0, w19
 ; CHECK-NEXT:    ldp x20, x19, [sp, #16] // 16-byte Folded Reload
 ; CHECK-NEXT:    ldp x30, x21, [sp], #32 // 16-byte Folded Reload
 ; CHECK-NEXT:    ret
@@ -82,15 +79,12 @@ define i32 @cant_sink_successive_call(i32 %n) {
 ; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    mov w0, w20
 ; CHECK-NEXT:    bl _Z3usei
-; CHECK-NEXT:    sdiv w21, w21, w0
-; CHECK-NEXT:    subs w19, w19, #1
+; CHECK-NEXT:    sdiv w19, w19, w0
+; CHECK-NEXT:    subs w21, w21, #1
 ; CHECK-NEXT:    b.ne .LBB1_2
-; CHECK-NEXT:    b .LBB1_4
-; CHECK-NEXT:  .LBB1_3:
-; CHECK-NEXT:    mov w21, w19
-; CHECK-NEXT:  .LBB1_4: // %for.cond.cleanup
+; CHECK-NEXT:  .LBB1_3: // %for.cond.cleanup
+; CHECK-NEXT:    mov w0, w19
 ; CHECK-NEXT:    ldp x20, x19, [sp, #16] // 16-byte Folded Reload
-; CHECK-NEXT:    mov w0, w21
 ; CHECK-NEXT:    ldp x30, x21, [sp], #32 // 16-byte Folded Reload
 ; CHECK-NEXT:    ret
 entry:
@@ -139,15 +133,12 @@ define i32 @cant_sink_successive_store(ptr nocapture readnone %store, i32 %n) {
 ; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    mov w0, w20
 ; CHECK-NEXT:    bl _Z3usei
-; CHECK-NEXT:    sdiv w21, w21, w0
-; CHECK-NEXT:    subs w19, w19, #1
+; CHECK-NEXT:    sdiv w19, w19, w0
+; CHECK-NEXT:    subs w21, w21, #1
 ; CHECK-NEXT:    b.ne .LBB2_2
-; CHECK-NEXT:    b .LBB2_4
-; CHECK-NEXT:  .LBB2_3:
-; CHECK-NEXT:    mov w21, w19
-; CHECK-NEXT:  .LBB2_4: // %for.cond.cleanup
+; CHECK-NEXT:  .LBB2_3: // %for.cond.cleanup
+; CHECK-NEXT:    mov w0, w19
 ; CHECK-NEXT:    ldp x20, x19, [sp, #16] // 16-byte Folded Reload
-; CHECK-NEXT:    mov w0, w21
 ; CHECK-NEXT:    ldp x30, x21, [sp], #32 // 16-byte Folded Reload
 ; CHECK-NEXT:    ret
 entry:
diff --git a/llvm/test/CodeGen/AArch64/machine-sink-kill-flags.ll b/llvm/test/CodeGen/AArch64/machine-sink-kill-flags.ll
index e7e109170d6a1..338084295fc7f 100644
--- a/llvm/test/CodeGen/AArch64/machine-sink-kill-flags.ll
+++ b/llvm/test/CodeGen/AArch64/machine-sink-kill-flags.ll
@@ -16,13 +16,12 @@ define i32 @test(ptr %ptr) {
 ; CHECK-NEXT:    mov w9, wzr
 ; CHECK-NEXT:  LBB0_1: ; %.thread
 ; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    lsr w11, w9, #1
 ; CHECK-NEXT:    sub w10, w9, #1
-; CHECK-NEXT:    mov w9, w11
+; CHECK-NEXT:    lsr w9, w9, #1
 ; CHECK-NEXT:    tbnz w10, #0, LBB0_1
 ; CHECK-NEXT:  ; %bb.2: ; %bb343
 ; CHECK-NEXT:    and w9, w10, #0x1
-; CHECK-NEXT:    mov w0, #-1
+; CHECK-NEXT:    mov w0, #-1 ; =0xffffffff
 ; CHECK-NEXT:    str w9, [x8]
 ; CHECK-NEXT:    ret
 bb:
diff --git a/llvm/test/CodeGen/AArch64/sme-pstate-sm-changing-call-disable-coalescing.ll b/llvm/test/CodeGen/AArch64/sme-pstate-sm-changing-call-disable-coalescing.ll
index b947c943ba448..72f6646930624 100644
--- a/llvm/test/CodeGen/AArch64/sme-pstate-sm-changing-call-disable-coalescing.ll
+++ b/llvm/test/CodeGen/AArch64/sme-pstate-sm-changing-call-disable-coalescing.ll
@@ -151,12 +151,11 @@ define void @dont_coalesce_arg_f16(half %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $h0 killed $h0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $h0 killed $h0 killed $z0
 ; CHECK-NEXT:    str h0, [sp, #14] // 2-byte Folded Spill
+; CHECK-NEXT:    // kill: def $h0 killed $h0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr h0, [sp, #14] // 2-byte Folded Reload
 ; CHECK-NEXT:    bl use_f16
@@ -190,12 +189,11 @@ define void @dont_coalesce_arg_f32(float %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $s0 killed $s0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $s0 killed $s0 killed $z0
 ; CHECK-NEXT:    str s0, [sp, #12] // 4-byte Folded Spill
+; CHECK-NEXT:    // kill: def $s0 killed $s0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr s0, [sp, #12] // 4-byte Folded Reload
 ; CHECK-NEXT:    bl use_f32
@@ -229,12 +227,11 @@ define void @dont_coalesce_arg_f64(double %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_f64
@@ -273,12 +270,11 @@ define void @dont_coalesce_arg_v1i8(<1 x i8> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_v16i8
@@ -313,12 +309,11 @@ define void @dont_coalesce_arg_v1i16(<1 x i16> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_v8i16
@@ -353,12 +348,11 @@ define void @dont_coalesce_arg_v1i32(<1 x i32> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_v4i32
@@ -393,12 +387,11 @@ define void @dont_coalesce_arg_v1i64(<1 x i64> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_v2i64
@@ -433,12 +426,11 @@ define void @dont_coalesce_arg_v1f16(<1 x half> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $h0 killed $h0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $h0 killed $h0 killed $z0
 ; CHECK-NEXT:    str h0, [sp, #14] // 2-byte Folded Spill
+; CHECK-NEXT:    // kill: def $h0 killed $h0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr h0, [sp, #14] // 2-byte Folded Reload
 ; CHECK-NEXT:    bl use_v8f16
@@ -513,12 +505,11 @@ define void @dont_coalesce_arg_v1f64(<1 x double> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_v2f64
@@ -557,12 +548,11 @@ define void @dont_coalesce_arg_v16i8(<16 x i8> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v16i8
@@ -596,12 +586,11 @@ define void @dont_coalesce_arg_v8i16(<8 x i16> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v8i16
@@ -635,12 +624,11 @@ define void @dont_coalesce_arg_v4i32(<4 x i32> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v4i32
@@ -674,12 +662,11 @@ define void @dont_coalesce_arg_v2i64(<2 x i64> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v2i64
@@ -713,12 +700,11 @@ define void @dont_coalesce_arg_v8f16(<8 x half> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v8f16
@@ -752,12 +738,11 @@ define void @dont_coalesce_arg_v8bf16(<8 x bfloat> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v8bf16
@@ -791,12 +776,11 @@ define void @dont_coalesce_arg_v4f32(<4 x float> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v4f32
@@ -830,12 +814,11 @@ define void @dont_coalesce_arg_v2f64(<2 x double> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v2f64
diff --git a/llvm/test/CodeGen/AArch64/sme-streaming-compatible-interface.ll b/llvm/test/CodeGen/AArch64/sme-streaming-compatible-interface.ll
index f2163ad15bafc..df88f37195ed6 100644
--- a/llvm/test/CodeGen/AArch64/sme-streaming-compatible-interface.ll
+++ b/llvm/test/CodeGen/AArch64/sme-streaming-compatible-interface.ll
@@ -129,12 +129,11 @@ define <2 x double> @streaming_compatible_with_neon_vectors(<2 x double> %arg) "
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
+; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    mrs x19, SVCR
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    mrs x19, SVCR
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
-; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
 ; CHECK-NEXT:    tbz w19, #0, .LBB4_2
 ; CHECK-NEXT:  // %bb.1:
 ; CHECK-NEXT:    smstop sm
diff --git a/llvm/test/CodeGen/AArch64/sve-extract-fixed-from-scalable-vector.ll b/llvm/test/CodeGen/AArch64/sve-extract-fixed-from-scalable-vector.ll
index 6c6a691760af3..52a77cb396909 100644
--- a/llvm/test/CodeGen/AArch64/sve-extract-fixed-from-scalable-vector.ll
+++ b/llvm/test/CodeGen/AArch64/sve-extract-fixed-from-scalable-vector.ll
@@ -147,15 +147,15 @@ define <2 x float> @extract_v2f32_nxv16f32_2(<vscale x 16 x float> %arg) {
 define <4 x i1> @extract_v4i1_nxv32i1_0(<vscale x 32 x i1> %arg) {
 ; CHECK-LABEL: extract_v4i1_nxv32i1_0:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    mov z1.b, p0/z, #1 // =0x1
-; CHECK-NEXT:    umov w8, v1.b[1]
-; CHECK-NEXT:    mov v0.16b, v1.16b
-; CHECK-NEXT:    umov w9, v1.b[2]
+; CHECK-NEXT:    mov z0.b, p0/z, #1 // =0x1
+; CHECK-NEXT:    umov w8, v0.b[1]
+; CHECK-NEXT:    mov v1.16b, v0.16b
 ; CHECK-NEXT:    mov v0.h[1], w8
+; CHECK-NEXT:    umov w8, v1.b[2]
+; CHECK-NEXT:    mov v0.h[2], w8
 ; CHECK-NEXT:    umov w8, v1.b[3]
-; CHECK-NEXT:    mov v0.h[2], w9
 ; CHECK-NEXT:    mov v0.h[3], w8
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $q0
+; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    ret
   %ext = call <4 x i1> @llvm.vector.extract.v4i1.n...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Oct 2, 2025

@llvm/pr-subscribers-llvm-regalloc

Author: Matt Arsenault (arsenm)

Changes

This appears to be forgotten switch flip from 2015. This
seems to do a nicer job with subregister copies. Most of the
test changes are improvements or neutral, not that many are
light regressions. The worst AMDGPU regressions are for true16
in the atomic tests, but I think that's due to existing true16
issues.

I also had to hack many hexagon tests to disable the rule. I have
no idea how to update these tests. They appear to be testing specific
scheduling and packet formation of later machine passes, so any change
in the incoming mir is likely hiding whatever was originally intended.
I'll open an issue to fixup these tests once this lands.


Patch is 1.65 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/161621.diff

152 Files Affected:

  • (modified) llvm/lib/CodeGen/RegisterCoalescer.cpp (+1-1)
  • (modified) llvm/test/CodeGen/AArch64/build-vector-two-dup.ll (+5-5)
  • (modified) llvm/test/CodeGen/AArch64/machine-licm-sink-instr.ll (+15-24)
  • (modified) llvm/test/CodeGen/AArch64/machine-sink-kill-flags.ll (+2-3)
  • (modified) llvm/test/CodeGen/AArch64/sme-pstate-sm-changing-call-disable-coalescing.ll (+34-51)
  • (modified) llvm/test/CodeGen/AArch64/sme-streaming-compatible-interface.ll (+2-3)
  • (modified) llvm/test/CodeGen/AArch64/sve-extract-fixed-from-scalable-vector.ll (+6-6)
  • (modified) llvm/test/CodeGen/AArch64/sve-extract-fixed-vector.ll (+20-19)
  • (modified) llvm/test/CodeGen/AArch64/sve-fixed-length-reshuffle.ll (+6-6)
  • (modified) llvm/test/CodeGen/AArch64/sve-fixed-length-shuffles.ll (+36-36)
  • (modified) llvm/test/CodeGen/AArch64/sve-ptest-removal-sink.ll (+4-4)
  • (modified) llvm/test/CodeGen/AArch64/zext-to-tbl.ll (+46-46)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.ll (+9-9)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.ll (+19-19)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll (+180-194)
  • (modified) llvm/test/CodeGen/AMDGPU/and.ll (+47-53)
  • (modified) llvm/test/CodeGen/AMDGPU/bfe-patterns.ll (+42-42)
  • (modified) llvm/test/CodeGen/AMDGPU/bfi_nested.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/bfm.ll (+6-7)
  • (modified) llvm/test/CodeGen/AMDGPU/bitreverse.ll (+19-23)
  • (modified) llvm/test/CodeGen/AMDGPU/build_vector.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/combine-cond-add-sub.ll (+16-16)
  • (modified) llvm/test/CodeGen/AMDGPU/divergence-driven-buildvector.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/divergence-driven-sext-inreg.ll (+26-30)
  • (modified) llvm/test/CodeGen/AMDGPU/fabs.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/fdiv.ll (+39-45)
  • (modified) llvm/test/CodeGen/AMDGPU/fmin_legacy.ll (+11-4)
  • (modified) llvm/test/CodeGen/AMDGPU/fnearbyint.ll (+6-7)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg-fabs.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/fp_to_sint.ll (+40-45)
  • (modified) llvm/test/CodeGen/AMDGPU/fp_to_uint.ll (+21-27)
  • (modified) llvm/test/CodeGen/AMDGPU/fshl.ll (+9-10)
  • (modified) llvm/test/CodeGen/AMDGPU/insert_vector_elt.ll (+20-22)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ubfe.ll (+14-16)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.exp.ll (+15-16)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.exp10.ll (+15-16)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.exp2.ll (+15-16)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.log.ll (+18-19)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.log10.ll (+18-19)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.log2.ll (+3-5)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fadd.ll (+1038-1012)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmax.ll (+1070-1042)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmin.ll (+1070-1042)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fsub.ll (+1224-1182)
  • (modified) llvm/test/CodeGen/AMDGPU/lshr.v2i16.ll (+12-13)
  • (modified) llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/max.ll (+40-46)
  • (modified) llvm/test/CodeGen/AMDGPU/memcpy-crash-issue63986.ll (+9-9)
  • (modified) llvm/test/CodeGen/AMDGPU/memmove-var-size.ll (+204-204)
  • (modified) llvm/test/CodeGen/AMDGPU/mul_int24.ll (+60-69)
  • (modified) llvm/test/CodeGen/AMDGPU/mul_uint24-amdgcn.ll (+50-56)
  • (modified) llvm/test/CodeGen/AMDGPU/or.ll (+12-14)
  • (modified) llvm/test/CodeGen/AMDGPU/set-inactive-wwm-overwrite.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/sext-divergence-driven-isel.ll (+7-8)
  • (modified) llvm/test/CodeGen/AMDGPU/shl.v2i16.ll (+30-36)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4f32.v3f32.ll (+9-16)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4i32.v3i32.ll (+9-16)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4p3.v3p3.ll (+9-16)
  • (modified) llvm/test/CodeGen/AMDGPU/sign_extend.ll (+42-48)
  • (modified) llvm/test/CodeGen/AMDGPU/skip-if-dead.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/sminmax.v2i16.ll (+31-32)
  • (modified) llvm/test/CodeGen/AMDGPU/sub.ll (+6-7)
  • (modified) llvm/test/CodeGen/AMDGPU/sub.v2i16.ll (+18-23)
  • (modified) llvm/test/CodeGen/AMDGPU/udiv.ll (+30-32)
  • (modified) llvm/test/CodeGen/AMDGPU/udiv64.ll (+12-16)
  • (modified) llvm/test/CodeGen/AMDGPU/while-break.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/xor.ll (+6-7)
  • (modified) llvm/test/CodeGen/AMDGPU/zext-divergence-driven-isel.ll (+7-8)
  • (modified) llvm/test/CodeGen/BPF/objdump_cond_op_2.ll (+2-2)
  • (modified) llvm/test/CodeGen/Hexagon/late_instr.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-carried-1.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-conv3x3-nested.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-epilog-phi11.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-epilog-phi12.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-epilog-phi7.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-kernel-phi1.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-matmul-bitext.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-order-copies.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-order-deps7.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-reuse-phi-6.ll (+1-1)
  • (modified) llvm/test/CodeGen/NVPTX/atomics-b128.ll (+75-75)
  • (modified) llvm/test/CodeGen/NVPTX/atomics-sm70.ll (+20-20)
  • (modified) llvm/test/CodeGen/NVPTX/atomics-sm90.ll (+20-20)
  • (modified) llvm/test/CodeGen/NVPTX/atomics.ll (+6-6)
  • (modified) llvm/test/CodeGen/PowerPC/ctrloop-fp128.ll (+3-3)
  • (modified) llvm/test/CodeGen/PowerPC/licm-xxsplti.ll (+27-27)
  • (modified) llvm/test/CodeGen/PowerPC/loop-instr-form-prepare.ll (+3-5)
  • (modified) llvm/test/CodeGen/PowerPC/perfect-shuffle.ll (+6-6)
  • (modified) llvm/test/CodeGen/PowerPC/sms-phi-1.ll (+3-2)
  • (modified) llvm/test/CodeGen/PowerPC/sms-phi-2.ll (+21-22)
  • (modified) llvm/test/CodeGen/RISCV/branch-on-zero.ll (+6-10)
  • (modified) llvm/test/CodeGen/RISCV/machine-pipeliner.ll (+23-23)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-shuffle-exact-vlen.ll (+4-6)
  • (modified) llvm/test/CodeGen/RISCV/rvv/pr95865.ll (+21-22)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vandn-sdnode.ll (+33-33)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vcpop-shl-zext-opt.ll (+14-14)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vxrm-insert-out-of-loop.ll (+12-12)
  • (modified) llvm/test/CodeGen/SystemZ/atomicrmw-fadd-01.ll (+6-5)
  • (modified) llvm/test/CodeGen/SystemZ/atomicrmw-fsub-01.ll (+6-5)
  • (modified) llvm/test/CodeGen/Thumb2/LowOverheadLoops/constbound.ll (+9-9)
  • (modified) llvm/test/CodeGen/Thumb2/LowOverheadLoops/varying-outer-2d-reduction.ll (+24-26)
  • (modified) llvm/test/CodeGen/Thumb2/LowOverheadLoops/while-loops.ll (+45-46)
  • (modified) llvm/test/CodeGen/Thumb2/mve-float32regloops.ll (+102-109)
  • (modified) llvm/test/CodeGen/Thumb2/mve-gather-increment.ll (+12-12)
  • (modified) llvm/test/CodeGen/Thumb2/mve-gather-scatter-optimisation.ll (+45-45)
  • (modified) llvm/test/CodeGen/Thumb2/mve-pipelineloops.ll (+26-26)
  • (modified) llvm/test/CodeGen/Thumb2/mve-shuffle.ll (+7-6)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vld4.ll (+7-6)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vmaxnma-commute.ll (+12-12)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vst4.ll (+7-7)
  • (modified) llvm/test/CodeGen/Thumb2/pacbti-m-vla.ll (+1-1)
  • (modified) llvm/test/CodeGen/WebAssembly/simd-shift-in-loop.ll (+6-8)
  • (modified) llvm/test/CodeGen/X86/3addr-16bit.ll (+24-24)
  • (modified) llvm/test/CodeGen/X86/atomic-rm-bit-test.ll (+13-9)
  • (modified) llvm/test/CodeGen/X86/atomicrmw-fadd-fp-vector.ll (+2-1)
  • (modified) llvm/test/CodeGen/X86/bitcast-vector-bool.ll (+16-16)
  • (modified) llvm/test/CodeGen/X86/coalescer-dead-flag-verifier-error.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/dag-update-nodetomatch.ll (+52-48)
  • (modified) llvm/test/CodeGen/X86/fold-loop-of-urem.ll (+38-43)
  • (modified) llvm/test/CodeGen/X86/freeze-binary.ll (+14-12)
  • (modified) llvm/test/CodeGen/X86/i128-mul.ll (+87-91)
  • (modified) llvm/test/CodeGen/X86/icmp-abs-C.ll (+11-11)
  • (modified) llvm/test/CodeGen/X86/masked_gather_scatter.ll (+6-6)
  • (modified) llvm/test/CodeGen/X86/midpoint-int.ll (+14-14)
  • (modified) llvm/test/CodeGen/X86/mmx-arith.ll (+1-2)
  • (modified) llvm/test/CodeGen/X86/mul-constant-i16.ll (+4-4)
  • (modified) llvm/test/CodeGen/X86/mul-constant-i32.ll (+8-8)
  • (modified) llvm/test/CodeGen/X86/mul-constant-i8.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/optimize-max-0.ll (+107-104)
  • (modified) llvm/test/CodeGen/X86/parity.ll (+15-15)
  • (modified) llvm/test/CodeGen/X86/rotate-extract.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/smul_fix.ll (+4-4)
  • (modified) llvm/test/CodeGen/X86/sshl_sat.ll (+20-20)
  • (modified) llvm/test/CodeGen/X86/sshl_sat_vec.ll (+56-57)
  • (modified) llvm/test/CodeGen/X86/stackmap.ll (+6-3)
  • (modified) llvm/test/CodeGen/X86/subvectorwise-store-of-vector-splat.ll (+105-105)
  • (modified) llvm/test/CodeGen/X86/twoaddr-lea.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/umul_fix.ll (+4-4)
  • (modified) llvm/test/CodeGen/X86/ushl_sat.ll (+14-14)
  • (modified) llvm/test/CodeGen/X86/ushl_sat_vec.ll (+55-56)
  • (modified) llvm/test/CodeGen/X86/vector-mulfix-legalize.ll (+17-17)
  • (modified) llvm/test/CodeGen/X86/vector-reduce-xor-bool.ll (+80-80)
  • (modified) llvm/test/CodeGen/X86/wide-scalar-shift-by-byte-multiple-legalization.ll (+3023-3058)
  • (modified) llvm/test/CodeGen/X86/wide-scalar-shift-legalization.ll (+668-676)
  • (modified) llvm/test/CodeGen/X86/widen-load-of-small-alloca-with-zero-upper-half.ll (+165-163)
  • (modified) llvm/test/CodeGen/X86/widen-load-of-small-alloca.ll (+49-46)
  • (modified) llvm/test/CodeGen/X86/x86-shrink-wrapping.ll (+9-9)
  • (modified) llvm/test/CodeGen/X86/xor.ll (+66-66)
  • (modified) llvm/test/Transforms/LoopStrengthReduce/X86/ivchain-X86.ll (+11-10)
diff --git a/llvm/lib/CodeGen/RegisterCoalescer.cpp b/llvm/lib/CodeGen/RegisterCoalescer.cpp
index 7ac1aef83777a..5bd38a916fe4d 100644
--- a/llvm/lib/CodeGen/RegisterCoalescer.cpp
+++ b/llvm/lib/CodeGen/RegisterCoalescer.cpp
@@ -81,7 +81,7 @@ static cl::opt<bool> EnableJoining("join-liveintervals",
 
 static cl::opt<bool> UseTerminalRule("terminal-rule",
                                      cl::desc("Apply the terminal rule"),
-                                     cl::init(false), cl::Hidden);
+                                     cl::init(true), cl::Hidden);
 
 /// Temporary flag to test critical edge unsplitting.
 static cl::opt<bool> EnableJoinSplits(
diff --git a/llvm/test/CodeGen/AArch64/build-vector-two-dup.ll b/llvm/test/CodeGen/AArch64/build-vector-two-dup.ll
index dbbfbea9176f6..f725c19081deb 100644
--- a/llvm/test/CodeGen/AArch64/build-vector-two-dup.ll
+++ b/llvm/test/CodeGen/AArch64/build-vector-two-dup.ll
@@ -188,11 +188,11 @@ entry:
 define <8 x i8> @test11(ptr nocapture noundef readonly %a, ptr nocapture noundef readonly %b) {
 ; CHECK-LABEL: test11:
 ; CHECK:       // %bb.0: // %entry
-; CHECK-NEXT:    ld1r { v1.8b }, [x0]
-; CHECK-NEXT:    ld1r { v2.8b }, [x1]
-; CHECK-NEXT:    mov v0.16b, v1.16b
-; CHECK-NEXT:    mov v0.h[2], v2.h[0]
-; CHECK-NEXT:    mov v0.h[3], v1.h[0]
+; CHECK-NEXT:    ld1r { v0.8b }, [x0]
+; CHECK-NEXT:    ld1r { v1.8b }, [x1]
+; CHECK-NEXT:    fmov d2, d0
+; CHECK-NEXT:    mov v0.h[2], v1.h[0]
+; CHECK-NEXT:    mov v0.h[3], v2.h[0]
 ; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $q0
 ; CHECK-NEXT:    ret
 entry:
diff --git a/llvm/test/CodeGen/AArch64/machine-licm-sink-instr.ll b/llvm/test/CodeGen/AArch64/machine-licm-sink-instr.ll
index 3230c9e946da7..b3a7ec961b736 100644
--- a/llvm/test/CodeGen/AArch64/machine-licm-sink-instr.ll
+++ b/llvm/test/CodeGen/AArch64/machine-licm-sink-instr.ll
@@ -20,20 +20,17 @@ define i32 @sink_load_and_copy(i32 %n) {
 ; CHECK-NEXT:    b.lt .LBB0_3
 ; CHECK-NEXT:  // %bb.1: // %for.body.preheader
 ; CHECK-NEXT:    adrp x8, A
-; CHECK-NEXT:    mov w20, w19
-; CHECK-NEXT:    ldr w21, [x8, :lo12:A]
+; CHECK-NEXT:    mov w21, w19
+; CHECK-NEXT:    ldr w20, [x8, :lo12:A]
 ; CHECK-NEXT:  .LBB0_2: // %for.body
 ; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    mov w0, w21
+; CHECK-NEXT:    mov w0, w20
 ; CHECK-NEXT:    bl _Z3usei
-; CHECK-NEXT:    sdiv w20, w20, w0
-; CHECK-NEXT:    subs w19, w19, #1
+; CHECK-NEXT:    sdiv w19, w19, w0
+; CHECK-NEXT:    subs w21, w21, #1
 ; CHECK-NEXT:    b.ne .LBB0_2
-; CHECK-NEXT:    b .LBB0_4
-; CHECK-NEXT:  .LBB0_3:
-; CHECK-NEXT:    mov w20, w19
-; CHECK-NEXT:  .LBB0_4: // %for.cond.cleanup
-; CHECK-NEXT:    mov w0, w20
+; CHECK-NEXT:  .LBB0_3: // %for.cond.cleanup
+; CHECK-NEXT:    mov w0, w19
 ; CHECK-NEXT:    ldp x20, x19, [sp, #16] // 16-byte Folded Reload
 ; CHECK-NEXT:    ldp x30, x21, [sp], #32 // 16-byte Folded Reload
 ; CHECK-NEXT:    ret
@@ -82,15 +79,12 @@ define i32 @cant_sink_successive_call(i32 %n) {
 ; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    mov w0, w20
 ; CHECK-NEXT:    bl _Z3usei
-; CHECK-NEXT:    sdiv w21, w21, w0
-; CHECK-NEXT:    subs w19, w19, #1
+; CHECK-NEXT:    sdiv w19, w19, w0
+; CHECK-NEXT:    subs w21, w21, #1
 ; CHECK-NEXT:    b.ne .LBB1_2
-; CHECK-NEXT:    b .LBB1_4
-; CHECK-NEXT:  .LBB1_3:
-; CHECK-NEXT:    mov w21, w19
-; CHECK-NEXT:  .LBB1_4: // %for.cond.cleanup
+; CHECK-NEXT:  .LBB1_3: // %for.cond.cleanup
+; CHECK-NEXT:    mov w0, w19
 ; CHECK-NEXT:    ldp x20, x19, [sp, #16] // 16-byte Folded Reload
-; CHECK-NEXT:    mov w0, w21
 ; CHECK-NEXT:    ldp x30, x21, [sp], #32 // 16-byte Folded Reload
 ; CHECK-NEXT:    ret
 entry:
@@ -139,15 +133,12 @@ define i32 @cant_sink_successive_store(ptr nocapture readnone %store, i32 %n) {
 ; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    mov w0, w20
 ; CHECK-NEXT:    bl _Z3usei
-; CHECK-NEXT:    sdiv w21, w21, w0
-; CHECK-NEXT:    subs w19, w19, #1
+; CHECK-NEXT:    sdiv w19, w19, w0
+; CHECK-NEXT:    subs w21, w21, #1
 ; CHECK-NEXT:    b.ne .LBB2_2
-; CHECK-NEXT:    b .LBB2_4
-; CHECK-NEXT:  .LBB2_3:
-; CHECK-NEXT:    mov w21, w19
-; CHECK-NEXT:  .LBB2_4: // %for.cond.cleanup
+; CHECK-NEXT:  .LBB2_3: // %for.cond.cleanup
+; CHECK-NEXT:    mov w0, w19
 ; CHECK-NEXT:    ldp x20, x19, [sp, #16] // 16-byte Folded Reload
-; CHECK-NEXT:    mov w0, w21
 ; CHECK-NEXT:    ldp x30, x21, [sp], #32 // 16-byte Folded Reload
 ; CHECK-NEXT:    ret
 entry:
diff --git a/llvm/test/CodeGen/AArch64/machine-sink-kill-flags.ll b/llvm/test/CodeGen/AArch64/machine-sink-kill-flags.ll
index e7e109170d6a1..338084295fc7f 100644
--- a/llvm/test/CodeGen/AArch64/machine-sink-kill-flags.ll
+++ b/llvm/test/CodeGen/AArch64/machine-sink-kill-flags.ll
@@ -16,13 +16,12 @@ define i32 @test(ptr %ptr) {
 ; CHECK-NEXT:    mov w9, wzr
 ; CHECK-NEXT:  LBB0_1: ; %.thread
 ; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    lsr w11, w9, #1
 ; CHECK-NEXT:    sub w10, w9, #1
-; CHECK-NEXT:    mov w9, w11
+; CHECK-NEXT:    lsr w9, w9, #1
 ; CHECK-NEXT:    tbnz w10, #0, LBB0_1
 ; CHECK-NEXT:  ; %bb.2: ; %bb343
 ; CHECK-NEXT:    and w9, w10, #0x1
-; CHECK-NEXT:    mov w0, #-1
+; CHECK-NEXT:    mov w0, #-1 ; =0xffffffff
 ; CHECK-NEXT:    str w9, [x8]
 ; CHECK-NEXT:    ret
 bb:
diff --git a/llvm/test/CodeGen/AArch64/sme-pstate-sm-changing-call-disable-coalescing.ll b/llvm/test/CodeGen/AArch64/sme-pstate-sm-changing-call-disable-coalescing.ll
index b947c943ba448..72f6646930624 100644
--- a/llvm/test/CodeGen/AArch64/sme-pstate-sm-changing-call-disable-coalescing.ll
+++ b/llvm/test/CodeGen/AArch64/sme-pstate-sm-changing-call-disable-coalescing.ll
@@ -151,12 +151,11 @@ define void @dont_coalesce_arg_f16(half %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $h0 killed $h0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $h0 killed $h0 killed $z0
 ; CHECK-NEXT:    str h0, [sp, #14] // 2-byte Folded Spill
+; CHECK-NEXT:    // kill: def $h0 killed $h0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr h0, [sp, #14] // 2-byte Folded Reload
 ; CHECK-NEXT:    bl use_f16
@@ -190,12 +189,11 @@ define void @dont_coalesce_arg_f32(float %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $s0 killed $s0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $s0 killed $s0 killed $z0
 ; CHECK-NEXT:    str s0, [sp, #12] // 4-byte Folded Spill
+; CHECK-NEXT:    // kill: def $s0 killed $s0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr s0, [sp, #12] // 4-byte Folded Reload
 ; CHECK-NEXT:    bl use_f32
@@ -229,12 +227,11 @@ define void @dont_coalesce_arg_f64(double %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_f64
@@ -273,12 +270,11 @@ define void @dont_coalesce_arg_v1i8(<1 x i8> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_v16i8
@@ -313,12 +309,11 @@ define void @dont_coalesce_arg_v1i16(<1 x i16> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_v8i16
@@ -353,12 +348,11 @@ define void @dont_coalesce_arg_v1i32(<1 x i32> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_v4i32
@@ -393,12 +387,11 @@ define void @dont_coalesce_arg_v1i64(<1 x i64> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_v2i64
@@ -433,12 +426,11 @@ define void @dont_coalesce_arg_v1f16(<1 x half> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $h0 killed $h0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $h0 killed $h0 killed $z0
 ; CHECK-NEXT:    str h0, [sp, #14] // 2-byte Folded Spill
+; CHECK-NEXT:    // kill: def $h0 killed $h0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr h0, [sp, #14] // 2-byte Folded Reload
 ; CHECK-NEXT:    bl use_v8f16
@@ -513,12 +505,11 @@ define void @dont_coalesce_arg_v1f64(<1 x double> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_v2f64
@@ -557,12 +548,11 @@ define void @dont_coalesce_arg_v16i8(<16 x i8> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v16i8
@@ -596,12 +586,11 @@ define void @dont_coalesce_arg_v8i16(<8 x i16> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v8i16
@@ -635,12 +624,11 @@ define void @dont_coalesce_arg_v4i32(<4 x i32> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v4i32
@@ -674,12 +662,11 @@ define void @dont_coalesce_arg_v2i64(<2 x i64> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v2i64
@@ -713,12 +700,11 @@ define void @dont_coalesce_arg_v8f16(<8 x half> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v8f16
@@ -752,12 +738,11 @@ define void @dont_coalesce_arg_v8bf16(<8 x bfloat> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v8bf16
@@ -791,12 +776,11 @@ define void @dont_coalesce_arg_v4f32(<4 x float> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v4f32
@@ -830,12 +814,11 @@ define void @dont_coalesce_arg_v2f64(<2 x double> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v2f64
diff --git a/llvm/test/CodeGen/AArch64/sme-streaming-compatible-interface.ll b/llvm/test/CodeGen/AArch64/sme-streaming-compatible-interface.ll
index f2163ad15bafc..df88f37195ed6 100644
--- a/llvm/test/CodeGen/AArch64/sme-streaming-compatible-interface.ll
+++ b/llvm/test/CodeGen/AArch64/sme-streaming-compatible-interface.ll
@@ -129,12 +129,11 @@ define <2 x double> @streaming_compatible_with_neon_vectors(<2 x double> %arg) "
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
+; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    mrs x19, SVCR
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    mrs x19, SVCR
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
-; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
 ; CHECK-NEXT:    tbz w19, #0, .LBB4_2
 ; CHECK-NEXT:  // %bb.1:
 ; CHECK-NEXT:    smstop sm
diff --git a/llvm/test/CodeGen/AArch64/sve-extract-fixed-from-scalable-vector.ll b/llvm/test/CodeGen/AArch64/sve-extract-fixed-from-scalable-vector.ll
index 6c6a691760af3..52a77cb396909 100644
--- a/llvm/test/CodeGen/AArch64/sve-extract-fixed-from-scalable-vector.ll
+++ b/llvm/test/CodeGen/AArch64/sve-extract-fixed-from-scalable-vector.ll
@@ -147,15 +147,15 @@ define <2 x float> @extract_v2f32_nxv16f32_2(<vscale x 16 x float> %arg) {
 define <4 x i1> @extract_v4i1_nxv32i1_0(<vscale x 32 x i1> %arg) {
 ; CHECK-LABEL: extract_v4i1_nxv32i1_0:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    mov z1.b, p0/z, #1 // =0x1
-; CHECK-NEXT:    umov w8, v1.b[1]
-; CHECK-NEXT:    mov v0.16b, v1.16b
-; CHECK-NEXT:    umov w9, v1.b[2]
+; CHECK-NEXT:    mov z0.b, p0/z, #1 // =0x1
+; CHECK-NEXT:    umov w8, v0.b[1]
+; CHECK-NEXT:    mov v1.16b, v0.16b
 ; CHECK-NEXT:    mov v0.h[1], w8
+; CHECK-NEXT:    umov w8, v1.b[2]
+; CHECK-NEXT:    mov v0.h[2], w8
 ; CHECK-NEXT:    umov w8, v1.b[3]
-; CHECK-NEXT:    mov v0.h[2], w9
 ; CHECK-NEXT:    mov v0.h[3], w8
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $q0
+; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    ret
   %ext = call <4 x i1> @llvm.vector.extract.v4i1.n...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Oct 2, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Matt Arsenault (arsenm)

Changes

This appears to be forgotten switch flip from 2015. This
seems to do a nicer job with subregister copies. Most of the
test changes are improvements or neutral, not that many are
light regressions. The worst AMDGPU regressions are for true16
in the atomic tests, but I think that's due to existing true16
issues.

I also had to hack many hexagon tests to disable the rule. I have
no idea how to update these tests. They appear to be testing specific
scheduling and packet formation of later machine passes, so any change
in the incoming mir is likely hiding whatever was originally intended.
I'll open an issue to fixup these tests once this lands.


Patch is 1.65 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/161621.diff

152 Files Affected:

  • (modified) llvm/lib/CodeGen/RegisterCoalescer.cpp (+1-1)
  • (modified) llvm/test/CodeGen/AArch64/build-vector-two-dup.ll (+5-5)
  • (modified) llvm/test/CodeGen/AArch64/machine-licm-sink-instr.ll (+15-24)
  • (modified) llvm/test/CodeGen/AArch64/machine-sink-kill-flags.ll (+2-3)
  • (modified) llvm/test/CodeGen/AArch64/sme-pstate-sm-changing-call-disable-coalescing.ll (+34-51)
  • (modified) llvm/test/CodeGen/AArch64/sme-streaming-compatible-interface.ll (+2-3)
  • (modified) llvm/test/CodeGen/AArch64/sve-extract-fixed-from-scalable-vector.ll (+6-6)
  • (modified) llvm/test/CodeGen/AArch64/sve-extract-fixed-vector.ll (+20-19)
  • (modified) llvm/test/CodeGen/AArch64/sve-fixed-length-reshuffle.ll (+6-6)
  • (modified) llvm/test/CodeGen/AArch64/sve-fixed-length-shuffles.ll (+36-36)
  • (modified) llvm/test/CodeGen/AArch64/sve-ptest-removal-sink.ll (+4-4)
  • (modified) llvm/test/CodeGen/AArch64/zext-to-tbl.ll (+46-46)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.ll (+9-9)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.ll (+19-19)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll (+180-194)
  • (modified) llvm/test/CodeGen/AMDGPU/and.ll (+47-53)
  • (modified) llvm/test/CodeGen/AMDGPU/bfe-patterns.ll (+42-42)
  • (modified) llvm/test/CodeGen/AMDGPU/bfi_nested.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/bfm.ll (+6-7)
  • (modified) llvm/test/CodeGen/AMDGPU/bitreverse.ll (+19-23)
  • (modified) llvm/test/CodeGen/AMDGPU/build_vector.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/combine-cond-add-sub.ll (+16-16)
  • (modified) llvm/test/CodeGen/AMDGPU/divergence-driven-buildvector.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/divergence-driven-sext-inreg.ll (+26-30)
  • (modified) llvm/test/CodeGen/AMDGPU/fabs.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/fdiv.ll (+39-45)
  • (modified) llvm/test/CodeGen/AMDGPU/fmin_legacy.ll (+11-4)
  • (modified) llvm/test/CodeGen/AMDGPU/fnearbyint.ll (+6-7)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg-fabs.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/fp_to_sint.ll (+40-45)
  • (modified) llvm/test/CodeGen/AMDGPU/fp_to_uint.ll (+21-27)
  • (modified) llvm/test/CodeGen/AMDGPU/fshl.ll (+9-10)
  • (modified) llvm/test/CodeGen/AMDGPU/insert_vector_elt.ll (+20-22)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ubfe.ll (+14-16)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.exp.ll (+15-16)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.exp10.ll (+15-16)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.exp2.ll (+15-16)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.log.ll (+18-19)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.log10.ll (+18-19)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.log2.ll (+3-5)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fadd.ll (+1038-1012)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmax.ll (+1070-1042)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmin.ll (+1070-1042)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fsub.ll (+1224-1182)
  • (modified) llvm/test/CodeGen/AMDGPU/lshr.v2i16.ll (+12-13)
  • (modified) llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/max.ll (+40-46)
  • (modified) llvm/test/CodeGen/AMDGPU/memcpy-crash-issue63986.ll (+9-9)
  • (modified) llvm/test/CodeGen/AMDGPU/memmove-var-size.ll (+204-204)
  • (modified) llvm/test/CodeGen/AMDGPU/mul_int24.ll (+60-69)
  • (modified) llvm/test/CodeGen/AMDGPU/mul_uint24-amdgcn.ll (+50-56)
  • (modified) llvm/test/CodeGen/AMDGPU/or.ll (+12-14)
  • (modified) llvm/test/CodeGen/AMDGPU/set-inactive-wwm-overwrite.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/sext-divergence-driven-isel.ll (+7-8)
  • (modified) llvm/test/CodeGen/AMDGPU/shl.v2i16.ll (+30-36)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4f32.v3f32.ll (+9-16)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4i32.v3i32.ll (+9-16)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4p3.v3p3.ll (+9-16)
  • (modified) llvm/test/CodeGen/AMDGPU/sign_extend.ll (+42-48)
  • (modified) llvm/test/CodeGen/AMDGPU/skip-if-dead.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/sminmax.v2i16.ll (+31-32)
  • (modified) llvm/test/CodeGen/AMDGPU/sub.ll (+6-7)
  • (modified) llvm/test/CodeGen/AMDGPU/sub.v2i16.ll (+18-23)
  • (modified) llvm/test/CodeGen/AMDGPU/udiv.ll (+30-32)
  • (modified) llvm/test/CodeGen/AMDGPU/udiv64.ll (+12-16)
  • (modified) llvm/test/CodeGen/AMDGPU/while-break.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/xor.ll (+6-7)
  • (modified) llvm/test/CodeGen/AMDGPU/zext-divergence-driven-isel.ll (+7-8)
  • (modified) llvm/test/CodeGen/BPF/objdump_cond_op_2.ll (+2-2)
  • (modified) llvm/test/CodeGen/Hexagon/late_instr.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-carried-1.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-conv3x3-nested.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-epilog-phi11.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-epilog-phi12.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-epilog-phi7.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-kernel-phi1.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-matmul-bitext.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-order-copies.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-order-deps7.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-reuse-phi-6.ll (+1-1)
  • (modified) llvm/test/CodeGen/NVPTX/atomics-b128.ll (+75-75)
  • (modified) llvm/test/CodeGen/NVPTX/atomics-sm70.ll (+20-20)
  • (modified) llvm/test/CodeGen/NVPTX/atomics-sm90.ll (+20-20)
  • (modified) llvm/test/CodeGen/NVPTX/atomics.ll (+6-6)
  • (modified) llvm/test/CodeGen/PowerPC/ctrloop-fp128.ll (+3-3)
  • (modified) llvm/test/CodeGen/PowerPC/licm-xxsplti.ll (+27-27)
  • (modified) llvm/test/CodeGen/PowerPC/loop-instr-form-prepare.ll (+3-5)
  • (modified) llvm/test/CodeGen/PowerPC/perfect-shuffle.ll (+6-6)
  • (modified) llvm/test/CodeGen/PowerPC/sms-phi-1.ll (+3-2)
  • (modified) llvm/test/CodeGen/PowerPC/sms-phi-2.ll (+21-22)
  • (modified) llvm/test/CodeGen/RISCV/branch-on-zero.ll (+6-10)
  • (modified) llvm/test/CodeGen/RISCV/machine-pipeliner.ll (+23-23)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-shuffle-exact-vlen.ll (+4-6)
  • (modified) llvm/test/CodeGen/RISCV/rvv/pr95865.ll (+21-22)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vandn-sdnode.ll (+33-33)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vcpop-shl-zext-opt.ll (+14-14)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vxrm-insert-out-of-loop.ll (+12-12)
  • (modified) llvm/test/CodeGen/SystemZ/atomicrmw-fadd-01.ll (+6-5)
  • (modified) llvm/test/CodeGen/SystemZ/atomicrmw-fsub-01.ll (+6-5)
  • (modified) llvm/test/CodeGen/Thumb2/LowOverheadLoops/constbound.ll (+9-9)
  • (modified) llvm/test/CodeGen/Thumb2/LowOverheadLoops/varying-outer-2d-reduction.ll (+24-26)
  • (modified) llvm/test/CodeGen/Thumb2/LowOverheadLoops/while-loops.ll (+45-46)
  • (modified) llvm/test/CodeGen/Thumb2/mve-float32regloops.ll (+102-109)
  • (modified) llvm/test/CodeGen/Thumb2/mve-gather-increment.ll (+12-12)
  • (modified) llvm/test/CodeGen/Thumb2/mve-gather-scatter-optimisation.ll (+45-45)
  • (modified) llvm/test/CodeGen/Thumb2/mve-pipelineloops.ll (+26-26)
  • (modified) llvm/test/CodeGen/Thumb2/mve-shuffle.ll (+7-6)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vld4.ll (+7-6)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vmaxnma-commute.ll (+12-12)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vst4.ll (+7-7)
  • (modified) llvm/test/CodeGen/Thumb2/pacbti-m-vla.ll (+1-1)
  • (modified) llvm/test/CodeGen/WebAssembly/simd-shift-in-loop.ll (+6-8)
  • (modified) llvm/test/CodeGen/X86/3addr-16bit.ll (+24-24)
  • (modified) llvm/test/CodeGen/X86/atomic-rm-bit-test.ll (+13-9)
  • (modified) llvm/test/CodeGen/X86/atomicrmw-fadd-fp-vector.ll (+2-1)
  • (modified) llvm/test/CodeGen/X86/bitcast-vector-bool.ll (+16-16)
  • (modified) llvm/test/CodeGen/X86/coalescer-dead-flag-verifier-error.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/dag-update-nodetomatch.ll (+52-48)
  • (modified) llvm/test/CodeGen/X86/fold-loop-of-urem.ll (+38-43)
  • (modified) llvm/test/CodeGen/X86/freeze-binary.ll (+14-12)
  • (modified) llvm/test/CodeGen/X86/i128-mul.ll (+87-91)
  • (modified) llvm/test/CodeGen/X86/icmp-abs-C.ll (+11-11)
  • (modified) llvm/test/CodeGen/X86/masked_gather_scatter.ll (+6-6)
  • (modified) llvm/test/CodeGen/X86/midpoint-int.ll (+14-14)
  • (modified) llvm/test/CodeGen/X86/mmx-arith.ll (+1-2)
  • (modified) llvm/test/CodeGen/X86/mul-constant-i16.ll (+4-4)
  • (modified) llvm/test/CodeGen/X86/mul-constant-i32.ll (+8-8)
  • (modified) llvm/test/CodeGen/X86/mul-constant-i8.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/optimize-max-0.ll (+107-104)
  • (modified) llvm/test/CodeGen/X86/parity.ll (+15-15)
  • (modified) llvm/test/CodeGen/X86/rotate-extract.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/smul_fix.ll (+4-4)
  • (modified) llvm/test/CodeGen/X86/sshl_sat.ll (+20-20)
  • (modified) llvm/test/CodeGen/X86/sshl_sat_vec.ll (+56-57)
  • (modified) llvm/test/CodeGen/X86/stackmap.ll (+6-3)
  • (modified) llvm/test/CodeGen/X86/subvectorwise-store-of-vector-splat.ll (+105-105)
  • (modified) llvm/test/CodeGen/X86/twoaddr-lea.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/umul_fix.ll (+4-4)
  • (modified) llvm/test/CodeGen/X86/ushl_sat.ll (+14-14)
  • (modified) llvm/test/CodeGen/X86/ushl_sat_vec.ll (+55-56)
  • (modified) llvm/test/CodeGen/X86/vector-mulfix-legalize.ll (+17-17)
  • (modified) llvm/test/CodeGen/X86/vector-reduce-xor-bool.ll (+80-80)
  • (modified) llvm/test/CodeGen/X86/wide-scalar-shift-by-byte-multiple-legalization.ll (+3023-3058)
  • (modified) llvm/test/CodeGen/X86/wide-scalar-shift-legalization.ll (+668-676)
  • (modified) llvm/test/CodeGen/X86/widen-load-of-small-alloca-with-zero-upper-half.ll (+165-163)
  • (modified) llvm/test/CodeGen/X86/widen-load-of-small-alloca.ll (+49-46)
  • (modified) llvm/test/CodeGen/X86/x86-shrink-wrapping.ll (+9-9)
  • (modified) llvm/test/CodeGen/X86/xor.ll (+66-66)
  • (modified) llvm/test/Transforms/LoopStrengthReduce/X86/ivchain-X86.ll (+11-10)
diff --git a/llvm/lib/CodeGen/RegisterCoalescer.cpp b/llvm/lib/CodeGen/RegisterCoalescer.cpp
index 7ac1aef83777a..5bd38a916fe4d 100644
--- a/llvm/lib/CodeGen/RegisterCoalescer.cpp
+++ b/llvm/lib/CodeGen/RegisterCoalescer.cpp
@@ -81,7 +81,7 @@ static cl::opt<bool> EnableJoining("join-liveintervals",
 
 static cl::opt<bool> UseTerminalRule("terminal-rule",
                                      cl::desc("Apply the terminal rule"),
-                                     cl::init(false), cl::Hidden);
+                                     cl::init(true), cl::Hidden);
 
 /// Temporary flag to test critical edge unsplitting.
 static cl::opt<bool> EnableJoinSplits(
diff --git a/llvm/test/CodeGen/AArch64/build-vector-two-dup.ll b/llvm/test/CodeGen/AArch64/build-vector-two-dup.ll
index dbbfbea9176f6..f725c19081deb 100644
--- a/llvm/test/CodeGen/AArch64/build-vector-two-dup.ll
+++ b/llvm/test/CodeGen/AArch64/build-vector-two-dup.ll
@@ -188,11 +188,11 @@ entry:
 define <8 x i8> @test11(ptr nocapture noundef readonly %a, ptr nocapture noundef readonly %b) {
 ; CHECK-LABEL: test11:
 ; CHECK:       // %bb.0: // %entry
-; CHECK-NEXT:    ld1r { v1.8b }, [x0]
-; CHECK-NEXT:    ld1r { v2.8b }, [x1]
-; CHECK-NEXT:    mov v0.16b, v1.16b
-; CHECK-NEXT:    mov v0.h[2], v2.h[0]
-; CHECK-NEXT:    mov v0.h[3], v1.h[0]
+; CHECK-NEXT:    ld1r { v0.8b }, [x0]
+; CHECK-NEXT:    ld1r { v1.8b }, [x1]
+; CHECK-NEXT:    fmov d2, d0
+; CHECK-NEXT:    mov v0.h[2], v1.h[0]
+; CHECK-NEXT:    mov v0.h[3], v2.h[0]
 ; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $q0
 ; CHECK-NEXT:    ret
 entry:
diff --git a/llvm/test/CodeGen/AArch64/machine-licm-sink-instr.ll b/llvm/test/CodeGen/AArch64/machine-licm-sink-instr.ll
index 3230c9e946da7..b3a7ec961b736 100644
--- a/llvm/test/CodeGen/AArch64/machine-licm-sink-instr.ll
+++ b/llvm/test/CodeGen/AArch64/machine-licm-sink-instr.ll
@@ -20,20 +20,17 @@ define i32 @sink_load_and_copy(i32 %n) {
 ; CHECK-NEXT:    b.lt .LBB0_3
 ; CHECK-NEXT:  // %bb.1: // %for.body.preheader
 ; CHECK-NEXT:    adrp x8, A
-; CHECK-NEXT:    mov w20, w19
-; CHECK-NEXT:    ldr w21, [x8, :lo12:A]
+; CHECK-NEXT:    mov w21, w19
+; CHECK-NEXT:    ldr w20, [x8, :lo12:A]
 ; CHECK-NEXT:  .LBB0_2: // %for.body
 ; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    mov w0, w21
+; CHECK-NEXT:    mov w0, w20
 ; CHECK-NEXT:    bl _Z3usei
-; CHECK-NEXT:    sdiv w20, w20, w0
-; CHECK-NEXT:    subs w19, w19, #1
+; CHECK-NEXT:    sdiv w19, w19, w0
+; CHECK-NEXT:    subs w21, w21, #1
 ; CHECK-NEXT:    b.ne .LBB0_2
-; CHECK-NEXT:    b .LBB0_4
-; CHECK-NEXT:  .LBB0_3:
-; CHECK-NEXT:    mov w20, w19
-; CHECK-NEXT:  .LBB0_4: // %for.cond.cleanup
-; CHECK-NEXT:    mov w0, w20
+; CHECK-NEXT:  .LBB0_3: // %for.cond.cleanup
+; CHECK-NEXT:    mov w0, w19
 ; CHECK-NEXT:    ldp x20, x19, [sp, #16] // 16-byte Folded Reload
 ; CHECK-NEXT:    ldp x30, x21, [sp], #32 // 16-byte Folded Reload
 ; CHECK-NEXT:    ret
@@ -82,15 +79,12 @@ define i32 @cant_sink_successive_call(i32 %n) {
 ; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    mov w0, w20
 ; CHECK-NEXT:    bl _Z3usei
-; CHECK-NEXT:    sdiv w21, w21, w0
-; CHECK-NEXT:    subs w19, w19, #1
+; CHECK-NEXT:    sdiv w19, w19, w0
+; CHECK-NEXT:    subs w21, w21, #1
 ; CHECK-NEXT:    b.ne .LBB1_2
-; CHECK-NEXT:    b .LBB1_4
-; CHECK-NEXT:  .LBB1_3:
-; CHECK-NEXT:    mov w21, w19
-; CHECK-NEXT:  .LBB1_4: // %for.cond.cleanup
+; CHECK-NEXT:  .LBB1_3: // %for.cond.cleanup
+; CHECK-NEXT:    mov w0, w19
 ; CHECK-NEXT:    ldp x20, x19, [sp, #16] // 16-byte Folded Reload
-; CHECK-NEXT:    mov w0, w21
 ; CHECK-NEXT:    ldp x30, x21, [sp], #32 // 16-byte Folded Reload
 ; CHECK-NEXT:    ret
 entry:
@@ -139,15 +133,12 @@ define i32 @cant_sink_successive_store(ptr nocapture readnone %store, i32 %n) {
 ; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    mov w0, w20
 ; CHECK-NEXT:    bl _Z3usei
-; CHECK-NEXT:    sdiv w21, w21, w0
-; CHECK-NEXT:    subs w19, w19, #1
+; CHECK-NEXT:    sdiv w19, w19, w0
+; CHECK-NEXT:    subs w21, w21, #1
 ; CHECK-NEXT:    b.ne .LBB2_2
-; CHECK-NEXT:    b .LBB2_4
-; CHECK-NEXT:  .LBB2_3:
-; CHECK-NEXT:    mov w21, w19
-; CHECK-NEXT:  .LBB2_4: // %for.cond.cleanup
+; CHECK-NEXT:  .LBB2_3: // %for.cond.cleanup
+; CHECK-NEXT:    mov w0, w19
 ; CHECK-NEXT:    ldp x20, x19, [sp, #16] // 16-byte Folded Reload
-; CHECK-NEXT:    mov w0, w21
 ; CHECK-NEXT:    ldp x30, x21, [sp], #32 // 16-byte Folded Reload
 ; CHECK-NEXT:    ret
 entry:
diff --git a/llvm/test/CodeGen/AArch64/machine-sink-kill-flags.ll b/llvm/test/CodeGen/AArch64/machine-sink-kill-flags.ll
index e7e109170d6a1..338084295fc7f 100644
--- a/llvm/test/CodeGen/AArch64/machine-sink-kill-flags.ll
+++ b/llvm/test/CodeGen/AArch64/machine-sink-kill-flags.ll
@@ -16,13 +16,12 @@ define i32 @test(ptr %ptr) {
 ; CHECK-NEXT:    mov w9, wzr
 ; CHECK-NEXT:  LBB0_1: ; %.thread
 ; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    lsr w11, w9, #1
 ; CHECK-NEXT:    sub w10, w9, #1
-; CHECK-NEXT:    mov w9, w11
+; CHECK-NEXT:    lsr w9, w9, #1
 ; CHECK-NEXT:    tbnz w10, #0, LBB0_1
 ; CHECK-NEXT:  ; %bb.2: ; %bb343
 ; CHECK-NEXT:    and w9, w10, #0x1
-; CHECK-NEXT:    mov w0, #-1
+; CHECK-NEXT:    mov w0, #-1 ; =0xffffffff
 ; CHECK-NEXT:    str w9, [x8]
 ; CHECK-NEXT:    ret
 bb:
diff --git a/llvm/test/CodeGen/AArch64/sme-pstate-sm-changing-call-disable-coalescing.ll b/llvm/test/CodeGen/AArch64/sme-pstate-sm-changing-call-disable-coalescing.ll
index b947c943ba448..72f6646930624 100644
--- a/llvm/test/CodeGen/AArch64/sme-pstate-sm-changing-call-disable-coalescing.ll
+++ b/llvm/test/CodeGen/AArch64/sme-pstate-sm-changing-call-disable-coalescing.ll
@@ -151,12 +151,11 @@ define void @dont_coalesce_arg_f16(half %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $h0 killed $h0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $h0 killed $h0 killed $z0
 ; CHECK-NEXT:    str h0, [sp, #14] // 2-byte Folded Spill
+; CHECK-NEXT:    // kill: def $h0 killed $h0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr h0, [sp, #14] // 2-byte Folded Reload
 ; CHECK-NEXT:    bl use_f16
@@ -190,12 +189,11 @@ define void @dont_coalesce_arg_f32(float %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $s0 killed $s0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $s0 killed $s0 killed $z0
 ; CHECK-NEXT:    str s0, [sp, #12] // 4-byte Folded Spill
+; CHECK-NEXT:    // kill: def $s0 killed $s0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr s0, [sp, #12] // 4-byte Folded Reload
 ; CHECK-NEXT:    bl use_f32
@@ -229,12 +227,11 @@ define void @dont_coalesce_arg_f64(double %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_f64
@@ -273,12 +270,11 @@ define void @dont_coalesce_arg_v1i8(<1 x i8> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_v16i8
@@ -313,12 +309,11 @@ define void @dont_coalesce_arg_v1i16(<1 x i16> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_v8i16
@@ -353,12 +348,11 @@ define void @dont_coalesce_arg_v1i32(<1 x i32> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_v4i32
@@ -393,12 +387,11 @@ define void @dont_coalesce_arg_v1i64(<1 x i64> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_v2i64
@@ -433,12 +426,11 @@ define void @dont_coalesce_arg_v1f16(<1 x half> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $h0 killed $h0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $h0 killed $h0 killed $z0
 ; CHECK-NEXT:    str h0, [sp, #14] // 2-byte Folded Spill
+; CHECK-NEXT:    // kill: def $h0 killed $h0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr h0, [sp, #14] // 2-byte Folded Reload
 ; CHECK-NEXT:    bl use_v8f16
@@ -513,12 +505,11 @@ define void @dont_coalesce_arg_v1f64(<1 x double> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_v2f64
@@ -557,12 +548,11 @@ define void @dont_coalesce_arg_v16i8(<16 x i8> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v16i8
@@ -596,12 +586,11 @@ define void @dont_coalesce_arg_v8i16(<8 x i16> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v8i16
@@ -635,12 +624,11 @@ define void @dont_coalesce_arg_v4i32(<4 x i32> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v4i32
@@ -674,12 +662,11 @@ define void @dont_coalesce_arg_v2i64(<2 x i64> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v2i64
@@ -713,12 +700,11 @@ define void @dont_coalesce_arg_v8f16(<8 x half> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v8f16
@@ -752,12 +738,11 @@ define void @dont_coalesce_arg_v8bf16(<8 x bfloat> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v8bf16
@@ -791,12 +776,11 @@ define void @dont_coalesce_arg_v4f32(<4 x float> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v4f32
@@ -830,12 +814,11 @@ define void @dont_coalesce_arg_v2f64(<2 x double> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v2f64
diff --git a/llvm/test/CodeGen/AArch64/sme-streaming-compatible-interface.ll b/llvm/test/CodeGen/AArch64/sme-streaming-compatible-interface.ll
index f2163ad15bafc..df88f37195ed6 100644
--- a/llvm/test/CodeGen/AArch64/sme-streaming-compatible-interface.ll
+++ b/llvm/test/CodeGen/AArch64/sme-streaming-compatible-interface.ll
@@ -129,12 +129,11 @@ define <2 x double> @streaming_compatible_with_neon_vectors(<2 x double> %arg) "
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
+; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    mrs x19, SVCR
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    mrs x19, SVCR
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
-; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
 ; CHECK-NEXT:    tbz w19, #0, .LBB4_2
 ; CHECK-NEXT:  // %bb.1:
 ; CHECK-NEXT:    smstop sm
diff --git a/llvm/test/CodeGen/AArch64/sve-extract-fixed-from-scalable-vector.ll b/llvm/test/CodeGen/AArch64/sve-extract-fixed-from-scalable-vector.ll
index 6c6a691760af3..52a77cb396909 100644
--- a/llvm/test/CodeGen/AArch64/sve-extract-fixed-from-scalable-vector.ll
+++ b/llvm/test/CodeGen/AArch64/sve-extract-fixed-from-scalable-vector.ll
@@ -147,15 +147,15 @@ define <2 x float> @extract_v2f32_nxv16f32_2(<vscale x 16 x float> %arg) {
 define <4 x i1> @extract_v4i1_nxv32i1_0(<vscale x 32 x i1> %arg) {
 ; CHECK-LABEL: extract_v4i1_nxv32i1_0:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    mov z1.b, p0/z, #1 // =0x1
-; CHECK-NEXT:    umov w8, v1.b[1]
-; CHECK-NEXT:    mov v0.16b, v1.16b
-; CHECK-NEXT:    umov w9, v1.b[2]
+; CHECK-NEXT:    mov z0.b, p0/z, #1 // =0x1
+; CHECK-NEXT:    umov w8, v0.b[1]
+; CHECK-NEXT:    mov v1.16b, v0.16b
 ; CHECK-NEXT:    mov v0.h[1], w8
+; CHECK-NEXT:    umov w8, v1.b[2]
+; CHECK-NEXT:    mov v0.h[2], w8
 ; CHECK-NEXT:    umov w8, v1.b[3]
-; CHECK-NEXT:    mov v0.h[2], w9
 ; CHECK-NEXT:    mov v0.h[3], w8
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $q0
+; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    ret
   %ext = call <4 x i1> @llvm.vector.extract.v4i1.n...
[truncated]

Copy link
Collaborator

@qcolombet qcolombet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To give an opportunity for targets to adopt the terminal rule on their own, I think it make sense to have a target hook for it, like what we do with join-globalcopies and ::enableJoinGlobalCopies

I don't remember why I didn't flip the switch when I introduced this rule, so I want to make sure we are friendly with downstream users.

@arsenm
Copy link
Contributor Author

arsenm commented Oct 2, 2025

To give an opportunity for targets to adopt the terminal rule on their own, I think it make sense to have a target hook for it, like what we do with join-globalcopies and ::enableJoinGlobalCopies

I specifically think we should not have a control for this, and have a policy against introducing this style of target hook. I think these are lazy shortcuts to avoid touching targets whoever adds the feature doesn't care about, and the project is worse off for it. The reality of target maintenance is this will never be implemented by any target. If we wait several years, a few might discover it and turn it on

@preames
Copy link
Collaborator

preames commented Oct 2, 2025

I specifically think we should not have a control for this, and have a policy against introducing this style of target hook. I think these are lazy shortcuts to avoid touching targets whoever adds the feature doesn't care about, and the project is worse off for it. The reality of target maintenance is this will never be implemented by any target. If we wait several years, a few might discover it and turn it on

@arsenm I generally agree with you here, but I think this is a case where adding a hook would make review much easier. I'd proposed adding the hook, enabling AMDGPU, then doing individual changes for the other backends, then (if desired) remove the hook.

LGTM from the RISC-V perspective, and the change generally seems reasonable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants