[InstCombine] Limit canonicalization of extractelement(cast) to constant index or same basic block. #166227

azwolski · 2025-11-03T20:15:35Z

The current canonicalization of extractelement(cast) requires that the CastInst has only one use. However, when that use occurs inside a loop, it still satisfies this condition, even though the cast is effectively used multiple times, once per iteration, rather than truly being used once.

} else if (auto *CI = dyn_cast<CastInst>(I)) {
  // Canonicalize extractelement(cast) -> cast(extractelement).
  // Bitcasts can change the number of vector elements, and they cost
  // nothing.
  if (CI->hasOneUse() && (CI->getOpcode() != Instruction::BitCast)){

Before

%34 = fptosi <4 x float> %33 to <4 x i32>
;/loop{
%40 = extractelement <4 x i32> %34, i32 %36

After

;/loop{
%37 = extractelement <4 x float> %30, i32 %32
%38 = fptosi float %37 to i32

After canonicalization, for this particular example, it no longer uses a single instruction to cast the entire vector at once, but instead performs the cast for every element separately, which is less performant.

Ideally, we would like to check if the cast instruction has one use and that this use is not called inside a loop. However, InstCombine/InstCombineVectorOps.cpp does not provide utilities like LoopInfo to check that. It might be possible to approximate this by analyzing basic block successors or by building a dominance tree, but that may be a costly solution.

A solution to prevent this optimization could be to check if the index is an immediate value and if the use is inside the same basic block as the cast instruction:

if (CI->hasOneUse() && (CI->getOpcode() != Instruction::BitCast)) {
    Instruction *U = cast<Instruction>(*CI->user_begin());
    if (U->getParent() == CI->getParent() || isa<ConstantInt>(Index)){

Fix: #165793

…ant index

…nstant index or same basic block.

…elt.ll test checks

…_elt.ll

github-actions · 2025-11-03T20:17:25Z

✅ With the latest revision this PR passed the C/C++ code formatter.

azwolski · 2025-11-03T20:32:10Z

In @test_poison_branch, the cast and extractelement are placed in different basic blocks. However, the canonicalization still occurs because tryToSinkInstruction in InstCombine/InstructionCombining.cpp sinks the cast into the false block:

BB after sinking:

false:                                            ; preds = %entry
  %vi = fptosi <4 x float> %in to <4 x i32>
  %elem = extractelement <4 x i32> %vi, i32 %i
  call void @use(i32 %elem)
  br label %done

llvmbot · 2025-11-03T20:38:18Z

@llvm/pr-subscribers-llvm-transforms

Author: None (azwolski)

Changes

The current canonicalization of extractelement(cast) requires that the CastInst has only one use. However, when that use occurs inside a loop, it still satisfies this condition, even though the cast is effectively used multiple times, once per iteration, rather than truly being used once.

} else if (auto *CI = dyn_cast&lt;CastInst&gt;(I)) {
  // Canonicalize extractelement(cast) -&gt; cast(extractelement).
  // Bitcasts can change the number of vector elements, and they cost
  // nothing.
  if (CI-&gt;hasOneUse() &amp;&amp; (CI-&gt;getOpcode() != Instruction::BitCast)){

Before

%34 = fptosi &lt;4 x float&gt; %33 to &lt;4 x i32&gt;
;/loop{
%40 = extractelement &lt;4 x i32&gt; %34, i32 %36

After

;/loop{
%37 = extractelement &lt;4 x float&gt; %30, i32 %32
%38 = fptosi float %37 to i32

After canonicalization, for this particular example, it no longer uses a single instruction to cast the entire vector at once, but instead performs the cast for every element separately, which is less performant.

Ideally, we would like to check if the cast instruction has one use and that this use is not called inside a loop. However, InstCombine/InstCombineVectorOps.cpp does not provide utilities like LoopInfo to check that. It might be possible to approximate this by analyzing basic block successors or by building a dominance tree, but that may be a costly solution.

A solution to prevent this optimization could be to check if the index is an immediate value and if the use is inside the same basic block as the cast instruction:

if (CI-&gt;hasOneUse() &amp;&amp; (CI-&gt;getOpcode() != Instruction::BitCast)) {
    Instruction *U = cast&lt;Instruction&gt;(*CI-&gt;user_begin());
    if (U-&gt;getParent() == CI-&gt;getParent() || isa&lt;ConstantInt&gt;(Index)){

Fix: #165793

Full diff: https://github.com/llvm/llvm-project/pull/166227.diff

2 Files Affected:

(modified) llvm/lib/Transforms/InstCombine/InstCombineVectorOps.cpp (+5-2)
(modified) llvm/test/Transforms/InstCombine/vec_extract_var_elt.ll (+35-5)

diff --git a/llvm/lib/Transforms/InstCombine/InstCombineVectorOps.cpp b/llvm/lib/Transforms/InstCombine/InstCombineVectorOps.cpp
index 18a45c6799bac..44c3863dd97b5 100644
--- a/llvm/lib/Transforms/InstCombine/InstCombineVectorOps.cpp
+++ b/llvm/lib/Transforms/InstCombine/InstCombineVectorOps.cpp
@@ -589,8 +589,11 @@ Instruction *InstCombinerImpl::visitExtractElementInst(ExtractElementInst &EI) {
       // Bitcasts can change the number of vector elements, and they cost
       // nothing.
       if (CI->hasOneUse() && (CI->getOpcode() != Instruction::BitCast)) {
-        Value *EE = Builder.CreateExtractElement(CI->getOperand(0), Index);
-        return CastInst::Create(CI->getOpcode(), EE, EI.getType());
+        Instruction *U = cast<Instruction>(*CI->user_begin());
+        if (U->getParent() == CI->getParent() || isa<ConstantInt>(Index)) {
+          Value *EE = Builder.CreateExtractElement(CI->getOperand(0), Index);
+          return CastInst::Create(CI->getOpcode(), EE, EI.getType());
+        }
       }
     }
   }
diff --git a/llvm/test/Transforms/InstCombine/vec_extract_var_elt.ll b/llvm/test/Transforms/InstCombine/vec_extract_var_elt.ll
index 205b4b88c473a..f96b7070f9f2a 100644
--- a/llvm/test/Transforms/InstCombine/vec_extract_var_elt.ll
+++ b/llvm/test/Transforms/InstCombine/vec_extract_var_elt.ll
@@ -40,19 +40,50 @@ define i32 @test_bitcast(i32 %i) {
 
 declare void @use(i32)
 
+define void @test_poison_branch(<4 x float> %in, i32 %a, i1 %cond) {
+; CHECK-LABEL: define void @test_poison_branch(
+; CHECK-SAME: <4 x float> [[IN:%.*]], i32 [[A:%.*]], i1 [[COND:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[I:%.*]] = add i32 [[A]], -2
+; CHECK-NEXT:    br i1 [[COND]], label %[[TRUE:.*]], label %[[FALSE:.*]]
+; CHECK:       [[TRUE]]:
+; CHECK-NEXT:    call void @use(i32 [[I]])
+; CHECK-NEXT:    br label %[[DONE:.*]]
+; CHECK:       [[FALSE]]:
+; CHECK-NEXT:    [[TMP0:%.*]] = extractelement <4 x float> [[IN]], i32 [[I]]
+; CHECK-NEXT:    [[ELEM:%.*]] = fptosi float [[TMP0]] to i32
+; CHECK-NEXT:    call void @use(i32 [[ELEM]])
+; CHECK-NEXT:    br label %[[DONE]]
+; CHECK:       [[DONE]]:
+; CHECK-NEXT:    ret void
+;
+entry:
+  %vi = fptosi <4 x float> %in to <4 x i32>
+  %i = add i32 %a, -2
+  br i1 %cond, label %true, label %false
+true:
+  call void @use(i32 %i)
+  br label %done
+false:
+  %elem = extractelement <4 x i32> %vi, i32 %i
+  call void @use(i32 %elem)
+  br label %done
+done:
+  ret void
+}
+
 define void @test_loop(<4 x float> %in) {
 ; CHECK-LABEL: define void @test_loop(
 ; CHECK-SAME: <4 x float> [[IN:%.*]]) {
 ; CHECK-NEXT:  [[ENTRY:.*]]:
-; CHECK-NEXT:    [[R:%.*]] = call <4 x float> @llvm.x86.sse41.round.ps(<4 x float> [[IN]], i32 9)
+; CHECK-NEXT:    [[VI:%.*]] = fptosi <4 x float> [[IN]] to <4 x i32>
 ; CHECK-NEXT:    br label %[[LOOP:.*]]
 ; CHECK:       [[LOOP]]:
 ; CHECK-NEXT:    [[I:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[NEXT:%.*]], %[[LATCH:.*]] ]
 ; CHECK-NEXT:    [[COND:%.*]] = icmp samesign ult i32 [[I]], 4
 ; CHECK-NEXT:    br i1 [[COND]], label %[[BODY:.*]], label %[[DONE:.*]]
 ; CHECK:       [[BODY]]:
-; CHECK-NEXT:    [[TMP0:%.*]] = extractelement <4 x float> [[R]], i32 [[I]]
-; CHECK-NEXT:    [[ELEM:%.*]] = fptosi float [[TMP0]] to i32
+; CHECK-NEXT:    [[ELEM:%.*]] = extractelement <4 x i32> [[VI]], i32 [[I]]
 ; CHECK-NEXT:    call void @use(i32 [[ELEM]])
 ; CHECK-NEXT:    br label %[[LATCH]]
 ; CHECK:       [[LATCH]]:
@@ -62,8 +93,7 @@ define void @test_loop(<4 x float> %in) {
 ; CHECK-NEXT:    ret void
 ;
 entry:
-  %r = call <4 x float> @llvm.x86.sse41.round.ps(<4 x float> %in, i32 9)
-  %vi = fptosi <4 x float> %r to <4 x i32>
+  %vi = fptosi <4 x float> %in to <4 x i32>
   br label %loop
 loop:
   %i = phi i32 [ 0, %entry ], [ %next, %latch ]

azwolski · 2025-11-13T10:38:45Z

@nikic

…nonicalization

github-actions · 2025-11-19T14:49:08Z

🐧 Linux x64 Test Results

186387 tests passed
4859 tests skipped

azwolski · 2025-11-19T15:20:35Z

@RKSimon ping

RKSimon

LGTM - but I think we'd be better off looking at moving this to VectorCombine at some point soon to allow it be cost driven. Also, FTR bitcasts aren't always free - fp<->int bitcasts (fp bit twiddling etc.) in particular can be a major headache.

azwolski added 5 commits November 3, 2025 11:53

[InstCombine] Limit canonicalization of extractelement(cast) to const…

aa49b2b

…ant index

[InstCombine] Update vec_extract_var_elt.ll test checks

298613c

[InstCombine] Refactor canonicalization of extractelement(cast) to co…

91e0c1b

…nstant index or same basic block.

[InstCombine] Add test_poison_branch test and update vec_extract_var_…

742d97e

…elt.ll test checks

[InstCombine] Remove unused declaration of @use_vi in vec_extract_var…

4d8d8fd

…_elt.ll

[InstCombine] Fix formatting in visitExtractElementInst

4a1267a

azwolski marked this pull request as ready for review November 3, 2025 20:37

azwolski requested a review from nikic as a code owner November 3, 2025 20:37

llvmbot added llvm:instcombine Covers the InstCombine, InstSimplify and AggressiveInstCombine passes llvm:transforms labels Nov 3, 2025

aneshlya mentioned this pull request Nov 6, 2025

Sub-par code generation when using extract ispc/ispc#3081

Open

dtcxzyw requested a review from RKSimon November 8, 2025 12:24

azwolski mentioned this pull request Nov 17, 2025

Limit canonicalization and add lit test ispc/ispc#3637

Open

5 tasks

[InstCombine] Simplify conditional checks for extractelement(cast) ca…

99da128

…nonicalization

RKSimon approved these changes Nov 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[InstCombine] Limit canonicalization of extractelement(cast) to constant index or same basic block. #166227

[InstCombine] Limit canonicalization of extractelement(cast) to constant index or same basic block. #166227

azwolski commented Nov 3, 2025

Uh oh!

github-actions bot commented Nov 3, 2025 •

edited

Loading

Uh oh!

azwolski commented Nov 3, 2025

Uh oh!

llvmbot commented Nov 3, 2025

Uh oh!

azwolski commented Nov 13, 2025

Uh oh!

github-actions bot commented Nov 19, 2025

Uh oh!

azwolski commented Nov 19, 2025

Uh oh!

RKSimon left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[InstCombine] Limit canonicalization of extractelement(cast) to constant index or same basic block. #166227

Are you sure you want to change the base?

[InstCombine] Limit canonicalization of extractelement(cast) to constant index or same basic block. #166227

Conversation

azwolski commented Nov 3, 2025

Uh oh!

github-actions bot commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

azwolski commented Nov 3, 2025

Uh oh!

llvmbot commented Nov 3, 2025

Uh oh!

azwolski commented Nov 13, 2025

Uh oh!

github-actions bot commented Nov 19, 2025

🐧 Linux x64 Test Results

Uh oh!

azwolski commented Nov 19, 2025

Uh oh!

RKSimon left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Nov 3, 2025 •

edited

Loading