Skip to content

Conversation

@folkertdev
Copy link
Contributor

@folkertdev folkertdev commented Aug 28, 2025

cc @lukel97 @badumbatish #145108

I've been learning a bit about LLVM, trying to make progress on some of these issues. The code below is based on #145108 (comment), by implementing shouldExpandReduction.

The implementation works for the test cases I added, but (obviously) fails for any existing cases. ISD::VECREDUCE_AND and ISD::VECREDUCE_OR are now marked as legal, which is required for the Pats to fire, but when they don't that causes a selection failure.

So, I'm wondering, what is the right approach here. Should I mark these intrinsics as Custom instead and manually perform the transformation in C++? Or is there some trick to still get the default lowering (a series of loads and scalar bitwise operations) when the patterns don't fire?

@badumbatish
Copy link
Contributor

badumbatish commented Aug 29, 2025

hi @folkertdev, I didn't quite spend much time on the PR but it seems like you're failing at isel stage because it cant find the pattern of (i32( reduce_and (v8i16 V128))). Here's the more detailed error in one of the test cases

LLVM ERROR: Cannot select: t8: i32 = vecreduce_or t2
  t2: v8i16 = WebAssemblyISD::ARGUMENT TargetConstant:i32<0>
In function: test_any_v8i1

So if you add these to the tablegen file it shouldn't fail anymore

def : Pat<(i32 (vecreduce_or(v8i16 V128:$vec))), (ANYTRUE V128:$vec)>;
def : Pat<(i32 (vecreduce_or(v16i8 V128:$vec))), (ANYTRUE V128:$vec)>;
def : Pat<(i32 (vecreduce_or(v4i32 V128:$vec))), (ANYTRUE V128:$vec)>;
def : Pat<(i32 (vecreduce_or(v2i64 V128:$vec))), (ANYTRUE V128:$vec)>;

def : Pat<(i32 (vecreduce_and(v8i16 V128:$vec))), (ALLTRUE_I8x16 V128:$vec)>;
def : Pat<(i32 (vecreduce_and(v16i8 V128:$vec))), (ALLTRUE_I16x8 V128:$vec)>;
def : Pat<(i32 (vecreduce_and(v4i32 V128:$vec))), (ALLTRUE_I32x4 V128:$vec)>;
def : Pat<(i32 (vecreduce_and(v2i64 V128:$vec))), (ALLTRUE_I64x2 V128:$vec)>;

However you should be careful about these works, idk where in the codebase this happens (which file specifically) yet (my LLDB got stuck for 5min trying to load the program) but if you run following command to reproduce, you'll find that t3, t4, t5 is being turned into t8 uncaringly between optimized lowered and type legalized stage

build/bin/llc -verify-machineinstrs -disable-wasm-fallthrough-return-opt -wasm-disable-explicit-locals -wasm-keep-registers -mattr=+simd128 llvm/test/CodeGen/WebAssembly/simd-vecreduce-bool.ll -debug-only=dagcomb
ine,isel -o - &> test.txt

Initial selection DAG: %bb.0 'test_any_v8i1:'
SelectionDAG has 7 nodes:
    t0: ch,glue = EntryToken
          t2: v8i16 = WebAssemblyISD::ARGUMENT TargetConstant:i32<0>
        t3: v8i1 = truncate t2
      t4: i1 = vecreduce_or t3
    t5: i32 = any_extend t4
  t6: ch = WebAssemblyISD::RETURN t0, t5



Combining: t6: ch = WebAssemblyISD::RETURN t0, t5

Combining: t5: i32 = any_extend t4

Combining: t4: i1 = vecreduce_or t3

Combining: t3: v8i1 = truncate t2

Combining: t2: v8i16 = WebAssemblyISD::ARGUMENT TargetConstant:i32<0>

Combining: t1: i32 = TargetConstant<0>

Combining: t0: ch,glue = EntryToken

Optimized lowered selection DAG: %bb.0 'test_any_v8i1:'
SelectionDAG has 7 nodes:
    t0: ch,glue = EntryToken
          t2: v8i16 = WebAssemblyISD::ARGUMENT TargetConstant:i32<0>
        t3: v8i1 = truncate t2
      t4: i1 = vecreduce_or t3
    t5: i32 = any_extend t4
  t6: ch = WebAssemblyISD::RETURN t0, t5



Type-legalized selection DAG: %bb.0 'test_any_v8i1:'
SelectionDAG has 5 nodes:
    t0: ch,glue = EntryToken
      t2: v8i16 = WebAssemblyISD::ARGUMENT TargetConstant:i32<0>
    t8: i32 = vecreduce_or t2
  t6: ch = WebAssemblyISD::RETURN t0, t8

we haven't proved that the other bits besides LSB bit in t2 is 0 yet before doing so, so if these transform were to go through while some bits other than LSB is 1 and LSB is 0, this will result in the wrong result.

I'm still a bit new so not sure if that transformation i pointed out is correct by design or not but it seems that seems like a bigger issue if you were to set the VEC_REDUCE to legal

@folkertdev
Copy link
Contributor Author

I wasn't sure that that would be accurate, based on:

https://llvm.org/docs/LangRef.html#llvm-vector-reduce-and-intrinsic

The llvm.vector.reduce.and.* intrinsics do a bitwise AND reduction of a vector, returning the result as a scalar. The return type matches the element-type of the vector input.

While the wasm spec says that all_true specifically produces a boolean, so just 0 or 1:

https://www.w3.org/TR/2025/CRD-wasm-core-2-20250616/#-hrefsyntax-shapemathitshapemathsfhrefsyntax-instr-vecmathsfall_true

Maybe I'm missing something though?


I'm open to other suggestions, of course, if marking those operations as legal is not a good approach. Marking them as Custom instead seems like it would work, it just requires a bunch of C++ code. The patterns are much more elegant, but maybe there is no other way.

@folkertdev
Copy link
Contributor Author

I've now implemented a manual combine between setcc and vecreduce_{and, or}. That works really well in most cases, but leads to some changes in tests.

This seems acceptable to me:

 ; SIMD128-LABEL: pairwise_or_v2i64:
 ; SIMD128:         .functype pairwise_or_v2i64 (v128) -> (i64)
 ; SIMD128-NEXT:  # %bb.0:
-; SIMD128-NEXT:    i8x16.shuffle $push0=, $0, $0, 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 4, 5, 6, 7
-; SIMD128-NEXT:    v128.or $push1=, $0, $pop0
-; SIMD128-NEXT:    i64x2.extract_lane $push2=, $pop1, 0
+; SIMD128-NEXT:    i64x2.extract_lane $push1=, $0, 0
+; SIMD128-NEXT:    i64x2.extract_lane $push0=, $0, 1
+; SIMD128-NEXT:    i64.or $push2=, $pop1, $pop0
 ; SIMD128-NEXT:    return $pop2
   %res = tail call i64 @llvm.vector.reduce.or.v2i64(<2 x i64> %arg)
   ret i64 %res

But in some cases it looks like the previous implementation was a lot smarter about the vector reduction than the default lowering.

 ; SIMD128-LABEL: pairwise_or_v8i16:
 ; SIMD128:         .functype pairwise_or_v8i16 (v128) -> (i32)
 ; SIMD128-NEXT:  # %bb.0:
-; SIMD128-NEXT:    i8x16.shuffle $push0=, $0, $0, 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 0, 1, 0, 1, 0, 1
-; SIMD128-NEXT:    v128.or $push8=, $0, $pop0
-; SIMD128-NEXT:    local.tee $push7=, $0=, $pop8
-; SIMD128-NEXT:    i8x16.shuffle $push1=, $0, $0, 4, 5, 6, 7, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
-; SIMD128-NEXT:    v128.or $push6=, $pop7, $pop1
-; SIMD128-NEXT:    local.tee $push5=, $0=, $pop6
-; SIMD128-NEXT:    i8x16.shuffle $push2=, $0, $0, 2, 3, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
-; SIMD128-NEXT:    v128.or $push3=, $pop5, $pop2
-; SIMD128-NEXT:    i16x8.extract_lane_u $push4=, $pop3, 0
-; SIMD128-NEXT:    return $pop4
+; SIMD128-NEXT:    i16x8.extract_lane_u $push1=, $0, 0
+; SIMD128-NEXT:    i16x8.extract_lane_u $push0=, $0, 1
+; SIMD128-NEXT:    i32.or $push2=, $pop1, $pop0
+; SIMD128-NEXT:    i16x8.extract_lane_u $push3=, $0, 2
+; SIMD128-NEXT:    i32.or $push4=, $pop2, $pop3
+; SIMD128-NEXT:    i16x8.extract_lane_u $push5=, $0, 3
+; SIMD128-NEXT:    i32.or $push6=, $pop4, $pop5
+; SIMD128-NEXT:    i16x8.extract_lane_u $push7=, $0, 4
+; SIMD128-NEXT:    i32.or $push8=, $pop6, $pop7
+; SIMD128-NEXT:    i16x8.extract_lane_u $push9=, $0, 5
+; SIMD128-NEXT:    i32.or $push10=, $pop8, $pop9
+; SIMD128-NEXT:    i16x8.extract_lane_u $push11=, $0, 6
+; SIMD128-NEXT:    i32.or $push12=, $pop10, $pop11
+; SIMD128-NEXT:    i16x8.extract_lane_u $push13=, $0, 7
+; SIMD128-NEXT:    i32.or $push14=, $pop12, $pop13
+; SIMD128-NEXT:    return $pop14
   %res = tail call i16 @llvm.vector.reduce.or.v8i16(<8 x i16> %arg)
   ret i16 %res
 }

What is the best way forward here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants