[LoopVectorizer] Pre-LTO LoopVectorization (as discussed in 10/21/25 Vectorization meeting)

@david-arm @ppenzin @hiraditya  @mgudim  Apologies if I missed anyone or tagged the wrong people. Opening the issue as a continuation of the discussion from the last vectorizer meeting as more information was requested.

We don't have good data on how often this occurs in other benchmarks, but if that is of interest I can try to gather that data. For `x264`, I only have instruction count data, but based on that I would guess this could give a 0.5-2% improvement for strided load targets. Since we aren't able to specialize all calls to the `pixel_avg` loop, not every instance would benefit from better improved vectorization.

The case brought up in the meeting is the following from x264; the loop bounds aren’t known at compile time, but they are exposed at link time for several of the function calls, so the loop bounds are constant after function specialization. Note: this involves a few local patches (we have begun the upstreaming process, see https://github.com/llvm/llvm-project/pull/163891) so is not yet reproduceable upstream.

```
void pixel_avg(uint8_t * restrict dst,  int i_dst_stride,
               uint8_t * restrict src1, int i_src1_stride,
               uint8_t * restrict src2, int i_src2_stride,
               int i_width, int i_height) {
  for(int y = 0; y < i_width; y++) {
    for (int x = 0; x < i_height; x++) {
      dst[x] = (src1[x] + src2[x] + 1) >> 1;
    }
    dst  += i_dst_stride;
    src1 += i_src1_stride;
    src2 += i_src2_stride;
  }
}
```

At compile time, the loops have non-constant bounds so aren’t unrolled, and the LoopVectorizer vectorizes them into the following scalable vectors:
```
87:                                               ; preds = %87, %81
  %88 = phi i64 [ %103, %87 ], [ 0, %81 ]
  %89 = phi i64 [ %104, %87 ], [ %49, %81 ]
  %90 = tail call i32 @llvm.experimental.get.vector.length.i64(i64 %89, i32 16, i1 true)
  %91 = getelementptr inbounds nuw i8, ptr %84, i64 %88
  %92 = tail call <vscale x 16 x i8> @llvm.vp.load.nxv16i8.p0(ptr align 1 %91, <vscale x 16 x i1> splat (i1 true), i32 %90), !alias.scope !69
  %93 = zext <vscale x 16 x i8> %92 to <vscale x 16 x i16>
  %94 = getelementptr inbounds nuw i8, ptr %85, i64 %88
  %95 = tail call <vscale x 16 x i8> @llvm.vp.load.nxv16i8.p0(ptr align 1 %94, <vscale x 16 x i1> splat (i1 true), i32 %90), !alias.scope !72
  %96 = zext <vscale x 16 x i8> %95 to <vscale x 16 x i16>
  %97 = add nuw nsw <vscale x 16 x i16> %93, splat (i16 1)
  %98 = add nuw nsw <vscale x 16 x i16> %97, %96
  %99 = lshr <vscale x 16 x i16> %98, splat (i16 1)
  %100 = trunc nuw <vscale x 16 x i16> %99 to <vscale x 16 x i8>
  %101 = getelementptr inbounds nuw i8, ptr %83, i64 %88
  tail call void @llvm.vp.store.nxv16i8.p0(<vscale x 16 x i8> %100, ptr align 1 %101, <vscale x 16 x i1> splat (i1 true), i32 %90), !alias.scope !74, !noalias !76
  %102 = zext i32 %90 to i64
  %103 = add nuw i64 %88, %102
  %104 = sub nuw i64 %89, %102
  %105 = icmp eq i64 %104, 0
  br i1 %105, label %178, label %87, !llvm.loop !77
```
During LTO, function specialization exposes constant values for `i_width` and `i_height`. But the inner loop is already vectorized (compile time) so it isn't unrolled. As a result, SLP is unable to optimize with strided loads/stores.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[LoopVectorizer] Pre-LTO LoopVectorization (as discussed in 10/21/25 Vectorization meeting) #164762

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[LoopVectorizer] Pre-LTO LoopVectorization (as discussed in 10/21/25 Vectorization meeting) #164762

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions