- 
                Notifications
    
You must be signed in to change notification settings  - Fork 15.1k
 
Description
@david-arm @ppenzin @hiraditya @mgudim Apologies if I missed anyone or tagged the wrong people. Opening the issue as a continuation of the discussion from the last vectorizer meeting as more information was requested.
We don't have good data on how often this occurs in other benchmarks, but if that is of interest I can try to gather that data. For x264, I only have instruction count data, but based on that I would guess this could give a 0.5-2% improvement for strided load targets. Since we aren't able to specialize all calls to the pixel_avg loop, not every instance would benefit from better improved vectorization.
The case brought up in the meeting is the following from x264; the loop bounds aren’t known at compile time, but they are exposed at link time for several of the function calls, so the loop bounds are constant after function specialization. Note: this involves a few local patches (we have begun the upstreaming process, see #163891) so is not yet reproduceable upstream.
void pixel_avg(uint8_t * restrict dst,  int i_dst_stride,
               uint8_t * restrict src1, int i_src1_stride,
               uint8_t * restrict src2, int i_src2_stride,
               int i_width, int i_height) {
  for(int y = 0; y < i_width; y++) {
    for (int x = 0; x < i_height; x++) {
      dst[x] = (src1[x] + src2[x] + 1) >> 1;
    }
    dst  += i_dst_stride;
    src1 += i_src1_stride;
    src2 += i_src2_stride;
  }
}
At compile time, the loops have non-constant bounds so aren’t unrolled, and the LoopVectorizer vectorizes them into the following scalable vectors:
87:                                               ; preds = %87, %81
  %88 = phi i64 [ %103, %87 ], [ 0, %81 ]
  %89 = phi i64 [ %104, %87 ], [ %49, %81 ]
  %90 = tail call i32 @llvm.experimental.get.vector.length.i64(i64 %89, i32 16, i1 true)
  %91 = getelementptr inbounds nuw i8, ptr %84, i64 %88
  %92 = tail call <vscale x 16 x i8> @llvm.vp.load.nxv16i8.p0(ptr align 1 %91, <vscale x 16 x i1> splat (i1 true), i32 %90), !alias.scope !69
  %93 = zext <vscale x 16 x i8> %92 to <vscale x 16 x i16>
  %94 = getelementptr inbounds nuw i8, ptr %85, i64 %88
  %95 = tail call <vscale x 16 x i8> @llvm.vp.load.nxv16i8.p0(ptr align 1 %94, <vscale x 16 x i1> splat (i1 true), i32 %90), !alias.scope !72
  %96 = zext <vscale x 16 x i8> %95 to <vscale x 16 x i16>
  %97 = add nuw nsw <vscale x 16 x i16> %93, splat (i16 1)
  %98 = add nuw nsw <vscale x 16 x i16> %97, %96
  %99 = lshr <vscale x 16 x i16> %98, splat (i16 1)
  %100 = trunc nuw <vscale x 16 x i16> %99 to <vscale x 16 x i8>
  %101 = getelementptr inbounds nuw i8, ptr %83, i64 %88
  tail call void @llvm.vp.store.nxv16i8.p0(<vscale x 16 x i8> %100, ptr align 1 %101, <vscale x 16 x i1> splat (i1 true), i32 %90), !alias.scope !74, !noalias !76
  %102 = zext i32 %90 to i64
  %103 = add nuw i64 %88, %102
  %104 = sub nuw i64 %89, %102
  %105 = icmp eq i64 %104, 0
  br i1 %105, label %178, label %87, !llvm.loop !77
During LTO, function specialization exposes constant values for i_width and i_height. But the inner loop is already vectorized (compile time) so it isn't unrolled. As a result, SLP is unable to optimize with strided loads/stores.