[Flang] performance regression on x86 after D126885

After the cost model change in https://reviews.llvm.org/D126885 SLP stopped producing vector stores from separate scalar stores.  This causes about 9% regression in CU2000/168.wupwise@skylake with Flang.

Small reproducer:
```f90
subroutine sub()
  complex*16 I
  PARAMETER (I=(0.0D+0,1.0D+0))
  call bar(I)
end subroutine sub
```

[test.zip](https://github.com/llvm/llvm-project/files/9410301/test.zip) - this archive contains test.fir and test.ll produced by Flang.

Initialization of `I` is represented as:
```llvm
  %1 = alloca { double, double }, i64 1, align 8, !dbg !7
  store { double, double } { double 0.000000e+00, double 1.000000e+00 }, ptr %1, align 8, !dbg !7
```

Command to reproduce the optimization pipeline:
```shell
clang++ -O3 -march=skylake -c -o test.o test.ll
```

Both before and after D126885 `SROAPass` splits the structured store into two scalar stores:
```llvm
*** IR Dump After SROAPass on sub_ ***
define void @sub_() !dbg !3 {
  %1 = alloca { double, double }, i64 1, align 8, !dbg !7
  %.fca.0.gep = getelementptr inbounds { double, double }, ptr %1, i32 0, i32 0, !dbg !7
  store double 0.000000e+00, ptr %.fca.0.gep, align 8, !dbg !7
  %.fca.1.gep = getelementptr inbounds { double, double }, ptr %1, i32 0, i32 1, !dbg !7
  store double 1.000000e+00, ptr %.fca.1.gep, align 8, !dbg !7
```

These two stores were combined into a vector store before D126885, but now they don't:
**Before:**

```llvm
*** IR Dump After SLPVectorizerPass on sub_ ***
define void @sub_() local_unnamed_addr !dbg !3 {
  %1 = alloca { double, double }, align 8, !dbg !7
  store <2 x double> <double 0.000000e+00, double 1.000000e+00>, ptr %1, align 8, !dbg !7
  call void @bar_(ptr nonnull %1), !dbg !7
  ret void, !dbg !9
}
```

**After:**

```llvm
*** IR Dump After SLPVectorizerPass on sub_ ***
define void @sub_() local_unnamed_addr !dbg !3 {
  %1 = alloca { double, double }, align 8, !dbg !7
  store double 0.000000e+00, ptr %1, align 8, !dbg !7
  %.fca.1.gep = getelementptr inbounds { double, double }, ptr %1, i64 0, i32 1, !dbg !7
  store double 1.000000e+00, ptr %.fca.1.gep, align 8, !dbg !7
  call void @bar_(ptr nonnull %1), !dbg !7
  ret void, !dbg !9
}
```

In `168.wupwise` multiple complex arguments are passed to `dcabs1` routine that computes `abs` for the real and imaginary components:
```llvm
define double @dcabs1_(ptr %0) !dbg !3 {
  %2 = alloca double, i64 1, align 8, !dbg !7
  %3 = getelementptr [2 x double], ptr %0, i32 0, i64 0, !dbg !9
  %4 = load double, ptr %3, align 8, !dbg !9
  %5 = call double @llvm.fabs.f64(double %4), !dbg !9
  %6 = getelementptr [2 x double], ptr %0, i32 0, i64 1, !dbg !9
  %7 = load double, ptr %6, align 8, !dbg !9
  %8 = call double @llvm.fabs.f64(double %7), !dbg !9
  %9 = fadd double %5, %8, !dbg !9
```

This ends up being SLPd into:
```llvm
*** IR Dump After SLPVectorizerPass on dcabs1_ ***
; Function Attrs: argmemonly mustprogress nofree nosync nounwind readonly willreturn
define double @dcabs1_(ptr nocapture readonly %0) local_unnamed_addr #0 !dbg !3 {
  %2 = load <2 x double>, ptr %0, align 8, !dbg !7
  %3 = call <2 x double> @llvm.fabs.v2f64(<2 x double> %2), !dbg !7
  %4 = extractelement <2 x double> %3, i32 0, !dbg !7
  %5 = extractelement <2 x double> %3, i32 1, !dbg !7
  %6 = fadd double %4, %5, !dbg !7
```

So we have two separate 8-byte stores followed by a 16-byte load, which is a store forwarding issue on some CPUs.

Given that SLP may vectorize the "use" part of the memory into a wider load, does it make sense to account for this in the cost model code computing the cost for replacing two scalar stores with one vector store?  For example, add assumed store-forwarding cost in case the address escapes or is already used by a vector loading operation.

I suppose Flang could also do something here, e.g. lower `fir.complex` stores/loads into vector stores/loads rather than structured ones.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Flang] performance regression on x86 after D126885 #57322

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Flang] performance regression on x86 after D126885 #57322

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions