Skip to content

[Flang] performance regression on x86 after D126885 #57322

@vzakhari

Description

@vzakhari

After the cost model change in https://reviews.llvm.org/D126885 SLP stopped producing vector stores from separate scalar stores. This causes about 9% regression in CU2000/168.wupwise@skylake with Flang.

Small reproducer:

subroutine sub()
  complex*16 I
  PARAMETER (I=(0.0D+0,1.0D+0))
  call bar(I)
end subroutine sub

test.zip - this archive contains test.fir and test.ll produced by Flang.

Initialization of I is represented as:

  %1 = alloca { double, double }, i64 1, align 8, !dbg !7
  store { double, double } { double 0.000000e+00, double 1.000000e+00 }, ptr %1, align 8, !dbg !7

Command to reproduce the optimization pipeline:

clang++ -O3 -march=skylake -c -o test.o test.ll

Both before and after D126885 SROAPass splits the structured store into two scalar stores:

*** IR Dump After SROAPass on sub_ ***
define void @sub_() !dbg !3 {
  %1 = alloca { double, double }, i64 1, align 8, !dbg !7
  %.fca.0.gep = getelementptr inbounds { double, double }, ptr %1, i32 0, i32 0, !dbg !7
  store double 0.000000e+00, ptr %.fca.0.gep, align 8, !dbg !7
  %.fca.1.gep = getelementptr inbounds { double, double }, ptr %1, i32 0, i32 1, !dbg !7
  store double 1.000000e+00, ptr %.fca.1.gep, align 8, !dbg !7

These two stores were combined into a vector store before D126885, but now they don't:
Before:

*** IR Dump After SLPVectorizerPass on sub_ ***
define void @sub_() local_unnamed_addr !dbg !3 {
  %1 = alloca { double, double }, align 8, !dbg !7
  store <2 x double> <double 0.000000e+00, double 1.000000e+00>, ptr %1, align 8, !dbg !7
  call void @bar_(ptr nonnull %1), !dbg !7
  ret void, !dbg !9
}

After:

*** IR Dump After SLPVectorizerPass on sub_ ***
define void @sub_() local_unnamed_addr !dbg !3 {
  %1 = alloca { double, double }, align 8, !dbg !7
  store double 0.000000e+00, ptr %1, align 8, !dbg !7
  %.fca.1.gep = getelementptr inbounds { double, double }, ptr %1, i64 0, i32 1, !dbg !7
  store double 1.000000e+00, ptr %.fca.1.gep, align 8, !dbg !7
  call void @bar_(ptr nonnull %1), !dbg !7
  ret void, !dbg !9
}

In 168.wupwise multiple complex arguments are passed to dcabs1 routine that computes abs for the real and imaginary components:

define double @dcabs1_(ptr %0) !dbg !3 {
  %2 = alloca double, i64 1, align 8, !dbg !7
  %3 = getelementptr [2 x double], ptr %0, i32 0, i64 0, !dbg !9
  %4 = load double, ptr %3, align 8, !dbg !9
  %5 = call double @llvm.fabs.f64(double %4), !dbg !9
  %6 = getelementptr [2 x double], ptr %0, i32 0, i64 1, !dbg !9
  %7 = load double, ptr %6, align 8, !dbg !9
  %8 = call double @llvm.fabs.f64(double %7), !dbg !9
  %9 = fadd double %5, %8, !dbg !9

This ends up being SLPd into:

*** IR Dump After SLPVectorizerPass on dcabs1_ ***
; Function Attrs: argmemonly mustprogress nofree nosync nounwind readonly willreturn
define double @dcabs1_(ptr nocapture readonly %0) local_unnamed_addr #0 !dbg !3 {
  %2 = load <2 x double>, ptr %0, align 8, !dbg !7
  %3 = call <2 x double> @llvm.fabs.v2f64(<2 x double> %2), !dbg !7
  %4 = extractelement <2 x double> %3, i32 0, !dbg !7
  %5 = extractelement <2 x double> %3, i32 1, !dbg !7
  %6 = fadd double %4, %5, !dbg !7

So we have two separate 8-byte stores followed by a 16-byte load, which is a store forwarding issue on some CPUs.

Given that SLP may vectorize the "use" part of the memory into a wider load, does it make sense to account for this in the cost model code computing the cost for replacing two scalar stores with one vector store? For example, add assumed store-forwarding cost in case the address escapes or is already used by a vector loading operation.

I suppose Flang could also do something here, e.g. lower fir.complex stores/loads into vector stores/loads rather than structured ones.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions