It looks like the problem is in SROA. It isn't supposed to be creating crazy things like i1024 unless it has no better options. In this testcase, "regular" SROA -- splitting double[16] into 16 doubles -- seems to be a much better option.
This should be handled at the IR level. We shouldn't be making SelectionDAG more complicated to try to paper over this. Selection dag only sees things one basic block at a time, so it can't fix the general case.
The core issue here (SROA doing bad things to matmul) was fixed by the SROA rewrite.
Note that it in turn exposed some register allocation bugs that caused the benchmark to slow down in some rare cases. ;] Anyways, I think this bug is done.
llvmbot
transferred this issue from llvm/llvm-bugzilla-archive
Dec 3, 2021
Extended Description
I added a test case to the nightly test suite: SingleSource/Benchmarks/Misc/matmul_f64_4x4.c
A function multiplies two matrices and puts the result in a local buffer before copying it out:
static void mul4(double *Out, const double A[4][4], const double B[4][4]) {
double Res[16];
Res[0] = ...
...
Res[15] = ...
for (n = 0; n < 16; ++n)
Out[n] = Res[n];
}
SROA converts the double[16] buffer to an i1024:
%188 = fadd double %184, %187
%189 = bitcast double %188 to i64
%190 = zext i64 %189 to i1024
%191 = shl nuw nsw i1024 %190, 768
…
%ins = or i1024 %mask, %191
%222 = bitcast double* %c to i1024*
store i1024 %ins, i1024* %222, align 4
ret void
This completely confuses ARM codegen which is unable to recover the original stores. Instead it copies the doubles to the GPRs and stores i32s:
We wanted a simple f64 store:
The problem seems to begin with LegalizeTypes. It breaks the i1024 store into i32 stores because i64 is not a legal ARM type.
The text was updated successfully, but these errors were encountered: