Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mlir] Lit tests fail after 2d2da25 #55992

Closed
phoebewang opened this issue Jun 12, 2022 · 11 comments
Closed

[mlir] Lit tests fail after 2d2da25 #55992

phoebewang opened this issue Jun 12, 2022 · 11 comments

Comments

@phoebewang
Copy link
Contributor

Hello, I just landed a patch to support FP16 emulation on X86: 2d2da25. Then I received a Buildbot notifaction from https://lab.llvm.org/buildbot#builders/61/builds/27616, where reported 2 failed tests:

Failed Tests (2):
  MLIR :: Integration/Dialect/SparseTensor/CPU/dense_output_f16.mlir
  MLIR :: Integration/Dialect/SparseTensor/CPU/sparse_sum_f16.mlir

I don't have any experiences on mlir and my local run show "Unsupported" on both tests.
Could anyone tell me how can I reproduce it locally, or help to generate a LLVM IR for me? Thanks!

@llvmbot
Copy link

llvmbot commented Jun 12, 2022

@llvm/issue-subscribers-mlir-llvm

@phoebewang
Copy link
Contributor Author

Hello, with @joker-eph 's commends, I can reproduce the fail locally. I spent one day to investigate the problem, but I still didn't root cause it due to limited knowledge on mlir. Here are what I got for now:
Reproducer command:

mlir-opt mlir/test/Integration/Dialect/SparseTensor/CPU/dense_output_f16.mlir --sparse-compiler | mlir-cpu-runner   -e entry -entry-point-result=void -shared-libs=builds/lib/libmlir_c_runner_utils.so

Error message

.../mlir/lib/ExecutionEngine/SparseTensorUtils.cpp:1735: void *_mlir_ciface_addEltF16(void *, f16, StridedMemRefType<index_type, 1> *, StridedMemRefType<index_type, 1> *): Assertion `coo &&iref &&pref' failed.

Backtrace at the crash point:

    frame #0: 0x00007fffe9b9c93f libc.so.6`raise + 271
    frame #1: 0x00007fffe9b86c95 libc.so.6`abort + 295
    frame #2: 0x00007fffe9b86b69 libc.so.6`.annobin_errno.c.unlikely + 15
    frame #3: 0x00007fffe9b94df6 libc.so.6`__assert_fail + 70
  * frame #4: 0x00007fffcef06b5d libmlir_c_runner_utils.so`::_mlir_ciface_addEltF16(coo=0x000000000033a090, value=(bits = 35864), iref=0x00007fffffff8c40, pref=0x0000000000000000) at SparseTensorUtils.cpp:1735:1
    frame #5: 0x00007ffff7efd04c
    frame #6: 0x00007ffff7efe2bd
    frame #7: 0x00007ffff7eb5d67 libMLIRJitRunner.so.15git`compileAndExecuteVoidFunction(options=0x00007fffffff9a80, module=ModuleOp @ 0x00007fffffff95f0, entryPoint=(Data = "entry", Length = 5), config=CompileAndExecuteConfig @ 0x00007fffffff96c0)::Options&, mlir::ModuleOp, llvm::StringRef, (anonymous namespace)::CompileAndExecuteConfig) at JitRunner.cpp:245:10
    frame #8: 0x00007ffff7eb48ae libMLIRJitRunner.so.15git`mlir::JitRunnerMain(argc=6, argv=0x00007fffffffa5f8, registry=0x00007fffffffa420, config=JitRunnerConfig @ 0x00007fffffffa3b0) at JitRunner.cpp:367:23
    frame #9: 0x0000000000204331 mlir-cpu-runner`main(argc=6, argv=0x00007fffffffa5f8) at mlir-cpu-runner.cpp:33:10
    frame #10: 0x00007fffe9b88813 libc.so.6`__libc_start_main + 243
    frame #11: 0x000000000020418e mlir-cpu-runner`_start + 46

I can find the problem is pref=0x0000000000000000, which causes the assertion fail. But I have problem to understand the parent frames #5 and #6 to do further investigation.

    0x7ffff7efe2b0: pushq  %rax
    0x7ffff7efe2b1: movabsq $0x7ffff7efdce0, %rax     ; imm = 0x7FFFF7EFDCE0
    0x7ffff7efe2bb: callq  *%rax
    0x7ffff7efe2bd: popq   %rax

I once traced from frame #7 to frame #6, then I get lost from the backtrace. I guess it's jitted code which doesn't preserve unwind information correctly.
Could anyone share the experience on debuging the jitted code? Any suggestions are appreciated, thanks!

Note, the patch has beed reverted, please apply 2d2da25 if you want to help with this. Thanks!

@joker-eph
Copy link
Collaborator

there shouldn’t be much that is MLIR specific here: MLIR will emit LLVM IT and JIT it before executing it.

@aartbik should be able to help to get the IR and maybe a reproducer that does not involve the JIT

@aartbik aartbik self-assigned this Jun 13, 2022
@aartbik
Copy link
Contributor

aartbik commented Jun 13, 2022

Does the patch change the ABI in any way?

We are trying to call

void *_mlir_ciface_addEltF16(void *, f16, ....

but perhaps the f16 has some "padding" applied, causing corruptions of all other parameters.
We observed a similar problem with complex on two f32. So rather than having one

void *_mlir_ciface_addEltC32(void *coo, c, complex32 c32, ...

we "fixed" this as

void *_mlir_ciface_addEltC32(void *coo, float r, float i,
                             StridedMemRefType<index_type, 1> *iref,
                             StridedMemRefType<index_type, 1> *pref) {
return _mlir_ciface_addEltC32ABI(coo, complex32(r, i), iref, pref);

See around

void *_mlir_ciface_addEltC32(void *coo, float r, float i,

@aartbik aartbik removed their assignment Jun 13, 2022
@phoebewang
Copy link
Contributor Author

Thanks @joker-eph @aartbik

Does the patch change the ABI in any way?

If you are asking the half type from LLVM IR level, yes, it does. See the difference in the test 2d2da25#diff-b377175c48a86b089bb448538c7849573d52ede1488dfd067f48da683c0110e7L16
Before this patch, half was passed by GPR registers, after it, it will be passed by FPR(XMM) registers.

but perhaps the f16 has some "padding" applied, causing corruptions of all other parameters.

Not padding but register class. IIUC, the problem is the jitted code passes the data in FPR while the ExecutionEngine expected the data is in GPR? If this is true, I think it's a problem of ExecutionEngine that has false ABI assumption on half type: https://github.com/llvm/llvm-project/blob/main/mlir/lib/ExecutionEngine/Float16bits.cpp#L33

Previously, Clang doesn't have ABI for half type and blocks arguments passing for it. The implementation above looks like a workaround to me. My patch is to enable the ABI and (will) allow pass half type by _Float16. So I think the ExecutionEngine may need improve to detect if _Float16 is supported from cmake and use _Float16 when possible.

In conclusion, I think I can land the patch and disable both tests temporaryly. I'll put comments to point to this issue, but I may not be able to fix it. WDYT?

@joker-eph
Copy link
Collaborator

It's not clear to me that we need to think in terms of register class here. I see it as an LLVM IR problem: we write our runtime in C/C++ and we are trying to interact with it when generating LLVM IR ourselves from a non-C source (MLIR).

The question to me is whether we can generate the right signatures for our runtime in LLVM IR that matches what clang will generate when compiling the runtime.
If the mapping is platform dependent, we better go through memory (like we do with memref descriptor).

@phoebewang
Copy link
Contributor Author

I understand the problem is the runtime uses a workaround that using uint16_t to represent FP16 type in C/C++ code, while MLIR is using half. So they are already not matching each other from the IR type's perspective. It just happened to work, and will break if half and i16 using different register class.

I think the right direction is to use _Float16 in C/C++ code to represent FP16, which is always mapping to half in LLVM IR. The problem is only a few targets has supported the type. So go through memory might be the feasible solution at present.

@aartbik
Copy link
Contributor

aartbik commented Jun 14, 2022

I am okay if you disable the two tests for now just so you can make progress. Please add this bug somewhere in the comment when you do so, so we can find a proper fix afterwards.

phoebewang added a commit that referenced this issue Jun 15, 2022
…e psABI"

Disabled 2 mlir tests due to the runtime doesn't support `_Float16`, see
the issue here #55992
@phoebewang
Copy link
Contributor Author

Thanks @aartbik ! Reland 6e02e27

@aartbik
Copy link
Contributor

aartbik commented Jun 17, 2022

Note that https://reviews.llvm.org/D128018 fixes the ABI issues hopefully once and for all (since we use memref's now for all non-trivially typed data). We should be able to re-enable the tests again.

aartbik added a commit that referenced this issue Jun 17, 2022
Sparse library ABI issues are fixed.

#55992

Reviewed By: bkramer

Differential Revision: https://reviews.llvm.org/D128086
@llvmbot
Copy link

llvmbot commented Jul 11, 2022

@llvm/issue-subscribers-mlir

aartbik added a commit that referenced this issue Aug 10, 2022
Supposedly our ABI issues were fixed, per issue:
#55992

However, with a recent changes to bf16, these tests
fail again; not sure why yet:
https://reviews.llvm.org/D130832
https://lab.llvm.org/buildbot/#/builders/61/builds/30600

So we disable the tests for now. Issue is tracked in:
#57042

Differential Revision: https://reviews.llvm.org/D131621
llvmbot pushed a commit to llvm/llvm-project-release-prs that referenced this issue Aug 11, 2022
Supposedly our ABI issues were fixed, per issue:
llvm/llvm-project#55992

However, with a recent changes to bf16, these tests
fail again; not sure why yet:
https://reviews.llvm.org/D130832
https://lab.llvm.org/buildbot/#/builders/61/builds/30600

So we disable the tests for now. Issue is tracked in:
llvm/llvm-project#57042

Differential Revision: https://reviews.llvm.org/D131621

(cherry picked from commit 6b74591)
llvmbot pushed a commit to llvm/llvm-project-release-prs that referenced this issue Aug 11, 2022
Supposedly our ABI issues were fixed, per issue:
llvm/llvm-project#55992

However, with a recent changes to bf16, these tests
fail again; not sure why yet:
https://reviews.llvm.org/D130832
https://lab.llvm.org/buildbot/#/builders/61/builds/30600

So we disable the tests for now. Issue is tracked in:
llvm/llvm-project#57042

Differential Revision: https://reviews.llvm.org/D131621

(cherry picked from commit 6b74591)
tru pushed a commit to llvm/llvm-project-release-prs that referenced this issue Aug 12, 2022
Supposedly our ABI issues were fixed, per issue:
llvm/llvm-project#55992

However, with a recent changes to bf16, these tests
fail again; not sure why yet:
https://reviews.llvm.org/D130832
https://lab.llvm.org/buildbot/#/builders/61/builds/30600

So we disable the tests for now. Issue is tracked in:
llvm/llvm-project#57042

Differential Revision: https://reviews.llvm.org/D131621

(cherry picked from commit 6b74591)
mem-frob pushed a commit to draperlaboratory/hope-llvm-project that referenced this issue Oct 7, 2022
…e psABI"

Disabled 2 mlir tests due to the runtime doesn't support `_Float16`, see
the issue here llvm/llvm-project#55992
mem-frob pushed a commit to draperlaboratory/hope-llvm-project that referenced this issue Oct 7, 2022
Supposedly our ABI issues were fixed, per issue:
llvm/llvm-project#55992

However, with a recent changes to bf16, these tests
fail again; not sure why yet:
https://reviews.llvm.org/D130832
https://lab.llvm.org/buildbot/#/builders/61/builds/30600

So we disable the tests for now. Issue is tracked in:
llvm/llvm-project#57042

Differential Revision: https://reviews.llvm.org/D131621

(cherry picked from commit 6b7459115f7b1c43f81ad8dfb5da9d206d3b1e06)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants