Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llvm_shutdown called by lld::exitLld(1) is racy #66974

Open
MaskRay opened this issue Sep 21, 2023 · 5 comments
Open

llvm_shutdown called by lld::exitLld(1) is racy #66974

MaskRay opened this issue Sep 21, 2023 · 5 comments

Comments

@MaskRay
Copy link
Member

MaskRay commented Sep 21, 2023

lld::fatal calls exitLld(1). lld::error when the error number reaches the limit calls exitLld(1) as well.

exitLld
  llvm_shutdown();
  llvm::sys::Process::Exit(val, /*NoCleanup=*/true);

If a worker thread created by llvm/lib/Support/Parallel.cpp calls exitLld(1), llvm_shutdown will destroy
the ManagedStatic instance in Executor::getDefaultExecutor.
If the main thread is blocked on L.sync() in TaskGroup::~TaskGroup, it is possible that the main thread finishes first and somehow exits with code 0 (not exactly clear how this happened given that errorCount() is non-zero), before the work thread calls _Exit(1).
When this happens, the exit code of LLD will be 0 instead of the expected 1 (therefore RUN: not ld.lld --eh-frame-hdr %t -o /dev/null 2>&1 fails)

To trigger test failures with the race condition, we need to ensure that a worker thread calls exit/fatal and the concurrency is at least 2. Not many tests satisfy the condition. I have measured flakiness for ELF/invalid-eh-frame5.s and ELF/invalid-eh-frame.s.
Sometimes the tests may exhibit 6 failures in 1000 runs, sometimes only 5 in 5000 runs.

If I commented out llvm_shutdown or add an _Exit(val) above llvm_shutdown, the flakiness will go away.

@nga888 https://reviews.llvm.org/D70447

@llvmbot
Copy link
Collaborator

llvmbot commented Sep 21, 2023

@llvm/issue-subscribers-lld-elf

`lld::fatal` calls `exitLld(1)`. `lld::error` when the error number reaches the limit calls `exitLld(1)` as well. ``` exitLld llvm_shutdown(); llvm::sys::Process::Exit(val, /*NoCleanup=*/true); ```

If a worker thread created by llvm/lib/Support/Parallel.cpp calls exitLld(1), llvm_shutdown will destroy
the ManagedStatic instance in Executor::getDefaultExecutor.
If the main thread is blocked on L.sync() in TaskGroup::~TaskGroup, it is possible that the main thread finishes first and exits with code 0, before the work thread calls _Exit(1).
WHen this happens, the exit code of LLD will be 0 instead of the expected 1.

I have measured flakiness for ELF/invalid-eh-frame5.s and ELF/invalid-eh-frame.s.
Sometimes the tests may exhibit 6 failures in 1000 runs, sometimes only 5 in 5000 runs.

@nga888

@nga888
Copy link
Collaborator

nga888 commented Oct 20, 2023

@MaskRay, sorry I am aware of this issue but I'm very busy with other work and do not have time to investigate in depth right now. TBH, the way LLD can exit via a worker thread has always been a concern but the patch you reference above that fixed issues that could cause crashing and deadlock on Windows was a far more prevalent issue at that time. Also the "exit" handling has become a bit more complex since that patch was committed.

The scenario you describe does sound strange because of the mismatch of the exit code which would also imply that the error state is not consistent between the threads. In any case, avoiding a race to "exit" would definitely be preferable.

@rnk
Copy link
Collaborator

rnk commented Oct 20, 2023

If we (or whoever takes on this flakiness issue) could find another way to sync the worker threads to prevent multiple attempts to exit concurrently, I'd be in favor of that. I worry that llvm_shutdown does too much stuff at exit. Our intention is to exit quickly and only run a set of enumerated, essential shutdown handlers, like flushing outs() or blocking to prevent concurrent attempts to exit. Relying on the LLVM ManagedStatic destructor registration scheme doesn't really align with that.

There's still an open question as to why the main thread tries to exit with a 0 exit status, we don't have an answer to that question.

@nga888
Copy link
Collaborator

nga888 commented Dec 13, 2023

Hi @MaskRay, I've at last found some time to have a look at this issue. So far, I have been unable to reproduce the issue with these tests on Ubuntu 22.04.3 with LLVM built from a26aa79, using both LLVM 17.0.6 and GCC 12.3.0. Do you have details of the build configuration that you were using when you saw these issues?

Looking at the current code, I don't think there should be a "race to exit" for these particular tests, although the possibility of a "race to exit" in LLD certainly does exist...

@MaskRay
Copy link
Member Author

MaskRay commented Jan 7, 2024

Sorry for my slow response.

I have replicated # RUN: not ld.lld --eh-frame-hdr %t -o /dev/null 2>&1 | FileCheck %s 100 times in lld/test/ELF/invalid-eh-frame5.s and invoked while /tmp/Rel/bin/llvm-lit -q invalid-eh-frame5.s; do done. I haven't reproduced a failure after a long time. Perhaps it's an issue specific to our internal lit runner (written in Go @sam-mccall).

Looking at the current code, I don't think there should be a "race to exit" for these particular tests, although the possibility of a "race to exit" in LLD certainly does exist...

I cannot find any either...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants