[llvm-exegesis] Kill process that recieve a signal #86069

boomanaiden154 · 2024-03-21T03:15:28Z

Before this patch, llvm-exegesis would leave processes lingering that experienced signals like segmentation faults. They would up in a signal-delivery-stop state under the ptrace and never exit. This does not cause problems (or at least many) in llvm-exegesis as they are cleaned up after the main process exits, which usually happens quickly. However, in downstream use, when many blocks are being executed (many of which run into signals) within a single process, these processes stay around and can easily exhaust the process limit on some systems.

This patch cleans them up by sending SIGKILL after information about the signal that was sent has been gathered.

Before this patch, llvm-exegesis would leave processes lingering that experienced signals like segmentation faults. They would up in a signal-delivery-stop state under the ptrace and never exit. This does not cause problems (or at least many) in llvm-exegesis as they are cleaned up after the main process exits, which usually happens quickly. However, in downstream use, when many blocks are being executed (many of which run into signals) within a single process, these processes stay around and can easily exhaust the process limit on some systems. This patch cleans them up by sending SIGKILL after information about the signal that was sent has been gathered.

boomanaiden154 · 2024-03-21T03:15:38Z

Fixes google/gematria#76.

llvmbot · 2024-03-21T03:16:00Z

@llvm/pr-subscribers-tools-llvm-exegesis

Author: Aiden Grossman (boomanaiden154)

Changes

Before this patch, llvm-exegesis would leave processes lingering that experienced signals like segmentation faults. They would up in a signal-delivery-stop state under the ptrace and never exit. This does not cause problems (or at least many) in llvm-exegesis as they are cleaned up after the main process exits, which usually happens quickly. However, in downstream use, when many blocks are being executed (many of which run into signals) within a single process, these processes stay around and can easily exhaust the process limit on some systems.

This patch cleans them up by sending SIGKILL after information about the signal that was sent has been gathered.

Full diff: https://github.com/llvm/llvm-project/pull/86069.diff

1 Files Affected:

(modified) llvm/tools/llvm-exegesis/lib/BenchmarkRunner.cpp (+16-2)

diff --git a/llvm/tools/llvm-exegesis/lib/BenchmarkRunner.cpp b/llvm/tools/llvm-exegesis/lib/BenchmarkRunner.cpp
index 5c9848f3c68885..f0452605eb24bf 100644
--- a/llvm/tools/llvm-exegesis/lib/BenchmarkRunner.cpp
+++ b/llvm/tools/llvm-exegesis/lib/BenchmarkRunner.cpp
@@ -342,7 +342,7 @@ class SubProcessFunctionExecutorImpl
       return make_error<Failure>("Failed to attach to the child process: " +
                                  Twine(strerror(errno)));
 
-    if (wait(NULL) == -1) {
+    if (waitpid(ParentOrChildPID, NULL, 0) == -1) {
       return make_error<Failure>(
           "Failed to wait for child process to stop after attaching: " +
           Twine(strerror(errno)));
@@ -361,7 +361,7 @@ class SubProcessFunctionExecutorImpl
       return SendError;
 
     int ChildStatus;
-    if (wait(&ChildStatus) == -1) {
+    if (waitpid(ParentOrChildPID, &ChildStatus, 0) == -1) {
       return make_error<Failure>(
           "Waiting for the child process to complete failed: " +
           Twine(strerror(errno)));
@@ -401,6 +401,20 @@ class SubProcessFunctionExecutorImpl
                                  Twine(strerror(errno)));
     }
 
+    // Send SIGKILL rather than SIGTERM as the child process has no SIGTERM
+    // handlers to run, and calling SIGTERM would mean that ptrace will force
+    // it to block in the signal-delivery-stop for the SIGSEGV/other signals,
+    // and upon exit.
+    if (kill(ParentOrChildPID, SIGKILL) == -1)
+      return make_error<Failure>("Failed to kill child benchmarking proces: " +
+                                 Twine(strerror(errno)));
+
+    // Wait for the process to exit so that there are no zombie processes left
+    // around.
+    if (waitpid(ParentOrChildPID, NULL, 0) == -1)
+      return make_error<Failure>("Failed to wait for process to die: " +
+                                 Twine(strerror(errno)));
+
     if (ChildSignalInfo.si_signo == SIGSEGV)
       return make_error<SnippetSegmentationFault>(
           reinterpret_cast<intptr_t>(ChildSignalInfo.si_addr));

legrosbuffle · 2024-03-21T12:02:45Z

llvm/tools/llvm-exegesis/lib/BenchmarkRunner.cpp

                                 Twine(strerror(errno)));

-    if (wait(NULL) == -1) {
+    if (waitpid(ParentOrChildPID, NULL, 0) == -1) {


At that point ParentOrChildPID is always the parent, but it's not immediately obvious (I had to scroll up). What about restructuring this function to have:

[[noreturn]] void runChildSubprocess(int ReadFD, int FriteFD) { // We are in the child process, close the write end of the pipe. close(PipeFiles[1]); // Unregister handlers, signal handling is now handled through ptrace in // the host process. sys::unregisterHandlers(); prepareAndRunBenchmark(PipeFiles[0], Key); llvm_unreachable("Child process didn't exit when expected."); } Error runParentSubprocess(pid_t PID, int ReadFD, int FriteFD) { const ExegesisTarget &ET = State.getExegesisTarget(); ... } createSubProcessAndRunBenchmark() { ... if (ParentOrChildPID == -1) { ... } if (ParentOrChildPID == 0) { runChildSubprocess(PipeFiles[0], Pipefiles[1]); llvm_unreachable("Child process didn't exit when expected."); } return runParentSubprocess(ParentOrChildPID, PipeFiles[0], Pipefiles[1]); }

That's a good point. The suggested refactoring would definitely make the code cleaner. I'm going to land this PR and then open up another PR with the suggested refactoring to try and keep the history cleanish. Thanks for the suggestion!

See #86232.

Before this patch, llvm-exegesis would leave processes lingering that experienced signals like segmentation faults. They would up in a signal-delivery-stop state under the ptrace and never exit. This does not cause problems (or at least many) in llvm-exegesis as they are cleaned up after the main process exits, which usually happens quickly. However, in downstream use, when many blocks are being executed (many of which run into signals) within a single process, these processes stay around and can easily exhaust the process limit on some systems. This patch cleans them up by sending SIGKILL after information about the signal that was sent has been gathered.

boomanaiden154 requested review from gchatelet and legrosbuffle March 21, 2024 03:15

llvmbot added the tools:llvm-exegesis label Mar 21, 2024

legrosbuffle approved these changes Mar 21, 2024

View reviewed changes

boomanaiden154 merged commit 718fbbe into llvm:main Mar 22, 2024

boomanaiden154 mentioned this pull request Mar 22, 2024

Bump LLVM version google/gematria#78

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[llvm-exegesis] Kill process that recieve a signal #86069

[llvm-exegesis] Kill process that recieve a signal #86069

Uh oh!

boomanaiden154 commented Mar 21, 2024

Uh oh!

boomanaiden154 commented Mar 21, 2024

Uh oh!

llvmbot commented Mar 21, 2024

Uh oh!

legrosbuffle Mar 21, 2024

Uh oh!

boomanaiden154 Mar 22, 2024

Uh oh!

boomanaiden154 Mar 22, 2024

Uh oh!

Uh oh!

[llvm-exegesis] Kill process that recieve a signal #86069

[llvm-exegesis] Kill process that recieve a signal #86069

Uh oh!

Conversation

boomanaiden154 commented Mar 21, 2024

Uh oh!

boomanaiden154 commented Mar 21, 2024

Uh oh!

llvmbot commented Mar 21, 2024

Uh oh!

legrosbuffle Mar 21, 2024

Choose a reason for hiding this comment

Uh oh!

boomanaiden154 Mar 22, 2024

Choose a reason for hiding this comment

Uh oh!

boomanaiden154 Mar 22, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!