Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[llvm-exegesis] Kill process that recieve a signal #86069

Merged

Conversation

boomanaiden154
Copy link
Contributor

Before this patch, llvm-exegesis would leave processes lingering that experienced signals like segmentation faults. They would up in a signal-delivery-stop state under the ptrace and never exit. This does not cause problems (or at least many) in llvm-exegesis as they are cleaned up after the main process exits, which usually happens quickly. However, in downstream use, when many blocks are being executed (many of which run into signals) within a single process, these processes stay around and can easily exhaust the process limit on some systems.

This patch cleans them up by sending SIGKILL after information about the signal that was sent has been gathered.

Before this patch, llvm-exegesis would leave processes lingering that
experienced signals like segmentation faults. They would up in a
signal-delivery-stop state under the ptrace and never exit. This does
not cause problems (or at least many) in llvm-exegesis as they are
cleaned up after the main process exits, which usually happens quickly.
However, in downstream use, when many blocks are being executed (many of
which run into signals) within a single process, these processes stay
around and can easily exhaust the process limit on some systems.

This patch cleans them up by sending SIGKILL after information about the
signal that was sent has been gathered.
@boomanaiden154
Copy link
Contributor Author

Fixes google/gematria#76.

@llvmbot
Copy link
Collaborator

llvmbot commented Mar 21, 2024

@llvm/pr-subscribers-tools-llvm-exegesis

Author: Aiden Grossman (boomanaiden154)

Changes

Before this patch, llvm-exegesis would leave processes lingering that experienced signals like segmentation faults. They would up in a signal-delivery-stop state under the ptrace and never exit. This does not cause problems (or at least many) in llvm-exegesis as they are cleaned up after the main process exits, which usually happens quickly. However, in downstream use, when many blocks are being executed (many of which run into signals) within a single process, these processes stay around and can easily exhaust the process limit on some systems.

This patch cleans them up by sending SIGKILL after information about the signal that was sent has been gathered.


Full diff: https://github.com/llvm/llvm-project/pull/86069.diff

1 Files Affected:

  • (modified) llvm/tools/llvm-exegesis/lib/BenchmarkRunner.cpp (+16-2)
diff --git a/llvm/tools/llvm-exegesis/lib/BenchmarkRunner.cpp b/llvm/tools/llvm-exegesis/lib/BenchmarkRunner.cpp
index 5c9848f3c68885..f0452605eb24bf 100644
--- a/llvm/tools/llvm-exegesis/lib/BenchmarkRunner.cpp
+++ b/llvm/tools/llvm-exegesis/lib/BenchmarkRunner.cpp
@@ -342,7 +342,7 @@ class SubProcessFunctionExecutorImpl
       return make_error<Failure>("Failed to attach to the child process: " +
                                  Twine(strerror(errno)));
 
-    if (wait(NULL) == -1) {
+    if (waitpid(ParentOrChildPID, NULL, 0) == -1) {
       return make_error<Failure>(
           "Failed to wait for child process to stop after attaching: " +
           Twine(strerror(errno)));
@@ -361,7 +361,7 @@ class SubProcessFunctionExecutorImpl
       return SendError;
 
     int ChildStatus;
-    if (wait(&ChildStatus) == -1) {
+    if (waitpid(ParentOrChildPID, &ChildStatus, 0) == -1) {
       return make_error<Failure>(
           "Waiting for the child process to complete failed: " +
           Twine(strerror(errno)));
@@ -401,6 +401,20 @@ class SubProcessFunctionExecutorImpl
                                  Twine(strerror(errno)));
     }
 
+    // Send SIGKILL rather than SIGTERM as the child process has no SIGTERM
+    // handlers to run, and calling SIGTERM would mean that ptrace will force
+    // it to block in the signal-delivery-stop for the SIGSEGV/other signals,
+    // and upon exit.
+    if (kill(ParentOrChildPID, SIGKILL) == -1)
+      return make_error<Failure>("Failed to kill child benchmarking proces: " +
+                                 Twine(strerror(errno)));
+
+    // Wait for the process to exit so that there are no zombie processes left
+    // around.
+    if (waitpid(ParentOrChildPID, NULL, 0) == -1)
+      return make_error<Failure>("Failed to wait for process to die: " +
+                                 Twine(strerror(errno)));
+
     if (ChildSignalInfo.si_signo == SIGSEGV)
       return make_error<SnippetSegmentationFault>(
           reinterpret_cast<intptr_t>(ChildSignalInfo.si_addr));

@@ -342,7 +342,7 @@ class SubProcessFunctionExecutorImpl
return make_error<Failure>("Failed to attach to the child process: " +
Twine(strerror(errno)));

if (wait(NULL) == -1) {
if (waitpid(ParentOrChildPID, NULL, 0) == -1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At that point ParentOrChildPID is always the parent, but it's not immediately obvious (I had to scroll up). What about restructuring this function to have:


[[noreturn]] void runChildSubprocess(int ReadFD, int FriteFD) {
   // We are in the child process, close the write end of the pipe.
      close(PipeFiles[1]);
      // Unregister handlers, signal handling is now handled through ptrace in
      // the host process.
      sys::unregisterHandlers();
      prepareAndRunBenchmark(PipeFiles[0], Key);
      llvm_unreachable("Child process didn't exit when expected.");
}

Error runParentSubprocess(pid_t PID, int ReadFD, int FriteFD) {
    const ExegesisTarget &ET = State.getExegesisTarget();
    ...
}


createSubProcessAndRunBenchmark() {
  ...
  if (ParentOrChildPID == -1) {
    ...
  }
  
  if (ParentOrChildPID == 0) {
     runChildSubprocess(PipeFiles[0], Pipefiles[1]);
      llvm_unreachable("Child process didn't exit when expected.");
  }

  return runParentSubprocess(ParentOrChildPID, PipeFiles[0], Pipefiles[1]);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. The suggested refactoring would definitely make the code cleaner. I'm going to land this PR and then open up another PR with the suggested refactoring to try and keep the history cleanish. Thanks for the suggestion!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #86232.

@boomanaiden154 boomanaiden154 merged commit 718fbbe into llvm:main Mar 22, 2024
6 checks passed
chencha3 pushed a commit to chencha3/llvm-project that referenced this pull request Mar 23, 2024
Before this patch, llvm-exegesis would leave processes lingering that
experienced signals like segmentation faults. They would up in a
signal-delivery-stop state under the ptrace and never exit. This does
not cause problems (or at least many) in llvm-exegesis as they are
cleaned up after the main process exits, which usually happens quickly.
However, in downstream use, when many blocks are being executed (many of
which run into signals) within a single process, these processes stay
around and can easily exhaust the process limit on some systems.

This patch cleans them up by sending SIGKILL after information about the
signal that was sent has been gathered.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants