Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New ThreadPlanSingleThreadTimeout to resolve potential deadlock in single thread stepping #90930

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

jeffreytan81
Copy link
Contributor

@jeffreytan81 jeffreytan81 commented May 3, 2024

This PR introduces a new ThreadPlanSingleThreadTimeout that will be used to address potential deadlock during single-thread stepping.

While debugging a target with a non-trivial number of threads (around 5000 threads in one example target), we noticed that a simple step over can take as long as 10 seconds. Enabling single-thread stepping mode significantly reduces the stepping time to around 3 seconds. However, this can introduce deadlock if we try to step over a method that depends on other threads to release a lock.

To address this issue, we introduce a new ThreadPlanSingleThreadTimeout that can be controlled by the target.process.thread.single-thread-plan-timeout setting during single-thread stepping mode. The concept involves counting the elapsed time since the last internal stop to detect overall stepping progress. Once a timeout occurs, we assume the target is not making progress due to a potential deadlock, as mentioned above. We then send a new async interrupt, resume all threads, and ThreadPlanSingleThreadTimeout completes its task.

To support this design, the major changes made in this PR are:

  1. ThreadPlanSingleThreadTimeout is popped during every internal stop and reset (re-pushed) to the top of the stack (as a leaf node) during resume. This is achieved by always returning true from ThreadPlanSingleThreadTimeout::DoPlanExplainsStop() and ThreadPlanSingleThreadTimeout::MischiefManaged().
  2. A new thread-specific async interrupt stop is introduced, which can be detected/consumed by ThreadPlanSingleThreadTimeout.
  3. The clearing of branch breakpoints in the range thread plan has been moved from DoPlanExplainsStop() to ShouldStop(), as it is not guaranteed that it will be called.

The detailed design is discussed in the RFC below:
https://discourse.llvm.org/t/improve-single-thread-stepping/74599

Copy link

github-actions bot commented May 6, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@jeffreytan81 jeffreytan81 marked this pull request as ready for review May 6, 2024 21:56
@llvmbot llvmbot added the lldb label May 6, 2024
@llvmbot
Copy link
Collaborator

llvmbot commented May 6, 2024

@llvm/pr-subscribers-lldb

Author: None (jeffreytan81)

Changes

This PR introduces a new ThreadPlanSingleThreadTimeout that will be used to address potential deadlock during single-thread stepping.

While debugging a target with a non-trivial number of threads (around 5000 threads in one example target), we noticed that a simple step over can take as long as 10 seconds. Enabling single-thread stepping mode significantly reduces the stepping time to around 3 seconds. However, this can introduce deadlock if we try to step over a method that depends on other threads to release a lock.

To address this issue, we introduce a new ThreadPlanSingleThreadTimeout that can be controlled by the target.process.thread.single-thread-plan-timeout setting during single-thread stepping mode. The concept involves counting the elapsed time since the last internal stop to detect overall stepping progress. Once a timeout occurs, we assume the target is not making progress due to a potential deadlock, as mentioned above. We then send a new async interrupt, resume all threads, and ThreadPlanSingleThreadTimeout completes its task.

To support this design, the major changes made in this PR are:

  1. ThreadPlanSingleThreadTimeout is popped during every internal stop and reset (re-pushed) to the top of the stack (as a leaf node) during resume. This is achieved by always returning true from ThreadPlanSingleThreadTimeout::DoPlanExplainsStop() and ThreadPlanSingleThreadTimeout::MischiefManaged().
  2. A new thread-specific async interrupt stop is introduced, which can be detected/consumed by ThreadPlanSingleThreadTimeout.
  3. The clearing of branch breakpoints in the range thread plan has been moved from DoPlanExplainsStop() to ShouldStop(), as it is not guaranteed that it will be called.

The detailed design is discussed in the RFC below:
https://discourse.llvm.org/t/improve-single-thread-stepping/74599


Patch is 49.90 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/90930.diff

29 Files Affected:

  • (modified) lldb/include/lldb/Target/Process.h (+9-2)
  • (modified) lldb/include/lldb/Target/StopInfo.h (+4)
  • (modified) lldb/include/lldb/Target/Thread.h (+2)
  • (modified) lldb/include/lldb/Target/ThreadPlan.h (+5-3)
  • (added) lldb/include/lldb/Target/ThreadPlanSingleThreadTimeout.h (+93)
  • (modified) lldb/include/lldb/Target/ThreadPlanStepOut.h (+1)
  • (modified) lldb/include/lldb/Target/ThreadPlanStepOverRange.h (+6)
  • (modified) lldb/include/lldb/Target/ThreadPlanStepRange.h (+5)
  • (modified) lldb/include/lldb/lldb-enumerations.h (+1)
  • (modified) lldb/source/API/SBThread.cpp (+6)
  • (modified) lldb/source/Interpreter/CommandInterpreter.cpp (+1-1)
  • (modified) lldb/source/Plugins/Process/gdb-remote/GDBRemoteCommunicationServerLLGS.cpp (+2)
  • (modified) lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp (+50-11)
  • (modified) lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.h (+6)
  • (modified) lldb/source/Target/CMakeLists.txt (+1)
  • (modified) lldb/source/Target/Process.cpp (+18-5)
  • (modified) lldb/source/Target/StopInfo.cpp (+37)
  • (modified) lldb/source/Target/TargetProperties.td (+4)
  • (modified) lldb/source/Target/Thread.cpp (+8-1)
  • (modified) lldb/source/Target/ThreadPlan.cpp (+1)
  • (added) lldb/source/Target/ThreadPlanSingleThreadTimeout.cpp (+234)
  • (modified) lldb/source/Target/ThreadPlanStepInRange.cpp (+1)
  • (modified) lldb/source/Target/ThreadPlanStepOverRange.cpp (+18-2)
  • (modified) lldb/source/Target/ThreadPlanStepRange.cpp (+45-26)
  • (added) lldb/test/API/functionalities/single-thread-step/Makefile (+4)
  • (added) lldb/test/API/functionalities/single-thread-step/TestSingleThreadStepTimeout.py (+123)
  • (added) lldb/test/API/functionalities/single-thread-step/main.cpp (+62)
  • (modified) lldb/tools/lldb-dap/JSONUtils.cpp (+3)
  • (modified) lldb/tools/lldb-dap/LLDBUtils.cpp (+1)
diff --git a/lldb/include/lldb/Target/Process.h b/lldb/include/lldb/Target/Process.h
index aac0cf51680a9e..7e758dbb9f6457 100644
--- a/lldb/include/lldb/Target/Process.h
+++ b/lldb/include/lldb/Target/Process.h
@@ -1312,11 +1312,13 @@ class Process : public std::enable_shared_from_this<Process>,
 
   size_t GetThreadStatus(Stream &ostrm, bool only_threads_with_stop_reason,
                          uint32_t start_frame, uint32_t num_frames,
-                         uint32_t num_frames_with_source,
-                         bool stop_format);
+                         uint32_t num_frames_with_source, bool stop_format);
 
   void SendAsyncInterrupt();
 
+  // Send an async interrupt and receive stop from a specific /p thread.
+  void SendAsyncInterrupt(Thread *thread);
+
   // Notify this process class that modules got loaded.
   //
   // If subclasses override this method, they must call this version before
@@ -3102,6 +3104,11 @@ void PruneThreadPlans();
                            // Resume will only request a resume, using this
                            // flag to check.
 
+  lldb::tid_t m_interrupt_tid; /// The tid of the thread that issued the async
+                               /// interrupt, used by thread plan timeout. It
+                               /// can be LLDB_INVALID_THREAD_ID to indicate
+                               /// user level async interrupt.
+
   /// This is set at the beginning of Process::Finalize() to stop functions
   /// from looking up or creating things during or after a finalize call.
   std::atomic<bool> m_finalizing;
diff --git a/lldb/include/lldb/Target/StopInfo.h b/lldb/include/lldb/Target/StopInfo.h
index d1848fcbbbdb19..fae90364deaf0a 100644
--- a/lldb/include/lldb/Target/StopInfo.h
+++ b/lldb/include/lldb/Target/StopInfo.h
@@ -123,6 +123,10 @@ class StopInfo : public std::enable_shared_from_this<StopInfo> {
                              const char *description = nullptr,
                              std::optional<int> code = std::nullopt);
 
+  static lldb::StopInfoSP
+  CreateStopReasonWithInterrupt(Thread &thread, int signo,
+                                const char *description);
+
   static lldb::StopInfoSP CreateStopReasonToTrace(Thread &thread);
 
   static lldb::StopInfoSP
diff --git a/lldb/include/lldb/Target/Thread.h b/lldb/include/lldb/Target/Thread.h
index c17bddf4d98b85..584093348b4c6a 100644
--- a/lldb/include/lldb/Target/Thread.h
+++ b/lldb/include/lldb/Target/Thread.h
@@ -57,6 +57,8 @@ class ThreadProperties : public Properties {
   bool GetStepOutAvoidsNoDebug() const;
 
   uint64_t GetMaxBacktraceDepth() const;
+
+  uint64_t GetSingleThreadPlanTimeout() const;
 };
 
 class Thread : public std::enable_shared_from_this<Thread>,
diff --git a/lldb/include/lldb/Target/ThreadPlan.h b/lldb/include/lldb/Target/ThreadPlan.h
index bf68a42e54d18f..640e997caf7b3d 100644
--- a/lldb/include/lldb/Target/ThreadPlan.h
+++ b/lldb/include/lldb/Target/ThreadPlan.h
@@ -302,7 +302,8 @@ class ThreadPlan : public std::enable_shared_from_this<ThreadPlan>,
     eKindStepInRange,
     eKindRunToAddress,
     eKindStepThrough,
-    eKindStepUntil
+    eKindStepUntil,
+    eKindSingleThreadTimeout,
   };
 
   virtual ~ThreadPlan();
@@ -483,6 +484,8 @@ class ThreadPlan : public std::enable_shared_from_this<ThreadPlan>,
     return m_takes_iteration_count;
   }
 
+  virtual lldb::StateType GetPlanRunState() = 0;
+
 protected:
   // Constructors and Destructors
   ThreadPlan(ThreadPlanKind kind, const char *name, Thread &thread,
@@ -522,8 +525,6 @@ class ThreadPlan : public std::enable_shared_from_this<ThreadPlan>,
     GetThread().SetStopInfo(stop_reason_sp);
   }
 
-  virtual lldb::StateType GetPlanRunState() = 0;
-
   bool IsUsuallyUnexplainedStopReason(lldb::StopReason);
 
   Status m_status;
@@ -590,6 +591,7 @@ class ThreadPlanNull : public ThreadPlan {
   const Status &GetStatus() { return m_status; }
 
 protected:
+  friend class ThreadPlanSingleThreadTimeout;
   bool DoPlanExplainsStop(Event *event_ptr) override;
 
   lldb::StateType GetPlanRunState() override;
diff --git a/lldb/include/lldb/Target/ThreadPlanSingleThreadTimeout.h b/lldb/include/lldb/Target/ThreadPlanSingleThreadTimeout.h
new file mode 100644
index 00000000000000..89db966b29bff3
--- /dev/null
+++ b/lldb/include/lldb/Target/ThreadPlanSingleThreadTimeout.h
@@ -0,0 +1,93 @@
+//===-- ThreadPlanSingleThreadTimeout.h -------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLDB_TARGET_THREADPLANSINGLETHREADTIMEOUT_H
+#define LLDB_TARGET_THREADPLANSINGLETHREADTIMEOUT_H
+
+#include "lldb/Target/Thread.h"
+#include "lldb/Target/ThreadPlan.h"
+#include "lldb/Utility/Event.h"
+#include "lldb/Utility/LLDBLog.h"
+#include "lldb/Utility/State.h"
+
+#include <thread>
+
+namespace lldb_private {
+
+//
+// Thread plan used by single thread execution to issue timeout. This is useful
+// to detect potential deadlock in single thread execution. The timeout measures
+// the elapsed time from the last internal stop and got reset by each internal
+// stops to ensure we are accurately detecting execution not moving forward.
+// This means this thread plan  can be created/destroyed multiple times by the
+// parent execution plan.
+//
+// When timeout happens, the thread plan resolves the potential deadlock by
+// issuing a thread specific async interrupt to enter stop state, then all
+// threads execution are resumed to resolve the potential deadlock.
+//
+class ThreadPlanSingleThreadTimeout : public ThreadPlan {
+public:
+  ~ThreadPlanSingleThreadTimeout() override;
+
+  // Create a new instance from fresh new state.
+  static void CreateNew(Thread &thread);
+  // Reset and create a new instance from the previous state.
+  static void ResetFromPrevState(Thread &thread);
+
+  void GetDescription(Stream *s, lldb::DescriptionLevel level) override;
+  bool ValidatePlan(Stream *error) override { return true; }
+  bool WillStop() override;
+  void DidPop() override;
+
+  bool DoPlanExplainsStop(Event *event_ptr) override;
+
+  lldb::StateType GetPlanRunState() override;
+  static void TimeoutThreadFunc(ThreadPlanSingleThreadTimeout *self);
+
+  bool MischiefManaged() override;
+
+  bool ShouldStop(Event *event_ptr) override;
+  void SetStopOthers(bool new_value) override;
+  bool StopOthers() override;
+
+private:
+  ThreadPlanSingleThreadTimeout(Thread &thread);
+
+  static bool IsAlive();
+
+  enum class State {
+    WaitTimeout,    // Waiting for timeout.
+    AsyncInterrupt, // Async interrupt has been issued.
+    Done,           // Finished resume all threads.
+  };
+
+  static std::mutex s_mutex;
+  static ThreadPlanSingleThreadTimeout *s_instance;
+  static State s_prev_state;
+
+  bool HandleEvent(Event *event_ptr);
+  void HandleTimeout();
+
+  static std::string StateToString(State state);
+
+  ThreadPlanSingleThreadTimeout(const ThreadPlanSingleThreadTimeout &) = delete;
+  const ThreadPlanSingleThreadTimeout &
+  operator=(const ThreadPlanSingleThreadTimeout &) = delete;
+
+  State m_state;
+  std::mutex m_mutex;
+  std::condition_variable m_wakeup_cv;
+  // Whether the timer thread should exit or not.
+  bool m_exit_flag;
+  std::thread m_timer_thread;
+};
+
+} // namespace lldb_private
+
+#endif // LLDB_TARGET_THREADPLANSINGLETHREADTIMEOUT_H
diff --git a/lldb/include/lldb/Target/ThreadPlanStepOut.h b/lldb/include/lldb/Target/ThreadPlanStepOut.h
index b1d8769f7c546e..013c675afc33d0 100644
--- a/lldb/include/lldb/Target/ThreadPlanStepOut.h
+++ b/lldb/include/lldb/Target/ThreadPlanStepOut.h
@@ -30,6 +30,7 @@ class ThreadPlanStepOut : public ThreadPlan, public ThreadPlanShouldStopHere {
   bool ValidatePlan(Stream *error) override;
   bool ShouldStop(Event *event_ptr) override;
   bool StopOthers() override;
+  void SetStopOthers(bool new_value) override { m_stop_others = new_value; }
   lldb::StateType GetPlanRunState() override;
   bool WillStop() override;
   bool MischiefManaged() override;
diff --git a/lldb/include/lldb/Target/ThreadPlanStepOverRange.h b/lldb/include/lldb/Target/ThreadPlanStepOverRange.h
index 8585ac62f09b35..faa0ab596a2836 100644
--- a/lldb/include/lldb/Target/ThreadPlanStepOverRange.h
+++ b/lldb/include/lldb/Target/ThreadPlanStepOverRange.h
@@ -27,7 +27,9 @@ class ThreadPlanStepOverRange : public ThreadPlanStepRange,
   ~ThreadPlanStepOverRange() override;
 
   void GetDescription(Stream *s, lldb::DescriptionLevel level) override;
+  void SetStopOthers(bool new_value) override;
   bool ShouldStop(Event *event_ptr) override;
+  void DidPush() override;
 
 protected:
   bool DoPlanExplainsStop(Event *event_ptr) override;
@@ -42,8 +44,12 @@ class ThreadPlanStepOverRange : public ThreadPlanStepRange,
 
   void SetupAvoidNoDebug(LazyBool step_out_avoids_code_without_debug_info);
   bool IsEquivalentContext(const SymbolContext &context);
+  // Clear and create a new ThreadPlanSingleThreadTimeout to detect if program
+  // is moving forward.
+  void ResetSingleThreadTimeout();
 
   bool m_first_resume;
+  lldb::RunMode m_run_mode;
 
   ThreadPlanStepOverRange(const ThreadPlanStepOverRange &) = delete;
   const ThreadPlanStepOverRange &
diff --git a/lldb/include/lldb/Target/ThreadPlanStepRange.h b/lldb/include/lldb/Target/ThreadPlanStepRange.h
index 2fe88527710006..ced84c03e83cf5 100644
--- a/lldb/include/lldb/Target/ThreadPlanStepRange.h
+++ b/lldb/include/lldb/Target/ThreadPlanStepRange.h
@@ -58,8 +58,13 @@ class ThreadPlanStepRange : public ThreadPlan {
   // run' plan, then just single step.
   bool SetNextBranchBreakpoint();
 
+  // Whether the input stop info is caused by the next branch breakpoint.
+  bool IsNextBranchBreakpointStop(lldb::StopInfoSP stop_info_sp);
+
   void ClearNextBranchBreakpoint();
 
+  void ClearNextBranchBreakpointExplainedStop();
+
   bool NextRangeBreakpointExplainsStop(lldb::StopInfoSP stop_info_sp);
 
   SymbolContext m_addr_context;
diff --git a/lldb/include/lldb/lldb-enumerations.h b/lldb/include/lldb/lldb-enumerations.h
index 15e45857186091..2000931913c81b 100644
--- a/lldb/include/lldb/lldb-enumerations.h
+++ b/lldb/include/lldb/lldb-enumerations.h
@@ -253,6 +253,7 @@ enum StopReason {
   eStopReasonFork,
   eStopReasonVFork,
   eStopReasonVForkDone,
+  eStopReasonInterrupt, ///< Thread requested interrupt
 };
 
 /// Command Return Status Types.
diff --git a/lldb/source/API/SBThread.cpp b/lldb/source/API/SBThread.cpp
index ac3e2cd25daa94..c6925096e3cb1a 100644
--- a/lldb/source/API/SBThread.cpp
+++ b/lldb/source/API/SBThread.cpp
@@ -192,6 +192,9 @@ size_t SBThread::GetStopReasonDataCount() {
         case eStopReasonSignal:
           return 1;
 
+        case eStopReasonInterrupt:
+          return 1;
+
         case eStopReasonException:
           return 1;
 
@@ -261,6 +264,9 @@ uint64_t SBThread::GetStopReasonDataAtIndex(uint32_t idx) {
         case eStopReasonSignal:
           return stop_info_sp->GetValue();
 
+        case eStopReasonInterrupt:
+          return stop_info_sp->GetValue();
+
         case eStopReasonException:
           return stop_info_sp->GetValue();
 
diff --git a/lldb/source/Interpreter/CommandInterpreter.cpp b/lldb/source/Interpreter/CommandInterpreter.cpp
index 4c58ecc3c1848f..d327822db7bd37 100644
--- a/lldb/source/Interpreter/CommandInterpreter.cpp
+++ b/lldb/source/Interpreter/CommandInterpreter.cpp
@@ -2500,7 +2500,7 @@ bool CommandInterpreter::DidProcessStopAbnormally() const {
     const StopReason reason = stop_info->GetStopReason();
     if (reason == eStopReasonException ||
         reason == eStopReasonInstrumentation ||
-        reason == eStopReasonProcessorTrace)
+        reason == eStopReasonProcessorTrace || reason == eStopReasonInterrupt)
       return true;
 
     if (reason == eStopReasonSignal) {
diff --git a/lldb/source/Plugins/Process/gdb-remote/GDBRemoteCommunicationServerLLGS.cpp b/lldb/source/Plugins/Process/gdb-remote/GDBRemoteCommunicationServerLLGS.cpp
index ae1a77e5be8321..b3c0359b7dcf7b 100644
--- a/lldb/source/Plugins/Process/gdb-remote/GDBRemoteCommunicationServerLLGS.cpp
+++ b/lldb/source/Plugins/Process/gdb-remote/GDBRemoteCommunicationServerLLGS.cpp
@@ -714,6 +714,8 @@ static const char *GetStopReasonString(StopReason stop_reason) {
     return "vfork";
   case eStopReasonVForkDone:
     return "vforkdone";
+  case eStopReasonInterrupt:
+    return "async interrupt";
   case eStopReasonInstrumentation:
   case eStopReasonInvalid:
   case eStopReasonPlanComplete:
diff --git a/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp b/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp
index 871683a605686f..b93aad8808f5bd 100644
--- a/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp
+++ b/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp
@@ -1730,14 +1730,24 @@ ThreadSP ProcessGDBRemote::SetThreadStopInfo(
       thread_sp = memory_thread_sp;
 
     if (exc_type != 0) {
-      const size_t exc_data_size = exc_data.size();
-
-      thread_sp->SetStopInfo(
-          StopInfoMachException::CreateStopReasonWithMachException(
-              *thread_sp, exc_type, exc_data_size,
-              exc_data_size >= 1 ? exc_data[0] : 0,
-              exc_data_size >= 2 ? exc_data[1] : 0,
-              exc_data_size >= 3 ? exc_data[2] : 0));
+      // For thread plan async interrupt, creating stop info on the
+      // original async interrupt request thread instead. If interrupt thread
+      // does not exist anymore we fallback to current signal receiving thread
+      // instead.
+      ThreadSP interrupt_thread;
+      if (m_interrupt_tid != LLDB_INVALID_THREAD_ID)
+        interrupt_thread = HandleThreadAsyncInterrupt(signo, description);
+      if (interrupt_thread)
+        thread_sp = interrupt_thread;
+      else {
+        const size_t exc_data_size = exc_data.size();
+        thread_sp->SetStopInfo(
+            StopInfoMachException::CreateStopReasonWithMachException(
+                *thread_sp, exc_type, exc_data_size,
+                exc_data_size >= 1 ? exc_data[0] : 0,
+                exc_data_size >= 2 ? exc_data[1] : 0,
+                exc_data_size >= 3 ? exc_data[2] : 0));
+      }
     } else {
       bool handled = false;
       bool did_exec = false;
@@ -1936,9 +1946,20 @@ ThreadSP ProcessGDBRemote::SetThreadStopInfo(
                   *thread_sp, signo, description.c_str()));
           }
         }
-        if (!handled)
-          thread_sp->SetStopInfo(StopInfo::CreateStopReasonWithSignal(
-              *thread_sp, signo, description.c_str()));
+        if (!handled) {
+          // For thread plan async interrupt, creating stop info on the
+          // original async interrupt request thread instead. If interrupt
+          // thread does not exist anymore we fallback to current signal
+          // receiving thread instead.
+          ThreadSP interrupt_thread;
+          if (m_interrupt_tid != LLDB_INVALID_THREAD_ID)
+            interrupt_thread = HandleThreadAsyncInterrupt(signo, description);
+          if (interrupt_thread)
+            thread_sp = interrupt_thread;
+          else
+            thread_sp->SetStopInfo(StopInfo::CreateStopReasonWithSignal(
+                *thread_sp, signo, description.c_str()));
+        }
       }
 
       if (!description.empty()) {
@@ -1957,6 +1978,24 @@ ThreadSP ProcessGDBRemote::SetThreadStopInfo(
   return thread_sp;
 }
 
+ThreadSP
+ProcessGDBRemote::HandleThreadAsyncInterrupt(uint8_t signo,
+                                             const std::string &description) {
+  ThreadSP thread_sp;
+  {
+    std::lock_guard<std::recursive_mutex> guard(m_thread_list_real.GetMutex());
+    thread_sp = m_thread_list_real.FindThreadByProtocolID(m_interrupt_tid,
+                                                          /*can_update=*/false);
+  }
+  if (thread_sp)
+    thread_sp->SetStopInfo(StopInfo::CreateStopReasonWithInterrupt(
+        *thread_sp, signo, description.c_str()));
+  // Clear m_interrupt_tid regardless we can find original interrupt thread or
+  // not.
+  m_interrupt_tid = LLDB_INVALID_THREAD_ID;
+  return thread_sp;
+}
+
 lldb::ThreadSP
 ProcessGDBRemote::SetThreadStopInfo(StructuredData::Dictionary *thread_dict) {
   static constexpr llvm::StringLiteral g_key_tid("tid");
diff --git a/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.h b/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.h
index 610a1ee0b34d2f..615831dd7e6f60 100644
--- a/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.h
+++ b/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.h
@@ -440,6 +440,12 @@ class ProcessGDBRemote : public Process,
   void HandleStopReply() override;
   void HandleAsyncStructuredDataPacket(llvm::StringRef data) override;
 
+  // Handle thread specific async interrupt and return the original thread
+  // that requested the async interrupt. It can be null if original thread
+  // has exited.
+  lldb::ThreadSP HandleThreadAsyncInterrupt(uint8_t signo,
+                                            const std::string &description);
+
   void SetThreadPc(const lldb::ThreadSP &thread_sp, uint64_t index);
   using ModuleCacheKey = std::pair<std::string, std::string>;
   // KeyInfo for the cached module spec DenseMap.
diff --git a/lldb/source/Target/CMakeLists.txt b/lldb/source/Target/CMakeLists.txt
index cf4818eae3eb8b..d1209d29245d54 100644
--- a/lldb/source/Target/CMakeLists.txt
+++ b/lldb/source/Target/CMakeLists.txt
@@ -59,6 +59,7 @@ add_lldb_library(lldbTarget
   ThreadPlanCallUserExpression.cpp
   ThreadPlanPython.cpp
   ThreadPlanRunToAddress.cpp
+  ThreadPlanSingleThreadTimeout.cpp
   ThreadPlanShouldStopHere.cpp
   ThreadPlanStepInRange.cpp
   ThreadPlanStepInstruction.cpp
diff --git a/lldb/source/Target/Process.cpp b/lldb/source/Target/Process.cpp
index 30c240b064b59c..3e5e4fe58b4b3b 100644
--- a/lldb/source/Target/Process.cpp
+++ b/lldb/source/Target/Process.cpp
@@ -446,7 +446,8 @@ Process::Process(lldb::TargetSP target_sp, ListenerSP listener_sp,
       m_memory_cache(*this), m_allocated_memory_cache(*this),
       m_should_detach(false), m_next_event_action_up(), m_public_run_lock(),
       m_private_run_lock(), m_currently_handling_do_on_removals(false),
-      m_resume_requested(false), m_finalizing(false), m_destructing(false),
+      m_resume_requested(false), m_interrupt_tid(LLDB_INVALID_THREAD_ID),
+      m_finalizing(false), m_destructing(false),
       m_clear_thread_plans_on_stop(false), m_force_next_event_delivery(false),
       m_last_broadcast_state(eStateInvalid), m_destroy_in_process(false),
       m_can_interpret_function_calls(false), m_run_thread_plan_lock(),
@@ -863,6 +864,7 @@ bool Process::HandleProcessStateChangedEvent(
             case eStopReasonThreadExiting:
             case eStopReasonInstrumentation:
             case eStopReasonProcessorTrace:
+            case eStopReasonInterrupt:
               if (!other_thread)
                 other_thread = thread;
               break;
@@ -3679,7 +3681,13 @@ void Process::ControlPrivateStateThread(uint32_t signal) {
   }
 }
 
-void Process::SendAsyncInterrupt() {
+void Process::SendAsyncInterrupt() { SendAsyncInterrupt(nullptr); }
+
+void Process::SendAsyncInterrupt(Thread *thread) {
+  if (thread != nullptr)
+    m_interrupt_tid = thread->GetProtocolID();
+  else
+    m_interrupt_tid = LLDB_INVALID_THREAD_ID;
   if (PrivateStateThreadIsValid())
     m_private_state_broadcaster.BroadcastEvent(Process::eBroadcastBitInterrupt,
                                                nullptr);
@@ -3905,9 +3913,14 @@ thread_result_t Process::RunPrivateStateThread(bool is_secondary_thread) {
 
       if (interrupt_requested) {
         if (StateIsStoppedState(internal_state, true)) {
-          // We requested the interrupt, so mark this as such in the stop event
-          // so clients can tell an interrupted process from a natural stop
-          ProcessEventData::SetInterruptedInEvent(event_sp.get(), true);
+          // Oly mark interrupt event if it is not thread specific async
+          // interrupt.
+          if (m_interrupt_tid == LLDB_INVALID_THREAD_ID) {
+            // We requested the interrupt, so mark this as such in the stop
+            // event so clients can tell an interrupted process from ...
[truncated]

@jeffreytan81
Copy link
Contributor Author

@jimingham, friendly ping. @clayborg mentioned that you are the sole domain expert on ThreadPlan. Can you help to review this change? Thanks

@jimingham
Copy link
Collaborator

Thanks for working up this patch. I should be able to get time to look at this next week.


void SendAsyncInterrupt();

// Send an async interrupt and receive stop from a specific /p thread.
void SendAsyncInterrupt(Thread *thread);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need two API's here? Why not have thread default to nullptr and document that that means we let the system choose which thread gets the interrupt? It's also worth noting here that the "receive stop from specific thread" is artificial, so you might say something like "and attribute the stop to a specific thread" instead to make it clear we don't require the ability to actually have that thread receive the signal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose overloading because I remember reading some guideline to prefer overloading than default parameter. Google finds one:
https://stackoverflow.com/questions/15876590/overload-a-method-or-use-default-values-c

But I do not care that much really, so changed and made the comment more clear.

@@ -590,6 +591,7 @@ class ThreadPlanNull : public ThreadPlan {
const Status &GetStatus() { return m_status; }

protected:
friend class ThreadPlanSingleThreadTimeout;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this needed for? It's a little odd having a derived class friended to the base class, since you could make whatever the derived class needed protected rather than private.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I thought there is a reason I can't remember, apparently not. Removed. Thanks.

//
// Thread plan used by single thread execution to issue timeout. This is useful
// to detect potential deadlock in single thread execution. The timeout measures
// the elapsed time from the last internal stop and got reset by each internal
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"gets reset by each internal stop"

// to detect potential deadlock in single thread execution. The timeout measures
// the elapsed time from the last internal stop and got reset by each internal
// stops to ensure we are accurately detecting execution not moving forward.
// This means this thread plan can be created/destroyed multiple times by the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"may" is better than "can" and there's one too many spaces.

Done, // Finished resume all threads.
};

static std::mutex s_mutex;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to see how you use this as I go along, but having statics here worries me. How are you going to support the case where lldb has two targets, each of which are running using this timeout feature?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. I was storing some static states so that we can ensure:

  1. Singleton (only one ThreadPlanSingleThreadTimeout is alive/used anytime).
  2. If there is ever a race that, the ThreadPlanSingleThreadTimeout is in State::AsyncInterrupt but it got popped by another internal stop before async interrupt received. I want to remember the previous/last state, and the newly created ThreadPlanSingleThreadTimeout can be re-created from previous state.

Instead, I will look into storing these static states into parent controlling ThreadPlan (ThreadPlanStepOverRange in this case).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be better, but if you find you do need a singleton, you could also store that either in the Thread or the Process, depending on the scope at which you need them.

@jeffreytan81
Copy link
Contributor Author

@jimingham, I have revised the PR, let me know when you can review again. Thanks.

void SendAsyncInterrupt();
/// Send an async interrupt request.
///
/// If \a thread is specified the async interrupt stop will be attributed the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"will be attributed to the" - missing "to"

/// specified thread.
///
/// \param[in] thread
/// The thread from which to attribute the async interrupt stop to.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The thread the async interrupt will be attributed to"


~ThreadPlanSingleThreadTimeout() override;

// Create a new instance from fresh new state.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this API doesn't return the Created thread plan, then this API has to do more than just create it - it also pushes the plan, and sometimes doesn't. So CreateNew is not a good name, and the description of what the function does is too abbreviated.

@@ -42,8 +45,13 @@ class ThreadPlanStepOverRange : public ThreadPlanStepRange,

void SetupAvoidNoDebug(LazyBool step_out_avoids_code_without_debug_info);
bool IsEquivalentContext(const SymbolContext &context);
// Clear and create a new ThreadPlanSingleThreadTimeout to detect if program
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ThreadPlanStepOverRange isn't the only thread plan that we want to have this behavior for. For instance, it would be great to do this for ThreadPlanStepOut, which we don't even try to run on one thread, IIRC, and in order not to duplicate code, we will want to switch the ThreadPlanCallUserExpression over to this method as well. So the code to handle the interrupt has to be shareable - maybe a mix-in class that does the management of the interrupt?

// Handle thread specific async interrupt and return the original thread
// that requested the async interrupt. It can be null if original thread
// has exited.
lldb::ThreadSP HandleThreadAsyncInterrupt(uint8_t signo,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see any reason why the ability to manage this interrupt event is GDBRemote specific. So this should be a virtual method in Process, and overridden here.

// We requested the interrupt, so mark this as such in the stop event
// so clients can tell an interrupted process from a natural stop
ProcessEventData::SetInterruptedInEvent(event_sp.get(), true);
// Oly mark interrupt event if it is not thread specific async
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only

bool ShouldStop(Event *event_ptr) override {
ThreadSP thread_sp(m_thread_wp.lock());
if (thread_sp)
return thread_sp->GetProcess()->GetUnixSignals()->GetShouldStop(m_value);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is right. This is a signal that you are using to implement your async interrupt. So you shouldn't be governed by whether the user would or would not want to stop for that signal if it had been generated by some other means.

@jeffreytan81
Copy link
Contributor Author

@jimingham , let me know any other comments/thoughts you have.

@jimingham
Copy link
Collaborator

jimingham commented Jun 6, 2024 via email

@jeffreytan81
Copy link
Contributor Author

@jimingham, hope WWDC is going well. Do you have time to review this now? Thanks.

@@ -2838,6 +2844,14 @@ void PruneThreadPlans();
return std::nullopt;
}

/// Handle thread specific async interrupt and return the original thread
/// that requested the async interrupt. It can be null if original thread
/// has exited.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should say what the description string is for.


namespace lldb_private {

// Mixin class that provides capability for ThreadPlan that supports single
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar:

// Mixin class that provides the capability for ThreadPlan to support single
// thread execution that resumes all threads after a timeout.

// This means this thread plan may be created/destroyed multiple times by the
// parent execution plan.
//
// When timeout happens, the thread plan resolves the potential deadlock by
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"When a timeout happens"

//
// When timeout happens, the thread plan resolves the potential deadlock by
// issuing a thread specific async interrupt to enter stop state, then all
// threads execution are resumed to resolve the potential deadlock.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"then execution is resumed with all threads running to resolve the potential deadlock"

};

public:
struct TimeoutInfo {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't something you need to do in this patch, but it would be good to allow different thread plan invocations to use different timeouts, and not just depend on the thread timeout. For instance, we should be able to reimplement RunThreadPlan with this infrastructure. But the timeouts you want for expression evaluation and stepping are likely to be very different (and different executions also use different timeouts). So being able to pass in the timeout, rather than relying on one thread specific one will be necessary. We could for instance add the timeout to the TimeoutInfo and plumb that so that the ThreadPlan that mixes in TimeoutResumeAll can pass in a timeout.

// If input \param thread is running in single thread mode, push a
// new ThreadPlanSingleThreadTimeout based on timeout setting from fresh new
// state. The reference of \param info is passed in so that when
// ThreadPlanSingleThreadTimeout got popped out its last state can be stored
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"popped out" is confusing. We just say "popped" for the most part.


ThreadPlanSingleThreadTimeout::~ThreadPlanSingleThreadTimeout() {
m_info.m_instance = nullptr;
if (m_state == State::Done)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like an odd thing to do in the destructor. Why is this necessary?


void ThreadPlanSingleThreadTimeout::GetDescription(
Stream *s, lldb::DescriptionLevel level) {
s->Printf("Single thread timeout, state(%s)", StateToString(m_state).c_str());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be super handy to print the time remaining here.

auto status = thread.QueueThreadPlan(thread_plan_sp,
/*abort_other_plans*/ false);
Log *log = GetLog(LLDBLog::Step);
LLDB_LOGF(log, "ThreadPlanSingleThreadTimeout pushing a brand new one");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definitely want the timeout in this log message.

auto status = thread.QueueThreadPlan(thread_plan_sp,
/*abort_other_plans*/ false);
Log *log = GetLog(LLDBLog::Step);
LLDB_LOGF(log, "ThreadPlanSingleThreadTimeout reset from previous state");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here, for debugging purposes it will be really handy to have the timeout value here.


// Reset the state during stop.
m_info.m_last_state = State::WaitTimeout;
m_info.m_instance = this;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear what this m_instance variable is for? You don't use it as a pointer, so I don't think it needs to hold this. It seems to be a should_resume_from_prev_state flag. Is that right? Anyway, this is confusing. If this needs to be the this pointer you should say why somewhere. If it's a state variable, use a bool instead.

lldb::StateType stop_state =
Process::ProcessEventData::GetStateFromEvent(event_ptr);
Log *log = GetLog(LLDBLog::Step);
LLDB_LOGF(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another place where it will probably be useful to see the timeout value if we're logging.

}

void ThreadPlanSingleThreadTimeout::TimeoutThreadFunc(
ThreadPlanSingleThreadTimeout *self) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on how much this gets used, we might not want to spin up this thread afresh every time we invoke a single-thread timeout style thread plan. There will only ever be one client of this timeout at a time, since this is to implement running only one thread, and gets cleared each time you stop. So it would be more efficient to have a worker thread you spin up the first time someone does a one-thread-with-timeout and have it just be a timeout-server thread.
I think it's likely we will want to do that, but that can be for a follow-up patch.

ThreadPlanSingleThreadTimeout *self) {
std::unique_lock<std::mutex> lock(self->m_mutex);
uint64_t timeout_in_ms = self->GetThread().GetSingleThreadPlanTimeout();
self->m_wakeup_cv.wait_for(lock, std::chrono::milliseconds(timeout_in_ms),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to log both before and after the wait call here. I think that will help read the step logs.

return HandleEvent(event_ptr);
}

void ThreadPlanSingleThreadTimeout::SetStopOthers(bool new_value) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is assuming that the SingleThreadTimeout plan is always going to be pushed on behalf of the plan directly above it. That seems like a valid assumption to me, but it's worth calling out somewhere.

@@ -58,8 +58,13 @@ class ThreadPlanStepRange : public ThreadPlan {
// run' plan, then just single step.
bool SetNextBranchBreakpoint();

// Whether the input stop info is caused by the next branch breakpoint.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This routine does NOT distinguish between the cases where the next branch breakpoint is the ONLY breakpoint at this site, and where it's one of many breakpoints that share this site. Later, you distinguish between the two cases where this is used (e.g. in ThreadPlanStepRange::NextRangeBreakpointExplainsStop.
It might be interesting to see if it makes sense to incorporate the logic that does that into IsNextBranchBreakpointStop. If it does that would allow some code reuse. If it does not, you should note in the comment here that it does not distinguish between the two cases, so someone doesn't make a mistake later on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function only checks the branch breakpoint without caring whether its site is shared by others or not. Let me add a comment.

@jimingham
Copy link
Collaborator

I left some more comments, most of them either nits or future work - unless you want to do them here...

This makes sense to me, so it's likely right. But the tests you've written so far are just about making sure that deadlocks are resolved in a simple stepping situation where nothing user-visible happens while you are stepping.

We need to make sure that if the "next" is interrupted - for instance by hitting a breakpoint in the code being stepped over, that once stopped, lldb can do other step operations having stopped and they work as expected.

Also, in lldb, if you do a next and that steps over a function that hits a breakpoint, the next stays on the stack and then "continue" will resume execution of the "next" ThreadPlan, completing the step. We should make sure that works and keeps tracking the single thread in a case where there's no deadlock to resolve, or goes on to resolve the deadlock.

We also really want to make sure that the async interrupt that you use here and the one that the user uses (through SBProcess.SendAsyncInterrupt) don't interfere with one another. So step over something with a really long user step timeout, then send an asynchronous interrupt and make sure we stop with the right stop reason. And then make sure the next continues successfully after that.

Also, if we can it would be nice to test that this succeeds in NOT allowing the other threads to run before the timeout elapses. You could have your worker thread stop right before code that's going to crash, and then switch to the main thread and do a bunch of steps over code that calls some function that will do a pthread yield. If the process doesn't crash after those steps, you're good.

@jeffreytan81
Copy link
Contributor Author

jeffreytan81 commented Jul 8, 2024

@jimingham, thanks for the comment.

When I try to add a new testcase for stepping over a function with a user set breakpoint, there is a bug found -- the user set breakpoint did not stop with ThreadPlanSingleThreadTimeout present (which is c case in your RFC comment)

I did some investigation, when we issue normal step-over and hit a user breakpoint, the all thread plans in the stack return false for PlanExplainsStop, so base thread plan got a chance to check thread's stop info and stop the execution.

However, while doing single thread stepping hitting a user breakpoint, ThreadPlanSingleThreadTimeout as leaf plan always return true for PlanExplainsStop, causing the logic to use the second while loop in Thread::ShouldStop(), eventually, ThreadPlanStepOut::MischiefManaged() returns false causing the loop to terminate early without giving base thread plan a chance to decide should stop or not:

} else {
break;
}

So overall the Thread::ShouldStop() returns false and ignores the hit inner breakpoint during stepping.

Any suggestion how to fix this?

@jimingham
Copy link
Collaborator

jimingham commented Jul 8, 2024 via email

@jeffreytan81
Copy link
Contributor Author

jeffreytan81 commented Jul 8, 2024

Why does the ThreadPlanSingleThreadTimeout have to claim to explain stops (like a random breakpoint hit) that on the face of it it's not responsible for

Just a quick comment to add the context for this question: we are doing so to ensure ThreadPlanSingleThreadTimeout can stay in the leaf thread plan. And it comes from your suggestion So ThreadPlanSingleThreadTimeout should always say it explains the stop in this comment:
https://discourse.llvm.org/t/improve-single-thread-stepping/74599/17?u=jeffreytan81

@jimingham
Copy link
Collaborator

Why does the ThreadPlanSingleThreadTimeout have to claim to explain stops (like a random breakpoint hit) that on the face of it it's not responsible for

Just a quick comment to add the context for this question: we are doing so to ensure ThreadPlanSingleThreadTimeout can stay in the leaf thread plan. And it comes from your suggestion So ThreadPlanSingleThreadTimeout should always say it explains the stop in this comment: https://discourse.llvm.org/t/improve-single-thread-stepping/74599/17?u=jeffreytan81

So maybe we need to do something explicit here.

The ThreadPlanSingleThreadTimeout's desired behavior, IIUC, is either it explains the stop or it should get discarded. I can't think of any instance where it should stay on the stack when there's a stop that's not for the timeout. If, for instance, you hit a "next range" breakpoint for a "step over" then we're going to remove the plan and have the thread plan push it again when you resume.

So instead of semi-lying about whether it actually explains stops it knows nothing about, can we add the concept of a plan that "has to be a leaf plan". So if it doesn't explain a stop during ShouldStop, it will get popped right away. So after we check "plan explains stop" and get false back, we check IsLeafPlan, and if that returns true, pop the plan and go on to the next plan.

The harder option is to either convert the ShouldStop to a single loop that re-enters if the plan that explained the stop returns True to MischiefManaged, or to figure out why the second loop is behaving asymmetrically to the first. We ought to do the first at some point, but that's probably a lot more work.

Copy link

github-actions bot commented Jul 10, 2024

✅ With the latest revision this PR passed the Python code formatter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants