Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[lldb] Fix the flakey Concurrent tests on macOS #81710

Conversation

jasonmolenda
Copy link
Collaborator

@jasonmolenda jasonmolenda commented Feb 14, 2024

The concurrent tests all do a pthread_join at the end, and concurrent_base.py stops after that pthread_join and sanity checks that only 1 thread is running. On macOS, after pthread_join() has completed, there can be an extra thread still running which is completing the details of that task asynchronously; this causes testsuite failures. When this happens, we see the second thread is in

frame #0: 0x0000000180ce7700 libsystem_kernel.dylib`__ulock_wake + 8
frame #1: 0x0000000180d25ad4 libsystem_pthread.dylib`_pthread_joiner_wake + 52
frame #2: 0x0000000180d23c18 libsystem_pthread.dylib`_pthread_terminate + 384
frame #3: 0x0000000180d23a98 libsystem_pthread.dylib`_pthread_terminate_invoke + 92
frame #4: 0x0000000180d26740 libsystem_pthread.dylib`_pthread_exit + 112
frame #5: 0x0000000180d26040 libsystem_pthread.dylib`_pthread_start + 148

there are none of the functions from the test file present on this thread.

In this patch, instead of counting the number of threads, I iterate over the threads looking for functions from our test file (by name) and only count threads that have at least one of them.

It's a lower frequency failure than the darwin kernel bug causing an extra step instruction mach exception when hardware breakpoint/watchpoints are used, but once I fixed that, this came up as the next most common failure for these tests.

rdar://110555062

The concurrent tests all do a pthread_join at the end, and
concurrent_base.py stops after that pthread_join and sanity checks
that only 1 thread is running.  On macOS, after pthread_join() has
completed, there can be an extra thread still running which is
completing the details of that task asynchronously; this causes
testsuite failures.  When this happens, we see the second thread is
in

frame #0: 0x0000000180ce7700 libsystem_kernel.dylib`__ulock_wake + 8
frame apple#1: 0x0000000180d25ad4 libsystem_pthread.dylib`_pthread_joiner_wake + 52
frame apple#2: 0x0000000180d23c18 libsystem_pthread.dylib`_pthread_terminate + 384
frame apple#3: 0x0000000180d23a98 libsystem_pthread.dylib`_pthread_terminate_invoke + 92
frame apple#4: 0x0000000180d26740 libsystem_pthread.dylib`_pthread_exit + 112
frame apple#5: 0x0000000180d26040 libsystem_pthread.dylib`_pthread_start + 148

there are none of the functions from the test file present on this
thread.

In this patch, instead of counting the number of threads, I iterate
over the threads looking for functions from our test file (by name)
and only count threads that have at least one of them.

It's a lower frequency failure than the darwin kernel bug causing
an extra step instruction mach exception when hardware
breakpoint/watchpoints are used, but once I fixed that, this came
up as the next most common failure for these tests.

rdar://110555062
@llvmbot
Copy link
Collaborator

llvmbot commented Feb 14, 2024

@llvm/pr-subscribers-lldb

Author: Jason Molenda (jasonmolenda)

Changes

The concurrent tests all do a pthread_join at the end, and concurrent_base.py stops after that pthread_join and sanity checks that only 1 thread is running. On macOS, after pthread_join() has completed, there can be an extra thread still running which is completing the details of that task asynchronously; this causes testsuite failures. When this happens, we see the second thread is in

frame #<!-- -->0: 0x0000000180ce7700 libsystem_kernel.dylib`__ulock_wake + 8
frame #<!-- -->1: 0x0000000180d25ad4 libsystem_pthread.dylib`_pthread_joiner_wake + 52
frame #<!-- -->2: 0x0000000180d23c18 libsystem_pthread.dylib`_pthread_terminate + 384
frame #<!-- -->3: 0x0000000180d23a98 libsystem_pthread.dylib`_pthread_terminate_invoke + 92
frame #<!-- -->4: 0x0000000180d26740 libsystem_pthread.dylib`_pthread_exit + 112
frame #<!-- -->5: 0x0000000180d26040 libsystem_pthread.dylib`_pthread_start + 148

there are none of the functions from the test file present on this thread.

In this patch, instead of counting the number of threads, I iterate over the threads looking for functions from our test file (by name) and only count threads that have at least one of them.

It's a lower frequency failure than the darwin kernel bug causing an extra step instruction mach exception when hardware breakpoint/watchpoints are used, but once I fixed that, this came up as the next most common failure for these tests.

rdar://110555062


Full diff: https://github.com/llvm/llvm-project/pull/81710.diff

1 Files Affected:

  • (modified) lldb/packages/Python/lldbsuite/test/concurrent_base.py (+31-3)
diff --git a/lldb/packages/Python/lldbsuite/test/concurrent_base.py b/lldb/packages/Python/lldbsuite/test/concurrent_base.py
index 39eb27fd997471..46d71666d06977 100644
--- a/lldb/packages/Python/lldbsuite/test/concurrent_base.py
+++ b/lldb/packages/Python/lldbsuite/test/concurrent_base.py
@@ -264,12 +264,40 @@ def do_thread_actions(
                 "Expected main thread (finish) breakpoint to be hit once",
             )
 
-            num_threads = self.inferior_process.GetNumThreads()
+            # There should be a single active thread (the main one) which hit
+            # the breakpoint after joining.  Depending on the pthread
+            # implementation we may have a worker thread finishing the pthread_join()
+            # after it has returned.  Filter the threads to only count those
+            # with user functions on them from our test case file,
+            # lldb/test/API/functionalities/thread/concurrent_events/main.cpp
+            user_code_funcnames = [
+                "breakpoint_func",
+                "crash_func",
+                "do_action_args",
+                "dotest",
+                "main",
+                "register_signal_handler",
+                "signal_func",
+                "sigusr1_handler",
+                "start_threads",
+                "watchpoint_func",
+            ]
+            num_threads_with_usercode = 0
+            for t in self.inferior_process.threads:
+                thread_has_user_code = False
+                for f in t.frames:
+                    for funcname in user_code_funcnames:
+                        if funcname in f.GetDisplayFunctionName():
+                            thread_has_user_code = True
+                            break
+                if thread_has_user_code:
+                    num_threads_with_usercode += 1
+
             self.assertEqual(
                 1,
-                num_threads,
+                num_threads_with_usercode,
                 "Expecting 1 thread but seeing %d. Details:%s"
-                % (num_threads, "\n\t".join(self.describe_threads())),
+                % (num_threads_with_usercode, "\n\t".join(self.describe_threads())),
             )
             self.runCmd("continue")
 

Copy link
Member

@JDevlieghere JDevlieghere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jasonmolenda jasonmolenda merged commit dbc40b3 into llvm:main Feb 14, 2024
6 checks passed
jasonmolenda added a commit to jasonmolenda/llvm-project that referenced this pull request Feb 24, 2024
The concurrent tests all do a pthread_join at the end, and
concurrent_base.py stops after that pthread_join and sanity checks that
only 1 thread is running. On macOS, after pthread_join() has completed,
there can be an extra thread still running which is completing the
details of that task asynchronously; this causes testsuite failures.
When this happens, we see the second thread is in

```
frame #0: 0x0000000180ce7700 libsystem_kernel.dylib`__ulock_wake + 8
frame apple#1: 0x0000000180d25ad4 libsystem_pthread.dylib`_pthread_joiner_wake + 52
frame apple#2: 0x0000000180d23c18 libsystem_pthread.dylib`_pthread_terminate + 384
frame apple#3: 0x0000000180d23a98 libsystem_pthread.dylib`_pthread_terminate_invoke + 92
frame apple#4: 0x0000000180d26740 libsystem_pthread.dylib`_pthread_exit + 112
frame apple#5: 0x0000000180d26040 libsystem_pthread.dylib`_pthread_start + 148
```

there are none of the functions from the test file present on this
thread.

In this patch, instead of counting the number of threads, I iterate over
the threads looking for functions from our test file (by name) and only
count threads that have at least one of them.

It's a lower frequency failure than the darwin kernel bug causing an
extra step instruction mach exception when hardware
breakpoint/watchpoints are used, but once I fixed that, this came up as
the next most common failure for these tests.

rdar://110555062
(cherry picked from commit dbc40b3)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants