Ensure kernels submitted by multiple threads to synchronous execution spaces are enqueued correctly #6151

masterleinad · 2023-05-19T15:32:20Z

Related to thread-safety questions in #6051 and #4385.
This pull request ensures that kernels submitted to the same execution space instance from multiple threads don't run concurrently. Also, calling fence ensures that any kernel running when fence completes before returning. This should be enough to ensure that is_running in #6051 can use

fence();
return false;

as fallback implementation.

There are still some cases where the test added failed but I think it's better to address that in follow-up pull requests to limit the size of this pull request (and unblock #6051).

Note that the behavior we aim for is that of the GPU backends:
Kernels submitted to the same execution space instance don't run concurrently and are executed in submission order (which needs external thread-synchronization to be well-defined if multiple threads submit to the same instance).

crtrott · 2023-05-19T16:15:50Z

We need to look at overhead for this, also its not clear to me how fence() would help with the is_running thing here?

masterleinad · 2023-05-19T16:30:31Z

We need to look at overhead for this, also its not clear to me how fence() would help with the is_running thing here?

I am now using an implementation using fence in #6051 for all synchronous execution spaces which should be legal (but possibly pessimistic) since we are assuming that is_running returns false after fence has been called.

ajpowelsnl · 2023-05-23T19:54:54Z

@masterleinad - would you like to continue discussing this issue in the Wed. meeting?

masterleinad · 2023-05-23T19:56:40Z

@masterleinad - would you like to continue discussing this issue in the Wed. meeting?

Yes, we should discuss this, depending on priority tomorrow or later.

masterleinad · 2023-05-26T13:25:39Z

We need to look at overhead for this, [...]

For an empty lambda using one thread, i.e., no contention, (which should be a worst case for applications using one thread) using

#include <Kokkos_Core.hpp>

int main(int argc, char* argv[]) {
  Kokkos::initialize(argc, argv);

  Kokkos::Timer timer;
  timer.reset();

  unsigned int N=10'000'000;

  for (unsigned int i=0; i<N; ++i)
    Kokkos::parallel_for(Kokkos::RangePolicy<Kokkos::Serial>(0, 1), KOKKOS_LAMBDA(int){});

  Kokkos::fence();
  auto count_time = timer.seconds();
  std::cout << count_time << std::endl;

  Kokkos::finalize();
  return 0;
}

I observe locally that execution time increases from 3.47s to 3.62s, i.e., it's like 4% slower.

masterleinad · 2023-09-13T20:30:57Z

Needs #6441.

masterleinad · 2023-12-20T13:05:45Z

Retest this please.

…ion space instance from multiple threads

nmm0

What exactly are the thread safety guarantees we are getting? E.g. are we also letting two threads submit to the same exec space index? If so we should test that

nmm0 · 2024-03-13T21:40:24Z

core/src/OpenMP/Kokkos_OpenMP_Instance.cpp

+    auto it = std::find(all_instances.begin(), all_instances.end(), this);
+    if (it == all_instances.end()) Kokkos::abort("error");
+    all_instances.erase(
+        std::find(all_instances.begin(), all_instances.end(), this));


Why not use it here since it's guaranteed to not be end?

core/src/OpenMP/Kokkos_OpenMP_Parallel_For.hpp

nmm0 · 2024-03-13T21:43:14Z

core/src/OpenMP/Kokkos_OpenMP_Parallel_For.hpp

@@ -338,6 +343,9 @@ class ParallelFor<FunctorType, Kokkos::TeamPolicy<Properties...>,
    m_instance->resize_thread_data(pool_reduce_size, team_reduce_size,
                                   team_shared_size, thread_local_size);

+    // Serialize kernels on the same execution space instance
+    std::lock_guard<std::mutex> lock(m_instance->m_instance_mutex);


Same as above

nmm0 · 2024-03-13T21:53:17Z

core/src/OpenMP/Kokkos_OpenMP_Parallel_Reduce.hpp

@@ -106,6 +109,9 @@ class ParallelReduce<CombinedFunctorReducerType, Kokkos::RangePolicy<Traits...>,
          update);

      reducer.final(ptr);
+
+      m_instance->release_lock();


If user code throws this could cause a deadlock by never releasing

I'm happy to rework OpenMP's acquire_lock/release_lock in a different pull request.

nmm0 · 2024-03-13T21:53:46Z

core/src/OpenMP/Kokkos_OpenMP_Parallel_Scan.hpp

@@ -86,6 +91,8 @@ class ParallelScan<FunctorType, Kokkos::RangePolicy<Traits...>,
      ParallelScan::template exec_range<WorkTag>(m_functor, m_policy.begin(),
                                                 m_policy.end(), update, true);

+      m_instance->release_lock();


Same as before, but it's a pattern we use in existing code...

nmm0 · 2024-03-13T21:57:41Z

core/src/OpenMP/Kokkos_OpenMP.cpp

+      Kokkos::Tools::Experimental::SpecialSynchronizationCases::
+          GlobalDeviceSynchronization,
+      []() {
+        std::lock_guard<std::mutex> lock_all_instances(


I don't quite understand this code. If this is supposed to ensure that the mutex inside of e.g. a parallel for is released before this function will exit, what happens if the mutex is acquired before the parallel for mutex is and the order is flipped?

Yes, if at the time of calling the fence any execution space instance is executing a kernel we wait for the kernel to have run before returning here.
If the mutex is acquired first and another thread tries to run a kernel then that kernel has to wait for the fence. This seems to be the best we can do.

nmm0 · 2024-03-13T21:59:02Z

core/src/Serial/Kokkos_Serial.hpp

@@ -134,15 +141,25 @@ class Serial {
        name,
        Kokkos::Tools::Experimental::SpecialSynchronizationCases::
            GlobalDeviceSynchronization,
-        []() {});  // TODO: correct device ID
+        []() {
+          std::lock_guard<std::mutex> lock_all_instances(


Same comment as before OpenMP

masterleinad · 2024-03-14T13:43:55Z

What exactly are the thread safety guarantees we are getting? E.g. are we also letting two threads submit to the same exec space index? If so we should test that

The thread safety guarantees addressed here are that kernels submitted to the same execution space instance are enqueued and don't run concurrently even if submitted from different threads. That's what we are testing in the test added here.

dalg24

(only reviewed the Serial backend changes so far)

core/src/Serial/Kokkos_Serial.hpp

core/src/Serial/Kokkos_Serial_Parallel_Team.hpp

dalg24 · 2024-04-04T22:46:44Z

core/src/Serial/Kokkos_Serial_Parallel_Range.hpp

+    // Make sure kernels are running sequentially even when using multiple
+    // threads
+    std::lock_guard<std::mutex> instance_lock(
+        internal_instance->m_instance_mutex);


You probably should lock 1st thing or comment why you deferred until after the resize.
Same comment below.

We need both lock the m_thread_team_data_mutex and the m_instance_mutex since we want to make sure that there is no other kernel that modifies the scratch allocations while this kernel is executing.
We can't just use the m_instance_mutex alone (before resize_thread_team_data) since that function might call deallocate which implies a global fence and that fence tries locking instance_lock resulting in a deadlock.

that function might call deallocate which implies a global fence

I don't think that's true in this case. We call deallocate from HostSpace which does not fence.

frame #12: 0x000000010066c384 Kokkos_CoreUnitTest_Serial1`Kokkos::fence(name="HostSpace::impl_deallocate before free") at Kokkos_Core.cpp:1113:47 [opt] frame #13: 0x000000010066d60c Kokkos_CoreUnitTest_Serial1`Kokkos::HostSpace::impl_deallocate(this=<unavailable>, arg_label="Kokkos::Serial::scratch_mem", arg_alloc_ptr=0x000000013980b800, arg_alloc_size=11264, arg_logical_size=0, arg_handle=(name = "Host")) const at Kokkos_HostSpace.cpp:122:5 [opt] frame #14: 0x000000010066d560 Kokkos_CoreUnitTest_Serial1`Kokkos::HostSpace::deallocate(this=<unavailable>, arg_label=<unavailable>, arg_alloc_ptr=<unavailable>, arg_alloc_size=<unavailable>, arg_logical_size=<unavailable>) const at Kokkos_HostSpace.cpp:115:3 [opt] frame #15: 0x00000001006773bc Kokkos_CoreUnitTest_Serial1`Kokkos::Impl::SerialInternal::resize_thread_team_data(this=0x0000600001798000, pool_reduce_bytes=512, team_reduce_bytes=512, team_shared_bytes=8, thread_local_bytes=0) at Kokkos_Serial.cpp:119:13 [opt] frame #16: 0x00000001000477e0 Kokkos_CoreUnitTest_Serial1`Kokkos::Impl::ParallelFor<Test::TestSharedAtomicsFunctor<Kokkos::Serial>, Kokkos::TeamPolicy<Kokkos::Serial>, Kokkos::Serial>::execute(this=0x000000016fdfefe0) const at Kokkos_Serial_Parallel_Team.hpp:255:24 [opt] frame #17: 0x0000000100047634 Kokkos_CoreUnitTest_Serial1`void Kokkos::parallel_for<Kokkos::TeamPolicy<Kokkos::Serial>, Test::TestSharedAtomicsFunctor<Kokkos::Serial>, void>(str=<unavailable>, policy=0x000000016fdff0d0, functor=0x000000016fdff080) at Kokkos_Parallel.hpp:144:11 [opt] frame #18: 0x0000000100045e30 Kokkos_CoreUnitTest_Serial1`Test::serial_atomic_shared_Test::TestBody() [inlined] void Kokkos::parallel_for<Kokkos::TeamPolicy<Kokkos::Serial>, Test::TestSharedAtomicsFunctor<Kokkos::Serial>>(policy=0x000000016fdff0d0, functor=0x000000016fdff080, (null)=0x0000000000000000) at Kokkos_Parallel.hpp:153:3 [opt] frame #19: 0x0000000100045e18 Kokkos_CoreUnitTest_Serial1`Test::serial_atomic_shared_Test::TestBody(this=<unavailable>) at TestAtomicOperations_shared.hpp:49:3 [opt] frame #20: 0x000000010063cb14 Kokkos_CoreUnitTest_Serial1`void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) [inlined] void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(object=<unavailable>, method=0x00000000000000010000000000000020, location="the test body") at gtest-all.cc:4101:10 [opt] frame #21: 0x000000010063cb04 Kokkos_CoreUnitTest_Serial1`void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(object=0x0000600002e88100, method=<unavailable>, location="the test body") at gtest-all.cc:4137:14 [opt] frame #22: 0x000000010063ca68 Kokkos_CoreUnitTest_Serial1`testing::Test::Run(this=0x0000600002e88100) at gtest-all.cc:4176:5 [opt] frame #23: 0x000000010063e1d4 Kokkos_CoreUnitTest_Serial1`testing::TestInfo::Run(this=0x0000000139605a00) at gtest-all.cc:4326:11 [opt] frame #24: 0x000000010063ed90 Kokkos_CoreUnitTest_Serial1`testing::TestSuite::Run(this=0x00000001396048f0) at gtest-all.cc:4480:28 [opt] frame #25: 0x000000010064eb74 Kokkos_CoreUnitTest_Serial1`testing::internal::UnitTestImpl::RunAllTests(this=0x000000013a804080) at gtest-all.cc:7320:44 [opt] frame #26: 0x000000010064e208 Kokkos_CoreUnitTest_Serial1`bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) [inlined] bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(object=<unavailable>, method=(Kokkos_CoreUnitTest_Serial1`testing::internal::UnitTestImpl::RunAllTests() at gtest-all.cc:7211), location="auxiliary test code (environments or event listeners)") at gtest-all.cc:4101:10 [opt] frame #27: 0x000000010064e1f8 Kokkos_CoreUnitTest_Serial1`bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(object=0x000000013a804080, method=<unavailable>, location="auxiliary test code (environments or event listeners)") at gtest-all.cc:4137:14 [opt] frame #28: 0x000000010064e178 Kokkos_CoreUnitTest_Serial1`testing::UnitTest::Run(this=0x0000000100854948) at gtest-all.cc:6903:10 [opt] frame #29: 0x000000010000369c Kokkos_CoreUnitTest_Serial1`main [inlined] RUN_ALL_TESTS() at gtest.h:12372:46 [opt] frame #30: 0x0000000100003694 Kokkos_CoreUnitTest_Serial1`main(argc=1, argv=0x000000016fdff638) at UnitTestMainInit.cpp:26:16 [opt] frame #31: 0x0000000189ebe0e0 dyld`start + 2360

@Rombur The above backtrace shows that we are calling Kokkos::fence in deallocate and that calls the static fence that is introduced in this pull request.

I saw that but my understanding is that the global fence was added to fix a deadlock in the Threads backend and the code has changed quite a bit since then. I want to understand if the fence is still necessary or if we could do something in the Threads backend, so that we don't pollute the other backend.

We need the fence (introduced in #4499) for asynchronous host backends like HPX so we don't free an allocation that is still used (same reasoning as for the GPU backends). As for the scope of the fence, see 69ce416#r747018936.

I'm open to rediscussing that deallocate only fences asynchronous execution spaces (that can access the respective allocation).

dalg24 · 2024-04-04T22:54:04Z

core/src/Serial/Kokkos_Serial.cpp

+    if (it == all_instances.end())
+      Kokkos::abort(
+          "Execution space instance to be removed couldn't be found!");
+    all_instances.erase(it);


This is interesting it means we could now diagnose at finalize that some exec space instance has not been destroyed yet.
I am not suggesting you implement it now, just one thought.

Sure. I'm happy to look into that in a follow-up for backends that have a similar implementation.

…afety

dalg24 · 2024-04-05T16:08:58Z

core/src/Serial/Kokkos_Serial.cpp

@@ -147,7 +166,9 @@ Serial::Serial(NewInstance)
    : m_space_instance(new Impl::SerialInternal, [](Impl::SerialInternal* ptr) {
        ptr->finalize();
        delete ptr;
-      }) {}
+      }) {
+  m_space_instance->initialize();


dalg24

I was about to comment that these changes do not guarantee that kernels, fences, and such get enqueued properly as the title line says, only that they don't don't run concurrently.
But your description is more accurate

This pull request ensures that kernels submitted to the same execution space instance from multiple threads don't run concurrently.

Please update the title and make sure the description says that it does not actually guarantee proper enqueuing.

masterleinad · 2024-04-05T16:21:45Z

Please update the title and make sure the description says that it does not actually guarantee proper enqueuing.

How would you define "enqueueing properly"? The definition I'm following is given by the behavior of the GPU backends, i.e., kernels submitted to the same execution space instance don't run concurrently and are executed in submission order (which needs external thread-synchronization to be well-defined if multiple threads submit to the same instance).

dalg24 · 2024-04-05T16:28:06Z

Please update the title and make sure the description says that it does not actually guarantee proper enqueuing.

How would you define "enqueueing properly"? The definition I'm following is given by the behavior of the GPU backends, i.e., kernels submitted to the same execution space instance don't run concurrently and are executed in submission order (which needs external thread-synchronization to be well-defined if multiple threads submit to the same instance).

I was thinking we are lacking the "execution in the same ordering as submission" but I suppose we have some playroom if we clearly document the caveat you added in your reply.

masterleinad · 2024-04-05T17:22:40Z

I amended the description some for the expected behavior.

Rombur · 2024-04-05T21:04:44Z

The serial tests are timing out

This reverts commit ac408dd.

masterleinad · 2024-04-07T13:58:44Z

The serial tests are timing out

Reverting ac408dd fixed that.

This reverts commit d2f0c00.

masterleinad requested a review from crtrott May 19, 2023 15:32

masterleinad mentioned this pull request Jun 7, 2023

Support completion query for execution spaces #6051

Draft

masterleinad mentioned this pull request Aug 7, 2023

Allow "parallel dispatch" to Serial execution space even within a parallel kernel #223

Open

masterleinad mentioned this pull request Sep 13, 2023

Serial: Allow for distinct execution space instances #6441

Merged

masterleinad marked this pull request as draft September 13, 2023 20:31

masterleinad force-pushed the improve_thread_safety branch from 144a1f8 to a01d986 Compare December 12, 2023 23:44

masterleinad mentioned this pull request Jan 3, 2024

Add ATOMICS_BYPASS configuration option to disable atomics #6692

Merged

masterleinad force-pushed the improve_thread_safety branch 2 times, most recently from 0345344 to 2150074 Compare January 11, 2024 21:38

masterleinad marked this pull request as ready for review January 12, 2024 03:32

masterleinad force-pushed the improve_thread_safety branch 2 times, most recently from fbc6570 to d53c3e6 Compare January 23, 2024 22:21

masterleinad added 3 commits January 25, 2024 09:10

Make OpenMP backend thread-safe

82fc17f

Make Serial backend thread-safe

7fdc0b4

Add test for thread-safety when submitting kernels to the same execut…

cd264a2

…ion space instance from multiple threads

masterleinad force-pushed the improve_thread_safety branch from 2bbb1e6 to cd264a2 Compare January 25, 2024 14:12

masterleinad marked this pull request as draft January 25, 2024 14:16

masterleinad marked this pull request as ready for review January 25, 2024 14:16

nmm0 reviewed Mar 13, 2024

View reviewed changes

Improve searching for execution space instance to be removed

7899441

dalg24 reviewed Apr 4, 2024

View reviewed changes

masterleinad added 2 commits April 5, 2024 11:38

Remove m_thread_team_data_mutex

ac408dd

Merge remote-tracking branch 'upstream/develop' into improve_thread_s…

6becd08

…afety

dalg24 reviewed Apr 5, 2024

View reviewed changes

dalg24 requested a review from Rombur April 5, 2024 20:52

Revert "Remove m_thread_team_data_mutex"

d2f0c00

This reverts commit ac408dd.

Revert "Revert "Remove m_thread_team_data_mutex""

9f77787

This reverts commit d2f0c00.

masterleinad mentioned this pull request Apr 15, 2024

Add thread-safety tests #6938

Open

Use impl_deallocate for Serial scratch allocations

b54d929

Ensure kernels submitted by multiple threads to synchronous execution spaces are enqueued correctly #6151

Are you sure you want to change the base?

Ensure kernels submitted by multiple threads to synchronous execution spaces are enqueued correctly #6151

Conversation

masterleinad commented May 19, 2023 • edited

crtrott commented May 19, 2023

masterleinad commented May 19, 2023

ajpowelsnl commented May 23, 2023

masterleinad commented May 23, 2023

masterleinad commented May 26, 2023

masterleinad commented Sep 13, 2023

masterleinad commented Dec 20, 2023

nmm0 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masterleinad commented Mar 14, 2024

dalg24 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masterleinad Apr 12, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dalg24 left a comment

Choose a reason for hiding this comment

masterleinad commented Apr 5, 2024

dalg24 commented Apr 5, 2024

masterleinad commented Apr 5, 2024

Rombur commented Apr 5, 2024

masterleinad commented Apr 7, 2024

masterleinad commented May 19, 2023 •

edited

masterleinad Apr 12, 2024 •

edited