Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure kernels submitted by multiple threads to synchronous execution spaces are enqueued correctly #6151

Open
wants to merge 9 commits into
base: develop
Choose a base branch
from

Conversation

masterleinad
Copy link
Contributor

@masterleinad masterleinad commented May 19, 2023

Related to thread-safety questions in #6051 and #4385.
This pull request ensures that kernels submitted to the same execution space instance from multiple threads don't run concurrently. Also, calling fence ensures that any kernel running when fence completes before returning. This should be enough to ensure that is_running in #6051 can use

fence();
return false;

as fallback implementation.

There are still some cases where the test added failed but I think it's better to address that in follow-up pull requests to limit the size of this pull request (and unblock #6051).

Note that the behavior we aim for is that of the GPU backends:
Kernels submitted to the same execution space instance don't run concurrently and are executed in submission order (which needs external thread-synchronization to be well-defined if multiple threads submit to the same instance).

@masterleinad masterleinad requested a review from crtrott May 19, 2023 15:32
@crtrott
Copy link
Member

crtrott commented May 19, 2023

We need to look at overhead for this, also its not clear to me how fence() would help with the is_running thing here?

@masterleinad
Copy link
Contributor Author

We need to look at overhead for this, also its not clear to me how fence() would help with the is_running thing here?

I am now using an implementation using fence in #6051 for all synchronous execution spaces which should be legal (but possibly pessimistic) since we are assuming that is_running returns false after fence has been called.

@ajpowelsnl
Copy link
Contributor

@masterleinad - would you like to continue discussing this issue in the Wed. meeting?

@masterleinad
Copy link
Contributor Author

@masterleinad - would you like to continue discussing this issue in the Wed. meeting?

Yes, we should discuss this, depending on priority tomorrow or later.

@masterleinad
Copy link
Contributor Author

We need to look at overhead for this, [...]

For an empty lambda using one thread, i.e., no contention, (which should be a worst case for applications using one thread) using

#include <Kokkos_Core.hpp>

int main(int argc, char* argv[]) {
  Kokkos::initialize(argc, argv);

  Kokkos::Timer timer;
  timer.reset();

  unsigned int N=10'000'000;

  for (unsigned int i=0; i<N; ++i)
    Kokkos::parallel_for(Kokkos::RangePolicy<Kokkos::Serial>(0, 1), KOKKOS_LAMBDA(int){});

  Kokkos::fence();
  auto count_time = timer.seconds();
  std::cout << count_time << std::endl;

  Kokkos::finalize();
  return 0;
}

I observe locally that execution time increases from 3.47s to 3.62s, i.e., it's like 4% slower.

@masterleinad
Copy link
Contributor Author

Needs #6441.

@masterleinad
Copy link
Contributor Author

Retest this please.

@masterleinad masterleinad force-pushed the improve_thread_safety branch 2 times, most recently from 0345344 to 2150074 Compare January 11, 2024 21:38
@masterleinad masterleinad marked this pull request as ready for review January 12, 2024 03:32
@masterleinad masterleinad force-pushed the improve_thread_safety branch 2 times, most recently from fbc6570 to d53c3e6 Compare January 23, 2024 22:21
@masterleinad masterleinad marked this pull request as draft January 25, 2024 14:16
@masterleinad masterleinad marked this pull request as ready for review January 25, 2024 14:16
Copy link
Contributor

@nmm0 nmm0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly are the thread safety guarantees we are getting? E.g. are we also letting two threads submit to the same exec space index? If so we should test that

auto it = std::find(all_instances.begin(), all_instances.end(), this);
if (it == all_instances.end()) Kokkos::abort("error");
all_instances.erase(
std::find(all_instances.begin(), all_instances.end(), this));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use it here since it's guaranteed to not be end?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@@ -338,6 +343,9 @@ class ParallelFor<FunctorType, Kokkos::TeamPolicy<Properties...>,
m_instance->resize_thread_data(pool_reduce_size, team_reduce_size,
team_shared_size, thread_local_size);

// Serialize kernels on the same execution space instance
std::lock_guard<std::mutex> lock(m_instance->m_instance_mutex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

@@ -106,6 +109,9 @@ class ParallelReduce<CombinedFunctorReducerType, Kokkos::RangePolicy<Traits...>,
update);

reducer.final(ptr);

m_instance->release_lock();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If user code throws this could cause a deadlock by never releasing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to rework OpenMP's acquire_lock/release_lock in a different pull request.

@@ -86,6 +91,8 @@ class ParallelScan<FunctorType, Kokkos::RangePolicy<Traits...>,
ParallelScan::template exec_range<WorkTag>(m_functor, m_policy.begin(),
m_policy.end(), update, true);

m_instance->release_lock();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as before, but it's a pattern we use in existing code...

Kokkos::Tools::Experimental::SpecialSynchronizationCases::
GlobalDeviceSynchronization,
[]() {
std::lock_guard<std::mutex> lock_all_instances(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand this code. If this is supposed to ensure that the mutex inside of e.g. a parallel for is released before this function will exit, what happens if the mutex is acquired before the parallel for mutex is and the order is flipped?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if at the time of calling the fence any execution space instance is executing a kernel we wait for the kernel to have run before returning here.
If the mutex is acquired first and another thread tries to run a kernel then that kernel has to wait for the fence. This seems to be the best we can do.

@@ -134,15 +141,25 @@ class Serial {
name,
Kokkos::Tools::Experimental::SpecialSynchronizationCases::
GlobalDeviceSynchronization,
[]() {}); // TODO: correct device ID
[]() {
std::lock_guard<std::mutex> lock_all_instances(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as before OpenMP

@masterleinad
Copy link
Contributor Author

What exactly are the thread safety guarantees we are getting? E.g. are we also letting two threads submit to the same exec space index? If so we should test that

The thread safety guarantees addressed here are that kernels submitted to the same execution space instance are enqueued and don't run concurrently even if submitted from different threads. That's what we are testing in the test added here.

Copy link
Member

@dalg24 dalg24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(only reviewed the Serial backend changes so far)

core/src/Serial/Kokkos_Serial.hpp Outdated Show resolved Hide resolved
core/src/Serial/Kokkos_Serial_Parallel_Team.hpp Outdated Show resolved Hide resolved
// Make sure kernels are running sequentially even when using multiple
// threads
std::lock_guard<std::mutex> instance_lock(
internal_instance->m_instance_mutex);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably should lock 1st thing or comment why you deferred until after the resize.
Same comment below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need both lock the m_thread_team_data_mutex and the m_instance_mutex since we want to make sure that there is no other kernel that modifies the scratch allocations while this kernel is executing.
We can't just use the m_instance_mutex alone (before resize_thread_team_data) since that function might call deallocate which implies a global fence and that fence tries locking instance_lock resulting in a deadlock.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that function might call deallocate which implies a global fence

I don't think that's true in this case. We call deallocate from HostSpace which does not fence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    frame #12: 0x000000010066c384 Kokkos_CoreUnitTest_Serial1`Kokkos::fence(name="HostSpace::impl_deallocate before free") at Kokkos_Core.cpp:1113:47 [opt]
    frame #13: 0x000000010066d60c Kokkos_CoreUnitTest_Serial1`Kokkos::HostSpace::impl_deallocate(this=<unavailable>, arg_label="Kokkos::Serial::scratch_mem", arg_alloc_ptr=0x000000013980b800, arg_alloc_size=11264, arg_logical_size=0, arg_handle=(name = "Host")) const at Kokkos_HostSpace.cpp:122:5 [opt]
    frame #14: 0x000000010066d560 Kokkos_CoreUnitTest_Serial1`Kokkos::HostSpace::deallocate(this=<unavailable>, arg_label=<unavailable>, arg_alloc_ptr=<unavailable>, arg_alloc_size=<unavailable>, arg_logical_size=<unavailable>) const at Kokkos_HostSpace.cpp:115:3 [opt]
    frame #15: 0x00000001006773bc Kokkos_CoreUnitTest_Serial1`Kokkos::Impl::SerialInternal::resize_thread_team_data(this=0x0000600001798000, pool_reduce_bytes=512, team_reduce_bytes=512, team_shared_bytes=8, thread_local_bytes=0) at Kokkos_Serial.cpp:119:13 [opt]
    frame #16: 0x00000001000477e0 Kokkos_CoreUnitTest_Serial1`Kokkos::Impl::ParallelFor<Test::TestSharedAtomicsFunctor<Kokkos::Serial>, Kokkos::TeamPolicy<Kokkos::Serial>, Kokkos::Serial>::execute(this=0x000000016fdfefe0) const at Kokkos_Serial_Parallel_Team.hpp:255:24 [opt]
    frame #17: 0x0000000100047634 Kokkos_CoreUnitTest_Serial1`void Kokkos::parallel_for<Kokkos::TeamPolicy<Kokkos::Serial>, Test::TestSharedAtomicsFunctor<Kokkos::Serial>, void>(str=<unavailable>, policy=0x000000016fdff0d0, functor=0x000000016fdff080) at Kokkos_Parallel.hpp:144:11 [opt]
    frame #18: 0x0000000100045e30 Kokkos_CoreUnitTest_Serial1`Test::serial_atomic_shared_Test::TestBody() [inlined] void Kokkos::parallel_for<Kokkos::TeamPolicy<Kokkos::Serial>, Test::TestSharedAtomicsFunctor<Kokkos::Serial>>(policy=0x000000016fdff0d0, functor=0x000000016fdff080, (null)=0x0000000000000000) at Kokkos_Parallel.hpp:153:3 [opt]
    frame #19: 0x0000000100045e18 Kokkos_CoreUnitTest_Serial1`Test::serial_atomic_shared_Test::TestBody(this=<unavailable>) at TestAtomicOperations_shared.hpp:49:3 [opt]
    frame #20: 0x000000010063cb14 Kokkos_CoreUnitTest_Serial1`void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) [inlined] void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(object=<unavailable>, method=0x00000000000000010000000000000020, location="the test body") at gtest-all.cc:4101:10 [opt]
    frame #21: 0x000000010063cb04 Kokkos_CoreUnitTest_Serial1`void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(object=0x0000600002e88100, method=<unavailable>, location="the test body") at gtest-all.cc:4137:14 [opt]
    frame #22: 0x000000010063ca68 Kokkos_CoreUnitTest_Serial1`testing::Test::Run(this=0x0000600002e88100) at gtest-all.cc:4176:5 [opt]
    frame #23: 0x000000010063e1d4 Kokkos_CoreUnitTest_Serial1`testing::TestInfo::Run(this=0x0000000139605a00) at gtest-all.cc:4326:11 [opt]
    frame #24: 0x000000010063ed90 Kokkos_CoreUnitTest_Serial1`testing::TestSuite::Run(this=0x00000001396048f0) at gtest-all.cc:4480:28 [opt]
    frame #25: 0x000000010064eb74 Kokkos_CoreUnitTest_Serial1`testing::internal::UnitTestImpl::RunAllTests(this=0x000000013a804080) at gtest-all.cc:7320:44 [opt]
    frame #26: 0x000000010064e208 Kokkos_CoreUnitTest_Serial1`bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) [inlined] bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(object=<unavailable>, method=(Kokkos_CoreUnitTest_Serial1`testing::internal::UnitTestImpl::RunAllTests() at gtest-all.cc:7211), location="auxiliary test code (environments or event listeners)") at gtest-all.cc:4101:10 [opt]
    frame #27: 0x000000010064e1f8 Kokkos_CoreUnitTest_Serial1`bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(object=0x000000013a804080, method=<unavailable>, location="auxiliary test code (environments or event listeners)") at gtest-all.cc:4137:14 [opt]
    frame #28: 0x000000010064e178 Kokkos_CoreUnitTest_Serial1`testing::UnitTest::Run(this=0x0000000100854948) at gtest-all.cc:6903:10 [opt]
    frame #29: 0x000000010000369c Kokkos_CoreUnitTest_Serial1`main [inlined] RUN_ALL_TESTS() at gtest.h:12372:46 [opt]
    frame #30: 0x0000000100003694 Kokkos_CoreUnitTest_Serial1`main(argc=1, argv=0x000000016fdff638) at UnitTestMainInit.cpp:26:16 [opt]
    frame #31: 0x0000000189ebe0e0 dyld`start + 2360

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Rombur The above backtrace shows that we are calling Kokkos::fence in deallocate and that calls the static fence that is introduced in this pull request.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw that but my understanding is that the global fence was added to fix a deadlock in the Threads backend and the code has changed quite a bit since then. I want to understand if the fence is still necessary or if we could do something in the Threads backend, so that we don't pollute the other backend.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the fence (introduced in #4499) for asynchronous host backends like HPX so we don't free an allocation that is still used (same reasoning as for the GPU backends). As for the scope of the fence, see 69ce416#r747018936.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to rediscussing that deallocate only fences asynchronous execution spaces (that can access the respective allocation).

if (it == all_instances.end())
Kokkos::abort(
"Execution space instance to be removed couldn't be found!");
all_instances.erase(it);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting it means we could now diagnose at finalize that some exec space instance has not been destroyed yet.
I am not suggesting you implement it now, just one thought.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I'm happy to look into that in a follow-up for backends that have a similar implementation.

@@ -147,7 +166,9 @@ Serial::Serial(NewInstance)
: m_space_instance(new Impl::SerialInternal, [](Impl::SerialInternal* ptr) {
ptr->finalize();
delete ptr;
}) {}
}) {
m_space_instance->initialize();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugggg

Copy link
Member

@dalg24 dalg24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was about to comment that these changes do not guarantee that kernels, fences, and such get enqueued properly as the title line says, only that they don't don't run concurrently.
But your description is more accurate

This pull request ensures that kernels submitted to the same execution space instance from multiple threads don't run concurrently.

Please update the title and make sure the description says that it does not actually guarantee proper enqueuing.

@masterleinad
Copy link
Contributor Author

Please update the title and make sure the description says that it does not actually guarantee proper enqueuing.

How would you define "enqueueing properly"? The definition I'm following is given by the behavior of the GPU backends, i.e., kernels submitted to the same execution space instance don't run concurrently and are executed in submission order (which needs external thread-synchronization to be well-defined if multiple threads submit to the same instance).

@dalg24
Copy link
Member

dalg24 commented Apr 5, 2024

Please update the title and make sure the description says that it does not actually guarantee proper enqueuing.

How would you define "enqueueing properly"? The definition I'm following is given by the behavior of the GPU backends, i.e., kernels submitted to the same execution space instance don't run concurrently and are executed in submission order (which needs external thread-synchronization to be well-defined if multiple threads submit to the same instance).

I was thinking we are lacking the "execution in the same ordering as submission" but I suppose we have some playroom if we clearly document the caveat you added in your reply.

@masterleinad
Copy link
Contributor Author

I amended the description some for the expected behavior.

@dalg24 dalg24 requested a review from Rombur April 5, 2024 20:52
@Rombur
Copy link
Member

Rombur commented Apr 5, 2024

The serial tests are timing out

@masterleinad
Copy link
Contributor Author

The serial tests are timing out

Reverting ac408dd fixed that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants