New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add partition_space to OpenMP #5105
Conversation
I would never merge anything without him looking :) |
|
||
int resources_used = 0; | ||
for (unsigned int i = 0; i < weights.size() - 1; ++i) { | ||
int instance_pool_size = (weights[i] / total_weight) * main_pool_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if it is 0
? Are we delegating this to the instance constructor?
If that is not legal (haven't yet looked at the constructor) should we raise a warning/an error here to produce better diagnostic for the user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should do something for. I just realized that there is also a problem if you pass an empty weights vector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more I think about it, the more I am convinced that we should have a generic version that validates the input and then delegate to backend-specific implementations. That way we can guarantee uniformity. (Not asking you to do that as part of this PR.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other backends do not care about the inputs. They are thrown away. The only thing that matters is the number of weights
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you also want to check for total_weight
not to be zero? Maybe, just check weights[i]
instead of instance_pool_size
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I don't plan to check total_weight
. The goal is to check that someone doesn't have two weights (1, 99)
with main_pool_size = 3
. You can do more checks on the input (like negative weight) but I don't believe it's worth the hassle.
|
||
int resources_used = 0; | ||
for (unsigned int i = 0; i < weights.size() - 1; ++i) { | ||
int instance_pool_size = (weights[i] / total_weight) * main_pool_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you also want to check for total_weight
not to be zero? Maybe, just check weights[i]
instead of instance_pool_size
?
std::cout << "instance " << i << " " << instance_pool_size << " " | ||
<< weights[i] << " " << total_weight << " " << main_pool_size | ||
<< std::endl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you intend to keep this output?
Worked a bit with a testcode to figure out whats going on and what should be going on: #include <Kokkos_Core.hpp>
#include <cmath>
thread_local int id;
int id_counter;
double global_counter;
void busyloop(int N) {
double tmp = 3.0;
for(int i=0; i<N; i++)
tmp += std::pow(tmp,0.9);
global_counter += tmp;
}
int main(int argc, char* argv[]) {
Kokkos::initialize(argc, argv);
{
int N = argc > 1 ? atoi(argv[1]) : 1000000;
int R = argc > 2 ? atoi(argv[2]) : 10;
#pragma omp parallel
{
id = 100 + Kokkos::atomic_fetch_inc(&id_counter);
}
{
Kokkos::Timer timer;
int num_threads = Kokkos::OpenMP().concurrency();
#pragma omp parallel num_threads(4)
{
printf("omp num id nesting level: %i %i %i %i %i\n",int(omp_get_num_threads()),int(omp_get_thread_num()),id,int(omp_get_nested()),int(omp_get_level()));
#pragma omp parallel num_threads(num_threads/4)
{
printf("nested_omp num id nesting level: %i %i %i %i %i\n",int(omp_get_num_threads()),int(omp_get_thread_num()),id,int(omp_get_nested()),int(omp_get_level()));
busyloop(N);
}
}
printf("time: %lf\n",timer.seconds());
}
auto instances = Kokkos::Experimental::partition_space(Kokkos::OpenMP(),0.25,0.25,0.25,0.25);
{
Kokkos::Timer timer;
int num_threads = Kokkos::OpenMP().concurrency();
#pragma omp parallel num_threads(4)
{
auto exec = instances[omp_get_thread_num()];
printf("omp num id nesting level: %i %i %i %i %i %i\n",int(omp_get_num_threads()),int(omp_get_thread_num()),id,int(omp_get_nested()),int(omp_get_level()),exec.concurrency());
Kokkos::parallel_for(Kokkos::RangePolicy<>(exec,0,num_threads/4),[=](int i)
{
printf("nested_omp num id nesting level: %i (%i %i %i) %i %i\n",int(omp_get_num_threads()),int(omp_get_thread_num()),i,id,int(omp_get_nested()),int(omp_get_level()));
busyloop(N);
//Kokkos::parallel_for(Kokkos::RangePolicy<>(exec,0,num_threads/4),[=](int i)
//{});
});
}
printf("time: %lf\n",timer.seconds());
}
}
Kokkos::finalize();
} |
@crtrott I have fixed |
@crtrott ping |
Retest this please |
// static | ||
return OpenMP::in_parallel() | ||
inline int OpenMP::impl_thread_pool_size(OpenMP const& exec_space) noexcept { | ||
return OpenMP::in_parallel(exec_space) | ||
? omp_get_num_threads() | ||
: (Impl::t_openmp_instance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In which cases do we actually need to use t_openmp_instance
? Is that only necessary for partition_master
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's only for partition_master
std::cout << "instance " << i << " " << instance_pool_size << " " | ||
<< weights[i] << " " << total_weight << " " << main_pool_size | ||
<< std::endl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::cout << "instance " << i << " " << instance_pool_size << " " | |
<< weights[i] << " " << total_weight << " " << main_pool_size | |
<< std::endl; |
// Unpack the arguments and create the weight vector. Note that if some of the | ||
// types are not same, you will get a narrowing warning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Unpack the arguments and create the weight vector. Note that if some of the | |
// types are not same, you will get a narrowing warning. | |
// Unpack the arguments and create the weight vector. Note that if not all of the | |
// types are the same, you will get a narrowing warning. |
cda2ca9
to
638578d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Provided a PR here Rombur#2 fixes some issues exposing others.
@@ -135,8 +137,7 @@ class OpenMP { | |||
int requested_partition_size = 0); | |||
#endif | |||
|
|||
// use UniqueToken | |||
static int concurrency(); | |||
static int concurrency(OpenMP const& = OpenMP()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand why you had to implement it this way but I feel like this needs a broader design discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't this be a non-static member function as it kind of is supposed to be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did that for consistency. The function exists in several backends and it is always static.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK we need an immediate follow up on this: this is ONLY static in the other guys because someone suggested that it is the same for every instance of currently existing types: it should NOT be static.
@@ -147,13 +148,12 @@ class OpenMP { | |||
/// \brief Free any resources being consumed by the default execution space | |||
static void impl_finalize(); | |||
|
|||
inline static int impl_thread_pool_size() noexcept; | |||
inline static int impl_thread_pool_size(OpenMP const& = OpenMP()) noexcept; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason to prefer this to a non-static member function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did that for consistency. The function exists in several backends and it is always static.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
ASSERT_NE(exec1.impl_internal_space_instance(), | ||
exec2.impl_internal_space_instance()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would need to be updated after #5398
@@ -135,8 +137,7 @@ class OpenMP { | |||
int requested_partition_size = 0); | |||
#endif | |||
|
|||
// use UniqueToken | |||
static int concurrency(); | |||
static int concurrency(OpenMP const& = OpenMP()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't this be a non-static member function as it kind of is supposed to be?
|
||
inline static int impl_thread_pool_size(int depth); | ||
inline static int impl_thread_pool_size(int depth, OpenMP const& = OpenMP()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not make this non-static?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did that for consistency. The function exists in several backends and it is always static.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
38385c4
to
5d409bb
Compare
int OpenMP::impl_thread_pool_rank() noexcept { | ||
// FIXME_OPENMP We are forced to use t_openmp_instance because the function is | ||
// static | ||
inline int OpenMP::impl_thread_pool_rank() noexcept { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to drop the KOKKOS_INLINE_FUNCTION
annotation? If so we probably should drop the KOKKOS_IF_ON_{HOST,DEVICE}
if (OpenMP::in_parallel() && | ||
!(omp_get_nested() && (omp_get_level() == 1))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We check this condition a bunch of times. Did you consider encapsulating it in a function?
2fcfe3b
to
8a6ef43
Compare
The code has significantly changed since the review.
@@ -74,7 +74,11 @@ class ParallelFor<FunctorType, Kokkos::WorkGraphPolicy<Traits...>, | |||
|
|||
public: | |||
inline void execute() { | |||
#pragma omp parallel num_threads(OpenMP::impl_thread_pool_size()) | |||
// Work around NVHPC 22.5 ICE | |||
int pool_size = OpenMP::impl_thread_pool_size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not [[maybe_unused]]
instead of the workaround below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll change it. The PR started 6 months ago so we couldn't use C++17
8b43366
to
01c47f6
Compare
@@ -194,10 +193,7 @@ void OpenMPInternal::resize_thread_data(size_t pool_reduce_bytes, | |||
|
|||
memory_fence(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you still need the memory fences here and below after serializing the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, can you please comment on the necessity to serialize the loop in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you still need the memory fences here and below after serializing the loop?
I think we want to keep this memory_fence but we can get rid of the one in the loop.
Also, can you please comment on the necessity to serialize the loop in the first place?
The problem is when you don't have nested parallelism, you only get rank=0
. That's why I am doing the loop in serial to make sure that I really perform the loop.
inline bool in_serial() { | ||
return (OpenMP::in_parallel() && | ||
!(omp_get_nested() && (omp_get_level() == 1))); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does in_serial
require in_parallel
? Should this function rather be called is_nested
or execute_in_serial
(also compared to what it's replacing below)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is not that it's nested. It's that you want to execute the code in parallel but the loop require nested parallelism which was not enable. I don't think that it should be called is_nested
because that's not the issue itself. I can change it to execute_in_serial
if you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK to me (assuming that there will be some cleanup for the static member functions later).
668551a
to
8cd5fdb
Compare
I see about 2-3% performance regression for LAMMPS with in.lj with size 4,2,2. This is on SKX with GCC 11.1 using 12 threads. 2.27s vs 2.32s. Ran test where I just commented out the lock stuff: and that on its own closes the gap. I guess we can't do anything about that really. |
I did some more experiments and implemented a different locking scheme here. Looks like mutex lock implies stronger memory fencing than we need: Rombur#4, and if we relax that a bit we recover the perf. |
desul::MemoryScopeDevice()); | ||
} | ||
|
||
void release_lock() {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy/paste error
20d8f0d
to
ab2984c
Compare
Co-authored-by: Christian Trott <crtrott@sandia.gov>
ab2984c
to
98e5b70
Compare
int lock = desul::atomic_compare_exchange(&m_pool_mutex, 0, 1, | ||
desul::MemoryOrderAcquire(), | ||
desul::MemoryScopeDevice()); | ||
while (lock == 1) | ||
lock = desul::atomic_compare_exchange(&m_pool_mutex, 0, 1, | ||
desul::MemoryOrderAcquire(), | ||
desul::MemoryScopeDevice()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int lock = desul::atomic_compare_exchange(&m_pool_mutex, 0, 1, | |
desul::MemoryOrderAcquire(), | |
desul::MemoryScopeDevice()); | |
while (lock == 1) | |
lock = desul::atomic_compare_exchange(&m_pool_mutex, 0, 1, | |
desul::MemoryOrderAcquire(), | |
desul::MemoryScopeDevice()); | |
while (1 == desul::atomic_compare_exchange(&m_pool_mutex, 0, 1, | |
desul::MemoryOrderAcquire(), | |
desul::MemoryScopeDevice())) { | |
// nothing | |
} |
void acquire_lock(); | ||
|
||
void release_lock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment that these are related to the pool mutex or update the name to reflect how they relate to the thread pool
c322f35
to
f983dda
Compare
f983dda
to
c534f25
Compare
Here is the Serial fix corresponding to this for the Intel ICE: #5671 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(not formatted not tested)
@@ -36,14 +36,18 @@ | |||
|
|||
#include <omp.h> | |||
|
|||
#include <mutex> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#include <mutex> |
@@ -329,7 +333,7 @@ class UniqueToken<OpenMP, UniqueTokenScope::Global> { | |||
/// \brief acquire value such that 0 <= value < size() | |||
KOKKOS_INLINE_FUNCTION | |||
int acquire() const noexcept { | |||
KOKKOS_IF_ON_HOST((return Kokkos::Impl::t_openmp_hardware_id;)) | |||
KOKKOS_IF_ON_HOST((return omp_get_thread_num();)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping
KOKKOS_PRAGMA_IVDEP_IF_ENABLED | ||
for (auto iwork = m_policy.begin(); iwork < m_policy.end(); ++iwork) { | ||
for (auto iwork = 0; iwork < m_mdr_policy.m_num_tiles; ++iwork) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for (auto iwork = 0; iwork < m_mdr_policy.m_num_tiles; ++iwork) { | |
for (typename Policy::index_type iwork = 0; iwork < m_mdr_policy.m_num_tiles; ++iwork) { |
KOKKOS_PRAGMA_IVDEP_IF_ENABLED | ||
for (auto iwork = m_policy.begin(); iwork < m_policy.end(); ++iwork) { | ||
for (auto iwork = 0; iwork < m_mdr_policy.m_num_tiles; ++iwork) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for (auto iwork = 0; iwork < m_mdr_policy.m_num_tiles; ++iwork) { | |
for (typename Policy::index_type iwork = 0; iwork < m_mdr_policy.m_num_tiles; ++iwork) { |
This is for the MDRangePolicy specialization of ParallelFor and ParallelReduce. Removing the extra policy object fixes classic Intel compiler ICE.
c534f25
to
27ed831
Compare
This is the last part of #4935
The PR does the following:
OpenMP()
ExecutionSpace
inimpl_thread_pool_size()
impl_thread_pool_size()
where necessaryExecutionSpace
by const reference instead of by copy in the default implementation ofpartition_space
. This was already done in the backend implementationspartition_space
for the OpenMP backend. I had to modify a test because the new function emits a warning when the variadic arguments are different types. So you cannot mixint
anddouble
anymore. I don't think that's a big deal and the function is still inExperimental
.@dalg24 please do not merge until @crtrott has time to take a look. It has written the
partition_space
for all the other backends and I would like to have his feedback