Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add partition_space to OpenMP #5105

Merged
merged 30 commits into from Dec 8, 2022
Merged

Conversation

Rombur
Copy link
Member

@Rombur Rombur commented Jun 9, 2022

This is the last part of #4935

The PR does the following:

  • add a new constructor for OpenMP()
  • use the ExecutionSpace in impl_thread_pool_size()
  • use the new impl_thread_pool_size() where necessary
  • pass the ExecutionSpace by const reference instead of by copy in the default implementation of partition_space. This was already done in the backend implementations
  • add partition_space for the OpenMP backend. I had to modify a test because the new function emits a warning when the variadic arguments are different types. So you cannot mix int and double anymore. I don't think that's a big deal and the function is still in Experimental.

@dalg24 please do not merge until @crtrott has time to take a look. It has written the partition_space for all the other backends and I would like to have his feedback

@dalg24 dalg24 requested a review from crtrott June 9, 2022 20:59
@dalg24
Copy link
Member

dalg24 commented Jun 9, 2022

I would never merge anything without him looking :)
I requested a review from him

core/src/OpenMP/Kokkos_OpenMP_Instance.hpp Show resolved Hide resolved

int resources_used = 0;
for (unsigned int i = 0; i < weights.size() - 1; ++i) {
int instance_pool_size = (weights[i] / total_weight) * main_pool_size;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if it is 0? Are we delegating this to the instance constructor?
If that is not legal (haven't yet looked at the constructor) should we raise a warning/an error here to produce better diagnostic for the user?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should do something for. I just realized that there is also a problem if you pass an empty weights vector

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I think about it, the more I am convinced that we should have a generic version that validates the input and then delegate to backend-specific implementations. That way we can guarantee uniformity. (Not asking you to do that as part of this PR.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other backends do not care about the inputs. They are thrown away. The only thing that matters is the number of weights

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you also want to check for total_weight not to be zero? Maybe, just check weights[i] instead of instance_pool_size?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I don't plan to check total_weight. The goal is to check that someone doesn't have two weights (1, 99) with main_pool_size = 3. You can do more checks on the input (like negative weight) but I don't believe it's worth the hassle.

@cz4rs cz4rs self-requested a review June 10, 2022 18:18

int resources_used = 0;
for (unsigned int i = 0; i < weights.size() - 1; ++i) {
int instance_pool_size = (weights[i] / total_weight) * main_pool_size;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you also want to check for total_weight not to be zero? Maybe, just check weights[i] instead of instance_pool_size?

Comment on lines 390 to 392
std::cout << "instance " << i << " " << instance_pool_size << " "
<< weights[i] << " " << total_weight << " " << main_pool_size
<< std::endl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you intend to keep this output?

@crtrott
Copy link
Member

crtrott commented Jun 16, 2022

Worked a bit with a testcode to figure out whats going on and what should be going on:

#include <Kokkos_Core.hpp>
#include <cmath>

thread_local int id;
int id_counter;

double global_counter;

void busyloop(int N) {
  double tmp = 3.0;
  for(int i=0; i<N; i++)
    tmp += std::pow(tmp,0.9);
  global_counter += tmp;
}

int main(int argc, char* argv[]) {
  Kokkos::initialize(argc, argv);
  {
    int N = argc > 1 ? atoi(argv[1]) : 1000000;
    int R = argc > 2 ? atoi(argv[2]) : 10;

    #pragma omp parallel
    {
       id = 100 + Kokkos::atomic_fetch_inc(&id_counter);
    }

    {
      Kokkos::Timer timer;
      int num_threads = Kokkos::OpenMP().concurrency();
      #pragma omp parallel num_threads(4)
      {
        printf("omp num id nesting level: %i %i %i %i %i\n",int(omp_get_num_threads()),int(omp_get_thread_num()),id,int(omp_get_nested()),int(omp_get_level()));
        #pragma omp parallel num_threads(num_threads/4)
        {
          printf("nested_omp num id nesting level: %i %i %i %i %i\n",int(omp_get_num_threads()),int(omp_get_thread_num()),id,int(omp_get_nested()),int(omp_get_level()));
          busyloop(N);
        }
      }
      printf("time: %lf\n",timer.seconds());
    }

    auto instances = Kokkos::Experimental::partition_space(Kokkos::OpenMP(),0.25,0.25,0.25,0.25);
    {
      Kokkos::Timer timer;
      int num_threads = Kokkos::OpenMP().concurrency();
      #pragma omp parallel num_threads(4)
      {
        auto exec = instances[omp_get_thread_num()];
        printf("omp num id nesting level: %i %i %i %i %i %i\n",int(omp_get_num_threads()),int(omp_get_thread_num()),id,int(omp_get_nested()),int(omp_get_level()),exec.concurrency());
        Kokkos::parallel_for(Kokkos::RangePolicy<>(exec,0,num_threads/4),[=](int i)
        {
          printf("nested_omp num id nesting level: %i (%i %i %i) %i %i\n",int(omp_get_num_threads()),int(omp_get_thread_num()),i,id,int(omp_get_nested()),int(omp_get_level()));
          busyloop(N);
          //Kokkos::parallel_for(Kokkos::RangePolicy<>(exec,0,num_threads/4),[=](int i)
          //{});
        });
      }
      printf("time: %lf\n",timer.seconds());
    }

  }
  Kokkos::finalize();
}

@Rombur
Copy link
Member Author

Rombur commented Jun 22, 2022

@crtrott I have fixed concurrency. By default, the behavior is unchanged but you can pass an execution space if you have use partition_space. I have added a mutex and locks before accessing m_pool data. I don't think that the code can hang if you launch kernels from different threads. The kernels will be executed serially similar to what is done in the other backends.

@Rombur
Copy link
Member Author

Rombur commented Jul 6, 2022

@crtrott ping

@Rombur
Copy link
Member Author

Rombur commented Jul 6, 2022

Retest this please

// static
return OpenMP::in_parallel()
inline int OpenMP::impl_thread_pool_size(OpenMP const& exec_space) noexcept {
return OpenMP::in_parallel(exec_space)
? omp_get_num_threads()
: (Impl::t_openmp_instance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which cases do we actually need to use t_openmp_instance? Is that only necessary for partition_master?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's only for partition_master

Comment on lines 393 to 395
std::cout << "instance " << i << " " << instance_pool_size << " "
<< weights[i] << " " << total_weight << " " << main_pool_size
<< std::endl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::cout << "instance " << i << " " << instance_pool_size << " "
<< weights[i] << " " << total_weight << " " << main_pool_size
<< std::endl;

Comment on lines 413 to 414
// Unpack the arguments and create the weight vector. Note that if some of the
// types are not same, you will get a narrowing warning.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Unpack the arguments and create the weight vector. Note that if some of the
// types are not same, you will get a narrowing warning.
// Unpack the arguments and create the weight vector. Note that if not all of the
// types are the same, you will get a narrowing warning.

masterleinad
masterleinad previously approved these changes Aug 26, 2022
Copy link
Member

@crtrott crtrott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provided a PR here Rombur#2 fixes some issues exposing others.

@@ -135,8 +137,7 @@ class OpenMP {
int requested_partition_size = 0);
#endif

// use UniqueToken
static int concurrency();
static int concurrency(OpenMP const& = OpenMP());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why you had to implement it this way but I feel like this needs a broader design discussion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't this be a non-static member function as it kind of is supposed to be?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did that for consistency. The function exists in several backends and it is always static.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK we need an immediate follow up on this: this is ONLY static in the other guys because someone suggested that it is the same for every instance of currently existing types: it should NOT be static.

@@ -147,13 +148,12 @@ class OpenMP {
/// \brief Free any resources being consumed by the default execution space
static void impl_finalize();

inline static int impl_thread_pool_size() noexcept;
inline static int impl_thread_pool_size(OpenMP const& = OpenMP()) noexcept;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason to prefer this to a non-static member function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did that for consistency. The function exists in several backends and it is always static.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

Comment on lines 81 to 82
ASSERT_NE(exec1.impl_internal_space_instance(),
exec2.impl_internal_space_instance());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would need to be updated after #5398

@@ -135,8 +137,7 @@ class OpenMP {
int requested_partition_size = 0);
#endif

// use UniqueToken
static int concurrency();
static int concurrency(OpenMP const& = OpenMP());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't this be a non-static member function as it kind of is supposed to be?


inline static int impl_thread_pool_size(int depth);
inline static int impl_thread_pool_size(int depth, OpenMP const& = OpenMP());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not make this non-static?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did that for consistency. The function exists in several backends and it is always static.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

@Rombur Rombur force-pushed the new_partition_space branch 2 times, most recently from 38385c4 to 5d409bb Compare October 26, 2022 17:59
core/unit_test/TestExecSpacePartitioning.hpp Outdated Show resolved Hide resolved
core/unit_test/TestExecSpacePartitioning.hpp Show resolved Hide resolved
core/src/OpenMP/Kokkos_OpenMP_WorkGraphPolicy.hpp Outdated Show resolved Hide resolved
core/src/OpenMP/Kokkos_OpenMP_WorkGraphPolicy.hpp Outdated Show resolved Hide resolved
int OpenMP::impl_thread_pool_rank() noexcept {
// FIXME_OPENMP We are forced to use t_openmp_instance because the function is
// static
inline int OpenMP::impl_thread_pool_rank() noexcept {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to drop the KOKKOS_INLINE_FUNCTION annotation? If so we probably should drop the KOKKOS_IF_ON_{HOST,DEVICE}

Comment on lines 393 to 394
if (OpenMP::in_parallel() &&
!(omp_get_nested() && (omp_get_level() == 1))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We check this condition a bunch of times. Did you consider encapsulating it in a function?

@Rombur Rombur dismissed masterleinad’s stale review October 27, 2022 17:55

The code has significantly changed since the review.

@@ -74,7 +74,11 @@ class ParallelFor<FunctorType, Kokkos::WorkGraphPolicy<Traits...>,

public:
inline void execute() {
#pragma omp parallel num_threads(OpenMP::impl_thread_pool_size())
// Work around NVHPC 22.5 ICE
int pool_size = OpenMP::impl_thread_pool_size();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not [[maybe_unused]] instead of the workaround below?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll change it. The PR started 6 months ago so we couldn't use C++17

@@ -194,10 +193,7 @@ void OpenMPInternal::resize_thread_data(size_t pool_reduce_bytes,

memory_fence();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you still need the memory fences here and below after serializing the loop?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can you please comment on the necessity to serialize the loop in the first place?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you still need the memory fences here and below after serializing the loop?

I think we want to keep this memory_fence but we can get rid of the one in the loop.

Also, can you please comment on the necessity to serialize the loop in the first place?

The problem is when you don't have nested parallelism, you only get rank=0. That's why I am doing the loop in serial to make sure that I really perform the loop.

Comment on lines 75 to 50
inline bool in_serial() {
return (OpenMP::in_parallel() &&
!(omp_get_nested() && (omp_get_level() == 1)));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does in_serial require in_parallel? Should this function rather be called is_nested or execute_in_serial (also compared to what it's replacing below)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is not that it's nested. It's that you want to execute the code in parallel but the loop require nested parallelism which was not enable. I don't think that it should be called is_nested because that's not the issue itself. I can change it to execute_in_serial if you prefer.

Copy link
Contributor

@masterleinad masterleinad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK to me (assuming that there will be some cleanup for the static member functions later).

@Rombur
Copy link
Member Author

Rombur commented Dec 5, 2022

After the rebase on #5601, we are hitting #5651. There is also an error with HIP but it is spurious.

@crtrott
Copy link
Member

crtrott commented Dec 7, 2022

I see about 2-3% performance regression for LAMMPS with in.lj with size 4,2,2. This is on SKX with GCC 11.1 using 12 threads. 2.27s vs 2.32s.

Ran test where I just commented out the lock stuff: and that on its own closes the gap. I guess we can't do anything about that really.

@crtrott
Copy link
Member

crtrott commented Dec 7, 2022

I did some more experiments and implemented a different locking scheme here. Looks like mutex lock implies stronger memory fencing than we need: Rombur#4, and if we relax that a bit we recover the perf.

desul::MemoryScopeDevice());
}

void release_lock() {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy/paste error

Co-authored-by: Christian Trott <crtrott@sandia.gov>
Comment on lines 38 to 44
int lock = desul::atomic_compare_exchange(&m_pool_mutex, 0, 1,
desul::MemoryOrderAcquire(),
desul::MemoryScopeDevice());
while (lock == 1)
lock = desul::atomic_compare_exchange(&m_pool_mutex, 0, 1,
desul::MemoryOrderAcquire(),
desul::MemoryScopeDevice());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int lock = desul::atomic_compare_exchange(&m_pool_mutex, 0, 1,
desul::MemoryOrderAcquire(),
desul::MemoryScopeDevice());
while (lock == 1)
lock = desul::atomic_compare_exchange(&m_pool_mutex, 0, 1,
desul::MemoryOrderAcquire(),
desul::MemoryScopeDevice());
while (1 == desul::atomic_compare_exchange(&m_pool_mutex, 0, 1,
desul::MemoryOrderAcquire(),
desul::MemoryScopeDevice())) {
// nothing
}

Comment on lines 89 to 91
void acquire_lock();

void release_lock();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment that these are related to the pool mutex or update the name to reflect how they relate to the thread pool

@crtrott
Copy link
Member

crtrott commented Dec 8, 2022

Here is the Serial fix corresponding to this for the Intel ICE: #5671

Copy link
Member

@dalg24 dalg24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not formatted not tested)

@@ -36,14 +36,18 @@

#include <omp.h>

#include <mutex>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#include <mutex>

@@ -329,7 +333,7 @@ class UniqueToken<OpenMP, UniqueTokenScope::Global> {
/// \brief acquire value such that 0 <= value < size()
KOKKOS_INLINE_FUNCTION
int acquire() const noexcept {
KOKKOS_IF_ON_HOST((return Kokkos::Impl::t_openmp_hardware_id;))
KOKKOS_IF_ON_HOST((return omp_get_thread_num();))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping

KOKKOS_PRAGMA_IVDEP_IF_ENABLED
for (auto iwork = m_policy.begin(); iwork < m_policy.end(); ++iwork) {
for (auto iwork = 0; iwork < m_mdr_policy.m_num_tiles; ++iwork) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (auto iwork = 0; iwork < m_mdr_policy.m_num_tiles; ++iwork) {
for (typename Policy::index_type iwork = 0; iwork < m_mdr_policy.m_num_tiles; ++iwork) {

KOKKOS_PRAGMA_IVDEP_IF_ENABLED
for (auto iwork = m_policy.begin(); iwork < m_policy.end(); ++iwork) {
for (auto iwork = 0; iwork < m_mdr_policy.m_num_tiles; ++iwork) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (auto iwork = 0; iwork < m_mdr_policy.m_num_tiles; ++iwork) {
for (typename Policy::index_type iwork = 0; iwork < m_mdr_policy.m_num_tiles; ++iwork) {

This is for the MDRangePolicy specialization of ParallelFor and
ParallelReduce. Removing the extra policy object fixes classic Intel
compiler ICE.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants