-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenMPTarget: scratch memory implementation #3611
OpenMPTarget: scratch memory implementation #3611
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to comment out code in the incremental tests?
inline OpenMPTargetExecTeamMember( | ||
const int league_rank, const int league_size, const int team_size, | ||
const int vector_length // const TeamPolicyInternal< OpenMPTarget, | ||
// Properties ...> & team | ||
, | ||
void* const glb_scratch, const int shmem_size_L1, const int shmem_size_L2) | ||
: m_team_shared(nullptr, 0), | ||
: m_team_shared(((char*)glb_scratch + | ||
league_rank * (shmem_size_L1 + shmem_size_L2 + 8)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the pointer be shifted by another 8 bytes since you said you are saving it for reductions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes it should be 16 to support all data types.
Did you mean to comment out code in the incremental tests?
I am not commenting any code in the incremental tests. I am only protecting an input size for the OpenMPTarget backend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are commenting a lot of code in core/unit_test/incremental/Test12a_ThreadScratch.hpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes sorry about that. I did not commit that file 🙄 . Also made a small change in the same tests, where I replace the hard coded scratch level to a const variable.
@@ -733,13 +734,22 @@ class OpenMPTargetExecTeamMember { | |||
using space = execution_space::scratch_memory_space; | |||
|
|||
public: | |||
// FIXME: OPENMPTARGET - 8 bytes at the begining of the scratch space for each |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weren't we using FIXME_OPENMPTARGET as token?
Also typo begining -> beginning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dint realize that. I will replace the token and fix the typo.
@@ -105,9 +107,17 @@ struct TeamScratch { | |||
|
|||
TEST(TEST_CATEGORY, IncrTest_12b_TeamScratch) { | |||
TeamScratch<TEST_EXECSPACE> test; | |||
// Fails with larger vector sizes and if the team-size is not a multiple | |||
// of 32. | |||
#ifdef KOKKOS_ENABLE_OPENMPTARGET |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok as a temporal workaround. Has the issue of failing with non-multiples of 32 been communicating to the CLANG folks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if this has been communicated to the clang compiler developers yet. I am in the process of writing a small reproducer for this issue to be submitted as a bug. Hopefully this is something that is easily fixable.
MAX_ACTIVE_THREADS * thread_local_bytes; // Thread Private Scratch | ||
const int64_t shmem_size = | ||
shmem_size_L0 + shmem_size_L1; // L0 + L1 scratch memory per team. | ||
const int64_t padding = shmem_size * 10 / 100; // Padding per team. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Over allocating the requested scratch memory by 10% per team to find an open slot sooner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
??? How is this supposed to work? Why
#if
?kokkos/core/src/impl/Kokkos_SharedAlloc.hpp
Lines 51 to 69 in b3e6acd
// undefined at end of file #if defined(KOKKOS_ENABLE_OPENMPTARGET) #if defined(KOKKOS_COMPILER_PGI) #define KOKKOS_IMPL_IF_ON_HOST if (!__builtin_is_device_code()) #else // Note: OpenMPTarget enforces C++17 at configure time #pragma omp begin declare variant match(device = {kind(host)}) static constexpr bool omp_is_initial_device2() { return true; } #pragma omp end declare variant #pragma omp begin declare variant match(device = {kind(nohost)}) static constexpr bool omp_is_initial_device2() { return false; } #pragma omp end declare variant #define KOKKOS_IMPL_IF_ON_HOST if constexpr (omp_is_initial_device2()) #endif #else #define KOKKOS_IMPL_IF_ON_HOST if (true) #endif
Yes, you are right. It has to be #ifdef here.
…d correctly, fails at runtime.
…e scratch array is being used for reduction.
no-undefined-internal : avoid warnings on internal linkage due to the is_omp_initial_device2 that is inside the declare no-pass-failed : avoids warnings arising from simd directives in debug mode.
…mory for the OpenMPTarget backend.
27c8ae9
to
b3e6acd
Compare
The PR is ready for review. |
core/src/Kokkos_View.hpp
Outdated
@@ -1618,7 +1618,7 @@ class View : public ViewTraits<DataType, Properties...> { | |||
: View(arg_prop, | |||
typename traits::array_layout(arg_N0, arg_N1, arg_N2, arg_N3, | |||
arg_N4, arg_N5, arg_N6, arg_N7)) { | |||
#ifdef KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST | |||
#if KOKKOS_IMPL_IF_ON_HOST |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
??? How is this supposed to work? Why #if
?
kokkos/core/src/impl/Kokkos_SharedAlloc.hpp
Lines 51 to 69 in b3e6acd
// undefined at end of file | |
#if defined(KOKKOS_ENABLE_OPENMPTARGET) | |
#if defined(KOKKOS_COMPILER_PGI) | |
#define KOKKOS_IMPL_IF_ON_HOST if (!__builtin_is_device_code()) | |
#else | |
// Note: OpenMPTarget enforces C++17 at configure time | |
#pragma omp begin declare variant match(device = {kind(host)}) | |
static constexpr bool omp_is_initial_device2() { return true; } | |
#pragma omp end declare variant | |
#pragma omp begin declare variant match(device = {kind(nohost)}) | |
static constexpr bool omp_is_initial_device2() { return false; } | |
#pragma omp end declare variant | |
#define KOKKOS_IMPL_IF_ON_HOST if constexpr (omp_is_initial_device2()) | |
#endif | |
#else | |
#define KOKKOS_IMPL_IF_ON_HOST if (true) | |
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also
kokkos/core/src/impl/Kokkos_SharedAlloc.hpp
Line 519 in b3e6acd
#undef KOKKOS_IMPL_IF_ON_HOST |
…oc.hpp. Removed the #if for KOKKOS_IMPL_IF_ON_HOST macro in Kokkos_View.hpp.
…enMPTarget backend.
…fined(KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST) in Kokkos_View.hpp.
@@ -1618,7 +1618,17 @@ class View : public ViewTraits<DataType, Properties...> { | |||
: View(arg_prop, | |||
typename traits::array_layout(arg_N0, arg_N1, arg_N2, arg_N3, | |||
arg_N4, arg_N5, arg_N6, arg_N7)) { | |||
#ifdef KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST | |||
#ifdef KOKKOS_ENABLE_OPENMPTARGET |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK for now we need a better long term solution.
// the number of threads per threadblock will not exceed 1024 threads. In | ||
// future this value should be based on the chosen architecture for the | ||
// OpenMPTarget backend. | ||
int64_t total_size = (shmem_size + 16 + padding) * ((1024 * 80) / team_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we probably need to build our own list for this, how fun ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also this needs to be 2048*80
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I used 1024 instead of 2048 is because it might be very rare that all the threads in a thread block will be used. Do you want me to use 2048 anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "which scratch buffer position do I need to use" thing needs to be lock based since the number of teams can exceed what the hardware can have in flight concurrently, and thus you run out of space.
void* const glb_scratch, const int shmem_size_L0, const int shmem_size_L1) | ||
: m_team_shared( | ||
((char*)glb_scratch + | ||
league_rank * (shmem_size_L0 + shmem_size_L1 + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not how this works: note league_rank can be larger than the 2048*80/team_size, you need to use a lock array to find an open spot like in CUDA. And then its just "open_index * (shmem_size_L0+shmem_size_L1+16)"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm what about my comment here?
// the number of threads per threadblock will not exceed 1024 threads. In | ||
// future this value should be based on the chosen architecture for the | ||
// OpenMPTarget backend. | ||
int64_t total_size = (shmem_size + 16 + padding) * ((1024 * 80) / team_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also this needs to be 2048*80
// reduction. | ||
m_reduce_scratch = | ||
(char*)glb_scratch + | ||
league_rank * (shmem_size_L0 + shmem_size_L1 + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above, needs lock based index and correct offset calculation
core/src/impl/Kokkos_SharedAlloc.hpp
Outdated
@@ -54,7 +54,15 @@ | |||
#define KOKKOS_IMPL_IF_ON_HOST if (!__builtin_is_device_code()) | |||
#else | |||
// Note: OpenMPTarget enforces C++17 at configure time | |||
#define KOKKOS_IMPL_IF_ON_HOST if constexpr (omp_is_initial_device()) | |||
#pragma omp begin declare variant match(device = {kind(host)}) | |||
static constexpr bool omp_is_initial_device2() { return true; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets call this differently, to make clear it is NOT a OpenMP runtime function. How about kokkos_omp_on_device()
…d omp_is_initial_device2 routine to kokkos_omp_on_device.
…m in a round-robin manner.
while (lock_array[shmem_block_index] == 0) | ||
; | ||
lock_team = | ||
atomic_compare_exchange(&lock_array[shmem_block_index], 1, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is the lock_team unusued?
typename Policy::member_type team(i, league_size, team_size, | ||
vector_length, scratch_ptr, | ||
shmem_size_L0, shmem_size_L1); | ||
m_functor(team); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't call the functor here, just compute the lock_id in the if block, then call the functor, then unlock in another if_block. Pass the lock_id to the team constructor, and if no scratch memory is used pass in a 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left comments. In particular the lock id need to be used.
…etExecTeamMember. Removed the lock-team while loop in TeamPolicy ParallelFor.
…mory to implementation to dynamically assign shared memory blocks.
|
||
// A lock array to keep track of scratch memory blocks. | ||
// The first element to keep track of the number of free-blocks. | ||
int* lock_array = new int[max_active_league_size + 1]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be a member of the execution space instance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably use openmpTargetSpace allocate for that and DeepCOpy to set it and then is_device_ptr later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also add like 10% or so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably use openmpTargetSpace allocate for that and DeepCOpy to set it and then is_device_ptr later.
I used another target region to initialize it currently. Do you think its better to set it on host and do a DeepCopy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also add like 10% or so
Do you mean just allocate an additional 10% but not use it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a) its fine to use a kernel to initialize, put that into a function on the execution space though. I.e. have something like "resize scratch array size"
b) No you do use it for the cycle around. Note this is also the number of scratch partitions you need to allocate.
} else { | ||
// Block until there is atleast one free scratch block is available. | ||
while (lock_array[0] == 0) | ||
; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't make much sense why not start with a count of 0? Also are we actually guaranteed that teams are launched in order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and generally the convention is that "its locked" means the value is 1, and finding a 0 means its not locked.
|
||
// Iterate till the first free scratch block is encountered and try and | ||
// get a lock on the block. | ||
int iter = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
start at i%length_of_lock_array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so they don't step on each others feet all the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified the logic to identify a free block to incorporate all the above comments.
…he allocation of block index in scratch array.
// Loop as long as a shmem_block_index is not found. | ||
while (shmem_block_index == 0) { | ||
// Proceed if there are free blocks available. | ||
if (lock_array[0] > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is needed. This should ALWAYS be true. The point of the lock array is that it is actually a bit larger then the max number of active teams. So the atomic counter can never reach zero ...
lock_array[0] = max_active_league_size; | ||
// Allocate the lock_array on the device and initialize it to 0's in a | ||
// target region. | ||
int* lock_array = static_cast<int*>(omp_target_alloc( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ptr still needs to come from the OpenMPTarget instance, in particular so you don't reallocate it all the time. Only reallocate if it has to grow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, working on it. I am allocating the size of the lock_array to the maximum possible by assuming the team-size to 32. Since the OpenMPTarget backend has a restriction of needing a minimum of 32 threads per team, the array will not need to grow.
We have
|
…ed a clear_lock routine to free the corresponding memory during finalize.
int* OpenMPTargetExec::get_lock_array(int num_teams) { | ||
Kokkos::Experimental::OpenMPTargetSpace space; | ||
int max_active_league_size = MAX_ACTIVE_THREADS / 32; | ||
if (max_active_league_size < num_teams) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that doesn't make sense. We never need more than max_active_league_size or?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can just allocate MAX_ACTIVE_THREADS/32 here always. Just delete the first branch. That branch doesn't make sense anyway. This would only be true if you had a team_size < 32 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there. Left a couple comments.
int* OpenMPTargetExec::get_lock_array(int num_teams) { | ||
Kokkos::Experimental::OpenMPTargetSpace space; | ||
int max_active_league_size = MAX_ACTIVE_THREADS / 32; | ||
if (max_active_league_size < num_teams) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can just allocate MAX_ACTIVE_THREADS/32 here always. Just delete the first branch. That branch doesn't make sense anyway. This would only be true if you had a team_size < 32 .
shmem_block_index = iter; | ||
else | ||
iter = ++iter % max_active_teams; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good now.
// Free the locked block and increment the number of available free | ||
// blocks. | ||
lock_team = atomic_compare_exchange(&lock_array[shmem_block_index], 1, 0); | ||
atomic_increment(&lock_array[0]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line needs to go. leftover from the previous approach
iter = ++iter % max_active_teams; | ||
} | ||
// Decrement the number of available free blocks. | ||
atomic_decrement(&lock_array[0]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line needs to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes deleted.
// Free the locked block and increment the number of available free | ||
// blocks. | ||
lock_team = atomic_compare_exchange(&lock_array[shmem_block_index], 1, 0); | ||
atomic_increment(&lock_array[0]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line needs to go
Yes I left the logic in there to be able to address the issue in case in future we have team_size < 32 |
…ach. Updated the logic in get_lock_array routine.
A basic implementation for scratch memory.
Modified the KOKKOS_IMPL_IF_ON_HOST macro to use "declare variant " directives.
Reverted the incremental test levels to match the file names.
Patrick Steinbrecher is a major collaborator on this PR .