Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenMPTarget: scratch memory implementation #3611

Merged
merged 24 commits into from
Jan 22, 2021

Conversation

rgayatri23
Copy link
Contributor

A basic implementation for scratch memory.
Modified the KOKKOS_IMPL_IF_ON_HOST macro to use "declare variant " directives.
Reverted the incremental test levels to match the file names.
Patrick Steinbrecher is a major collaborator on this PR .

Copy link
Member

@dalg24 dalg24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to comment out code in the incremental tests?

inline OpenMPTargetExecTeamMember(
const int league_rank, const int league_size, const int team_size,
const int vector_length // const TeamPolicyInternal< OpenMPTarget,
// Properties ...> & team
,
void* const glb_scratch, const int shmem_size_L1, const int shmem_size_L2)
: m_team_shared(nullptr, 0),
: m_team_shared(((char*)glb_scratch +
league_rank * (shmem_size_L1 + shmem_size_L2 + 8)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the pointer be shifted by another 8 bytes since you said you are saving it for reductions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes it should be 16 to support all data types.

Did you mean to comment out code in the incremental tests?

I am not commenting any code in the incremental tests. I am only protecting an input size for the OpenMPTarget backend.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are commenting a lot of code in core/unit_test/incremental/Test12a_ThreadScratch.hpp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes sorry about that. I did not commit that file 🙄 . Also made a small change in the same tests, where I replace the hard coded scratch level to a const variable.

@@ -733,13 +734,22 @@ class OpenMPTargetExecTeamMember {
using space = execution_space::scratch_memory_space;

public:
// FIXME: OPENMPTARGET - 8 bytes at the begining of the scratch space for each
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weren't we using FIXME_OPENMPTARGET as token?
Also typo begining -> beginning

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dint realize that. I will replace the token and fix the typo.

@@ -105,9 +107,17 @@ struct TeamScratch {

TEST(TEST_CATEGORY, IncrTest_12b_TeamScratch) {
TeamScratch<TEST_EXECSPACE> test;
// Fails with larger vector sizes and if the team-size is not a multiple
// of 32.
#ifdef KOKKOS_ENABLE_OPENMPTARGET
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok as a temporal workaround. Has the issue of failing with non-multiples of 32 been communicating to the CLANG folks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this has been communicated to the clang compiler developers yet. I am in the process of writing a small reproducer for this issue to be submitted as a bug. Hopefully this is something that is easily fixable.

.jenkins Show resolved Hide resolved
MAX_ACTIVE_THREADS * thread_local_bytes; // Thread Private Scratch
const int64_t shmem_size =
shmem_size_L0 + shmem_size_L1; // L0 + L1 scratch memory per team.
const int64_t padding = shmem_size * 10 / 100; // Padding per team.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Over allocating the requested scratch memory by 10% per team to find an open slot sooner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

??? How is this supposed to work? Why #if?

// undefined at end of file
#if defined(KOKKOS_ENABLE_OPENMPTARGET)
#if defined(KOKKOS_COMPILER_PGI)
#define KOKKOS_IMPL_IF_ON_HOST if (!__builtin_is_device_code())
#else
// Note: OpenMPTarget enforces C++17 at configure time
#pragma omp begin declare variant match(device = {kind(host)})
static constexpr bool omp_is_initial_device2() { return true; }
#pragma omp end declare variant
#pragma omp begin declare variant match(device = {kind(nohost)})
static constexpr bool omp_is_initial_device2() { return false; }
#pragma omp end declare variant
#define KOKKOS_IMPL_IF_ON_HOST if constexpr (omp_is_initial_device2())
#endif
#else
#define KOKKOS_IMPL_IF_ON_HOST if (true)
#endif

Yes, you are right. It has to be #ifdef here.

@rgayatri23
Copy link
Contributor Author

rgayatri23 commented Dec 4, 2020

The PR is ready for review.

@@ -1618,7 +1618,7 @@ class View : public ViewTraits<DataType, Properties...> {
: View(arg_prop,
typename traits::array_layout(arg_N0, arg_N1, arg_N2, arg_N3,
arg_N4, arg_N5, arg_N6, arg_N7)) {
#ifdef KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST
#if KOKKOS_IMPL_IF_ON_HOST
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

??? How is this supposed to work? Why #if?

// undefined at end of file
#if defined(KOKKOS_ENABLE_OPENMPTARGET)
#if defined(KOKKOS_COMPILER_PGI)
#define KOKKOS_IMPL_IF_ON_HOST if (!__builtin_is_device_code())
#else
// Note: OpenMPTarget enforces C++17 at configure time
#pragma omp begin declare variant match(device = {kind(host)})
static constexpr bool omp_is_initial_device2() { return true; }
#pragma omp end declare variant
#pragma omp begin declare variant match(device = {kind(nohost)})
static constexpr bool omp_is_initial_device2() { return false; }
#pragma omp end declare variant
#define KOKKOS_IMPL_IF_ON_HOST if constexpr (omp_is_initial_device2())
#endif
#else
#define KOKKOS_IMPL_IF_ON_HOST if (true)
#endif

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also

#undef KOKKOS_IMPL_IF_ON_HOST

rgayatri added 3 commits December 4, 2020 13:52
…oc.hpp. Removed the #if for KOKKOS_IMPL_IF_ON_HOST macro in Kokkos_View.hpp.
…fined(KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST) in Kokkos_View.hpp.
@@ -1618,7 +1618,17 @@ class View : public ViewTraits<DataType, Properties...> {
: View(arg_prop,
typename traits::array_layout(arg_N0, arg_N1, arg_N2, arg_N3,
arg_N4, arg_N5, arg_N6, arg_N7)) {
#ifdef KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST
#ifdef KOKKOS_ENABLE_OPENMPTARGET
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK for now we need a better long term solution.

// the number of threads per threadblock will not exceed 1024 threads. In
// future this value should be based on the chosen architecture for the
// OpenMPTarget backend.
int64_t total_size = (shmem_size + 16 + padding) * ((1024 * 80) / team_size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably need to build our own list for this, how fun ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also this needs to be 2048*80

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I used 1024 instead of 2048 is because it might be very rare that all the threads in a thread block will be used. Do you want me to use 2048 anyway?

Copy link
Member

@crtrott crtrott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "which scratch buffer position do I need to use" thing needs to be lock based since the number of teams can exceed what the hardware can have in flight concurrently, and thus you run out of space.

void* const glb_scratch, const int shmem_size_L0, const int shmem_size_L1)
: m_team_shared(
((char*)glb_scratch +
league_rank * (shmem_size_L0 + shmem_size_L1 +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not how this works: note league_rank can be larger than the 2048*80/team_size, you need to use a lock array to find an open spot like in CUDA. And then its just "open_index * (shmem_size_L0+shmem_size_L1+16)"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm what about my comment here?

// the number of threads per threadblock will not exceed 1024 threads. In
// future this value should be based on the chosen architecture for the
// OpenMPTarget backend.
int64_t total_size = (shmem_size + 16 + padding) * ((1024 * 80) / team_size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also this needs to be 2048*80

// reduction.
m_reduce_scratch =
(char*)glb_scratch +
league_rank * (shmem_size_L0 + shmem_size_L1 +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, needs lock based index and correct offset calculation

@@ -54,7 +54,15 @@
#define KOKKOS_IMPL_IF_ON_HOST if (!__builtin_is_device_code())
#else
// Note: OpenMPTarget enforces C++17 at configure time
#define KOKKOS_IMPL_IF_ON_HOST if constexpr (omp_is_initial_device())
#pragma omp begin declare variant match(device = {kind(host)})
static constexpr bool omp_is_initial_device2() { return true; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets call this differently, to make clear it is NOT a OpenMP runtime function. How about kokkos_omp_on_device()

…d omp_is_initial_device2 routine to kokkos_omp_on_device.
while (lock_array[shmem_block_index] == 0)
;
lock_team =
atomic_compare_exchange(&lock_array[shmem_block_index], 1, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the lock_team unusued?

typename Policy::member_type team(i, league_size, team_size,
vector_length, scratch_ptr,
shmem_size_L0, shmem_size_L1);
m_functor(team);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't call the functor here, just compute the lock_id in the if block, then call the functor, then unlock in another if_block. Pass the lock_id to the team constructor, and if no scratch memory is used pass in a 0.

Copy link
Member

@crtrott crtrott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left comments. In particular the lock id need to be used.

rgayatri added 2 commits December 20, 2020 21:24
…etExecTeamMember. Removed the lock-team while loop in TeamPolicy ParallelFor.
…mory to implementation to dynamically assign shared memory blocks.

// A lock array to keep track of scratch memory blocks.
// The first element to keep track of the number of free-blocks.
int* lock_array = new int[max_active_league_size + 1];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be a member of the execution space instance

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably use openmpTargetSpace allocate for that and DeepCOpy to set it and then is_device_ptr later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also add like 10% or so

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably use openmpTargetSpace allocate for that and DeepCOpy to set it and then is_device_ptr later.

I used another target region to initialize it currently. Do you think its better to set it on host and do a DeepCopy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also add like 10% or so

Do you mean just allocate an additional 10% but not use it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a) its fine to use a kernel to initialize, put that into a function on the execution space though. I.e. have something like "resize scratch array size"
b) No you do use it for the cycle around. Note this is also the number of scratch partitions you need to allocate.

} else {
// Block until there is atleast one free scratch block is available.
while (lock_array[0] == 0)
;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't make much sense why not start with a count of 0? Also are we actually guaranteed that teams are launched in order?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and generally the convention is that "its locked" means the value is 1, and finding a 0 means its not locked.


// Iterate till the first free scratch block is encountered and try and
// get a lock on the block.
int iter = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start at i%length_of_lock_array

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so they don't step on each others feet all the time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified the logic to identify a free block to incorporate all the above comments.

@masterleinad masterleinad linked an issue Jan 13, 2021 that may be closed by this pull request
…he allocation of block index in scratch array.
// Loop as long as a shmem_block_index is not found.
while (shmem_block_index == 0) {
// Proceed if there are free blocks available.
if (lock_array[0] > 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is needed. This should ALWAYS be true. The point of the lock array is that it is actually a bit larger then the max number of active teams. So the atomic counter can never reach zero ...

lock_array[0] = max_active_league_size;
// Allocate the lock_array on the device and initialize it to 0's in a
// target region.
int* lock_array = static_cast<int*>(omp_target_alloc(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ptr still needs to come from the OpenMPTarget instance, in particular so you don't reallocate it all the time. Only reallocate if it has to grow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, working on it. I am allocating the size of the lock_array to the maximum possible by assuming the team-size to 32. Since the OpenMPTarget backend has a restriction of needing a minimum of 32 threads per team, the array will not need to grow.

@masterleinad
Copy link
Contributor

We have

4: Libomptarget error: Call to omp_target_memcpy with invalid arguments

in https://cloud.cees.ornl.gov/jenkins-ci/blue/organizations/jenkins/Kokkos/detail/Kokkos/4055/pipeline#step-155-log-1139 and also

4/42 Test  #4: KokkosCore_UnitTest_OpenMPTarget .............***Timeout 1500.16 sec

int* OpenMPTargetExec::get_lock_array(int num_teams) {
Kokkos::Experimental::OpenMPTargetSpace space;
int max_active_league_size = MAX_ACTIVE_THREADS / 32;
if (max_active_league_size < num_teams) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that doesn't make sense. We never need more than max_active_league_size or?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just allocate MAX_ACTIVE_THREADS/32 here always. Just delete the first branch. That branch doesn't make sense anyway. This would only be true if you had a team_size < 32 .

Copy link
Member

@crtrott crtrott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there. Left a couple comments.

int* OpenMPTargetExec::get_lock_array(int num_teams) {
Kokkos::Experimental::OpenMPTargetSpace space;
int max_active_league_size = MAX_ACTIVE_THREADS / 32;
if (max_active_league_size < num_teams) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just allocate MAX_ACTIVE_THREADS/32 here always. Just delete the first branch. That branch doesn't make sense anyway. This would only be true if you had a team_size < 32 .

shmem_block_index = iter;
else
iter = ++iter % max_active_teams;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good now.

// Free the locked block and increment the number of available free
// blocks.
lock_team = atomic_compare_exchange(&lock_array[shmem_block_index], 1, 0);
atomic_increment(&lock_array[0]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line needs to go. leftover from the previous approach

iter = ++iter % max_active_teams;
}
// Decrement the number of available free blocks.
atomic_decrement(&lock_array[0]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line needs to go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes deleted.

// Free the locked block and increment the number of available free
// blocks.
lock_team = atomic_compare_exchange(&lock_array[shmem_block_index], 1, 0);
atomic_increment(&lock_array[0]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line needs to go

@rgayatri23
Copy link
Contributor Author

I think you can just allocate MAX_ACTIVE_THREADS/32 here always. Just delete the first branch. That branch doesn't make sense anyway. This would only be true if you had a team_size < 32 .

Yes I left the logic in there to be able to address the issue in case in future we have team_size < 32

…ach. Updated the logic in get_lock_array routine.
@dalg24 dalg24 merged commit f8fd210 into kokkos:develop Jan 22, 2021
@rgayatri23 rgayatri23 deleted the OpenMPTarget_scratch_trials branch March 24, 2023 22:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

omp_is_initial_device is non-constexpr at least on icpx
5 participants