-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenMPTarget: scratch implementation for parallel_reduce #3776
OpenMPTarget: scratch implementation for parallel_reduce #3776
Conversation
…s avoided in the OpenMPTarget backend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Essentially copy/paste from code in ParallelFor
specialization. Code duplication is unfortunate but I don't necessary have a better idea.
I am unsure how the lock_array would play with streams but I guess we have the same issue with parallel_for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add FIXMEs in the unit tests
Make sure you add a comment when it is not obvious
|
while (lock_team != 1) { | ||
// Avoid tripping over other team's block index. | ||
// Find a free block index. | ||
iter = (omp_get_team_num() % max_active_teams); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this happen outside the while loop? Is that also wrong in the ParallelFor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also why do we need two while loops. Isn't the only way to escape the inner while loop for lock_team==1 in which case shmem_block_indx != -1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the additional loop is not needed. Deleted it.
#pragma omp target teams distribute parallel for is_device_ptr(lock_array) | ||
for (int i = 0; i < max_active_teams; ++i) { | ||
lock_array[i] = 0; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do this again and again (i probably have overlooked that in the ParallelFor too) I mean we only need to do it when reallocating or (so the get_lock_array could do that). Since otherwise this thing should be back to zero anyway (a kernel will never leave it not being zero unless it crashes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was initializing it in case a previous kernel crashes. But I guess we don't need since if the kernel crashes, then the application would quit anyway.
// Loop as long as a shmem_block_index is not found. | ||
while (shmem_block_index == -1) { | ||
// Loop until a lock can be acquired on a team | ||
while (lock_team != 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
…llelReduce` when calculating the shared block index.
Updates in this PR :