Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a potential race condition in the CUDA backend #3999

Merged
merged 1 commit into from
May 10, 2021

Conversation

Rombur
Copy link
Member

@Rombur Rombur commented May 4, 2021

There is a potential race condition when using multiple Teams parallel_for/reduce with the same instance. It's possible that the scratch memory pointer gets invalidated before the kernel is done. In HIP this was fixed by using a guard lock but that does not work for CUDA because we need ParallelFor/Reduce to have a copy constructor. Instead, I have created a pool of 10 pointers (I just picked a number, I don't have a good reason to choose 10) and we cycle through them when we need one. I know that there isn't any test yet, but I am working on that.

…ads launch different Team ParallelFor/Reduce
@masterleinad
Copy link
Contributor

What's the reason that ParallelFor needs to be copy-constructible for Cuda but not for HIP? I guess the backends should be consistent in this regard.

@crtrott
Copy link
Member

crtrott commented May 5, 2021

This is a race condition only with multiple threads dispatching work right?

@Rombur
Copy link
Member Author

Rombur commented May 5, 2021

What's the reason that ParallelFor needs to be copy-constructible for Cuda but not for HIP? I guess the backends should be consistent in this regard.

No, the backends should not be consistent. The reason we don't use a copy in HIP is because the compiler sometimes creates a bugged kernel when the kernel is copied. So we needed to create a work around which has performance implication.

This is a race condition only with multiple threads dispatching work right?

Correct.

Copy link
Member

@crtrott crtrott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I get it. This works because the race is only on the reallocation/launch of the kernel. If you actually deallocate something and another kernel already in flight is using the same scratch you are good anyway since the realloc will block the device. Hence releasing the lock at the end of ParalleFor (which is NOT the end of the asynchronous kernel) is fine.

@crtrott crtrott merged commit 1120689 into kokkos:develop May 10, 2021
@Rombur Rombur deleted the cuda_race_condition branch March 31, 2023 14:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants