Fix a potential race condition in the CUDA backend #3999

Rombur · 2021-05-04T17:56:29Z

There is a potential race condition when using multiple Teams parallel_for/reduce with the same instance. It's possible that the scratch memory pointer gets invalidated before the kernel is done. In HIP this was fixed by using a guard lock but that does not work for CUDA because we need ParallelFor/Reduce to have a copy constructor. Instead, I have created a pool of 10 pointers (I just picked a number, I don't have a good reason to choose 10) and we cycle through them when we need one. I know that there isn't any test yet, but I am working on that.

…ads launch different Team ParallelFor/Reduce

masterleinad · 2021-05-04T20:31:26Z

What's the reason that ParallelFor needs to be copy-constructible for Cuda but not for HIP? I guess the backends should be consistent in this regard.

crtrott · 2021-05-05T05:00:30Z

This is a race condition only with multiple threads dispatching work right?

Rombur · 2021-05-05T12:35:34Z

What's the reason that ParallelFor needs to be copy-constructible for Cuda but not for HIP? I guess the backends should be consistent in this regard.

No, the backends should not be consistent. The reason we don't use a copy in HIP is because the compiler sometimes creates a bugged kernel when the kernel is copied. So we needed to create a work around which has performance implication.

This is a race condition only with multiple threads dispatching work right?

Correct.

crtrott

Ok I get it. This works because the race is only on the reallocation/launch of the kernel. If you actually deallocate something and another kernel already in flight is using the same scratch you are good anyway since the realloc will block the device. Hence releasing the lock at the end of ParalleFor (which is NOT the end of the asynchronous kernel) is fine.

Fix a potential race condition in the CUDA backend when multiple thre…

31a136c

…ads launch different Team ParallelFor/Reduce

crtrott approved these changes May 6, 2021

View reviewed changes

crtrott merged commit 1120689 into kokkos:develop May 10, 2021

Rombur deleted the cuda_race_condition branch March 31, 2023 14:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix a potential race condition in the CUDA backend #3999

Fix a potential race condition in the CUDA backend #3999

Rombur commented May 4, 2021

masterleinad commented May 4, 2021

crtrott commented May 5, 2021

Rombur commented May 5, 2021

crtrott left a comment

Fix a potential race condition in the CUDA backend #3999

Fix a potential race condition in the CUDA backend #3999

Conversation

Rombur commented May 4, 2021

masterleinad commented May 4, 2021

crtrott commented May 5, 2021

Rombur commented May 5, 2021

crtrott left a comment

Choose a reason for hiding this comment