Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't rely on synchronization behavior of default stream in CUDA and HIP #5391

Merged
merged 1 commit into from
Sep 2, 2022

Conversation

masterleinad
Copy link
Contributor

@masterleinad masterleinad commented Aug 25, 2022

https://docs.nvidia.com/cuda/cuda-runtime-api/stream-sync-behavior.html:

The legacy default stream is an implicit stream which synchronizes with all other streams in the same CUcontext except for non-blocking streams, described below. (For applications using the runtime APIs only, there will be one context per device.) When an action is taken in the legacy stream such as a kernel launch or cudaStreamWaitEvent(), the legacy stream first waits on all blocking streams, the action is queued in the legacy stream, and then all blocking streams wait on the legacy stream.

and HIP behaves similarly, see ROCm/HIP#129 (comment).
As discussed on Slack, the default execution space instance shouldn't have any special synchronization behavior with respect to other execution space instances. Hence, this pull request also creates a stream for the singleton that is used for the default execution space instance.

@PhilMiller
Copy link
Contributor

It would be helpful to document or reference the particular behavior in the PR description. I believe this relates to kernel launches on CUDA/HIP's default stream implicitly synchronizing more broadly than launches on any other stream that gets created explicitly?

@PhilMiller
Copy link
Contributor

Also, this will pose a merge conflict with #5390 if/when that gets cherry-picked to develop

@masterleinad
Copy link
Contributor Author

masterleinad commented Aug 25, 2022

It would be helpful to document or reference the particular behavior in the PR description. I believe this relates to kernel launches on CUDA/HIP's default stream implicitly synchronizing more broadly than launches on any other stream that gets created explicitly?

This was still just a draft. 🙂 I updated the description now, though.

@masterleinad masterleinad marked this pull request as ready for review August 25, 2022 19:18
@dalg24 dalg24 requested a review from Rombur August 26, 2022 03:06
dalg24
dalg24 previously requested changes Aug 26, 2022
Copy link
Member

@dalg24 dalg24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record at the moment none of the CUDA nor the HIP builds passes

core/src/Cuda/Kokkos_Cuda_Instance.cpp Outdated Show resolved Hide resolved
core/src/HIP/Kokkos_HIP_Space.cpp Outdated Show resolved Hide resolved
@masterleinad masterleinad marked this pull request as draft August 26, 2022 12:21
@masterleinad masterleinad marked this pull request as ready for review August 26, 2022 15:37
@masterleinad
Copy link
Contributor Author

Retest this please.

1 similar comment
@masterleinad
Copy link
Contributor Author

Retest this please.

test_cuda_spaces_int_value<<<1, 1>>>(uvm_ptr);
Kokkos::Cuda().fence();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok this makes sense.

@crtrott
Copy link
Member

crtrott commented Aug 29, 2022

hip_DeathTest.abort_from_device is timeing out.

@masterleinad
Copy link
Contributor Author

hip_DeathTest.abort_from_device is timeing out.

Yeah, I couldn't reproduce on crusher or spock easily and need to dig a little deeper.

@masterleinad masterleinad marked this pull request as draft August 30, 2022 15:44
Comment on lines 62 to 65
ExecutionSpace exec;
Kokkos::parallel_for(Kokkos::RangePolicy<ExecutionSpace>(exec, 0, 1),
*this);
exec.fence();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not quite sure why the HIP Ci was timing out without changing this and I'm not quite sure how much we care.
I couldn't reproduce on crusher or spock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, @JBludau can only reproduce with ROCm < 5.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you are saying this is not hanging with 5.2?

Comment on lines 62 to 65
ExecutionSpace exec;
Kokkos::parallel_for(Kokkos::RangePolicy<ExecutionSpace>(exec, 0, 1),
*this);
exec.fence();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you are saying this is not hanging with 5.2?

Copy link
Member

@crtrott crtrott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok lets revert that abort test change again, and move testing to 5.2

@Rombur
Copy link
Member

Rombur commented Aug 31, 2022

Ok lets revert that abort test change again, and move testing to 5.2

We cannot require 5.2 yet unless your machine has been updated.

@masterleinad masterleinad marked this pull request as ready for review August 31, 2022 21:10
@masterleinad
Copy link
Contributor Author

As expected this works after updating CI to ROCm 5.2.

@masterleinad
Copy link
Contributor Author

Waiting for #5416.

@masterleinad
Copy link
Contributor Author

Rebased after #5416 has been merged.

@masterleinad
Copy link
Contributor Author

Rebased after #5410 has been merged.

@masterleinad
Copy link
Contributor Author

Only the OpenMPTarget CI is failing with

[ RUN      ] openmptarget.unique_token_global
4: /var/jenkins/workspace/Kokkos/core/unit_test/TestUniqueToken.hpp:151: Failure
4: Expected equality of these values:
4:   sum
4:     Which is: 9999990
4:   int64_t(N) * R
4:     Which is: 10000000
4: [  FAILED  ] openmptarget.unique_token_global (267 ms)

Since only CUDA and HIP were changed, this is clearly unrelated.

@dalg24 dalg24 dismissed their stale review September 2, 2022 12:28

Changes requested have been applied

@dalg24 dalg24 merged commit 9645d46 into kokkos:develop Sep 2, 2022
@masterleinad masterleinad mentioned this pull request Sep 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants