-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA streams #1723
Comments
I just came across this: Peterson, Brad, Alan Humphrey, John Holmen, Todd Harman, Martin Berzins, Dan Sunderland, and H. Carter Edwards. "Demonstrating GPU Code Portability and Scalability for Radiative Heat Transfer Computations." Journal of Computational Science (2018). What is the state of the Kokkos work on Cuda streams and tasking reported in section 6 of that paper? Is that sitting in an unmerged branch somewhere? |
Yes, I've been told the CUDA stream work is in an unmerged branch in an unknown fork. That would ideally be the starting point for this. As for the tasking system used, that may not be part of Kokkos, it may something in the UIntah code base. I'm not sure where the tasking is. |
Does this mean things like "having |
@mhoemmen the first pass would be execution space instances containing CUDA streams and getting the |
@ibaned Thanks for the info! I'm wondering whether it would make sense to do both tasks (
The output View y could have nondeterministic results if |
I just opened a pull requests which gets most of the infrastructure in place. |
For now the way this works is that you create a stream and then construct a Cuda class object: cudaStream_t stream1,stream2);
cudaCreateStream(&stream1);
cudaCreateStream(&stream2);
Cuda cuda1(stream1);
Cuda cuda2(stream2);
parallel_for(RangePolicy<>(cuda1,0,N), f1);
parallel_for(RangePolicy<>(cuda2,0,N), f2); In this case the two functors f1 and f2 would be executing simultaneously. |
What we need beyond this is a backend agnostic way of partitioning an execution space. ExecSpace spaces[N];
partition_space(ExecSpace(),N,spaces); Any thoughts? |
Would "partition" mean "create a stream for each available GPU"? If so, then I think that could be the right word. However, I worry it would suggest something more like |
From an abstraction perspective I mean partition. On a std::threads backend the generated sub-execution space instances would be disjoint sets of threads of the thread pool. Since GPUs are a throughput architecture though, we use the streams to achieve the same. Say you got three kernels none, of which has enough parallelism to saturate your socket or GPU, but all can be executed at the same time: in that case it makes sense to use streams on a GPU and split the thread pool on a CPU like architecture. So I think of streams here still as partitioning, its just that a GPU by nature is more like a throughput service than dedicated resources. |
To add to that: there is no way to split a GPU in dedicated sub-GPUs right now. So if partitioning is what you want streams are the only way to achieve that. |
A stream also comes with a promise about sequencing operations, but I guess that would be true of a partition of an std::thread-based thread pool as well, no?
MPS server on the most recent GPUs does something like this, but yes, there's no API for doing that. |
The same would be true for a std::thread based thread pool, work submitted to an instance of an execution space is done in sequence as long as we don't add explicit other semantics through stuff like futures etc. Also the MPS thingy would only work for partitioning between processes not within one. |
This work now, but you need to do a little more. There was no acceptable non-interface based way to solve the constant-cache optimization issue. cudaStream_t stream1,stream2);
cudaCreateStream(&stream1);
cudaCreateStream(&stream2);
Cuda cuda1(stream1);
Cuda cuda2(stream2);
auto range_policy1 = require( RangePolicy<Cuda>(cuda1, 0, N)
, WorkItemProperty::HintLightWeight );
auto range_policy1 = require( RangePolicy<Cuda>(cuda2, 0, N)
, WorkItemProperty::HintLightWeight );
parallel_for(range_policy1, f1);
parallel_for(range_policy2, f2); Note: |
This is not the endpoint though, we are adding the partitioning soon too, and then probably have something like this (without all the namespaces): auto range_policy1 = require( on( cuda1, RangePolicy<>(0, N))
, WorkItemProperty::HintLightWeight ); |
We were just wondering whether the stream partitioning works with Kokkos, and it is nice to find out that there is a design already. |
@ascheinb I'm not a Kokkos developer at the moment -- I'm interested in the answer to this question, but it would be best to ask the Kokkos developers. Thanks! |
No description provided.
The text was updated successfully, but these errors were encountered: