Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA streams #1723

Closed
ibaned opened this issue Aug 1, 2018 · 18 comments
Closed

CUDA streams #1723

ibaned opened this issue Aug 1, 2018 · 18 comments
Assignees
Labels
Feature Request Create new capability; will potentially require voting

Comments

@ibaned
Copy link
Contributor

ibaned commented Aug 1, 2018

No description provided.

@ibaned ibaned added the Feature Request Create new capability; will potentially require voting label Aug 1, 2018
@ibaned ibaned added this to the 2018 September milestone Aug 1, 2018
@sslattery
Copy link

sslattery commented Aug 2, 2018

I just came across this:

Peterson, Brad, Alan Humphrey, John Holmen, Todd Harman, Martin Berzins, Dan Sunderland, and H. Carter Edwards. "Demonstrating GPU Code Portability and Scalability for Radiative Heat Transfer Computations." Journal of Computational Science (2018).

What is the state of the Kokkos work on Cuda streams and tasking reported in section 6 of that paper? Is that sitting in an unmerged branch somewhere?

@ibaned
Copy link
Contributor Author

ibaned commented Aug 2, 2018

Yes, I've been told the CUDA stream work is in an unmerged branch in an unknown fork. That would ideally be the starting point for this. As for the tasking system used, that may not be part of Kokkos, it may something in the UIntah code base. I'm not sure where the tasking is.

@mhoemmen
Copy link
Contributor

mhoemmen commented Aug 2, 2018

Does this mean things like "having deep_copy actually use the execution space instance argument"?

@ibaned
Copy link
Contributor Author

ibaned commented Aug 2, 2018

@mhoemmen the first pass would be execution space instances containing CUDA streams and getting the parallel_* functions to work with that, but yes a subsequent step could be deep_copy acting on a stream through the execution space, possibly calling cudaMemcpyAsync.

@mhoemmen
Copy link
Contributor

mhoemmen commented Aug 2, 2018

@ibaned Thanks for the info! I'm wondering whether it would make sense to do both tasks (parallel_* and deep_copy with nondefault streams) at the same time. Suppose that parallel_* are done but deep_copy still ignores its stream argument, and a user does the following:

  1. Launch a parallel_for with nondefault stream S, that outputs to View x
  2. Launch a deep_copy with nondefault stream S, from x to y

The output View y could have nondeterministic results if deep_copy ignores S and uses the default stream.

@crtrott crtrott assigned crtrott and unassigned ibaned and dsunder Nov 7, 2018
@ibaned ibaned modified the milestones: 2018 September, 2018 December Nov 14, 2018
@crtrott
Copy link
Member

crtrott commented Dec 1, 2018

I just opened a pull requests which gets most of the infrastructure in place.
#1919

@crtrott
Copy link
Member

crtrott commented Dec 1, 2018

For now the way this works is that you create a stream and then construct a Cuda class object:

cudaStream_t stream1,stream2);
cudaCreateStream(&stream1);
cudaCreateStream(&stream2);

Cuda cuda1(stream1);
Cuda cuda2(stream2);

parallel_for(RangePolicy<>(cuda1,0,N), f1);
parallel_for(RangePolicy<>(cuda2,0,N), f2);

In this case the two functors f1 and f2 would be executing simultaneously.

@crtrott
Copy link
Member

crtrott commented Dec 1, 2018

What we need beyond this is a backend agnostic way of partitioning an execution space.
Maybe something like this:

ExecSpace spaces[N];
partition_space(ExecSpace(),N,spaces);

Any thoughts?

@mhoemmen
Copy link
Contributor

mhoemmen commented Dec 1, 2018

Would "partition" mean "create a stream for each available GPU"? If so, then I think that could be the right word. However, I worry it would suggest something more like MPI_Comm_split. That's really not what this means. A stream is a "timeline" that sequences operations. Creating a new stream is more like MPI_Comm_dup. What you call "partition" is really more like MPI_Comm_spawn, that acquires new parallel resources, rather than dividing old ones.

@crtrott
Copy link
Member

crtrott commented Dec 1, 2018

From an abstraction perspective I mean partition. On a std::threads backend the generated sub-execution space instances would be disjoint sets of threads of the thread pool. Since GPUs are a throughput architecture though, we use the streams to achieve the same. Say you got three kernels none, of which has enough parallelism to saturate your socket or GPU, but all can be executed at the same time: in that case it makes sense to use streams on a GPU and split the thread pool on a CPU like architecture.

So I think of streams here still as partitioning, its just that a GPU by nature is more like a throughput service than dedicated resources.

@crtrott
Copy link
Member

crtrott commented Dec 1, 2018

To add to that: there is no way to split a GPU in dedicated sub-GPUs right now. So if partitioning is what you want streams are the only way to achieve that.

@mhoemmen
Copy link
Contributor

mhoemmen commented Dec 1, 2018

So I think of streams here still as partitioning, its just that a GPU by nature is more like a throughput service than dedicated resources.

A stream also comes with a promise about sequencing operations, but I guess that would be true of a partition of an std::thread-based thread pool as well, no?

there is no way to split a GPU in dedicated sub-GPUs right now.

MPS server on the most recent GPUs does something like this, but yes, there's no API for doing that.

@crtrott
Copy link
Member

crtrott commented Dec 1, 2018

The same would be true for a std::thread based thread pool, work submitted to an instance of an execution space is done in sequence as long as we don't add explicit other semantics through stuff like futures etc. Also the MPS thingy would only work for partitioning between processes not within one.

@ibaned ibaned mentioned this issue Dec 4, 2018
@ndellingwood ndellingwood modified the milestones: 2018 December, 2019 February Feb 6, 2019
@ndellingwood ndellingwood modified the milestones: 2018 December, 2019 April, 2019 February Feb 6, 2019
@crtrott
Copy link
Member

crtrott commented Feb 16, 2019

This work now, but you need to do a little more. There was no acceptable non-interface based way to solve the constant-cache optimization issue.

cudaStream_t stream1,stream2);
cudaCreateStream(&stream1);
cudaCreateStream(&stream2);

Cuda cuda1(stream1);
Cuda cuda2(stream2);

auto range_policy1 = require( RangePolicy<Cuda>(cuda1, 0, N)
                            , WorkItemProperty::HintLightWeight );
auto range_policy1 = require( RangePolicy<Cuda>(cuda2, 0, N)
                            , WorkItemProperty::HintLightWeight );
parallel_for(range_policy1, f1);
parallel_for(range_policy2, f2);

Note: require and WorkItemProperty are currently in the namespace Kokkos::Experimental

@crtrott
Copy link
Member

crtrott commented Feb 16, 2019

This is not the endpoint though, we are adding the partitioning soon too, and then probably have something like this (without all the namespaces):

auto range_policy1 = require( on( cuda1, RangePolicy<>(0, N))
                            , WorkItemProperty::HintLightWeight );

@freifrauvonbleifrei
Copy link

We were just wondering whether the stream partitioning works with Kokkos, and it is nice to find out that there is a design already.
Is it implemented at this point? If yes, where to find and how to use it?

@ascheinb
Copy link

ascheinb commented Apr 1, 2020

Seconding the interest in an update on this topic. @crtrott @mhoemmen Is this still the recommended way to launch parallel_ operations to specified cuda streams? What about deep_copy? Thanks.

@mhoemmen
Copy link
Contributor

mhoemmen commented Apr 2, 2020

@ascheinb I'm not a Kokkos developer at the moment -- I'm interested in the answer to this question, but it would be best to ask the Kokkos developers. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature Request Create new capability; will potentially require voting
Projects
None yet
Development

No branches or pull requests

8 participants