CUDA streams #1723

ibaned · 2018-08-01T19:01:43Z

No description provided.

sslattery · 2018-08-02T01:22:57Z

I just came across this:

Peterson, Brad, Alan Humphrey, John Holmen, Todd Harman, Martin Berzins, Dan Sunderland, and H. Carter Edwards. "Demonstrating GPU Code Portability and Scalability for Radiative Heat Transfer Computations." Journal of Computational Science (2018).

What is the state of the Kokkos work on Cuda streams and tasking reported in section 6 of that paper? Is that sitting in an unmerged branch somewhere?

ibaned · 2018-08-02T14:59:55Z

Yes, I've been told the CUDA stream work is in an unmerged branch in an unknown fork. That would ideally be the starting point for this. As for the tasking system used, that may not be part of Kokkos, it may something in the UIntah code base. I'm not sure where the tasking is.

mhoemmen · 2018-08-02T16:08:02Z

Does this mean things like "having deep_copy actually use the execution space instance argument"?

ibaned · 2018-08-02T16:30:39Z

@mhoemmen the first pass would be execution space instances containing CUDA streams and getting the parallel_* functions to work with that, but yes a subsequent step could be deep_copy acting on a stream through the execution space, possibly calling cudaMemcpyAsync.

mhoemmen · 2018-08-02T17:42:03Z

@ibaned Thanks for the info! I'm wondering whether it would make sense to do both tasks (parallel_* and deep_copy with nondefault streams) at the same time. Suppose that parallel_* are done but deep_copy still ignores its stream argument, and a user does the following:

Launch a parallel_for with nondefault stream S, that outputs to View x
Launch a deep_copy with nondefault stream S, from x to y

The output View y could have nondeterministic results if deep_copy ignores S and uses the default stream.

crtrott · 2018-12-01T02:27:19Z

I just opened a pull requests which gets most of the infrastructure in place.
#1919

crtrott · 2018-12-01T02:30:32Z

For now the way this works is that you create a stream and then construct a Cuda class object:

cudaStream_t stream1,stream2);
cudaCreateStream(&stream1);
cudaCreateStream(&stream2);

Cuda cuda1(stream1);
Cuda cuda2(stream2);

parallel_for(RangePolicy<>(cuda1,0,N), f1);
parallel_for(RangePolicy<>(cuda2,0,N), f2);

In this case the two functors f1 and f2 would be executing simultaneously.

crtrott · 2018-12-01T02:32:34Z

What we need beyond this is a backend agnostic way of partitioning an execution space.
Maybe something like this:

ExecSpace spaces[N];
partition_space(ExecSpace(),N,spaces);

Any thoughts?

mhoemmen · 2018-12-01T03:07:51Z

Would "partition" mean "create a stream for each available GPU"? If so, then I think that could be the right word. However, I worry it would suggest something more like MPI_Comm_split. That's really not what this means. A stream is a "timeline" that sequences operations. Creating a new stream is more like MPI_Comm_dup. What you call "partition" is really more like MPI_Comm_spawn, that acquires new parallel resources, rather than dividing old ones.

crtrott · 2018-12-01T17:59:53Z

From an abstraction perspective I mean partition. On a std::threads backend the generated sub-execution space instances would be disjoint sets of threads of the thread pool. Since GPUs are a throughput architecture though, we use the streams to achieve the same. Say you got three kernels none, of which has enough parallelism to saturate your socket or GPU, but all can be executed at the same time: in that case it makes sense to use streams on a GPU and split the thread pool on a CPU like architecture.

So I think of streams here still as partitioning, its just that a GPU by nature is more like a throughput service than dedicated resources.

crtrott · 2018-12-01T18:00:33Z

To add to that: there is no way to split a GPU in dedicated sub-GPUs right now. So if partitioning is what you want streams are the only way to achieve that.

mhoemmen · 2018-12-01T20:44:57Z

So I think of streams here still as partitioning, its just that a GPU by nature is more like a throughput service than dedicated resources.

A stream also comes with a promise about sequencing operations, but I guess that would be true of a partition of an std::thread-based thread pool as well, no?

there is no way to split a GPU in dedicated sub-GPUs right now.

MPS server on the most recent GPUs does something like this, but yes, there's no API for doing that.

crtrott · 2018-12-01T22:41:06Z

The same would be true for a std::thread based thread pool, work submitted to an instance of an execution space is done in sequence as long as we don't add explicit other semantics through stuff like futures etc. Also the MPS thingy would only work for partitioning between processes not within one.

crtrott · 2019-02-16T03:42:03Z

This work now, but you need to do a little more. There was no acceptable non-interface based way to solve the constant-cache optimization issue.

cudaStream_t stream1,stream2);
cudaCreateStream(&stream1);
cudaCreateStream(&stream2);

Cuda cuda1(stream1);
Cuda cuda2(stream2);

auto range_policy1 = require( RangePolicy<Cuda>(cuda1, 0, N)
                            , WorkItemProperty::HintLightWeight );
auto range_policy1 = require( RangePolicy<Cuda>(cuda2, 0, N)
                            , WorkItemProperty::HintLightWeight );
parallel_for(range_policy1, f1);
parallel_for(range_policy2, f2);

Note: require and WorkItemProperty are currently in the namespace Kokkos::Experimental

crtrott · 2019-02-16T03:45:57Z

This is not the endpoint though, we are adding the partitioning soon too, and then probably have something like this (without all the namespaces):

auto range_policy1 = require( on( cuda1, RangePolicy<>(0, N))
                            , WorkItemProperty::HintLightWeight );

freifrauvonbleifrei · 2019-11-13T16:48:16Z

We were just wondering whether the stream partitioning works with Kokkos, and it is nice to find out that there is a design already.
Is it implemented at this point? If yes, where to find and how to use it?

ascheinb · 2020-04-01T15:51:01Z

Seconding the interest in an update on this topic. @crtrott @mhoemmen Is this still the recommended way to launch parallel_ operations to specified cuda streams? What about deep_copy? Thanks.

mhoemmen · 2020-04-02T00:11:45Z

@ascheinb I'm not a Kokkos developer at the moment -- I'm interested in the answer to this question, but it would be best to ask the Kokkos developers. Thanks!

ibaned added the Feature Request Create new capability; will potentially require voting label Aug 1, 2018

ibaned added this to the 2018 September milestone Aug 1, 2018

ibaned assigned ibaned and dsunder Aug 1, 2018

ibaned mentioned this issue Aug 1, 2018

Kokkos many task parallelism for GPUs #1670

Closed

crtrott assigned crtrott and unassigned ibaned and dsunder Nov 7, 2018

ibaned modified the milestones: 2018 September, 2018 December Nov 14, 2018

ibaned mentioned this issue Dec 4, 2018

Cuda instances #1919

Merged

This was referenced Feb 1, 2019

CUDA Execution space instances #817

Closed

Kokkos Ecosystem Promotion to 3.0.00 #1982

Closed

ndellingwood modified the milestones: 2018 December, 2019 February Feb 6, 2019

ndellingwood modified the milestones: 2018 December, 2019 April, 2019 February Feb 6, 2019

crtrott added the InDevelop label Feb 16, 2019

ndellingwood closed this as completed Jun 24, 2019

ndellingwood mentioned this issue Jun 24, 2019

Kokkos promotion trilinos/Trilinos#5432

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA streams #1723

CUDA streams #1723

ibaned commented Aug 1, 2018

sslattery commented Aug 2, 2018 •

edited

ibaned commented Aug 2, 2018

mhoemmen commented Aug 2, 2018

ibaned commented Aug 2, 2018

mhoemmen commented Aug 2, 2018

crtrott commented Dec 1, 2018

crtrott commented Dec 1, 2018

crtrott commented Dec 1, 2018

mhoemmen commented Dec 1, 2018

crtrott commented Dec 1, 2018

crtrott commented Dec 1, 2018

mhoemmen commented Dec 1, 2018

crtrott commented Dec 1, 2018

crtrott commented Feb 16, 2019 •

edited

crtrott commented Feb 16, 2019

freifrauvonbleifrei commented Nov 13, 2019

ascheinb commented Apr 1, 2020

mhoemmen commented Apr 2, 2020

CUDA streams #1723

CUDA streams #1723

Comments

ibaned commented Aug 1, 2018

sslattery commented Aug 2, 2018 • edited

ibaned commented Aug 2, 2018

mhoemmen commented Aug 2, 2018

ibaned commented Aug 2, 2018

mhoemmen commented Aug 2, 2018

crtrott commented Dec 1, 2018

crtrott commented Dec 1, 2018

crtrott commented Dec 1, 2018

mhoemmen commented Dec 1, 2018

crtrott commented Dec 1, 2018

crtrott commented Dec 1, 2018

mhoemmen commented Dec 1, 2018

crtrott commented Dec 1, 2018

crtrott commented Feb 16, 2019 • edited

crtrott commented Feb 16, 2019

freifrauvonbleifrei commented Nov 13, 2019

ascheinb commented Apr 1, 2020

mhoemmen commented Apr 2, 2020

sslattery commented Aug 2, 2018 •

edited

crtrott commented Feb 16, 2019 •

edited