CUDA: Implement support for PTDS globally #6936

gmarkall · 2021-04-15T22:02:46Z

This PR adds support for using the per-thread default stream by default - a config variable, CUDA_PER_THREAD_DEFAULT_STREAM, is added, which is used to control whether the default stream is legacy or per-thread. Presently the default is the legacy default stream - once this implementation is proven in use, we could consider making per-thread the default in future.

All tests pass for me locally with the legacy and per-thread default stream, run with:

$ python -m numba.runtests numba.cuda.tests -vf -m

and

$ NUMBA_CUDA_PER_THREAD_DEFAULT_STREAM=1 python -m numba.runtests numba.cuda.tests -vf -m

respectively. Additionally, a test program:

from numba import cuda, float32, void
from time import perf_counter
import numpy as np
import threading

N = 2 ** 16
N_THREADS = 10
N_ADDITIONS = 4096

np.random.seed(1)
x = np.random.random(N).astype(np.float32)
r = np.zeros_like(x)

xs = [cuda.to_device(x) for _ in range(N_THREADS)]
rs = [cuda.to_device(r) for _ in range(N_THREADS)]

n_threads = 256
n_blocks = N // n_threads
stream = cuda.default_stream()


@cuda.jit(void(float32[::1], float32[::1]))
def f(r, x):
    i = cuda.grid(1)

    if i > len(r):
        return

    # Accumulate x into r
    for j in range(N_ADDITIONS):
        r[i] += x[i]


def kernel_thread(n):
    f[n_blocks, n_threads, stream](rs[n], xs[n])


def main():
    print("Creating threads")
    threads = [threading.Thread(target=kernel_thread, args=(i,))
               for i in range(N_THREADS)]

    print("Starting threads")
    start = perf_counter()

    for thread in threads:
        thread.start()

    print("Waiting for threads to finish")
    for thread in threads:
        thread.join()

    print("Synchronizing with device")
    cuda.synchronize()

    end = perf_counter()
    print(f"Elapsed time: {end - start}")

    print("Checking output")
    expected = x * N_ADDITIONS

    for i in range(N_THREADS):
        print(f"Checking output {i}")
        # Lower than usual tolerance because our method of accumulation is not
        # particularly accurate
        rtol = 1.0e-4
        np.testing.assert_allclose(rs[i].copy_to_host(), expected, rtol=rtol)

    print("Done!")


if __name__ == '__main__':
    main()

can be shown to correctly use the legacy default stream or the per-thread default stream depending on the setting of the config variable, using Nsight Systems to record the streams of the kernel launches. With the legacy default stream set:

With the per-thread default stream set:

EDIT (Test added in commit 8577d9b). Original question: However, I'm struggling to thing of a fully automated / programmatic way of testing this. Any thoughts on this, anyone?

Fixes #5137.

pentschev · 2021-04-16T09:54:48Z

This looks great! Thanks @gmarkall for the work here!

However, I'm struggling to thing of a fully automated / programmatic way of testing this. Any thoughts on this, anyone?

This was indeed something I could not find an optimal solution for in CuPy. Because CuPy uses the CUDA runtime API, what we did was replacing cudaStreamLegacy by cudaStreamPerThread when CUPY_CUDA_PER_THREAD_DEFAULT_STREAM is enabled, and then we test whether the CuPy default stream is cudaStreamLegacy or cudaStreamPerThread, depending on whether the test was executed with or without CUPY_CUDA_PER_THREAD_DEFAULT_STREAM. Of course that means we don't actually test whether each thread is running on a different CUDA stream, which IMO would be the optimal test.

It seems that using the driver API things are even more complicated, but the best I can think of is to test whether the API calls in Numba are those with _ptds/_ptsz or not, but that would still require some sort of injection to each driver API call that could report back what's the function name, and I'm not sure that's feasible in a simple manner.

gmarkall · 2021-04-16T10:20:34Z

Thanks for the feedback / thoughts @pentschev !

It seems that using the driver API things are even more complicated, but the best I can think of is to test whether the API calls in Numba are those with _ptds/_ptsz or not, but that would still require some sort of injection to each driver API call that could report back what's the function name, and I'm not sure that's feasible in a simple manner.

That's a good point. The env var NUMBA_CUDA_LOG_LEVEL=DEBUG can be set and the names of the driver API function calls will be printed. It won't be easy (or perhaps possible) to use this in a unit test though, because Numba memoizes the bindings to the driver API functions so PTDS can't just be switched on for one test. Perhaps there is a way to run the test in a subprocess, but I need to be careful of the "CUDA initialized before forking" situation - presently all Numba CUDA tests run in a single test process to avoid this. I am not sure what the IPC tests do about this though, so I will look at them and see if they might provide a clue about how to test this.

gmarkall · 2021-04-16T10:51:09Z

I am not sure what the IPC tests do about this though, so I will look at them and see if they might provide a clue about how to test this.

The IPC tests spawn instead of forking, so it looks like I might be able to use a similar approach to test PTDS.

pentschev · 2021-04-16T12:48:55Z

The IPC tests spawn instead of forking, so it looks like I might be able to use a similar approach to test PTDS.

We have been increasingly using this approach for Dask/UCX-Py tests where we need different configurations (such as different environment variables), which have another benefit of avoiding leaks of unexpected state into other pytests. It's definitely a bit slower, but it's also much easier to track when tests fail.

insertinterestingnamehere · 2021-04-17T17:11:13Z

Confirmed working downstream! Thank you for taking care of this!

This was added in Python 3.4, so the check is no longer necessary.

gmarkall · 2021-04-19T21:00:00Z

We have been increasingly using this approach for Dask/UCX-Py tests where we need different configurations (such as different environment variables), which have another benefit of avoiding leaks of unexpected state into other pytests. It's definitely a bit slower, but it's also much easier to track when tests fail.

Thanks for the insights here @pentschev - this does seem to be the best way to go. A test for PTDS is now added in 8577d9b.

Whilst implementing this based on the pattern in test_ipc, I noticed that test_ipc checks for the existence of multiprocessing.get_context - this is no longer necessary as it was added in Python 3.4 and the minimum supported version is 3.6, so I've removed that check.

pentschev · 2021-04-20T12:18:50Z

Thanks for the insights here @pentschev - this does seem to be the best way to go. A test for PTDS is now added in 8577d9b.

Thanks for working on that, it's great to know that this is indeed a feasible way of testing PTDS!

stuartarchibald

Thanks for the patch and for writing the somewhat involved unit test to check this! Just one minor comment to resolve in the docs else looks good.

docs/source/reference/envvars.rst

stuartarchibald

Thanks for the patch and fixes.

stuartarchibald · 2021-06-21T10:40:23Z

Buildfarm ID: numba_smoketest_cuda_yaml_75.

stuartarchibald · 2021-06-21T11:09:59Z

Buildfarm ID: numba_smoketest_cuda_yaml_75.

Passed.

Implement support for PTDS globally

36fcaf5

gmarkall requested review from esc, sklam and stuartarchibald as code owners April 15, 2021 22:02

gmarkall mentioned this pull request Apr 15, 2021

Support per-thread default streams in CUDA #5137

Closed

gmarkall added 2 - In Progress CUDA CUDA related issue/PR labels Apr 15, 2021

gmarkall added this to the Numba 0.54 RC milestone Apr 15, 2021

gmarkall mentioned this pull request Apr 15, 2021

[WIP] Support enabling PTDS via CUDA_PTDS environment variable rapidsai/rmm#633

Open

gmarkall added 3 - Ready for Review and removed 2 - In Progress labels Apr 16, 2021

[WIP] start PTDS, DONT PUSH

ceb66fc

insertinterestingnamehere mentioned this pull request Apr 16, 2021

Switch to using the per-thread default stream ut-parla/Parla.py#54

Draft

gmarkall added 2 commits April 19, 2021 21:56

CUDA: Add test of PTDS

8577d9b

CUDA: Remove check for multiprocessing.get_context() in test_ipc

37ac8d1

This was added in Python 3.4, so the check is no longer necessary.

stuartarchibald reviewed Jun 16, 2021

View reviewed changes

docs/source/reference/envvars.rst Outdated Show resolved Hide resolved

stuartarchibald added 4 - Waiting on author Waiting for author to respond to review Effort - medium Medium size effort needed and removed 3 - Ready for Review labels Jun 16, 2021

Update docs/source/reference/envvars.rst

ba88952

gmarkall added 4 - Waiting on reviewer Waiting for reviewer to respond to author and removed 4 - Waiting on author Waiting for author to respond to review labels Jun 16, 2021

stuartarchibald approved these changes Jun 16, 2021

View reviewed changes

stuartarchibald added 4 - Waiting on CI Review etc done, waiting for CI to finish Pending BuildFarm For PRs that have been reviewed but pending a push through our buildfarm and removed 4 - Waiting on reviewer Waiting for reviewer to respond to author labels Jun 16, 2021

sklam merged commit 52f6ee1 into numba:master Jun 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: Implement support for PTDS globally #6936

CUDA: Implement support for PTDS globally #6936

gmarkall commented Apr 15, 2021 •

edited

Loading

pentschev commented Apr 16, 2021

gmarkall commented Apr 16, 2021

gmarkall commented Apr 16, 2021

pentschev commented Apr 16, 2021

insertinterestingnamehere commented Apr 17, 2021

gmarkall commented Apr 19, 2021

pentschev commented Apr 20, 2021

stuartarchibald left a comment

stuartarchibald left a comment

stuartarchibald commented Jun 21, 2021

stuartarchibald commented Jun 21, 2021

CUDA: Implement support for PTDS globally #6936

CUDA: Implement support for PTDS globally #6936

Conversation

gmarkall commented Apr 15, 2021 • edited Loading

pentschev commented Apr 16, 2021

gmarkall commented Apr 16, 2021

gmarkall commented Apr 16, 2021

pentschev commented Apr 16, 2021

insertinterestingnamehere commented Apr 17, 2021

gmarkall commented Apr 19, 2021

pentschev commented Apr 20, 2021

stuartarchibald left a comment

Choose a reason for hiding this comment

stuartarchibald left a comment

Choose a reason for hiding this comment

stuartarchibald commented Jun 21, 2021

stuartarchibald commented Jun 21, 2021

gmarkall commented Apr 15, 2021 •

edited

Loading