Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA: Implement support for PTDS globally #6936

Merged
merged 5 commits into from
Jun 22, 2021
Merged

Conversation

gmarkall
Copy link
Member

@gmarkall gmarkall commented Apr 15, 2021

This PR adds support for using the per-thread default stream by default - a config variable, CUDA_PER_THREAD_DEFAULT_STREAM, is added, which is used to control whether the default stream is legacy or per-thread. Presently the default is the legacy default stream - once this implementation is proven in use, we could consider making per-thread the default in future.

All tests pass for me locally with the legacy and per-thread default stream, run with:

$ python -m numba.runtests numba.cuda.tests -vf -m

and

$ NUMBA_CUDA_PER_THREAD_DEFAULT_STREAM=1 python -m numba.runtests numba.cuda.tests -vf -m

respectively. Additionally, a test program:

from numba import cuda, float32, void
from time import perf_counter
import numpy as np
import threading

N = 2 ** 16
N_THREADS = 10
N_ADDITIONS = 4096

np.random.seed(1)
x = np.random.random(N).astype(np.float32)
r = np.zeros_like(x)

xs = [cuda.to_device(x) for _ in range(N_THREADS)]
rs = [cuda.to_device(r) for _ in range(N_THREADS)]

n_threads = 256
n_blocks = N // n_threads
stream = cuda.default_stream()


@cuda.jit(void(float32[::1], float32[::1]))
def f(r, x):
    i = cuda.grid(1)

    if i > len(r):
        return

    # Accumulate x into r
    for j in range(N_ADDITIONS):
        r[i] += x[i]


def kernel_thread(n):
    f[n_blocks, n_threads, stream](rs[n], xs[n])


def main():
    print("Creating threads")
    threads = [threading.Thread(target=kernel_thread, args=(i,))
               for i in range(N_THREADS)]

    print("Starting threads")
    start = perf_counter()

    for thread in threads:
        thread.start()

    print("Waiting for threads to finish")
    for thread in threads:
        thread.join()

    print("Synchronizing with device")
    cuda.synchronize()

    end = perf_counter()
    print(f"Elapsed time: {end - start}")

    print("Checking output")
    expected = x * N_ADDITIONS

    for i in range(N_THREADS):
        print(f"Checking output {i}")
        # Lower than usual tolerance because our method of accumulation is not
        # particularly accurate
        rtol = 1.0e-4
        np.testing.assert_allclose(rs[i].copy_to_host(), expected, rtol=rtol)

    print("Done!")


if __name__ == '__main__':
    main()

can be shown to correctly use the legacy default stream or the per-thread default stream depending on the setting of the config variable, using Nsight Systems to record the streams of the kernel launches. With the legacy default stream set:

image

With the per-thread default stream set:

image

EDIT (Test added in commit 8577d9b). Original question: However, I'm struggling to thing of a fully automated / programmatic way of testing this. Any thoughts on this, anyone?

Fixes #5137.

@pentschev
Copy link
Contributor

This looks great! Thanks @gmarkall for the work here!

However, I'm struggling to thing of a fully automated / programmatic way of testing this. Any thoughts on this, anyone?

This was indeed something I could not find an optimal solution for in CuPy. Because CuPy uses the CUDA runtime API, what we did was replacing cudaStreamLegacy by cudaStreamPerThread when CUPY_CUDA_PER_THREAD_DEFAULT_STREAM is enabled, and then we test whether the CuPy default stream is cudaStreamLegacy or cudaStreamPerThread, depending on whether the test was executed with or without CUPY_CUDA_PER_THREAD_DEFAULT_STREAM. Of course that means we don't actually test whether each thread is running on a different CUDA stream, which IMO would be the optimal test.

It seems that using the driver API things are even more complicated, but the best I can think of is to test whether the API calls in Numba are those with _ptds/_ptsz or not, but that would still require some sort of injection to each driver API call that could report back what's the function name, and I'm not sure that's feasible in a simple manner.

@gmarkall
Copy link
Member Author

Thanks for the feedback / thoughts @pentschev !

It seems that using the driver API things are even more complicated, but the best I can think of is to test whether the API calls in Numba are those with _ptds/_ptsz or not, but that would still require some sort of injection to each driver API call that could report back what's the function name, and I'm not sure that's feasible in a simple manner.

That's a good point. The env var NUMBA_CUDA_LOG_LEVEL=DEBUG can be set and the names of the driver API function calls will be printed. It won't be easy (or perhaps possible) to use this in a unit test though, because Numba memoizes the bindings to the driver API functions so PTDS can't just be switched on for one test. Perhaps there is a way to run the test in a subprocess, but I need to be careful of the "CUDA initialized before forking" situation - presently all Numba CUDA tests run in a single test process to avoid this. I am not sure what the IPC tests do about this though, so I will look at them and see if they might provide a clue about how to test this.

@gmarkall
Copy link
Member Author

I am not sure what the IPC tests do about this though, so I will look at them and see if they might provide a clue about how to test this.

The IPC tests spawn instead of forking, so it looks like I might be able to use a similar approach to test PTDS.

@pentschev
Copy link
Contributor

The IPC tests spawn instead of forking, so it looks like I might be able to use a similar approach to test PTDS.

We have been increasingly using this approach for Dask/UCX-Py tests where we need different configurations (such as different environment variables), which have another benefit of avoiding leaks of unexpected state into other pytests. It's definitely a bit slower, but it's also much easier to track when tests fail.

@insertinterestingnamehere

Confirmed working downstream! Thank you for taking care of this!

This was added in Python 3.4, so the check is no longer necessary.
@gmarkall
Copy link
Member Author

We have been increasingly using this approach for Dask/UCX-Py tests where we need different configurations (such as different environment variables), which have another benefit of avoiding leaks of unexpected state into other pytests. It's definitely a bit slower, but it's also much easier to track when tests fail.

Thanks for the insights here @pentschev - this does seem to be the best way to go. A test for PTDS is now added in 8577d9b.

Whilst implementing this based on the pattern in test_ipc, I noticed that test_ipc checks for the existence of multiprocessing.get_context - this is no longer necessary as it was added in Python 3.4 and the minimum supported version is 3.6, so I've removed that check.

@pentschev
Copy link
Contributor

Thanks for the insights here @pentschev - this does seem to be the best way to go. A test for PTDS is now added in 8577d9b.

Thanks for working on that, it's great to know that this is indeed a feasible way of testing PTDS!

Copy link
Contributor

@stuartarchibald stuartarchibald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch and for writing the somewhat involved unit test to check this! Just one minor comment to resolve in the docs else looks good.

docs/source/reference/envvars.rst Outdated Show resolved Hide resolved
@stuartarchibald stuartarchibald added 4 - Waiting on author Waiting for author to respond to review Effort - medium Medium size effort needed and removed 3 - Ready for Review labels Jun 16, 2021
@gmarkall gmarkall added 4 - Waiting on reviewer Waiting for reviewer to respond to author and removed 4 - Waiting on author Waiting for author to respond to review labels Jun 16, 2021
Copy link
Contributor

@stuartarchibald stuartarchibald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch and fixes.

@stuartarchibald stuartarchibald added 4 - Waiting on CI Review etc done, waiting for CI to finish Pending BuildFarm For PRs that have been reviewed but pending a push through our buildfarm and removed 4 - Waiting on reviewer Waiting for reviewer to respond to author labels Jun 16, 2021
@stuartarchibald
Copy link
Contributor

Buildfarm ID: numba_smoketest_cuda_yaml_75.

@stuartarchibald
Copy link
Contributor

Buildfarm ID: numba_smoketest_cuda_yaml_75.

Passed.

@stuartarchibald stuartarchibald added 5 - Ready to merge Review and testing done, is ready to merge BuildFarm Passed For PRs that have been through the buildfarm and passed and removed 4 - Waiting on CI Review etc done, waiting for CI to finish Pending BuildFarm For PRs that have been reviewed but pending a push through our buildfarm labels Jun 21, 2021
@sklam sklam merged commit 52f6ee1 into numba:master Jun 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to merge Review and testing done, is ready to merge BuildFarm Passed For PRs that have been through the buildfarm and passed CUDA CUDA related issue/PR Effort - medium Medium size effort needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support per-thread default streams in CUDA
5 participants