Support NVIDIA's CUDA Python bindings #7461

gmarkall · 2021-10-06T15:57:40Z

This PR adds support for using NVIDIA's CUDA Python bindings to make CUDA Driver API calls. The ecosystem around these bindings presently at an early stage, with CuPy recently gaining support for them. Eventually the common use of the NVIDIA bindings by Python libraries in the CUDA ecosystem will facilitate better interoperability for primitives other than just arrays (which are presently dealt with nicely by the CUDA Array Interface), and eliminate the need for converting types between libraries (see e.g. #4797).

An overview of differences between the NVIDIA bindings and Numba's ctypes bindings:

The NVIDIA bindings are more strongly typed than the ctypes ones, so values held internally in Numba need wrapping in a type to be able to pass them to the bindings.
The return value of functions in the NVIDIA bindings is always a tuple of the error code and then the output values (of which there can be several, e.g. for cuMemGetInfo which returns the free and total memory). The way in which values are returned is made consistent with the ctypes bindings by some adaptation in _check_cuda_python_error.
The differences are handled mostly with branches on the the config variable CUDA_USE_CUDA_PYTHON. Module, Function and Linker are now ABCs with a separate implementations for each binding, as they are mostly specific to the binding in use.

Some remarks:

Using the bindings is optional, and turned off by default, because the official bindings are yet to support Per-Thread Default Streams and profiling. Once these are supported, I expect to submit another PR that makes it the default to use them if they are installed.
Runtime API calls still uses a ctypes interface - changing the Runtime API to use the NVIDIA bindings will come in a follow-up PR.
I expect the NVIDIA and Numba ctypes bindings to coexist for a period of time, and for the ctypes binding to eventually be retired once the NVIDIA bindings are ubiquitous.
I've left in the commit history in this PR because I've kept things working at each commit, and I think if there are issues introduced by these changes, it'll be much easier to track down the source of any problems with a bisect through the individual changes instead of one mega-commit.
I've tested this on my workstation with Windows + Linux and Linux on a Jetson AGX Xavier, both in environments with the NVIDIA bindings installed, and without them. However, a buildfarm run early on would be great in case there's any issues that turn up on the buildfarm machines that I've not spotted - @esc / @stuartarchibald could this have a CUDA buildfarm run to check please?

CCs for visiblity: @quasiben @leofang @stuartarchibald @mmccarty @shwina

This provides enough to successfully run `cuda.detect()` and for `cuda.is_available()` to return `True`, whilst maintaining correctness of the inbuilt ctypes bindings.

All tests pass with ctypes bindings. With CUDA Python bindings: ``` Ran 1201 tests in 30.800s FAILED (failures=5, errors=900, skipped=10, expected failures=1) ```

CUDA Python test results are now: ``` Ran 1201 tests in 31.862s FAILED (failures=49, errors=863, skipped=10, expected failures=1) ```

CUDA Python test results are now: ``` Ran 1201 tests in 32.145s FAILED (failures=6, errors=892, skipped=10, expected failures=1) ```

No functional change (intended).

CUDA Python test results are now: ``` Ran 1201 tests in 35.580s FAILED (failures=6, errors=892, skipped=10, expected failures=1) ```

CUDA Python test results are now: ``` Ran 1201 tests in 36.668s FAILED (failures=6, errors=1580, skipped=10, expected failures=1) ```

CUDA Python test results are presently: ``` Ran 1201 tests in 39.895s FAILED (failures=30, errors=1500, skipped=10, expected failures=1) ```

Unfortunately the testsuite segfaults before completion at present, so there are no metrics available.

Test results with CUDA Python: ``` Ran 1174 tests in 60.225s FAILED (failures=30, errors=591, skipped=10, expected failures=1) ```

Test results with CUDA Python are now: ``` Ran 1201 tests in 47.054s FAILED (failures=31, errors=248, skipped=41, expected failures=1) ```

It should not be the responsibility of launch_kernel to get the ctypes pointer from a higher-level abstraction object such as a device array or record.

CUDA Python test results are now: ``` Ran 1201 tests in 72.650s FAILED (failures=32, errors=224, skipped=50, expected failures=1) ```

These caused some problems for the ctypes bindings.

CUDA Python test results are now: ``` Ran 1201 tests in 68.140s FAILED (failures=15, errors=217, skipped=41, expected failures=1) ```

CUDA Python test results are now: ``` Ran 1201 tests in 54.820s FAILED (failures=13, errors=146, skipped=41, expected failures=1) ```

CUDA Python test results are now: ``` Ran 1201 tests in 65.175s FAILED (failures=15, errors=139, skipped=41, expected failures=1) ```

CUDA Python test results are now: ``` Ran 1201 tests in 55.840s FAILED (failures=17, errors=105, skipped=41, expected failures=1) ```

CUDA Python results are now: ``` Ran 1201 tests in 74.768s FAILED (failures=18, errors=99, skipped=41, expected failures=1) ```

CUDA Python test results are now: ``` Ran 1201 tests in 81.919s FAILED (failures=15, errors=49, skipped=41, expected failures=1) ```

CUDA Python results are now: ``` Ran 1201 tests in 73.974s FAILED (failures=15, errors=39, skipped=41, expected failures=1) ```

CUDA Python test results are now: ``` Ran 1201 tests in 73.229s FAILED (failures=10, errors=38, skipped=41, expected failures=1) ```

Tests now unskipped. CUDA Python tests are now: ``` Ran 1201 tests in 78.288s FAILED (failures=10, errors=38, skipped=13, expected failures=1) ```

CUDA Python test results now look like: ``` Ran 1201 tests in 74.361s FAILED (failures=10, errors=38, skipped=10, expected failures=1) ```

Tests results with CUDA Python are now: ``` Ran 1201 tests in 74.033s FAILED (failures=7, errors=34, skipped=10, expected failures=1) ```

Test results with CUDA Python are now: ``` Ran 1201 tests in 73.647s FAILED (failures=7, errors=32, skipped=10, expected failures=1) ```

CUDA Python test results are now: ``` Ran 1201 tests in 75.513s FAILED (failures=7, errors=25, skipped=10, expected failures=1) ```

CUDA Python test results are now: ``` Ran 1201 tests in 74.231s FAILED (failures=3, errors=25, skipped=10, expected failures=2) ```

CUDA Python test results are now: ``` Ran 1201 tests in 77.240s FAILED (failures=1, errors=22, skipped=10, expected failures=2) (numbanp120) ```

CUDA Python test results are now: ``` Ran 1201 tests in 73.284s FAILED (failures=1, errors=14, skipped=10, expected failures=2) ```

docs/source/reference/envvars.rst

numba/core/config.py

stuartarchibald

Thanks for the update, couple more minor things to look at otherwise looks good.

docs/source/cuda/overview.rst

docs/source/reference/envvars.rst

gmarkall · 2021-11-23T11:54:00Z

gpuci run tests

gmarkall · 2021-11-23T12:18:45Z

@stuartarchibald Thanks for the feedback - I've turned the hard error into a warning now. I think this should be ready for another look.

numba/core/config.py

Co-authored-by: stuartarchibald <stuartarchibald@users.noreply.github.com>

stuartarchibald

Many thanks for all your efforts on this @gmarkall, really glad to see this working and to get the performance boost this provides to CUDA users. This is waiting on a final gpuci run and also the buildfarm for confirmation. Thanks again.

esc · 2021-11-23T15:17:37Z

Build numba_smoketest_cuda_yaml_104 has started

gmarkall · 2021-11-23T15:23:47Z

gpuci run tests

gmarkall · 2021-11-23T19:40:29Z

@esc Did the buildfarm run succeed?

stuartarchibald · 2021-11-23T20:26:03Z

@esc Did the buildfarm run succeed?

From asking @esc OOB, yes! Which means I think this is now done!!! Thanks again @gmarkall.

gmarkall added 30 commits September 7, 2021 12:29

CUDA: Start of support for CUDA Python bindings

ccfce50

This provides enough to successfully run `cuda.detect()` and for `cuda.is_available()` to return `True`, whilst maintaining correctness of the inbuilt ctypes bindings.

CUDA testsuite runs with CUDA Python bindings

c51a9d1

All tests pass with ctypes bindings. With CUDA Python bindings: ``` Ran 1201 tests in 30.800s FAILED (failures=5, errors=900, skipped=10, expected failures=1) ```

Implement memory allocation for CUDA Python

8db1185

CUDA Python test results are now: ``` Ran 1201 tests in 31.862s FAILED (failures=49, errors=863, skipped=10, expected failures=1) ```

Some fixes for views with CUDA Python

1244c4e

CUDA Python test results are now: ``` Ran 1201 tests in 32.145s FAILED (failures=6, errors=892, skipped=10, expected failures=1) ```

CUDA: Add framework for two separate linker implementations

2ba0e4b

No functional change (intended).

CUDA: Implement load_module_image with CUDA Python bindings

cd57e37

CUDA Python test results are now: ``` Ran 1201 tests in 35.580s FAILED (failures=6, errors=892, skipped=10, expected failures=1) ```

Implement modules and functions for CUDA Python

cb3d680

CUDA Python test results are now: ``` Ran 1201 tests in 36.668s FAILED (failures=6, errors=1580, skipped=10, expected failures=1) ```

Correct argument preparation with CUDA Python

b4e93c6

CUDA Python test results are presently: ``` Ran 1201 tests in 39.895s FAILED (failures=30, errors=1500, skipped=10, expected failures=1) ```

Kernel launches now starting with CUDA Python

e4db9a5

Unfortunately the testsuite segfaults before completion at present, so there are no metrics available.

Skip record test with CUDA Python

bd539e8

Test results with CUDA Python: ``` Ran 1174 tests in 60.225s FAILED (failures=30, errors=591, skipped=10, expected failures=1) ```

Fix block size and occupancy functions and kernel launch for CUDA Python

0670f2d

Test results with CUDA Python are now: ``` Ran 1201 tests in 47.054s FAILED (failures=31, errors=248, skipped=41, expected failures=1) ```

CUDA: Only handle device pointers in launch_kernel

5e69273

It should not be the responsibility of launch_kernel to get the ctypes pointer from a higher-level abstraction object such as a device array or record.

Fix CUDA Python stream creation and skip some tests

84ce354

CUDA Python test results are now: ``` Ran 1201 tests in 72.650s FAILED (failures=32, errors=224, skipped=50, expected failures=1) ```

Revert changes to context stack

d25ee6e

These caused some problems for the ctypes bindings.

CUDA Python fixes for IPC, streams, and CAI

03dd908

CUDA Python test results are now: ``` Ran 1201 tests in 68.140s FAILED (failures=15, errors=217, skipped=41, expected failures=1) ```

CUDA Python IPC and context fixes

c033420

CUDA Python test results are now: ``` Ran 1201 tests in 54.820s FAILED (failures=13, errors=146, skipped=41, expected failures=1) ```

CUDA Python host allocation fixes

d1c1eb9

CUDA Python test results are now: ``` Ran 1201 tests in 65.175s FAILED (failures=15, errors=139, skipped=41, expected failures=1) ```

CUDA Python host allocation fixes

96f617a

CUDA Python test results are now: ``` Ran 1201 tests in 55.840s FAILED (failures=17, errors=105, skipped=41, expected failures=1) ```

Fix managed allocation with CUDA Python

e1dfe1e

CUDA Python results are now: ``` Ran 1201 tests in 74.768s FAILED (failures=18, errors=99, skipped=41, expected failures=1) ```

Fix views with CUDA Python

afa0d1b

CUDA Python test results are now: ``` Ran 1201 tests in 81.919s FAILED (failures=15, errors=49, skipped=41, expected failures=1) ```

Some CUDA Python IPC fixes

0b974c7

CUDA Python results are now: ``` Ran 1201 tests in 73.974s FAILED (failures=15, errors=39, skipped=41, expected failures=1) ```

Fix test_cuda_memory with CUDA Python

991b58f

CUDA Python test results are now: ``` Ran 1201 tests in 73.229s FAILED (failures=10, errors=38, skipped=41, expected failures=1) ```

Fix record argument passing with CUDA Python

92e053d

Tests now unskipped. CUDA Python tests are now: ``` Ran 1201 tests in 78.288s FAILED (failures=10, errors=38, skipped=13, expected failures=1) ```

Unskip remaining skipped CUDA Python tests

24d7b48

CUDA Python test results now look like: ``` Ran 1201 tests in 74.361s FAILED (failures=10, errors=38, skipped=10, expected failures=1) ```

Fix CUDA driver tests with CUDA Python

00d568c

Tests results with CUDA Python are now: ``` Ran 1201 tests in 74.033s FAILED (failures=7, errors=34, skipped=10, expected failures=1) ```

Fix a couple of CAI tests with CUDA Python

5d757be

Test results with CUDA Python are now: ``` Ran 1201 tests in 73.647s FAILED (failures=7, errors=32, skipped=10, expected failures=1) ```

Fix CUDA Array Interface tests with CUDA Python

55161ec

CUDA Python test results are now: ``` Ran 1201 tests in 75.513s FAILED (failures=7, errors=25, skipped=10, expected failures=1) ```

Mark PTDS as unsupported with CUDA Python

b1ef00f

CUDA Python test results are now: ``` Ran 1201 tests in 74.231s FAILED (failures=3, errors=25, skipped=10, expected failures=2) ```

Fix context stack tests with CUDA Python

58fd927

CUDA Python test results are now: ``` Ran 1201 tests in 77.240s FAILED (failures=1, errors=22, skipped=10, expected failures=2) (numbanp120) ```

Add file extension map for CUDA Python

eb959b2

CUDA Python test results are now: ``` Ran 1201 tests in 73.284s FAILED (failures=1, errors=14, skipped=10, expected failures=2) ```

gmarkall added 2 commits November 23, 2021 10:50

Correct mis-spelled env var in docs

1b59892

Update CUDA docs based on PR numba#7461 feedback

60321d5

stuartarchibald reviewed Nov 23, 2021

View reviewed changes

docs/source/reference/envvars.rst Show resolved Hide resolved

numba/core/config.py Outdated Show resolved Hide resolved

stuartarchibald reviewed Nov 23, 2021

View reviewed changes

docs/source/cuda/overview.rst Show resolved Hide resolved

docs/source/reference/envvars.rst Outdated Show resolved Hide resolved

gmarkall added 3 commits November 23, 2021 11:48

Warn when NVIDIA bindings requested but not found

1ca9acb

Mention env var in NVIDIA bindings warning

4a50d0c

Update NV binding env var docs

8ab1535

gmarkall added 4 - Waiting on reviewer Waiting for reviewer to respond to author and removed 4 - Waiting on author Waiting for author to respond to review labels Nov 23, 2021

stuartarchibald reviewed Nov 23, 2021

View reviewed changes

numba/core/config.py Outdated Show resolved Hide resolved

numba/core/config.py Show resolved Hide resolved

numba/core/config.py Show resolved Hide resolved

stuartarchibald added 4 - Waiting on author Waiting for author to respond to review and removed 4 - Waiting on reviewer Waiting for reviewer to respond to author 4 - Waiting on CI Review etc done, waiting for CI to finish labels Nov 23, 2021

Update numba/core/config.py

f93c602

Co-authored-by: stuartarchibald <stuartarchibald@users.noreply.github.com>

gmarkall added 4 - Waiting on reviewer Waiting for reviewer to respond to author and removed 4 - Waiting on author Waiting for author to respond to review labels Nov 23, 2021

stuartarchibald approved these changes Nov 23, 2021

View reviewed changes

stuartarchibald added 4 - Waiting on CI Review etc done, waiting for CI to finish and removed 4 - Waiting on reviewer Waiting for reviewer to respond to author labels Nov 23, 2021

sklam merged commit 0849ae7 into numba:master Nov 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support NVIDIA's CUDA Python bindings #7461

Support NVIDIA's CUDA Python bindings #7461

gmarkall commented Oct 6, 2021

stuartarchibald left a comment

gmarkall commented Nov 23, 2021

gmarkall commented Nov 23, 2021

stuartarchibald left a comment

esc commented Nov 23, 2021

gmarkall commented Nov 23, 2021

gmarkall commented Nov 23, 2021

stuartarchibald commented Nov 23, 2021

Support NVIDIA's CUDA Python bindings #7461

Support NVIDIA's CUDA Python bindings #7461

Conversation

gmarkall commented Oct 6, 2021

stuartarchibald left a comment

Choose a reason for hiding this comment

gmarkall commented Nov 23, 2021

gmarkall commented Nov 23, 2021

stuartarchibald left a comment

Choose a reason for hiding this comment

esc commented Nov 23, 2021

gmarkall commented Nov 23, 2021

gmarkall commented Nov 23, 2021

stuartarchibald commented Nov 23, 2021