CUDA: Fix potential leaks when initialization fails #7360

gmarkall · 2021-09-02T09:42:25Z

When CUDA driver initialization fails, the driver singleton object persists holding on to an exception object that references calling
frames and potentially other objects. This can create a leak in the case where modules that attempt to initialize the driver and then fail are created and destroyed.

This change rectifies the issue by holding on to a string describing the error instead of the exception object. There are a couple of small related changes:

initialize() is changed to ensure_initialized(), and the caller no longer needs to check whether it should be called - it can always call it when it needs to ensure that the driver is initialized.
cuda.cuda_error() returns the error string instead of an exception object. Constructing an exception object here just to maintain the original behavior seems a bit convoluted; it's likely that any code using cuda_error() is checking whether its return value is None rather than looking for a specific instance of an exception class to see if an exception occurred.

Some initialization tests are added - for the failing cases we need to run in a subprocess to avoid interfering with the initialization of the driver in the process in which we're actually running tests. The failure of cuInit(0) is accomplished by a slightly unorthodox patching of driver.cuInit, which is needed because driver functions are added to the Driver object on-demand, so there is nothing for mock.patch.object() to replace at the time we need to set up the mock.

When CUDA driver initialization fails, the driver singleton object persists holding on to an exception object that references calling frames and potentially other objects. This can create a leak in the case where modules that attempt to initialize the driver and then fail are created and destroyed. This change rectifies the issue by holding on to a string describing the error instead of the exception object. There are a couple of small related changes: - `initialize()` is changed to `ensure_initialized()`, and the caller no longer needs to check whether it should be called - it can always call it when it needs to ensure that the driver is initialized. - `cuda.cuda_error()` returns the error string instead of an exception object. Constructing an exception object here just to maintain the original behavior seems a bit convoluted; it's likely that any code using `cuda_error()` is checking whether its return value is `None` rather than looking for a specific instance of an exception class to see if an exception occurred. Some initialization tests are added - for the failing cases we need to run in a subprocess to avoid interfering with the initialization of the driver in the process in which we're actually running tests. The failure of `cuInit(0)` is accomplished by a slightly unorthodox patching of `driver.cuInit`, which is needed because driver functions are added to the `Driver` object on-demand, so there is nothing for `mock.patch.object()` to replace at the time we need to set up the mock.

This includes addition of some necessary stubs for the test to correctly import in the simulator.

gmarkall · 2021-09-02T14:00:06Z

@stuartarchibald @sklam I've tested this with cuDF:

In the normal case (cuInit(0) succeeds), all is well.
In the failing case (cuInit(0) fails) cuDF errors on all tests before it can even reach Numba - every test errors in an RMM error instead.

So I believe this change will have no effect on cuDF (and the other RAPIDS libraries).

gmarkall · 2021-09-02T16:09:56Z

@philippjfr Are you able to verify that this resolves holoviz/panel#2640?

philippjfr · 2021-09-02T16:12:05Z

Thanks for checking with me, what's the best way to install the PR? I've never installed numba from source, can I just check it out and pip install -e .?

gmarkall · 2021-09-02T16:27:16Z

Thanks for checking with me, what's the best way to install the PR? I've never installed numba from source, can I just check it out and pip install -e .?

That might work - you will need the latest dev version of llvmlite though, and I'm not sure if that will be a problem with pip - I usually install the dev version of llvmlite with:

conda install numba/label/dev::llvmlite

If you're in a conda env, then I would guess you could do the above then run pip install --no-deps -e ..

philippjfr · 2021-09-02T17:20:47Z

numba/cuda/cudadrv/driver.py

@@ -223,7 +223,10 @@ def __init__(self):
            self.is_initialized = True
            self.initialization_error = e


Still seeing the same issue because of this line (when I comment it out the issue goes away).

Presumably .initialize() is never exercised in the scenario I'm testing (which is simply to import datashader on a machine without CUDA)

Argh! Sorry I missed this, will fix up tomorrow.

@philippjfr This should now be resolved - could you let me know if it's fixed up all the issues in your test please?

Can confirm, fixed now.

Many thanks for re-testing!

…SupportError This addresses a further cause of a memory leak similar to the previous commit. Additional tests are added with CUDA disabled to force an error during `Driver.__init__()`. A very small edit is made to `initialization_error_test()`, because `cuda_error()` should be available any time, not just when catching an exception from initializing CUDA. This does not affect the test, but does more closely mirror any potential use cases.

stuartarchibald

Thanks for the patch, looks good and on inspection should stop the issue reported regarding holding reference to objects via the numba cuda driver singleton->exception->backtrace->frame path. There's a few minor things to look at but otherwise is ready for testing on the buildfarm.

numba/cuda/cudadrv/driver.py

stuartarchibald · 2021-09-08T20:36:53Z

numba/cuda/cudadrv/driver.py

-            self.initialization_error = e
+            self.initialization_error = e.msg
+
+    def ensure_initialized(self):


As noted out of band, am not convinced this (or the init) is threadsafe in its current state, seems like the is_initialized bit is set too eagerly. Suggest fixing for the next release. xref #7387

stuartarchibald · 2021-09-08T20:41:08Z

numba/cuda/tests/cudadrv/test_init.py

+        result_queue = ctx.Queue()
+        proc = ctx.Process(target=target, args=(result_queue,))
+        proc.start()
+        proc.join()


Suggested change

proc.join()

proc.join(30) # should complete within 30s

Perhaps add a timeout just in case something gets stuck so the test suite doesn't hang?

stuartarchibald · 2021-09-08T20:41:32Z

numba/cuda/tests/cudadrv/test_init.py

+        proc.join()
+        success, msg = result_queue.get()
+
+        # Ensure the child process raised an except during initialization


Suggested change

# Ensure the child process raised an except during initialization

# Ensure the child process raised an exception during initialization

stuartarchibald · 2021-09-08T20:53:06Z

CC @sklam, did you perhaps want to take a look at this patch too? I helped debug the original issue etc. so some additional eyes on it might be good especially given it's scheduled the 0.54.1 patch release. Thanks!

gmarkall · 2021-09-13T10:09:11Z

@stuartarchibald Many thanks for the review - comments now addressed.

stuartarchibald · 2021-09-13T10:15:39Z

@stuartarchibald Many thanks for the review - comments now addressed.

Thanks, looks good.

stuartarchibald

Thanks for the patch!

philippjfr · 2021-09-13T10:49:46Z

Appreciate the quick turnaround, thanks everyone!

stuartarchibald · 2021-09-13T11:11:14Z

Buildfarm ID: numba_smoketest_cuda_yaml_94.

stuartarchibald · 2021-09-13T11:35:06Z

Buildfarm ID: numba_smoketest_cuda_yaml_94.

Passed.

stuartarchibald · 2021-09-13T11:37:18Z

Appreciate the quick turnaround, thanks everyone!

@philippjfr No problem, thanks for testing it! If you still have the set up available, is there any chance you could please test 17e112b against the original problem just to make sure the patch that will get merged does indeed still fix it!? Many thanks.

CUDA: Fix potential leaks when initialization fails

gmarkall requested a review from stuartarchibald September 2, 2021 09:42

gmarkall added the 2 - In Progress label Sep 2, 2021

CUDA: Skip test_init on the simulator

9de2c66

This includes addition of some necessary stubs for the test to correctly import in the simulator.

gmarkall force-pushed the fix-init-fail-leak branch from 4c4b275 to 9de2c66 Compare September 2, 2021 09:49

gmarkall added CUDA CUDA related issue/PR 3 - Ready for Review and removed 2 - In Progress labels Sep 2, 2021

philippjfr mentioned this pull request Sep 2, 2021

Memory leak in panel holoviz/panel#2640

Open

bryevdv mentioned this pull request Sep 2, 2021

[BUG] Increasing memory consumption of bokeh server (part 2) bokeh/bokeh#11477

Closed

philippjfr reviewed Sep 2, 2021

View reviewed changes

stuartarchibald added the Effort - medium Medium size effort needed label Sep 6, 2021

gmarkall added this to the Numba 0.54.1 milestone Sep 6, 2021

gmarkall mentioned this pull request Sep 8, 2021

Potential race condition in CUDA initialization #7387

Open

stuartarchibald reviewed Sep 8, 2021

View reviewed changes

stuartarchibald added 4 - Waiting on author Waiting for author to respond to review and removed 3 - Ready for Review labels Sep 8, 2021

Add timeout to init test, fix typo (PR numba#7360 feedback)

17e112b

gmarkall added 4 - Waiting on reviewer Waiting for reviewer to respond to author and removed 4 - Waiting on author Waiting for author to respond to review labels Sep 13, 2021

stuartarchibald approved these changes Sep 13, 2021

View reviewed changes

stuartarchibald removed the 4 - Waiting on reviewer Waiting for reviewer to respond to author label Sep 13, 2021

stuartarchibald added 4 - Waiting on CI Review etc done, waiting for CI to finish Pending BuildFarm For PRs that have been reviewed but pending a push through our buildfarm labels Sep 13, 2021

stuartarchibald added BuildFarm Passed For PRs that have been through the buildfarm and passed and removed Pending BuildFarm For PRs that have been reviewed but pending a push through our buildfarm labels Sep 13, 2021

stuartarchibald added 5 - Ready to merge Review and testing done, is ready to merge and removed 4 - Waiting on CI Review etc done, waiting for CI to finish labels Sep 22, 2021

sklam merged commit 15d8eb1 into numba:master Sep 22, 2021

sklam added a commit to sklam/numba that referenced this pull request Sep 22, 2021

Merge pull request numba#7360 from gmarkall/fix-init-fail-leak

5dbcadb

CUDA: Fix potential leaks when initialization fails

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: Fix potential leaks when initialization fails #7360

CUDA: Fix potential leaks when initialization fails #7360

gmarkall commented Sep 2, 2021

gmarkall commented Sep 2, 2021

gmarkall commented Sep 2, 2021

philippjfr commented Sep 2, 2021

gmarkall commented Sep 2, 2021

philippjfr Sep 2, 2021

philippjfr Sep 2, 2021

gmarkall Sep 2, 2021

gmarkall Sep 3, 2021

philippjfr Sep 3, 2021

gmarkall Sep 3, 2021

stuartarchibald left a comment

stuartarchibald Sep 8, 2021

stuartarchibald Sep 8, 2021

stuartarchibald Sep 8, 2021

stuartarchibald commented Sep 8, 2021

gmarkall commented Sep 13, 2021

stuartarchibald commented Sep 13, 2021

stuartarchibald left a comment

philippjfr commented Sep 13, 2021

stuartarchibald commented Sep 13, 2021

stuartarchibald commented Sep 13, 2021

stuartarchibald commented Sep 13, 2021

		@@ -223,7 +223,10 @@ def __init__(self):
		self.is_initialized = True
		self.initialization_error = e

	# Ensure the child process raised an except during initialization
	# Ensure the child process raised an exception during initialization

CUDA: Fix potential leaks when initialization fails #7360

CUDA: Fix potential leaks when initialization fails #7360

Conversation

gmarkall commented Sep 2, 2021

gmarkall commented Sep 2, 2021

gmarkall commented Sep 2, 2021

philippjfr commented Sep 2, 2021

gmarkall commented Sep 2, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stuartarchibald left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stuartarchibald commented Sep 8, 2021

gmarkall commented Sep 13, 2021

stuartarchibald commented Sep 13, 2021

stuartarchibald left a comment

Choose a reason for hiding this comment

philippjfr commented Sep 13, 2021

stuartarchibald commented Sep 13, 2021

stuartarchibald commented Sep 13, 2021

stuartarchibald commented Sep 13, 2021