New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: Fix potential leaks when initialization fails #7360
Conversation
When CUDA driver initialization fails, the driver singleton object persists holding on to an exception object that references calling frames and potentially other objects. This can create a leak in the case where modules that attempt to initialize the driver and then fail are created and destroyed. This change rectifies the issue by holding on to a string describing the error instead of the exception object. There are a couple of small related changes: - `initialize()` is changed to `ensure_initialized()`, and the caller no longer needs to check whether it should be called - it can always call it when it needs to ensure that the driver is initialized. - `cuda.cuda_error()` returns the error string instead of an exception object. Constructing an exception object here just to maintain the original behavior seems a bit convoluted; it's likely that any code using `cuda_error()` is checking whether its return value is `None` rather than looking for a specific instance of an exception class to see if an exception occurred. Some initialization tests are added - for the failing cases we need to run in a subprocess to avoid interfering with the initialization of the driver in the process in which we're actually running tests. The failure of `cuInit(0)` is accomplished by a slightly unorthodox patching of `driver.cuInit`, which is needed because driver functions are added to the `Driver` object on-demand, so there is nothing for `mock.patch.object()` to replace at the time we need to set up the mock.
This includes addition of some necessary stubs for the test to correctly import in the simulator.
4c4b275
to
9de2c66
Compare
@stuartarchibald @sklam I've tested this with cuDF:
So I believe this change will have no effect on cuDF (and the other RAPIDS libraries). |
@philippjfr Are you able to verify that this resolves holoviz/panel#2640? |
Thanks for checking with me, what's the best way to install the PR? I've never installed numba from source, can I just check it out and |
That might work - you will need the latest dev version of llvmlite though, and I'm not sure if that will be a problem with pip - I usually install the dev version of llvmlite with:
If you're in a conda env, then I would guess you could do the above then run |
numba/cuda/cudadrv/driver.py
Outdated
@@ -223,7 +223,10 @@ def __init__(self): | |||
self.is_initialized = True | |||
self.initialization_error = e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still seeing the same issue because of this line (when I comment it out the issue goes away).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably .initialize()
is never exercised in the scenario I'm testing (which is simply to import datashader
on a machine without CUDA)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Argh! Sorry I missed this, will fix up tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@philippjfr This should now be resolved - could you let me know if it's fixed up all the issues in your test please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can confirm, fixed now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many thanks for re-testing!
…SupportError This addresses a further cause of a memory leak similar to the previous commit. Additional tests are added with CUDA disabled to force an error during `Driver.__init__()`. A very small edit is made to `initialization_error_test()`, because `cuda_error()` should be available any time, not just when catching an exception from initializing CUDA. This does not affect the test, but does more closely mirror any potential use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch, looks good and on inspection should stop the issue reported regarding holding reference to objects via the numba cuda driver singleton->exception->backtrace->frame path. There's a few minor things to look at but otherwise is ready for testing on the buildfarm.
self.initialization_error = e | ||
self.initialization_error = e.msg | ||
|
||
def ensure_initialized(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted out of band, am not convinced this (or the init) is threadsafe in its current state, seems like the is_initialized
bit is set too eagerly. Suggest fixing for the next release. xref #7387
result_queue = ctx.Queue() | ||
proc = ctx.Process(target=target, args=(result_queue,)) | ||
proc.start() | ||
proc.join() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
proc.join() | |
proc.join(30) # should complete within 30s |
Perhaps add a timeout just in case something gets stuck so the test suite doesn't hang?
proc.join() | ||
success, msg = result_queue.get() | ||
|
||
# Ensure the child process raised an except during initialization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Ensure the child process raised an except during initialization | |
# Ensure the child process raised an exception during initialization |
CC @sklam, did you perhaps want to take a look at this patch too? I helped debug the original issue etc. so some additional eyes on it might be good especially given it's scheduled the 0.54.1 patch release. Thanks! |
@stuartarchibald Many thanks for the review - comments now addressed. |
Thanks, looks good. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch!
Appreciate the quick turnaround, thanks everyone! |
Buildfarm ID: |
Passed. |
@philippjfr No problem, thanks for testing it! If you still have the set up available, is there any chance you could please test 17e112b against the original problem just to make sure the patch that will get merged does indeed still fix it!? Many thanks. |
CUDA: Fix potential leaks when initialization fails
When CUDA driver initialization fails, the driver singleton object persists holding on to an exception object that references calling
frames and potentially other objects. This can create a leak in the case where modules that attempt to initialize the driver and then fail are created and destroyed.
This change rectifies the issue by holding on to a string describing the error instead of the exception object. There are a couple of small related changes:
initialize()
is changed toensure_initialized()
, and the caller no longer needs to check whether it should be called - it can always call it when it needs to ensure that the driver is initialized.cuda.cuda_error()
returns the error string instead of an exception object. Constructing an exception object here just to maintain the original behavior seems a bit convoluted; it's likely that any code usingcuda_error()
is checking whether its return value isNone
rather than looking for a specific instance of an exception class to see if an exception occurred.Some initialization tests are added - for the failing cases we need to run in a subprocess to avoid interfering with the initialization of the driver in the process in which we're actually running tests. The failure of
cuInit(0)
is accomplished by a slightly unorthodox patching ofdriver.cuInit
, which is needed because driver functions are added to theDriver
object on-demand, so there is nothing formock.patch.object()
to replace at the time we need to set up the mock.