Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA: Fix potential leaks when initialization fails #7360

Merged
merged 4 commits into from Sep 22, 2021

Conversation

gmarkall
Copy link
Member

@gmarkall gmarkall commented Sep 2, 2021

When CUDA driver initialization fails, the driver singleton object persists holding on to an exception object that references calling
frames and potentially other objects. This can create a leak in the case where modules that attempt to initialize the driver and then fail are created and destroyed.

This change rectifies the issue by holding on to a string describing the error instead of the exception object. There are a couple of small related changes:

  • initialize() is changed to ensure_initialized(), and the caller no longer needs to check whether it should be called - it can always call it when it needs to ensure that the driver is initialized.
  • cuda.cuda_error() returns the error string instead of an exception object. Constructing an exception object here just to maintain the original behavior seems a bit convoluted; it's likely that any code using cuda_error() is checking whether its return value is None rather than looking for a specific instance of an exception class to see if an exception occurred.

Some initialization tests are added - for the failing cases we need to run in a subprocess to avoid interfering with the initialization of the driver in the process in which we're actually running tests. The failure of cuInit(0) is accomplished by a slightly unorthodox patching of driver.cuInit, which is needed because driver functions are added to the Driver object on-demand, so there is nothing for mock.patch.object() to replace at the time we need to set up the mock.

When CUDA driver initialization fails, the driver singleton object
persists holding on to an exception object that references calling
frames and potentially other objects. This can create a leak in the case
where modules that attempt to initialize the driver and then fail are
created and destroyed.

This change rectifies the issue by holding on to a string describing the
error instead of the exception object. There are a couple of small
related changes:

- `initialize()` is changed to `ensure_initialized()`, and the caller no
  longer needs to check whether it should be called - it can always call
  it when it needs to ensure that the driver is initialized.
- `cuda.cuda_error()` returns the error string instead of an exception
  object. Constructing an exception object here just to maintain the
  original behavior seems a bit convoluted; it's likely that any code
  using `cuda_error()` is checking whether its return value is `None`
  rather than looking for a specific instance of an exception class to
  see if an exception occurred.

Some initialization tests are added - for the failing cases we need to
run in a subprocess to avoid interfering with the initialization of the
driver in the process in which we're actually running tests. The failure
of `cuInit(0)` is accomplished by a slightly unorthodox patching of
`driver.cuInit`, which is needed because driver functions are added to
the `Driver` object on-demand, so there is nothing for
`mock.patch.object()` to replace at the time we need to set up the mock.
This includes addition of some necessary stubs for the test to correctly
import in the simulator.
@gmarkall
Copy link
Member Author

gmarkall commented Sep 2, 2021

@stuartarchibald @sklam I've tested this with cuDF:

  • In the normal case (cuInit(0) succeeds), all is well.
  • In the failing case (cuInit(0) fails) cuDF errors on all tests before it can even reach Numba - every test errors in an RMM error instead.

So I believe this change will have no effect on cuDF (and the other RAPIDS libraries).

@gmarkall
Copy link
Member Author

gmarkall commented Sep 2, 2021

@philippjfr Are you able to verify that this resolves holoviz/panel#2640?

@philippjfr
Copy link

Thanks for checking with me, what's the best way to install the PR? I've never installed numba from source, can I just check it out and pip install -e .?

@gmarkall
Copy link
Member Author

gmarkall commented Sep 2, 2021

Thanks for checking with me, what's the best way to install the PR? I've never installed numba from source, can I just check it out and pip install -e .?

That might work - you will need the latest dev version of llvmlite though, and I'm not sure if that will be a problem with pip - I usually install the dev version of llvmlite with:

conda install numba/label/dev::llvmlite

If you're in a conda env, then I would guess you could do the above then run pip install --no-deps -e ..

@@ -223,7 +223,10 @@ def __init__(self):
self.is_initialized = True
self.initialization_error = e

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still seeing the same issue because of this line (when I comment it out the issue goes away).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably .initialize() is never exercised in the scenario I'm testing (which is simply to import datashader on a machine without CUDA)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argh! Sorry I missed this, will fix up tomorrow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@philippjfr This should now be resolved - could you let me know if it's fixed up all the issues in your test please?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can confirm, fixed now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks for re-testing!

…SupportError

This addresses a further cause of a memory leak similar to the previous
commit. Additional tests are added with CUDA disabled to force an error
during `Driver.__init__()`.

A very small edit is made to `initialization_error_test()`, because
`cuda_error()` should be available any time, not just when catching an
exception from initializing CUDA. This does not affect the test, but
does more closely mirror any potential use cases.
@stuartarchibald stuartarchibald added the Effort - medium Medium size effort needed label Sep 6, 2021
@gmarkall gmarkall added this to the Numba 0.54.1 milestone Sep 6, 2021
Copy link
Contributor

@stuartarchibald stuartarchibald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch, looks good and on inspection should stop the issue reported regarding holding reference to objects via the numba cuda driver singleton->exception->backtrace->frame path. There's a few minor things to look at but otherwise is ready for testing on the buildfarm.

numba/cuda/cudadrv/driver.py Show resolved Hide resolved
self.initialization_error = e
self.initialization_error = e.msg

def ensure_initialized(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted out of band, am not convinced this (or the init) is threadsafe in its current state, seems like the is_initialized bit is set too eagerly. Suggest fixing for the next release. xref #7387

result_queue = ctx.Queue()
proc = ctx.Process(target=target, args=(result_queue,))
proc.start()
proc.join()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
proc.join()
proc.join(30) # should complete within 30s

Perhaps add a timeout just in case something gets stuck so the test suite doesn't hang?

proc.join()
success, msg = result_queue.get()

# Ensure the child process raised an except during initialization
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Ensure the child process raised an except during initialization
# Ensure the child process raised an exception during initialization

@stuartarchibald
Copy link
Contributor

CC @sklam, did you perhaps want to take a look at this patch too? I helped debug the original issue etc. so some additional eyes on it might be good especially given it's scheduled the 0.54.1 patch release. Thanks!

@stuartarchibald stuartarchibald added 4 - Waiting on author Waiting for author to respond to review and removed 3 - Ready for Review labels Sep 8, 2021
@gmarkall gmarkall added 4 - Waiting on reviewer Waiting for reviewer to respond to author and removed 4 - Waiting on author Waiting for author to respond to review labels Sep 13, 2021
@gmarkall
Copy link
Member Author

@stuartarchibald Many thanks for the review - comments now addressed.

@stuartarchibald
Copy link
Contributor

@stuartarchibald Many thanks for the review - comments now addressed.

Thanks, looks good.

Copy link
Contributor

@stuartarchibald stuartarchibald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch!

@stuartarchibald stuartarchibald removed the 4 - Waiting on reviewer Waiting for reviewer to respond to author label Sep 13, 2021
@stuartarchibald stuartarchibald added 4 - Waiting on CI Review etc done, waiting for CI to finish Pending BuildFarm For PRs that have been reviewed but pending a push through our buildfarm labels Sep 13, 2021
@philippjfr
Copy link

Appreciate the quick turnaround, thanks everyone!

@stuartarchibald
Copy link
Contributor

Buildfarm ID: numba_smoketest_cuda_yaml_94.

@stuartarchibald
Copy link
Contributor

Buildfarm ID: numba_smoketest_cuda_yaml_94.

Passed.

@stuartarchibald
Copy link
Contributor

Appreciate the quick turnaround, thanks everyone!

@philippjfr No problem, thanks for testing it! If you still have the set up available, is there any chance you could please test 17e112b against the original problem just to make sure the patch that will get merged does indeed still fix it!? Many thanks.

@stuartarchibald stuartarchibald added BuildFarm Passed For PRs that have been through the buildfarm and passed and removed Pending BuildFarm For PRs that have been reviewed but pending a push through our buildfarm labels Sep 13, 2021
@stuartarchibald stuartarchibald added 5 - Ready to merge Review and testing done, is ready to merge and removed 4 - Waiting on CI Review etc done, waiting for CI to finish labels Sep 22, 2021
@sklam sklam merged commit 15d8eb1 into numba:master Sep 22, 2021
sklam added a commit to sklam/numba that referenced this pull request Sep 22, 2021
CUDA: Fix potential leaks when initialization fails
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to merge Review and testing done, is ready to merge BuildFarm Passed For PRs that have been through the buildfarm and passed CUDA CUDA related issue/PR Effort - medium Medium size effort needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants