Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delayed CUDA deallocation breaks pinned/mapped context managers #3508

Closed
2 tasks done
danielwe opened this issue Nov 18, 2018 · 3 comments
Closed
2 tasks done

Delayed CUDA deallocation breaks pinned/mapped context managers #3508

danielwe opened this issue Nov 18, 2018 · 3 comments
Labels
bug CUDA CUDA related issue/PR

Comments

@danielwe
Copy link
Contributor

Context managers cannot be used to repeatedly pin/map an existing array: the call to cuMemHostUnregister is delayed by the same mechanism as device memory deallocation, hence in many cases the memory will still be pinned on subsequent context manager invocations, raising CUDA_ERROR_HOST_MEMORY_ALREADY_REGISTERED.

This fails:

import numpy as np
from numba import cuda

arr = np.zeros(1)
with cuda.pinned(arr):
    pass
with cuda.pinned(arr):
    pass
...
CudaAPIError: [712] Call to cuMemHostRegister results in CUDA_ERROR_HOST_MEMORY_ALREADY_REGISTERED

This works:

import numpy as np
from numba import cuda

arr = np.zeros(1)
with cuda.pinned(arr):
    pass
cuda.current_context().deallocations.clear()
with cuda.pinned(arr):
    pass

Are there good reasons for routing finalizers that wrap cuMemHostUnregister through the deallocation queue instead of calling them immediately?

And what about cuMemHostFree, i.e., deallocation of memory that was allocated with cuda.{pinned,mapped}_array? Seems odd that a chunk of host memory has to wait in line together with objects in device memory to be freed, however I don't think I fully appreciate the implications of these events on asynchronous execution and system freezing in the case of corrupt contexts.

@danielwe
Copy link
Contributor Author

A cheap solution could be to always flush deallocations before the context manager yields, like this:
numba/cuda/api.py, lines 238-252

# Page lock
@require_context
@contextlib.contextmanager
def pinned(*arylist):
    """A context manager for temporary pinning a sequence of host ndarrays.
    """
+   current_context().deallocations.clear()
    pmlist = []
    for ary in arylist:
        pm = current_context().mempin(ary, driver.host_pointer(ary),
                                      driver.host_memory_size(ary),
                                      mapped=False)
        pmlist.append(pm)
    yield
    del pmlist

However, this doesn't help if the context manager is called inside a defer_cleanup() context. That is, the following still fails:

with cuda.pinned(arr):
    pass
with cuda.defer_cleanup():
    with cuda.pinned(arr):
        pass

Side note: To ensure cleanup in case of an exception within the with block, the last two lines should be wrapped in a try/finally statement:

    try:
        yield
    finally:
        del pmlist

@stuartarchibald stuartarchibald added bug CUDA CUDA related issue/PR labels Nov 20, 2018
@stuartarchibald
Copy link
Contributor

Thanks for the report, I can reproduce. I think your assessment of the problem is correct and evidently some thought needs to go into a suitable fix. Thanks for pointing out test cases and providing initial thoughts, v. useful.

@sklam
Copy link
Member

sklam commented Nov 20, 2018

I think you are right that at exit of with cuda.pinned(arr): the arr should have been unpinned without delay. The same goes for mapped.

The delayed device deallocation is needed to avoid breaking asynchronous execution because device arrays can be created automatically and go out-of-scope in odd places. As for pinned and mapped, there are explicitly created by users so it won't have the same problem of going out-of-scope unknowingly. So, I don't think they need to have delayed cleanup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug CUDA CUDA related issue/PR
Projects
None yet
Development

No branches or pull requests

3 participants