Delayed CUDA deallocation breaks pinned/mapped context managers #3508

danielwe · 2018-11-18T04:01:01Z

I am using the latest released version of Numba (most recent is visible in
the change log (https://github.com/numba/numba/blob/master/CHANGE_LOG).
I have included below a minimal working reproducer (if you are unsure how
to write one see http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports).

Context managers cannot be used to repeatedly pin/map an existing array: the call to cuMemHostUnregister is delayed by the same mechanism as device memory deallocation, hence in many cases the memory will still be pinned on subsequent context manager invocations, raising CUDA_ERROR_HOST_MEMORY_ALREADY_REGISTERED.

This fails:

import numpy as np
from numba import cuda

arr = np.zeros(1)
with cuda.pinned(arr):
    pass
with cuda.pinned(arr):
    pass

...
CudaAPIError: [712] Call to cuMemHostRegister results in CUDA_ERROR_HOST_MEMORY_ALREADY_REGISTERED

This works:

import numpy as np
from numba import cuda

arr = np.zeros(1)
with cuda.pinned(arr):
    pass
cuda.current_context().deallocations.clear()
with cuda.pinned(arr):
    pass

Are there good reasons for routing finalizers that wrap cuMemHostUnregister through the deallocation queue instead of calling them immediately?

And what about cuMemHostFree, i.e., deallocation of memory that was allocated with cuda.{pinned,mapped}_array? Seems odd that a chunk of host memory has to wait in line together with objects in device memory to be freed, however I don't think I fully appreciate the implications of these events on asynchronous execution and system freezing in the case of corrupt contexts.

The text was updated successfully, but these errors were encountered:

danielwe · 2018-11-18T04:27:07Z

A cheap solution could be to always flush deallocations before the context manager yields, like this:
numba/cuda/api.py, lines 238-252

# Page lock
@require_context
@contextlib.contextmanager
def pinned(*arylist):
    """A context manager for temporary pinning a sequence of host ndarrays.
    """
+   current_context().deallocations.clear()
    pmlist = []
    for ary in arylist:
        pm = current_context().mempin(ary, driver.host_pointer(ary),
                                      driver.host_memory_size(ary),
                                      mapped=False)
        pmlist.append(pm)
    yield
    del pmlist

However, this doesn't help if the context manager is called inside a defer_cleanup() context. That is, the following still fails:

with cuda.pinned(arr):
    pass
with cuda.defer_cleanup():
    with cuda.pinned(arr):
        pass

Side note: To ensure cleanup in case of an exception within the with block, the last two lines should be wrapped in a try/finally statement:

    try:
        yield
    finally:
        del pmlist

stuartarchibald · 2018-11-20T09:15:51Z

Thanks for the report, I can reproduce. I think your assessment of the problem is correct and evidently some thought needs to go into a suitable fix. Thanks for pointing out test cases and providing initial thoughts, v. useful.

sklam · 2018-11-20T15:24:48Z

I think you are right that at exit of with cuda.pinned(arr): the arr should have been unpinned without delay. The same goes for mapped.

The delayed device deallocation is needed to avoid breaking asynchronous execution because device arrays can be created automatically and go out-of-scope in odd places. As for pinned and mapped, there are explicitly created by users so it won't have the same problem of going out-of-scope unknowingly. So, I don't think they need to have delayed cleanup.

stuartarchibald added bug CUDA CUDA related issue/PR labels Nov 20, 2018

danielwe mentioned this issue Nov 24, 2018

Unregister temporarily pinned host arrays at once #3532

Merged

sklam closed this as completed in #3532 Dec 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delayed CUDA deallocation breaks pinned/mapped context managers #3508

Delayed CUDA deallocation breaks pinned/mapped context managers #3508

danielwe commented Nov 18, 2018

danielwe commented Nov 18, 2018

stuartarchibald commented Nov 20, 2018

sklam commented Nov 20, 2018 •

edited

Delayed CUDA deallocation breaks pinned/mapped context managers #3508

Delayed CUDA deallocation breaks pinned/mapped context managers #3508

Comments

danielwe commented Nov 18, 2018

danielwe commented Nov 18, 2018

stuartarchibald commented Nov 20, 2018

sklam commented Nov 20, 2018 • edited

sklam commented Nov 20, 2018 •

edited