Conversation
8a2d0e3 to
6d27f2a
Compare
…tration Previously, UR_MEM_FLAG_USE_HOST_POINTER was effectively ignored in both adapters due to a dead `EnableUseHostPtr = false` flag, causing the host data to be copied to the device instead of being mapped. This change removes that dead flag and properly enables the USE_HOST_PTR path, calling cuMemHostRegister / hipHostRegister to map the caller-provided host pointer for device access. In allocateMemObjOnDeviceIfNeeded, CUDA_ERROR_HOST_MEMORY_ALREADY_REGISTERED / hipErrorHostMemoryAlreadyRegistered is now tolerated to handle the case where the same host pointer is registered more than once across multiple lazy device allocations. The conformance test urMemBufferCreate.UseHostPointer is updated to remove CUDA and HIP from the known-failure list. Resolves: #9789
|
This PR started from an attempt to get the UseHostPointer conformance test in urMemBufferCreate to pass for CUDA and HIP. UR_MEM_FLAG_USE_HOST_POINTER had been silently treated as a copy operation in both adapters, hidden behind a hardcoded EnableUseHostPtr = false guard. The guard was accompanied by a comment citing a "weird segfault after program ends," but no such segfault was observed, so the dead code has been removed. Enabling the feature revealed one real issue: cuMemHostRegister/hipHostRegister is called once during buffer creation in urMemBufferCreate, and then again in allocateMemObjOnDeviceIfNeeded at kernel submission time, resulting in a double-registration. This is handled by tolerating CUDA_ERROR_HOST_MEMORY_ALREADY_REGISTERED/hipErrorHostMemoryAlreadyRegistered in allocateMemObjOnDeviceIfNeeded instead of propagating it as an error. |
There was a problem hiding this comment.
Pull request overview
This PR updates Unified Runtime buffer creation behavior to better support UR_MEM_FLAG_USE_HOST_POINTER by using host-memory registration (CUDA/HIP) rather than falling back to an initial copy, and adjusts conformance expectations accordingly.
Changes:
- Enable
UR_MEM_FLAG_USE_HOST_POINTERpaths for HIP and CUDA adapters (removing the previous “disabled” gating logic). - Refine “initial copy” logic so it only triggers for
UR_MEM_FLAG_ALLOC_COPY_HOST_POINTER. - Update conformance test expectations by removing HIP/CUDA from the known-failing
UseHostPointertest list.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| unified-runtime/test/conformance/memory/urMemBufferCreate.cpp | Removes HIP/CUDA from known failures for UseHostPointer conformance test. |
| unified-runtime/source/adapters/hip/memory.cpp | Switches USE_HOST_POINTER to UseHostPtr alloc mode and adjusts registration behavior/error handling. |
| unified-runtime/source/adapters/cuda/memory.cpp | Switches USE_HOST_POINTER to UseHostPtr alloc mode and adjusts registration error handling. |
Comments suppressed due to low confidence (1)
unified-runtime/source/adapters/hip/memory.cpp:107
UR_MEM_FLAG_USE_HOST_POINTERnow setsAllocMode::UseHostPtr, buturMemBufferCreateno longer registers the host memory. SinceBufferMem::clear()unconditionally callshipHostUnregister(HostPtr)forUseHostPtr, releasing a buffer that was never used on a device will attempt to unregister an unregistered pointer and can fail/crash. Register the host pointer inurMemBufferCreatewhenUSE_HOST_POINTERis set, or track whether registration actually happened and only unregister in that case.
auto HostPtr = pProperties ? pProperties->pHost : nullptr;
BufferMem::AllocMode AllocMode = BufferMem::AllocMode::Classic;
if (flags & UR_MEM_FLAG_USE_HOST_POINTER) {
AllocMode = BufferMem::AllocMode::UseHostPtr;
} else if (flags & UR_MEM_FLAG_ALLOC_HOST_POINTER) {
UR_CHECK_ERROR(hipHostMalloc(&HostPtr, size));
AllocMode = BufferMem::AllocMode::AllocHostPtr;
} else if (flags & UR_MEM_FLAG_ALLOC_COPY_HOST_POINTER) {
AllocMode = BufferMem::AllocMode::CopyIn;
}
Fix CU_MEMHOSTALLOC_DEVICEMAP → CU_MEMHOSTREGISTER_DEVICEMAP in the CUDA adapter's allocateMemObjOnDeviceIfNeeded (wrong flag for cuMemHostRegister). Add HostPtrRegisteredByUR flag to HIP BufferMem to track whether UR performed the hipHostRegister call. clear() now only calls hipHostUnregister when UR owns the registration, preventing incorrect unregistration of user-provided already-registered memory.
Remove eager cuMemHostRegister from urMemBufferCreate; registration is now lazy-only in allocateMemObjOnDeviceIfNeeded, matching the HIP approach.
Replace the defaulted BufferMem copy constructor with an explicit one that does not copy HostPtrRegisteredByUR, so sub-buffers do not inherit registration ownership from their parent.
Add explicit copy constructor to HIP BufferMem that does not copy HostPtrRegisteredByUR, matching the fix already applied to the CUDA adapter. Prevents sub-buffers from inheriting registration ownership from their parent.
|
Review please @intel/llvm-reviewers-cuda |
|
@kweronsx please update the description of this PR (it is empty now) #21725 (comment), because it will be used as the description of the final commit squashed from all commits from this PR. |
This PR fixes a long-standing issue (#9789) where UR_MEM_FLAG_USE_HOST_POINTER was silently ignored in the CUDA and HIP adapters, causing the host data to be copied to the device instead of being mapped.
What changed:
Removed the dead EnableUseHostPtr = false flag in both cuda/memory.cpp and hip/memory.cpp. This flag was hardcoded to false with a TODO comment, meaning the USE_HOST_PTR path was never actually taken — the buffer always fell back to copying.
Enabled the UseHostPtr allocation mode by properly entering that branch when UR_MEM_FLAG_USE_HOST_POINTER is set. The hipHostRegister / cuMemHostRegister call was moved from urMemBufferCreate into allocateMemObjOnDeviceIfNeeded (i.e. it is deferred to lazy device allocation), which is where it already lived.
Added HostPtrRegisteredByUR ownership flag to BufferMem in both headers. This boolean is set to true only when UR itself performed the registration. It ensures cuMemHostUnregister / hipHostUnregister is only called by the owner, preventing a double-unregister.
Handled already-registered pointers gracefully: CUDA_ERROR_HOST_MEMORY_ALREADY_REGISTERED / hipErrorHostMemoryAlreadyRegistered is now tolerated in allocateMemObjOnDeviceIfNeeded. This covers cases where the same host pointer is registered across multiple lazy device allocations (e.g. multi-device contexts).
Explicit copy constructors were added to BufferMem in both adapters to deliberately not copy HostPtrRegisteredByUR, making ownership semantics clear — only the original object is responsible for unregistering.
Conformance test update: CUDA and HIP are removed from the UseHostPointer known-failure list in urMemBufferCreate.cpp, reflecting that the feature now works correctly on those backends.