Skip to content

Conversation

ldorau
Copy link
Contributor

@ldorau ldorau commented Sep 12, 2025

urDeviceRetain(MDevice) should not be called in the device_impl constructor at all,
because RefCounter is initialized with 1, when the device is created.

It fixes URT-961.

@ldorau ldorau force-pushed the Do_not_call_DeviceRetain_in_constructor_of_device_impl branch from b07c1d4 to 51a4b37 Compare September 12, 2025 13:43
@ldorau ldorau force-pushed the Do_not_call_DeviceRetain_in_constructor_of_device_impl branch from 51a4b37 to ab4b9b9 Compare September 12, 2025 13:48
@ldorau ldorau changed the title [UR] Do not call urDeviceRetain() in constructor of device_impl [UR] Do not call urDeviceRetain() in case of subdevices Sep 12, 2025
@ldorau ldorau force-pushed the Do_not_call_DeviceRetain_in_constructor_of_device_impl branch from ab4b9b9 to 7c033d0 Compare September 15, 2025 10:05
@ldorau ldorau force-pushed the Do_not_call_DeviceRetain_in_constructor_of_device_impl branch from 7c033d0 to 054acf1 Compare September 15, 2025 10:15
@ldorau ldorau force-pushed the Do_not_call_DeviceRetain_in_constructor_of_device_impl branch from 054acf1 to 7eb40c4 Compare September 15, 2025 15:27
@ldorau ldorau changed the title [UR] Do not call urDeviceRetain() in case of subdevices [SYCL] Do not call urDeviceRetain() in case of subdevices Sep 15, 2025
@ldorau ldorau changed the title [SYCL] Do not call urDeviceRetain() in case of subdevices [SYCL] Do not call urDeviceRetain() in case of subdevices Sep 15, 2025
@ldorau ldorau force-pushed the Do_not_call_DeviceRetain_in_constructor_of_device_impl branch from 7eb40c4 to 1e745dc Compare September 15, 2025 15:33
@ldorau ldorau marked this pull request as ready for review September 15, 2025 15:34
@ldorau ldorau requested a review from a team as a code owner September 15, 2025 15:34
@ldorau ldorau requested a review from againull September 15, 2025 15:34
@ldorau
Copy link
Contributor Author

ldorau commented Sep 15, 2025

@pbalcer please review

@ldorau ldorau force-pushed the Do_not_call_DeviceRetain_in_constructor_of_device_impl branch from 1e745dc to 1ab563c Compare September 15, 2025 16:21
Copy link
Contributor

@intel/llvm-gatekeepers please consider merging

Comment on lines 33 to 37
if (!IsSubDevice) {
// Interoperability Constructor already calls DeviceRetain in
// urDeviceCreateWithNativeHandle.
getAdapter().call<UrApiKind::urDeviceRetain>(MDevice);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The destructor device_impl::~device_impl calls urDeviceRelease, which is meant to balance the urDeviceRetain in the constructor. However, after this change, urDeviceRetain is no longer called for subdevices in the constructor, while urDeviceRelease is still invoked for them in the destructor. This imbalance is somewhat concerning. It seems that the actual root cause of the memory leak may lie elsewhere. Could you please clarify what you believe the underlying issue is?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is that when a device is created RefCount is equal 1 without any call to retain(), but it is destroyed when RefCount is equal 0, so there should be one more call to release() than a number of calls to retain() ... or the implementation of class RefCount is wrong ?

class RefCount {
public:
  RefCount(uint32_t count = 1) : Count(count) {}
  ...
  uint32_t retain() { return ++Count; }
  bool release() { return --Count == 0; }
  void reset(uint32_t value = 1) { Count = value; }

private:
  std::atomic_uint32_t Count;
};

...

ur_result_t urDeviceRelease(ur_device_handle_t Device) {
  // Root devices are destroyed during the piTearDown process.
  if (Device->isSubDevice()) {
    if (Device->RefCount.release()) {
      delete Device;
    }
  }

  return UR_RESULT_SUCCESS;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@againull againull Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ldorau @pbalcer In such case, why we differentiate between root device and sub-device in this case? can we just remove retain from the constructor completely? (only sub-devices being ref counted seems like implementation detail of the adapter)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"(only sub-devices being ref counted seems like implementation detail of the adapter)"

That's the reason. I pointed it out because it's possible that the existing sequence of creates/retains/releases was always broken, and we never noticed because the leak happens in the rare case of subdevices. So in a way, I'm agreeing with your earlier comment.

Copy link
Contributor Author

@ldorau ldorau Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am testing not calling urDeviceRetain() in device_impl ctor at all (#20144) - one unittest (sycl/unittests/context_device/DeviceRefCounter.cpp) will have to be changed at least ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@againull I have removed urDeviceRetain(MDevice) from the device_impl constructor completely.
Please re-review.

@ldorau ldorau requested a review from againull September 17, 2025 09:45
Copy link
Contributor

@intel/llvm-gatekeepers please consider merging

1 similar comment
Copy link
Contributor

@intel/llvm-gatekeepers please consider merging

@ldorau ldorau changed the title [SYCL] Do not call urDeviceRetain() in case of subdevices [SYCL][UR] Do not call urDeviceRetain() in device_impl constructor at all Sep 22, 2025
@ldorau ldorau force-pushed the Do_not_call_DeviceRetain_in_constructor_of_device_impl branch from 119f6a2 to 6396c62 Compare September 22, 2025 20:12
@ldorau
Copy link
Contributor Author

ldorau commented Sep 22, 2025

@pbalcer @againull I have removed urDeviceRetain(MDevice) from the device_impl constructor completely.
Please re-review.

urDeviceRetain(MDevice) should not be called in device_impl ctor at all,
because RefCounter is initialized with 1, when the device is created.

It fixes URT-961.

Signed-off-by: Lukasz Dorau <lukasz.dorau@intel.com>
@pbalcer
Copy link
Contributor

pbalcer commented Sep 23, 2025

@pbalcer @againull I have removed urDeviceRetain(MDevice) from the device_impl constructor completely. Please re-review.

I don't know enough about how the SYCL device object is managed to be able to say whether this is correct. But, generally, this makes sense to me - devices (and all other UR objects) start with refcount 1, so retaining them immediately after creation is unnecessary.

// So for this test, we just do it.
sycl::detail::GlobalHandler::instance().getPlatformCache().clear();

EXPECT_EQ(DevRefCounter, 0);
Copy link
Contributor

@againull againull Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a bit off that ref count is zero here, because callback for urDeviceGet (redefinedDevicesGetAfter) increments the DevRefCounter. So, even if we don't call Retain anymore, ref count is still expected to be at least 1 because of urDevicesGet. It would be nice to either understand why it is normal to have zero ref count here or to update the test if necessary if there is something wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@againull I have debugged this unittest. It calls 3 times platform_impl::getDevicesImplHelper():

void platform_impl::getDevicesImplHelper(ur_device_type_t UrDeviceType,
which calls:

  1. urDeviceGet(NumDevices == 0) what sets DevRefCounter == 0 (using redefinedDevicesGetAfter()):
    MAdapter->call<UrApiKind::urDeviceGet>(MPlatform, UrDeviceType,
    0u, // CP info::device_type::all
    nullptr, &NumDevices);
  2. urDeviceGet(NumDevices == 1) what sets DevRefCounter == 1 (using redefinedDevicesGetAfter()):
    MAdapter->call<UrApiKind::urDeviceGet>(
    MPlatform,
    UrDeviceType, // CP info::device_type::all
    NumDevices, UrDevices.data(), nullptr);
  3. urDeviceRelease() on this device what decrements DevRefCounter to 0 (using redefinedDeviceReleaseAfter()):
    // The reference counter for handles, that we used to create sycl objects, is
    // incremented, so we need to call release here.
    for (ur_device_handle_t &UrDev : UrDevicesToCleanUp)
    MAdapter->call<UrApiKind::urDeviceRelease>(UrDev);

... what looks like:

- redefinedDevicesGetAfter(*params.pphDevices == 0)    DevRefCounter == 0  from urDeviceGet(NumDevices == 0)
- redefinedDevicesGetAfter(*params.pNumEntries == 1)   DevRefCounter == 1  from urDeviceGet(NumDevices == 1)
- redefinedDeviceReleaseAfter()                        DevRefCounter == 0  from urDeviceRelease()

- redefinedDevicesGetAfter(*params.pphDevices == 0)    DevRefCounter == 0  from urDeviceGet(NumDevices == 0)
- redefinedDevicesGetAfter(*params.pNumEntries == 1)   DevRefCounter == 1  from urDeviceGet(NumDevices == 1)
- redefinedDeviceReleaseAfter()                        DevRefCounter == 0  from urDeviceRelease()

- redefinedDevicesGetAfter(*params.pphDevices == 0)    DevRefCounter == 0  from urDeviceGet(NumDevices == 0)
- redefinedDevicesGetAfter(*params.pNumEntries == 1)   DevRefCounter == 1  from urDeviceGet(NumDevices == 1)
- redefinedDeviceReleaseAfter()                        DevRefCounter == 0  from urDeviceRelease()

@againull Does it make sense or is there anything wrong here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a bit off that ref count is zero here, because callback for urDeviceGet (redefinedDevicesGetAfter) increments the DevRefCounter. So, even if we don't call Retain anymore, ref count is still expected to be at least 1 because of urDevicesGet. It would be nice to either understand why it is normal to have zero ref count here or to update the test if necessary if there is something wrong.

@againull So, summarizing, each time urDeviceGet() (redefinedDevicesGetAfter()) increments the DevRefCounter() at

MAdapter->call<UrApiKind::urDeviceGet>(
MPlatform,
UrDeviceType, // CP info::device_type::all
NumDevices, UrDevices.data(), nullptr);

a few moments later it is decremented to 0 by urDeviceRelease() (redefinedDeviceReleaseAfter()) at

// The reference counter for handles, that we used to create sycl objects, is
// incremented, so we need to call release here.
for (ur_device_handle_t &UrDev : UrDevicesToCleanUp)
MAdapter->call<UrApiKind::urDeviceRelease>(UrDev);

So, IMHO, all is OK here and works correctly. Do you agree @againull ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@againull ping

Copy link
Contributor

@againull againull Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry for delay. It looks like that this code:

 // The reference counter for handles, that we used to create sycl objects, is 
 // incremented, so we need to call release here. 
 for (ur_device_handle_t &UrDev : UrDevicesToCleanUp) 
   MAdapter->call<UrApiKind::urDeviceRelease>(UrDev); 

needs to be removed as well. I believe these release calls were needed just because of redundant retain in device_impl constructor, but you are removing that retain now. Could you please try to remove this part as well.

Copy link
Contributor Author

@ldorau ldorau Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@againull It seems it works with this code removed, only the sycl/unittests/context_device/DeviceRefCounter.cpp test should be changed, but frankly I do not know how (see https://github.com/intel/llvm/pull/20341/files) - could you help with that? Do you have any idea?

Copy link
Contributor

@againull againull Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ldorau I have taken a closer look. Actually, I believe initial device_impl implementation was correct (urRetain in constructor and urRelease in destructor) because device_impl doesn't own UR handle exclusively, there is shared ownership. For example, there might be a situation when two device_impl objects are created with the same UR handle, in this case it is supposed to look like this:

 urDeviceGet  ->  we get ur_device_handle_t with ref count == 1
 create first object using that handle ->  ref count == 2
 create second object using same handle  -> ref count == 3
 urRelease -> we call this when current code doesn't need device handle anymore. (ref count 2)
 destructor of first object (ref count 1)
 destructor of first object (ref count 0)

Same for sub-devices: urDevicePartition is similar to urDeviceGet (it returns +1 refcount).

So, I believe the correct fix will be this (includes test which fails without changes):
againull@bbd8db9

Copy link
Contributor Author

@ldorau ldorau Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@againull Right, it makes sense. Thanks for this fix. Could you submit a pull request with this fix, please?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, sure, opened #20370

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot!

@ldorau ldorau requested a review from againull September 29, 2025 09:19
@ldorau
Copy link
Contributor Author

ldorau commented Sep 30, 2025

@pbalcer please re-review

@ldorau
Copy link
Contributor Author

ldorau commented Oct 6, 2025

@pbalcer @againull please re-review

@ldorau ldorau requested a review from cperkinsintel October 11, 2025 10:30
Copy link
Contributor

@intel/llvm-gatekeepers please consider merging

@intel intel deleted a comment from github-actions bot Oct 14, 2025
@intel intel deleted a comment from github-actions bot Oct 14, 2025
@intel intel deleted a comment from github-actions bot Oct 14, 2025
@intel intel deleted a comment from github-actions bot Oct 14, 2025
@intel intel deleted a comment from github-actions bot Oct 14, 2025
@intel intel deleted a comment from github-actions bot Oct 14, 2025
@ldorau ldorau marked this pull request as draft October 15, 2025 13:54
@ldorau
Copy link
Contributor Author

ldorau commented Oct 16, 2025

Replaced with #20370

@ldorau ldorau closed this Oct 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants