stream_safe_custom_device_allocator中释放显存前调用MarkAsWillBeFreed函数的作用是什么？ #63300

continue-coding · 2024-04-08T03:33:35Z

请提出你的问题 Please ask your question

stream_safe_custom_device_allocator在释放显存前会调用MarkAsWillBeFreed函数

Paddle/paddle/fluid/memory/allocation/stream_safe_custom_device_allocator.cc

Line 193 in a76e7ed

stream_safe_cuda_allocation->MarkAsWillBeFreed();

该函数的功能是若当前allocation所在stream没有绑定event，则新建event并record到stream上（MarkAsWillBeFreed中对flag will_be_freed_的修改似乎有误，will_be_freed_一直为false）

Paddle/paddle/fluid/memory/allocation/stream_safe_custom_device_allocator.cc

Line 60 in a76e7ed

void StreamSafeCustomDeviceAllocation::MarkAsWillBeFreed() {

我比较疑惑的是调用MarkAsWillBeFreed函数的作用是什么？是否有必要？我理解释放显存前保证已有event全部完成即可，增加新的event是否会造成性能下降甚至出现功能异常？我看到xpu的stream safe allocator中释放显存前是通过CanBeFreed函数查询所有event的状态来决定是否直接释放的；gpu的stream safe allocator的CanBeFreed函数中虽然有record stream的操作，但是是对graph_capturing_stream_set_中的stream进行的，而该set似乎通常情况下为空？

Paddle/paddle/fluid/memory/allocation/stream_safe_cuda_allocator.cc

Line 54 in de49489

if (UNLIKELY(phi::backends::gpu::CUDAGraph::IsThisThreadCapturing())) {

实际使用中，在开启custom device的stream safe allocator时，我遇到了2个bug：
1.在多进程场景下，当cpu的allocator和custom device的allocator同时有释放存储的操作时（custom devcice稍早于cpu），cpu allocator通过MarkAsWillBeFreed record的event为空（推测在custom device的allocator调用CanBeFreed时被删除），导致后续CanBeFreed中query时报错
2.不开stream safe allocator可以正常运行的模型，开启后可能因为调用MarkAsWillBeFreed新增event导致显存未能及时释放，出现oom
以上bug在取消对MarkAsWillBeFreed的调用后均得到解决。所以想请教下飞桨的大佬MarkAsWillBeFreed的调用是否可以删除？

continue-coding · 2024-04-08T03:36:19Z

@ronny1996 麻烦实现stream_safe_custom_device_allocator的大佬帮忙解答一下，谢谢：）

continue-coding · 2024-04-08T05:57:08Z

我还有个疑问，StreamSafeCustomDeviceAllocation::RecordStream仅对outstanding_event_map_中不存在的stream做了record event操作，outstanding_event_map_已有的stream是否应该复用其对应的event，做record event呢？

Paddle/paddle/fluid/memory/allocation/stream_safe_custom_device_allocator.cc

Line 47 in a61d20a

if (it == outstanding_event_map_.end()) {

ronny1996 · 2024-04-08T15:08:37Z

请提出你的问题 Please ask your question

stream_safe_custom_device_allocator在释放显存前会调用MarkAsWillBeFreed函数

Paddle/paddle/fluid/memory/allocation/stream_safe_custom_device_allocator.cc

Line 193 in a76e7ed

stream_safe_cuda_allocation->MarkAsWillBeFreed();

该函数的功能是若当前allocation所在stream没有绑定event，则新建event并record到stream上（MarkAsWillBeFreed中对flag will_be_freed_的修改似乎有误，will_be_freed_一直为false）

Paddle/paddle/fluid/memory/allocation/stream_safe_custom_device_allocator.cc

Line 60 in a76e7ed

void StreamSafeCustomDeviceAllocation::MarkAsWillBeFreed() {

我比较疑惑的是调用MarkAsWillBeFreed函数的作用是什么？是否有必要？我理解释放显存前保证已有event全部完成即可，增加新的event是否会造成性能下降甚至出现功能异常？我看到xpu的stream safe allocator中释放显存前是通过CanBeFreed函数查询所有event的状态来决定是否直接释放的；gpu的stream safe allocator的CanBeFreed函数中虽然有record stream的操作，但是是对graph_capturing_stream_set_中的stream进行的，而该set似乎通常情况下为空？

Paddle/paddle/fluid/memory/allocation/stream_safe_cuda_allocator.cc

Line 54 in de49489

if (UNLIKELY(phi::backends::gpu::CUDAGraph::IsThisThreadCapturing())) {

实际使用中，在开启custom device的stream safe allocator时，我遇到了2个bug：
1.在多进程场景下，当cpu的allocator和custom device的allocator同时有释放存储的操作时（custom devcice稍早于cpu），cpu allocator通过MarkAsWillBeFreed record的event为空（推测在custom device的allocator调用CanBeFreed时被删除），导致后续CanBeFreed中query时报错
2.不开stream safe allocator可以正常运行的模型，开启后可能因为调用MarkAsWillBeFreed新增event导致显存未能及时释放，出现oom
以上bug在取消对MarkAsWillBeFreed的调用后均得到解决。所以想请教下飞桨的大佬MarkAsWillBeFreed的调用是否可以删除？

你好，custom stream safe allocator 实现的是非cuda graph，即 phi::backends::gpu::CUDAGraph::IsThisThreadCapturing() = false 的情况，MarkAsWillBeFreed 确实可以删除，outstanding_event_map_已有的stream也应该复用其对应的event。我们提个pr修改下。

continue-coding · 2024-04-09T01:51:07Z

请提出你的问题 Please ask your question

stream_safe_custom_device_allocator在释放显存前会调用MarkAsWillBeFreed函数

Paddle/paddle/fluid/memory/allocation/stream_safe_custom_device_allocator.cc

Line 193 in a76e7ed

stream_safe_cuda_allocation->MarkAsWillBeFreed();

该函数的功能是若当前allocation所在stream没有绑定event，则新建event并record到stream上（MarkAsWillBeFreed中对flag will_be_freed_的修改似乎有误，will_be_freed_一直为false）

Paddle/paddle/fluid/memory/allocation/stream_safe_custom_device_allocator.cc

Line 60 in a76e7ed

void StreamSafeCustomDeviceAllocation::MarkAsWillBeFreed() {

我比较疑惑的是调用MarkAsWillBeFreed函数的作用是什么？是否有必要？我理解释放显存前保证已有event全部完成即可，增加新的event是否会造成性能下降甚至出现功能异常？我看到xpu的stream safe allocator中释放显存前是通过CanBeFreed函数查询所有event的状态来决定是否直接释放的；gpu的stream safe allocator的CanBeFreed函数中虽然有record stream的操作，但是是对graph_capturing_stream_set_中的stream进行的，而该set似乎通常情况下为空？

Paddle/paddle/fluid/memory/allocation/stream_safe_cuda_allocator.cc

Line 54 in de49489

if (UNLIKELY(phi::backends::gpu::CUDAGraph::IsThisThreadCapturing())) {

实际使用中，在开启custom device的stream safe allocator时，我遇到了2个bug：
1.在多进程场景下，当cpu的allocator和custom device的allocator同时有释放存储的操作时（custom devcice稍早于cpu），cpu allocator通过MarkAsWillBeFreed record的event为空（推测在custom device的allocator调用CanBeFreed时被删除），导致后续CanBeFreed中query时报错
2.不开stream safe allocator可以正常运行的模型，开启后可能因为调用MarkAsWillBeFreed新增event导致显存未能及时释放，出现oom
以上bug在取消对MarkAsWillBeFreed的调用后均得到解决。所以想请教下飞桨的大佬MarkAsWillBeFreed的调用是否可以删除？

你好，custom stream safe allocator 实现的是非cuda graph，即 phi::backends::gpu::CUDAGraph::IsThisThreadCapturing() = false 的情况，MarkAsWillBeFreed 确实可以删除，outstanding_event_map_已有的stream也应该复用其对应的event。我们提个pr修改下。

感谢大佬的回复，我的疑问得到了解决：）

ronny1996 · 2024-04-10T07:37:49Z

#63369 这个pr修复了

continue-coding · 2024-04-10T07:48:19Z

#63369 这个pr修复了

感谢感谢！这个issue可以关闭了。

continue-coding added status/new-issue 新建 type/question 用户提问 labels Apr 8, 2024

paddle-bot bot assigned pkuzyc Apr 8, 2024

pkuzyc assigned ronny1996 Apr 8, 2024

paddle-bot bot added status/following-up 跟进中 type/bug-report 报bug and removed status/new-issue 新建 type/question 用户提问 labels Apr 9, 2024

continue-coding closed this as completed Apr 10, 2024

paddle-bot bot added status/close 已关闭 and removed status/following-up 跟进中 labels Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stream_safe_custom_device_allocator中释放显存前调用MarkAsWillBeFreed函数的作用是什么？ #63300

stream_safe_custom_device_allocator中释放显存前调用MarkAsWillBeFreed函数的作用是什么？ #63300

continue-coding commented Apr 8, 2024

continue-coding commented Apr 8, 2024

continue-coding commented Apr 8, 2024

ronny1996 commented Apr 8, 2024 •

edited

请提出你的问题 Please ask your question

continue-coding commented Apr 9, 2024

请提出你的问题 Please ask your question

ronny1996 commented Apr 10, 2024

continue-coding commented Apr 10, 2024

stream_safe_custom_device_allocator中释放显存前调用MarkAsWillBeFreed函数的作用是什么？ #63300

stream_safe_custom_device_allocator中释放显存前调用MarkAsWillBeFreed函数的作用是什么？ #63300

Comments

continue-coding commented Apr 8, 2024

请提出你的问题 Please ask your question

continue-coding commented Apr 8, 2024

continue-coding commented Apr 8, 2024

ronny1996 commented Apr 8, 2024 • edited

请提出你的问题 Please ask your question

continue-coding commented Apr 9, 2024

请提出你的问题 Please ask your question

ronny1996 commented Apr 10, 2024

continue-coding commented Apr 10, 2024

ronny1996 commented Apr 8, 2024 •

edited