-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stream_safe_custom_device_allocator中释放显存前调用MarkAsWillBeFreed函数的作用是什么? #63300
Comments
@ronny1996 麻烦实现stream_safe_custom_device_allocator的大佬帮忙解答一下,谢谢:) |
我还有个疑问,StreamSafeCustomDeviceAllocation::RecordStream仅对outstanding_event_map_中不存在的stream做了record event操作,outstanding_event_map_已有的stream是否应该复用其对应的event,做record event呢?
|
你好,custom stream safe allocator 实现的是非cuda graph,即 phi::backends::gpu::CUDAGraph::IsThisThreadCapturing() = false 的情况,MarkAsWillBeFreed 确实可以删除,outstanding_event_map_已有的stream也应该复用其对应的event。我们提个pr修改下。 |
感谢大佬的回复,我的疑问得到了解决:) |
#63369 这个pr修复了 |
感谢感谢!这个issue可以关闭了。 |
请提出你的问题 Please ask your question
stream_safe_custom_device_allocator在释放显存前会调用MarkAsWillBeFreed函数
Paddle/paddle/fluid/memory/allocation/stream_safe_custom_device_allocator.cc
Line 193 in a76e7ed
该函数的功能是若当前allocation所在stream没有绑定event,则新建event并record到stream上(MarkAsWillBeFreed中对flag will_be_freed_的修改似乎有误,will_be_freed_一直为false)
Paddle/paddle/fluid/memory/allocation/stream_safe_custom_device_allocator.cc
Line 60 in a76e7ed
我比较疑惑的是调用MarkAsWillBeFreed函数的作用是什么?是否有必要?我理解释放显存前保证已有event全部完成即可,增加新的event是否会造成性能下降甚至出现功能异常?我看到xpu的stream safe allocator中释放显存前是通过CanBeFreed函数查询所有event的状态来决定是否直接释放的;gpu的stream safe allocator的CanBeFreed函数中虽然有record stream的操作,但是是对graph_capturing_stream_set_中的stream进行的,而该set似乎通常情况下为空?
Paddle/paddle/fluid/memory/allocation/stream_safe_cuda_allocator.cc
Line 54 in de49489
实际使用中,在开启custom device的stream safe allocator时,我遇到了2个bug:
1.在多进程场景下,当cpu的allocator和custom device的allocator同时有释放存储的操作时(custom devcice稍早于cpu),cpu allocator通过MarkAsWillBeFreed record的event为空(推测在custom device的allocator调用CanBeFreed时被删除),导致后续CanBeFreed中query时报错
2.不开stream safe allocator可以正常运行的模型,开启后可能因为调用MarkAsWillBeFreed新增event导致显存未能及时释放,出现oom
以上bug在取消对MarkAsWillBeFreed的调用后均得到解决。所以想请教下飞桨的大佬MarkAsWillBeFreed的调用是否可以删除?
The text was updated successfully, but these errors were encountered: