[bugfix] fix safe_ddp_context hang#47
Conversation
There was a problem hiding this comment.
Code Review
This pull request modifies the patch_get_dynamic_module utility in utils.py to disable the synchronization barrier by setting use_barrier=False within the safe_ddp_context. While this change is intended to prevent hangs in non-collective execution paths where ranks might diverge, it introduces a potential risk of race conditions in multi-node environments with shared filesystems if robust distributed locking is not otherwise ensured.
|
|
||
| def new_get_cached_module_file(pretrained_model_name_or_path, *args, **kwargs): | ||
| with safe_ddp_context(hash_id=str(pretrained_model_name_or_path)): | ||
| with safe_ddp_context(hash_id=str(pretrained_model_name_or_path), use_barrier=False): |
There was a problem hiding this comment.
Disabling the barrier in safe_ddp_context may introduce race conditions in multi-node environments with shared filesystems (e.g., NFS, Lustre) if synchronization is not handled by other means, such as file-based locking. While this change avoids hangs in non-collective execution paths (where some ranks might skip the call due to local cache hits), it relies on safe_ddp_context or the underlying transformers library to provide robust distributed locking to prevent multiple ranks from concurrently downloading or accessing incomplete files.
No description provided.