Skip to content

[bugfix] fix safe_ddp_context hang#47

Merged
Jintao-Huang merged 1 commit intomodelscope:mainfrom
Jintao-Huang:fix_safe_ddp_context_hang
Apr 21, 2026
Merged

[bugfix] fix safe_ddp_context hang#47
Jintao-Huang merged 1 commit intomodelscope:mainfrom
Jintao-Huang:fix_safe_ddp_context_hang

Conversation

@Jintao-Huang
Copy link
Copy Markdown
Collaborator

No description provided.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies the patch_get_dynamic_module utility in utils.py to disable the synchronization barrier by setting use_barrier=False within the safe_ddp_context. While this change is intended to prevent hangs in non-collective execution paths where ranks might diverge, it introduces a potential risk of race conditions in multi-node environments with shared filesystems if robust distributed locking is not otherwise ensured.


def new_get_cached_module_file(pretrained_model_name_or_path, *args, **kwargs):
with safe_ddp_context(hash_id=str(pretrained_model_name_or_path)):
with safe_ddp_context(hash_id=str(pretrained_model_name_or_path), use_barrier=False):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Disabling the barrier in safe_ddp_context may introduce race conditions in multi-node environments with shared filesystems (e.g., NFS, Lustre) if synchronization is not handled by other means, such as file-based locking. While this change avoids hangs in non-collective execution paths (where some ranks might skip the call due to local cache hits), it relies on safe_ddp_context or the underlying transformers library to provide robust distributed locking to prevent multiple ranks from concurrently downloading or accessing incomplete files.

@Jintao-Huang Jintao-Huang merged commit 4546657 into modelscope:main Apr 21, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant