Skip to content

from_pretrained orchestration + distributed save/load#45409

Open
3outeille wants to merge 2 commits intomoe-sequence-parallelfrom
orchestration-save-load
Open

from_pretrained orchestration + distributed save/load#45409
3outeille wants to merge 2 commits intomoe-sequence-parallelfrom
orchestration-save-load

Conversation

@3outeille
Copy link
Copy Markdown
Member

Summary

  • Full distributed_config integration in from_pretrained() — mesh creation, apply TP + FSDP, attach model.device_mesh
  • gather_full_state_dict() for streaming DTensor→full tensor saving (rank 0 only)
  • convert_strided_to_shard() / restore_strided_from_shard() for DCP compatibility with _StridedShard
  • save_optimizer() / load_optimizer() in distributed/utils.py
  • Rename apply_fsdp2apply_fully_shard_data_parallel
  • Trainer integration with distributed_config

Part of the distributed training API chain: #44989

Chain: main ← #44989 ← #44083 ← #44974 ← #45028 ← #45408 ← this PR

Review question

Does from_pretrained wire things up in the right order? Is save/load round-trip correct?

Test plan

  • End-to-end from_pretrained with distributed_config
  • gather_full_state_dict() roundtrip verification
  • save_optimizer() / load_optimizer() roundtrip
  • Run existing FSDP and TP mixin tests

- Add gather_full_state_dict() for DTensor→full tensor saving
- Add convert_strided_to_shard() / restore_strided_from_shard() for DCP
- Add _redistribute_dtensor() helper
- Full distributed_config integration in from_pretrained/save_pretrained
- Rename apply_fsdp2 → apply_fully_shard_data_parallel
- save_optimizer() / load_optimizer() in distributed/utils
- Trainer integration with distributed_config
- Updated FSDP and TP tests for new orchestration API
- DTensor shard-on-read test updates
"ALL_PARALLEL_STYLES",
"translate_to_torch_parallel_style",
]
_import_structure["tensor_parallel"] = []
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it empty ? do we still need it ?

def get_correct_experts_implementation(self, requested_experts: str | None) -> str:
applicable_experts = "grouped_mm" if requested_experts is None else requested_experts
if applicable_experts not in ["eager", "grouped_mm", "batched_mm", "deepgemm"]:
if applicable_experts not in ["eager", "grouped_mm", "batched_mm"]:
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was deepgemm removed ?

# so that custom models calling .item() during __init__ (e.g. drop-path
# schedules) don't crash on meta tensors.
init_contexts.extend([torch.device("meta"), init.meta_device_safe_creation_ops()])
init_contexts.append(torch.device("meta"))
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert back

if device_mesh is not None:
shard_and_distribute_module(
self, value, param, key, None, False, device_mesh.get_local_rank(), device_mesh
from torch.distributed.tensor import DTensor
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove import

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@3outeille 3outeille force-pushed the moe-sequence-parallel branch from e04c7d9 to 24ca327 Compare April 14, 2026 09:54
# Conflicts:
#	src/transformers/distributed/utils.py
@3outeille 3outeille force-pushed the orchestration-save-load branch from 815b5b2 to 7361deb Compare April 14, 2026 09:55
@github-actions
Copy link
Copy Markdown
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45409&sha=7361de

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants