Skip to content

FSDP orchestration: apply + loading/saving#46990

Open
3outeille wants to merge 31 commits into
split/a-pr-3-dual-path-loadingfrom
split/a-pr-4-fsdp-orchestration
Open

FSDP orchestration: apply + loading/saving#46990
3outeille wants to merge 31 commits into
split/a-pr-3-dual-path-loadingfrom
split/a-pr-4-fsdp-orchestration

Conversation

@3outeille

@3outeille 3outeille commented Jul 1, 2026

Copy link
Copy Markdown
Member

CI

Summary

  • FSDP:
    • Only 1 model for now has base_fsdp_plan. Will do another PR to edit every other models later
    • now wired through from_pretrained
    • For FSDP: loading through shard-on-Read + saving like TP (DCP optional)
    • Add FSDP Ci
    • DistributedMixin
  • TP:
    • Wired DistributedConfig everywhere (no more tp_plan=auto)
    • TP left untouched (no Dtensor yet)

Stack

3outeille and others added 4 commits July 1, 2026 05:08
Wire distributed_config from_pretrained/save_pretrained alongside the legacy tp_plan path, add distributed/utils.py for mesh orchestration and checkpoint I/O, and extend sharding_utils with DTensor gather/optimizer fusion helpers needed by save/load.
@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@3outeille 3outeille marked this pull request as draft July 1, 2026 08:49
3outeille added a commit to 3outeille/transformers-test-ci that referenced this pull request Jul 3, 2026
Wire FSDP tests into the dynamic PR CI caller, mirroring tests_tensor_parallel_ci:
detect tests_fsdp_ci_test_list.txt, run with is_fsdp_test marker and RUN_FSDP_TESTS,
and exclude FSDP tests from the tests_torch job.

Companion to huggingface/transformers#46990 (tests_fetcher changes stay in transformers).

Co-authored-by: Cursor <cursoragent@cursor.com>
@3outeille 3outeille changed the title FSDP orchestration: mesh init, distribute-before-load, DCP save FSDP orchestration: apply + loading/saving Jul 3, 2026
@3outeille 3outeille marked this pull request as ready for review July 3, 2026 03:49
@3outeille

Copy link
Copy Markdown
Member Author

run-slow: cohere2_moe, deepseek_v4, glm_moe_dsa, gpt_oss

@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/cohere2_moe", "models/deepseek_v4", "models/glm_moe_dsa", "models/gpt_oss"]
quantizations: []

@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN 1bcc0fb0 workflow commit (merge commit)
PR 37df13ee branch commit (from PR)
main 70544cd9 base commit (on main)

⚠️ Model CI failed to report results

The test failure analysis could not be completed. Please check the workflow run for details.

@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: cohere2_moe, deepseek_v4, glm_moe_dsa, gpt_oss

@3outeille 3outeille requested a review from ArthurZucker July 3, 2026 05:10
@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

CI recap

Dashboard: View test results in Grafana
Latest run: 28639792108:1
Result: success | Jobs: 15 | Tests: 79,359 | Failures: 0 | Duration: 16h 45m

tarekziade pushed a commit to huggingface/transformers-test-ci that referenced this pull request Jul 3, 2026
Wire FSDP tests into the dynamic PR CI caller, mirroring tests_tensor_parallel_ci:
detect tests_fsdp_ci_test_list.txt, run with is_fsdp_test marker and RUN_FSDP_TESTS,
and exclude FSDP tests from the tests_torch job.

Companion to huggingface/transformers#46990 (tests_fetcher changes stay in transformers).

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants