Skip to content

Conversation

@xmfan
Copy link
Member

@xmfan xmfan commented Nov 20, 2025

Stacked PRs:


Add numerics comparison script from Devmate

Intended usage: python examples/run_ds3_numerics_check.py

Which creates a temp folder e.g. /tmp/ds3_numerics_check_ufc54rz8, then runs:

torchrun --standalone --nproc-per-node 4 examples/example_ds3_local_map.py --rng-seed 42
torchrun --standalone --nproc-per-node 8 examples/example_ds3_pp.py --rng-seed 42
diff out/0/weights.log out/1/pp_weights.log
diff out/0/diff.log out/1/diff.log

It currently warns, because there's a silly mismatch wrt buffer registration, where we have 4 freq_cis in the pp model. Once fixed, I can change it back to a RuntimeError, and we can add it into CI.

xmfan added a commit that referenced this pull request Nov 20, 2025
stack-info: PR: #259, branch: xmfan/stack/23
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 20, 2025
@xmfan xmfan changed the base branch from xmfan/stack/20 to main November 20, 2025 06:03
xmfan added a commit that referenced this pull request Nov 20, 2025
stack-info: PR: #259, branch: xmfan/stack/23
@xmfan xmfan changed the base branch from main to xmfan/stack/20 November 20, 2025 06:03
@xmfan xmfan changed the base branch from xmfan/stack/20 to main November 20, 2025 20:05
xmfan added a commit that referenced this pull request Nov 20, 2025
stack-info: PR: #259, branch: xmfan/stack/23
@xmfan xmfan changed the base branch from main to xmfan/stack/20 November 20, 2025 20:05
@xmfan xmfan changed the base branch from xmfan/stack/20 to main November 20, 2025 20:55
xmfan added a commit that referenced this pull request Nov 20, 2025
stack-info: PR: #259, branch: xmfan/stack/23
@sanketpurandare
Copy link
Contributor

Have you verified numerics with the flag --use-loss-fn and by changing the schedules using --schedule-name?

@xmfan
Copy link
Member Author

xmfan commented Nov 20, 2025

nope, for loss_fn, we would need to add it to the non-pp script

and for schedule names, i'll need to validate them. will add a CLI flag to switch between schedules for now

stack-info: PR: #259, branch: xmfan/stack/23
@xmfan
Copy link
Member Author

xmfan commented Nov 20, 2025

@sanketpurandare interleave1f1b and zbpp are both fine, dualpipev needs some minor logging changes. will do separately and add into ci for next pr

@xmfan xmfan merged commit 6e2085e into main Nov 21, 2025
4 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants