Skip to content

Conversation

@joecummings
Copy link
Member

@joecummings joecummings commented Oct 1, 2025

@felipemello1 Correctly diagnosed that our memory was super high for our default Qwen3 8B script. The solution he proposed in #278 was to drop the max_req_tokens and max_res_tokens, which was very valid. But overall the high memory seems suspect. After looking into it, it was apparent that although we specified "bfloat16" in our Trainer/Ref sections of the configs, it was not being applied - hence the massive memory. The culprit was that we had an out of date Torchtitan package :/

In this PR, I update the Torchtitan package so that bf16 is applied correctly.

Memory usage before (with 486 seq len):
Screenshot 2025-10-01 at 4 14 24 PM

Memory usage after (with 512 seq len):
Screenshot 2025-10-01 at 4 00 36 PM

sidenote: This has the fortunate side effect of halving the weight sync speed #impact

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 1, 2025
@joecummings joecummings marked this pull request as ready for review October 1, 2025 20:19
)
use_vllm_builtin_load: bool = True
compile: Compile = field(default_factory=Compile)
float8: Float8Dense = field(default_factory=Float8Dense)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was also a part of the updated Torchtitan package

Copy link
Contributor

@felipemello1 felipemello1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10x engineer

@felipemello1
Copy link
Contributor

nice wandb logging :D

@joecummings
Copy link
Member Author

nice wandb logging :D

🫡

@joecummings joecummings merged commit 3186797 into meta-pytorch:main Oct 1, 2025
5 checks passed
photomz pushed a commit to photomz/forge that referenced this pull request Oct 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants