Support dpo/grpo/gkd/sft padding_free#181
Conversation
There was a problem hiding this comment.
Code Review
This pull request centralizes the logic for unpacking packed sequences (padding-free mode) into the InputProcessor class, moving it out of specific loss implementations like GRPO. It introduces a canonical method to detect packing and unpack tensors such as log-probabilities and labels into a per-sequence batch format. These changes are integrated into both Megatron and Transformers sequence parallel strategies. The review feedback identifies several improvement opportunities: ensuring the boundary detection logic explicitly includes the first sequence, relaxing the packing detection heuristic to support sequences of length one, and optimizing the unpacking process for better performance and consistency.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors the handling of packed sequences by moving unpacking logic into the InputProcessor and introduces a require_logits attribute to loss classes to optimize memory usage. It also updates Megatron and Transformers models to support variable sequence lengths by default, implements rank-aware logging, and includes FSDP2 compatibility fixes for LoRA dtypes. Feedback points out an inconsistent error message in the sampling configuration and a potential crash in the sequence unpacking utility when processing empty lists.
…eet/twinkle into feat/padding_free_fix
PR type
PR information
Experiment results