Skip to content

[bugfix] sync template.padding_free with args after prepare_model for…#9031

Merged
Jintao-Huang merged 1 commit intomodelscope:mainfrom
yaoruda:fix/sync-template-padding-free
Apr 8, 2026
Merged

[bugfix] sync template.padding_free with args after prepare_model for…#9031
Jintao-Huang merged 1 commit intomodelscope:mainfrom
yaoruda:fix/sync-template-padding-free

Conversation

@yaoruda
Copy link
Copy Markdown
Contributor

@yaoruda yaoruda commented Apr 7, 2026

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

Problem

When running Megatron DPO/KTO training with DSA-based models (e.g. DeepSeek-V3.2),
training produces NaN loss on every step, regardless of --packing settings.

Root Cause

In the Megatron training path, MegatronArguments defines padding_free: bool = True as the default.
The template object is created before model initialization, so template.padding_free = True.

During BaseMegatronTrainer.prepare_model(), _check_padding_free() in
swift/megatron/model/utils.py detects DSA attention and forces args.padding_free = False.
However, template.padding_free is never updated — it remains True.

This mismatch causes:

  1. Data collator (template.padding_free=True): packs chosen+rejected into a single row
    labels.shape = [1, N]
  2. _prepare_batch (args.padding_free=False): does not create packed_seq_params
    packed_seq_params = None
  3. loss_func: computes num_samples = labels.shape[0] // 2 = 1 // 2 = 0
    → empty chosen_logpsloss = NaN

This bug affects all DSA models by default (since Megatron defaults padding_free=True).
Non-DSA models (e.g. Qwen) are not affected because _check_padding_free does not override
args.padding_free for them.

Fix

Sync template.padding_free to args.padding_free right after self.prepare_model()
in BaseMegatronTrainer.__init__(), before the data collator is created.

self.prepare_model()
# Sync template.padding_free after prepare_model(), because _check_padding_free
# may override args.padding_free for certain models (e.g. DSA attention).
if template.padding_free != args.padding_free:
    logger.warning(
        f'template.padding_free({template.padding_free}) != args.padding_free({args.padding_free}), '
        f'syncing template.padding_free to {args.padding_free}.')
    template.padding_free = args.padding_free

Scope

Scenario Before Fix After Fix
DSA model + default padding_free (True) NaN ✅ Normal
DSA model + --padding_free true NaN ✅ Normal
DSA model + --packing true NaN ✅ Normal
DSA model + --padding_free false Normal (workaround) ✅ Normal
Non-DSA model (any setting) Normal ✅ No change

Verification

Tested with DeepSeek-V3.2 Megatron DPO training:

  • Before fix: NaN loss from step 1 (confirmed chosen_logps.shape=[0] via logging)
  • After fix: training runs normally with valid loss values

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a synchronization step in the MegatronTrainer base class to ensure that template.padding_free is consistent with args.padding_free after the model is prepared. This change accounts for potential overrides of the padding configuration during model-specific initialization, such as for DSA attention models. I have no feedback to provide.

@Jintao-Huang
Copy link
Copy Markdown
Collaborator

thanks!

@Jintao-Huang
Copy link
Copy Markdown
Collaborator

please run:

pip install pre-commit
pre-commit run --all-files

@Jintao-Huang Jintao-Huang merged commit e8b3876 into modelscope:main Apr 8, 2026
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants