Skip to content

Conversation

@Jintao-Huang
Copy link
Collaborator

No description provided.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Jintao-Huang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances data handling capabilities by introducing a 'split' truncation strategy, which intelligently divides large data samples into smaller, manageable chunks during pretraining to optimize resource usage. It also refines the default max_length parameter for quantization processes, making it context-aware. Additionally, the documentation has been updated to reflect these changes and to inform users about the benefits of Mcore-Bridge for Megatron-SWIFT workflows.

Highlights

  • New Truncation Strategy: 'split': Introduced a new truncation_strategy option, 'split', which allows oversized data samples to be broken down into multiple smaller samples. This aims to prevent Out-Of-Memory (OOM) errors and improve token utilization during pretraining.
  • Conditional max_length for Quantization: The default max_length for calibration sets in quantization has been updated. It now defaults to None but will be automatically set to 2048 if a quant_method is specified, providing more flexible configuration.
  • Restrictions for 'split' Strategy: The 'split' truncation strategy is specifically designed for plain text causal language model pretraining (swift/megatron pt). It is not compatible with cached_dataset, chat templates, or multimodal models, and requires ms-swift>=3.11.
  • Mcore-Bridge Documentation: Added a note to the Megatron-SWIFT quick start guides, highlighting the support for Mcore-Bridge to help avoid the additional time cost associated with weight conversion.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for a new truncation_strategy, 'split', which is useful for pre-training tasks to avoid wasting tokens on long sequences. The changes span across argument definitions, template encoding logic, and documentation.

I've identified a critical bug in the data collation logic when using the 'split' strategy without packing, and a missing validation check for cached_dataset incompatibility. I've also included a suggestion to improve code clarity.

By the way, there's a small typo in the pull request title: "spllit" should be "split".

Comment on lines 320 to 323
if args.truncation_strategy == 'split' and (args.task_type != 'causal_lm' or template.mode != 'train'
or args.use_chat_template or args.model_meta.is_multimodal):
raise ValueError(
'`--truncation_strategy split` is currently only supported for plain text model pretraining')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The documentation mentions that truncation_strategy='split' is incompatible with cached_dataset. It would be good to add a check here to enforce this and prevent potential issues. For example, you could add if args.cached_dataset: raise ValueError(...) before this block or within a refactored conditional.

while i < length:
splited = {}
for key in ['input_ids', 'labels', 'loss_scale']:
value = locals()[key]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using locals() can be a bit magical and less explicit. It would be clearer and more robust to access the variables directly from the encoded dictionary, since they were populated from it just a few lines above. This improves maintainability by making the data flow more explicit.

Suggested change
value = locals()[key]
value = encoded.get(key)

@Jintao-Huang
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a 'split' truncation strategy, primarily for pre-training, to handle long sequences by splitting them into multiple samples. The changes span across argument definitions, template encoding logic, and the training pipeline, with corresponding documentation updates. My review identifies a critical bug that could lead to an IndexError during data processing and suggests a fix. I also recommend improving an assertion message for better clarity and adding a check to enforce a documented incompatibility. Lastly, there is a minor typo in the pull request title ('spllit' instead of 'split').

Comment on lines 1245 to 1249
value = value[i:i + self.max_length]
if key == 'labels':
value[0] = -100
elif key == 'loss_scale':
value[0] = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There's a potential IndexError here. If len(encoded.get(key)) is less than i, the slice value[i:i + self.max_length] will be an empty list. Accessing value[0] on an empty list will raise an IndexError. This can happen if, for example, input_ids is longer than labels, and the loop is still running based on len(input_ids). You should add a check to ensure the sliced value is not empty before accessing its first element.

Suggested change
value = value[i:i + self.max_length]
if key == 'labels':
value[0] = -100
elif key == 'loss_scale':
value[0] = 0
value = value[i:i + self.max_length]
if value:
if key == 'labels':
value[0] = -100
elif key == 'loss_scale':
value[0] = 0

or args.model_meta.is_multimodal):
raise ValueError(
'`--truncation_strategy split` is currently only supported for plain text model pretraining')
assert not args.lazy_tokenize, 'not support'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The assertion message 'not support' is not very descriptive. It would be more helpful for developers if it clearly stated why the assertion failed. Additionally, the documentation mentions that truncation_strategy='split' is incompatible with cached_dataset. It would be good to add an assertion here to enforce this and prevent potential issues.

Suggested change
assert not args.lazy_tokenize, 'not support'
assert not args.lazy_tokenize, '`--truncation_strategy split` does not support lazy tokenization'
assert not args.cached_dataset, '`--truncation_strategy split` is not compatible with `--cached_dataset`'

@Jintao-Huang Jintao-Huang merged commit e8c0282 into modelscope:main Nov 19, 2025
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants