[template] support truncation_strategy spllit (swift pt) #6672

Jintao-Huang · 2025-11-19T15:11:29Z

No description provided.

gemini-code-assist · 2025-11-19T15:11:52Z

Summary of Changes

Hello @Jintao-Huang, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances data handling capabilities by introducing a 'split' truncation strategy, which intelligently divides large data samples into smaller, manageable chunks during pretraining to optimize resource usage. It also refines the default max_length parameter for quantization processes, making it context-aware. Additionally, the documentation has been updated to reflect these changes and to inform users about the benefits of Mcore-Bridge for Megatron-SWIFT workflows.

Highlights

New Truncation Strategy: 'split': Introduced a new truncation_strategy option, 'split', which allows oversized data samples to be broken down into multiple smaller samples. This aims to prevent Out-Of-Memory (OOM) errors and improve token utilization during pretraining.
Conditional max_length for Quantization: The default max_length for calibration sets in quantization has been updated. It now defaults to None but will be automatically set to 2048 if a quant_method is specified, providing more flexible configuration.
Restrictions for 'split' Strategy: The 'split' truncation strategy is specifically designed for plain text causal language model pretraining (swift/megatron pt). It is not compatible with cached_dataset, chat templates, or multimodal models, and requires ms-swift>=3.11.
Mcore-Bridge Documentation: Added a note to the Megatron-SWIFT quick start guides, highlighting the support for Mcore-Bridge to help avoid the additional time cost associated with weight conversion.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for a new truncation_strategy, 'split', which is useful for pre-training tasks to avoid wasting tokens on long sequences. The changes span across argument definitions, template encoding logic, and documentation.

I've identified a critical bug in the data collation logic when using the 'split' strategy without packing, and a missing validation check for cached_dataset incompatibility. I've also included a suggestion to improve code clarity.

By the way, there's a small typo in the pull request title: "spllit" should be "split".

swift/llm/template/base.py

gemini-code-assist · 2025-11-19T15:15:51Z

swift/llm/train/sft.py

+        if args.truncation_strategy == 'split' and (args.task_type != 'causal_lm' or template.mode != 'train'
+                                                    or args.use_chat_template or args.model_meta.is_multimodal):
+            raise ValueError(
+                '`--truncation_strategy split` is currently only supported for plain text model pretraining')


The documentation mentions that truncation_strategy='split' is incompatible with cached_dataset. It would be good to add a check here to enforce this and prevent potential issues. For example, you could add if args.cached_dataset: raise ValueError(...) before this block or within a refactored conditional.

gemini-code-assist · 2025-11-19T15:15:51Z

swift/llm/template/base.py

+                while i < length:
+                    splited = {}
+                    for key in ['input_ids', 'labels', 'loss_scale']:
+                        value = locals()[key]


Using locals() can be a bit magical and less explicit. It would be clearer and more robust to access the variables directly from the encoded dictionary, since they were populated from it just a few lines above. This improves maintainability by making the data flow more explicit.

Suggested change

value = locals()[key]

value = encoded.get(key)

Jintao-Huang · 2025-11-19T15:40:47Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a 'split' truncation strategy, primarily for pre-training, to handle long sequences by splitting them into multiple samples. The changes span across argument definitions, template encoding logic, and the training pipeline, with corresponding documentation updates. My review identifies a critical bug that could lead to an IndexError during data processing and suggests a fix. I also recommend improving an assertion message for better clarity and adding a check to enforce a documented incompatibility. Lastly, there is a minor typo in the pull request title ('spllit' instead of 'split').

gemini-code-assist · 2025-11-19T15:54:42Z

swift/llm/template/base.py

+                            value = value[i:i + self.max_length]
+                            if key == 'labels':
+                                value[0] = -100
+                            elif key == 'loss_scale':
+                                value[0] = 0


There's a potential IndexError here. If len(encoded.get(key)) is less than i, the slice value[i:i + self.max_length] will be an empty list. Accessing value[0] on an empty list will raise an IndexError. This can happen if, for example, input_ids is longer than labels, and the loop is still running based on len(input_ids). You should add a check to ensure the sliced value is not empty before accessing its first element.

Suggested change

value = value[i:i + self.max_length]

if key == 'labels':

value[0] = -100

elif key == 'loss_scale':

value[0] = 0

value = value[i:i + self.max_length]

if value:

if key == 'labels':

value[0] = -100

elif key == 'loss_scale':

value[0] = 0

gemini-code-assist · 2025-11-19T15:54:42Z

swift/llm/train/sft.py

+                    or args.model_meta.is_multimodal):
+                raise ValueError(
+                    '`--truncation_strategy split` is currently only supported for plain text model pretraining')
+            assert not args.lazy_tokenize, 'not support'


The assertion message 'not support' is not very descriptive. It would be more helpful for developers if it clearly stated why the assertion failed. Additionally, the documentation mentions that truncation_strategy='split' is incompatible with cached_dataset. It would be good to add an assertion here to enforce this and prevent potential issues.

Suggested change

assert not args.lazy_tokenize, 'not support'

assert not args.lazy_tokenize, '`--truncation_strategy split` does not support lazy tokenization'

assert not args.cached_dataset, '`--truncation_strategy split` is not compatible with `--cached_dataset`'

support truncation_strategy split (swift pt)

204d619

update

fe30fdf

gemini-code-assist bot reviewed Nov 19, 2025

View reviewed changes

Jintao-Huang added 3 commits November 19, 2025 23:21

update

b972505

update

b22a961

update

cf45519

hjh0119 approved these changes Nov 19, 2025

View reviewed changes

Jintao-Huang mentioned this pull request Nov 19, 2025

feat: Add split truncation_strategy #6541

Closed

4 tasks

update

8ea2bac

gemini-code-assist bot reviewed Nov 19, 2025

View reviewed changes

update

ac0f7b0

Jintao-Huang merged commit e8c0282 into modelscope:main Nov 19, 2025
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[template] support truncation_strategy spllit (swift pt) #6672

[template] support truncation_strategy spllit (swift pt) #6672

Jintao-Huang commented Nov 19, 2025

Uh oh!

gemini-code-assist bot commented Nov 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

gemini-code-assist bot Nov 19, 2025

Uh oh!

gemini-code-assist bot Nov 19, 2025

Uh oh!

Jintao-Huang commented Nov 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 19, 2025

Uh oh!

gemini-code-assist bot Nov 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	assert not args.lazy_tokenize, 'not support'
	assert not args.lazy_tokenize, '`--truncation_strategy split` does not support lazy tokenization'
	assert not args.cached_dataset, '`--truncation_strategy split` is not compatible with `--cached_dataset`'

[template] support truncation_strategy spllit (swift pt) #6672

[template] support truncation_strategy spllit (swift pt) #6672

Conversation

Jintao-Huang commented Nov 19, 2025

Uh oh!

gemini-code-assist bot commented Nov 19, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Jintao-Huang commented Nov 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants