[bugfix] fix cache_dataset truncation_strategy by Jintao-Huang · Pull Request #9210 · modelscope/ms-swift

Jintao-Huang · 2026-04-25T09:25:42Z

No description provided.

gemini-code-assist

Code Review

This pull request updates the dataset selection logic in swift/pipelines/utils.py to support configurable truncation strategies, including a 'delete' strategy and a mapping-based truncation. The review feedback identifies potential runtime errors, such as a TypeError when handling scalar length values in the truncation function and a ValueError when calling max() on empty lists. Additionally, it is recommended to use sum() instead of max() when filtering multi-turn conversation segments to accurately reflect the total sequence length against the maximum limit.

Jintao-Huang · 2026-04-25T09:36:34Z

/gemini review

gemini-code-assist

Code Review

This pull request updates the _select_dataset function in swift/pipelines/utils.py to support a configurable truncation strategy. When the strategy is set to 'delete', the dataset is filtered based on maximum length; otherwise, the lengths are truncated using a mapping function. The review feedback highlights a potential TypeError in the mapping logic when handling scalar values and suggests resolving variable shadowing. Additionally, it recommends using dataset.filter with multi-processing to improve performance and safely handle empty lists in the deletion logic.

gemini-code-assist · 2026-04-25T09:38:14Z

+        new_dataset = dataset.map(
+            lambda rows: {'lengths': [[min(length, max_length) for length in lengths] for lengths in rows['lengths']]},
+            num_proc=args.dataset_num_proc,
+            load_from_cache_file=args.load_from_cache_file,
+            batched=True)


There are two issues in this map call:

Potential Crash: The lambda assumes row_lengths is always a list (via the inner loop for length in lengths). However, the 'delete' strategy logic (line 58) indicates that 'lengths' can also be scalars. If a scalar is encountered, this will raise a TypeError: 'int' object is not iterable.

Variable Shadowing: The variable name lengths is used for both the batch of lengths and the individual sample's lengths, which reduces readability and can lead to bugs.

The suggested fix handles both scalars and lists while avoiding variable shadowing.

Suggested change

new_dataset = dataset.map(

lambda rows: {'lengths': [[min(length, max_length) for length in lengths] for lengths in rows['lengths']]},

num_proc=args.dataset_num_proc,

load_from_cache_file=args.load_from_cache_file,

batched=True)

new_dataset = dataset.map(

lambda rows: {'lengths': [([min(l, max_length) for l in row_lengths] if isinstance(row_lengths, list) else min(row_lengths, max_length)) for row_lengths in rows['lengths']]},

num_proc=args.dataset_num_proc,

load_from_cache_file=args.load_from_cache_file,

batched=True)

fix cache_dataset truncation_strategy

639c1e7

gemini-code-assist Bot reviewed Apr 25, 2026

View reviewed changes

Comment thread swift/pipelines/utils.py Outdated

Comment thread swift/pipelines/utils.py Outdated

update

6dbfb1c

gemini-code-assist Bot reviewed Apr 25, 2026

View reviewed changes

Jintao-Huang merged commit d49add2 into modelscope:main Apr 25, 2026
2 of 3 checks passed

Jintao-Huang added a commit that referenced this pull request Apr 25, 2026

[bugfix] fix cache_dataset truncation_strategy (#9210)

c6875ef

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bugfix] fix cache_dataset truncation_strategy#9210

[bugfix] fix cache_dataset truncation_strategy#9210
Jintao-Huang merged 2 commits into
modelscope:mainfrom
Jintao-Huang:fix_cache_dataset_truncation_strategy

Jintao-Huang commented Apr 25, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Jintao-Huang commented Apr 25, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 25, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Jintao-Huang commented Apr 25, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Jintao-Huang commented Apr 25, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant