Skip to content

[bugfix] fix cache_dataset truncation_strategy#9210

Merged
Jintao-Huang merged 2 commits into
modelscope:mainfrom
Jintao-Huang:fix_cache_dataset_truncation_strategy
Apr 25, 2026
Merged

[bugfix] fix cache_dataset truncation_strategy#9210
Jintao-Huang merged 2 commits into
modelscope:mainfrom
Jintao-Huang:fix_cache_dataset_truncation_strategy

Conversation

@Jintao-Huang
Copy link
Copy Markdown
Collaborator

No description provided.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the dataset selection logic in swift/pipelines/utils.py to support configurable truncation strategies, including a 'delete' strategy and a mapping-based truncation. The review feedback identifies potential runtime errors, such as a TypeError when handling scalar length values in the truncation function and a ValueError when calling max() on empty lists. Additionally, it is recommended to use sum() instead of max() when filtering multi-turn conversation segments to accurately reflect the total sequence length against the maximum limit.

Comment thread swift/pipelines/utils.py Outdated
Comment thread swift/pipelines/utils.py Outdated
@Jintao-Huang
Copy link
Copy Markdown
Collaborator Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the _select_dataset function in swift/pipelines/utils.py to support a configurable truncation strategy. When the strategy is set to 'delete', the dataset is filtered based on maximum length; otherwise, the lengths are truncated using a mapping function. The review feedback highlights a potential TypeError in the mapping logic when handling scalar values and suggests resolving variable shadowing. Additionally, it recommends using dataset.filter with multi-processing to improve performance and safely handle empty lists in the deletion logic.

Comment thread swift/pipelines/utils.py
Comment on lines +62 to +66
new_dataset = dataset.map(
lambda rows: {'lengths': [[min(length, max_length) for length in lengths] for lengths in rows['lengths']]},
num_proc=args.dataset_num_proc,
load_from_cache_file=args.load_from_cache_file,
batched=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There are two issues in this map call:

  1. Potential Crash: The lambda assumes row_lengths is always a list (via the inner loop for length in lengths). However, the 'delete' strategy logic (line 58) indicates that 'lengths' can also be scalars. If a scalar is encountered, this will raise a TypeError: 'int' object is not iterable.
  2. Variable Shadowing: The variable name lengths is used for both the batch of lengths and the individual sample's lengths, which reduces readability and can lead to bugs.

The suggested fix handles both scalars and lists while avoiding variable shadowing.

Suggested change
new_dataset = dataset.map(
lambda rows: {'lengths': [[min(length, max_length) for length in lengths] for lengths in rows['lengths']]},
num_proc=args.dataset_num_proc,
load_from_cache_file=args.load_from_cache_file,
batched=True)
new_dataset = dataset.map(
lambda rows: {'lengths': [([min(l, max_length) for l in row_lengths] if isinstance(row_lengths, list) else min(row_lengths, max_length)) for row_lengths in rows['lengths']]},
num_proc=args.dataset_num_proc,
load_from_cache_file=args.load_from_cache_file,
batched=True)

Comment thread swift/pipelines/utils.py
@Jintao-Huang Jintao-Huang merged commit d49add2 into modelscope:main Apr 25, 2026
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant