Skip to content

[dataset] support "loss_scale" in dataset#9214

Merged
Jintao-Huang merged 7 commits into
modelscope:mainfrom
Jintao-Huang:support_disable_auto_column_mapping
Apr 26, 2026
Merged

[dataset] support "loss_scale" in dataset#9214
Jintao-Huang merged 7 commits into
modelscope:mainfrom
Jintao-Huang:support_disable_auto_column_mapping

Conversation

@Jintao-Huang
Copy link
Copy Markdown
Collaborator

No description provided.

@Jintao-Huang Jintao-Huang changed the title [dataset] support "loss_scale" in template [dataset] support "loss_scale" in dataset Apr 26, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several enhancements to dataset handling and loss scaling. Key changes include the addition of a loss_scale field in datasets, a new disable_auto_column_mapping parameter, and support for is_binary_loss_scale to optimize memory usage with liger_kernel. The LossScale and Template classes were updated to handle aggregated message fields when merging consecutive assistant responses. Feedback highlights a potential risk where merging message content into lists might break downstream string-based processing, and identifies an unused import in swift/template/base.py.

Comment thread swift/template/base.py
Comment thread swift/template/base.py Outdated
@Jintao-Huang
Copy link
Copy Markdown
Collaborator Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for a per-message loss_scale field in datasets and adds a disable_auto_column_mapping parameter to control automatic dataset column renaming. It also implements is_binary_loss_scale to optimize memory usage via liger_kernel when loss scales are binary. Key changes include updating the dataset preprocessor schema, refining message merging logic to handle multiple loss scales, and implementing lazy loading for templates. Feedback highlights a potential TypeError when processing merged assistant messages and concerns regarding backward compatibility for the loss field schema change.

Comment thread swift/template/base.py Outdated
Comment on lines +1155 to +1165
if self.template_backend == 'swift' and pre_role == 'assistant':
new_message = {}
for key in ['content', 'loss', 'loss_scale']:
values = []
for msg in [pre_message, message]:
value = msg.get(key)
if not isinstance(value, list):
value = [value]
values += value
new_message[key] = values
pre_message.update(new_message)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

When merging consecutive assistant messages, the content field is now converted into a list of strings. However, the downstream logic in _swift_encode (specifically around line 1240) expects response to be either a string or a list of token IDs (integers). If it receives a list of strings, tokenizer.decode will be called on strings, leading to a TypeError. You should ensure that if response is a list of strings, it is handled correctly (e.g., by joining them or updating the decoding logic).

'role': Value(dtype='string'),
'content': Value(dtype='string'),
'loss': Value(dtype='float64'),
'loss': Value(dtype='bool'),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Changing the loss field type from float64 to bool in the Arrow schema might cause issues with existing cached datasets or datasets that use 0.0/1.0 for this field. While bool is more semantic for a toggle, ensure that this change doesn't break compatibility with existing data that might be loaded as floats.

@Jintao-Huang
Copy link
Copy Markdown
Collaborator Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several enhancements to dataset handling and loss calculation, including the addition of a loss_scale field for individual messages and a disable_auto_column_mapping parameter to control automatic dataset column mapping. It also adds the is_binary_loss_scale parameter to optimize memory usage when compatible with liger_kernel. The internal logic for merging consecutive assistant messages and calculating loss scales was refactored to support these new fields. Review feedback identified critical issues in the handling of token ID lists during context processing and message merging, where broad type checks could lead to data corruption or crashes.

Comment thread swift/loss_scale/base.py Outdated
Comment thread swift/template/base.py
@Jintao-Huang Jintao-Huang merged commit addedd6 into modelscope:main Apr 26, 2026
1 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants