Add sequence parallel compatibility with transformers >= 5.4.0 by Jintao-Huang · Pull Request #9167 · modelscope/ms-swift

Jintao-Huang · 2026-04-21T08:00:36Z

No description provided.

gemini-code-assist

Code Review

This pull request updates the minimum Python version requirement to 3.10 in the documentation and modifies the loss scaling logic in seq2seq_trainer.py. A review comment identifies that changing the label indexing from labels[:, 1:] to labels for calculating num_items_in_batch introduces an inconsistency with other parts of the trainer and may lead to incorrect gradient scaling in Causal Language Modeling tasks.

Jintao-Huang · 2026-04-21T08:24:12Z

/gemini review

gemini-code-assist

Code Review

This pull request updates the minimum Python version requirement to 3.10 and refactors sequence parallel masking functions to improve compatibility with different versions of the transformers library. It also adjusts the loss normalization logic in the seq2seq trainer. However, the refactoring of flash_attention_mask and sdpa_mask introduces potential bugs: by switching to *args and **kwargs but only checking kwargs for specific parameters like attention_mask and cache_position, the code may fail to capture these values if they are passed positionally. This could result in missing masks or TypeError exceptions during execution.

Jintao-Huang · 2026-04-21T09:37:20Z

/gemini review

gemini-code-assist

Code Review

This pull request updates the minimum Python version to 3.10 and refactors attention masking and loss calculation logic to improve compatibility with newer transformer versions and sequence parallelism. Feedback highlights potential issues with positional argument handling in the sdpa_mask wrapper, concerns about the accuracy of the num_items_in_batch calculation for non-sequence-parallel training, and a discrepancy between the code and comments regarding the reduction group used for batch item counts.

gemini-code-assist · 2026-04-21T09:39:13Z

+            def sdpa_mask(*args, **kwargs):
                if self.world_size == 1:
-                    return masking_utils.ALL_MASK_ATTENTION_FUNCTIONS._global_mapping['sdpa_origin'](batch_size,
-                                                                                                     cache_position,
-                                                                                                     kv_length, *args,
-                                                                                                     **kwargs)
-                device = cache_position.device
+                    return masking_utils.ALL_MASK_ATTENTION_FUNCTIONS._global_mapping['sdpa_origin'](*args, **kwargs)
+                if 'cache_position' in kwargs:
+                    device = kwargs['cache_position'].device
+                else:
+                    # transformers>=5.4.0
+                    device = kwargs['device']
                cache_position = self.real_position_ids[0]
                cache_position = self.pad(cache_position, padding_value=-1, position_ids=self.real_position_ids, dim=0)
                cache_position = torch.arange(0, cache_position.shape[0], device=device)
-                kv_length = cache_position.shape[0]
-                return masking_utils.ALL_MASK_ATTENTION_FUNCTIONS._global_mapping['sdpa_origin'](batch_size,
-                                                                                                 cache_position,
-                                                                                                 kv_length, *args,
-                                                                                                 **kwargs)
+                kwargs['kv_length'] = cache_position.shape[0]
+                if 'cache_position' in kwargs:
+                    kwargs['cache_position'] = cache_position
+                else:
+                    kwargs['q_length'] = kwargs['kv_length']
+                return masking_utils.ALL_MASK_ATTENTION_FUNCTIONS._global_mapping['sdpa_origin'](*args, **kwargs)


Similar to flash_attention_mask, the sdpa_mask wrapper fails to handle positional arguments correctly. If cache_position is passed positionally (index 1), it will not be detected in kwargs, leading to a KeyError when trying to access kwargs['device'] (or kwargs['cache_position']). Furthermore, updating kwargs while leaving the original positional args intact will cause a TypeError when calling sdpa_origin due to multiple values for the same parameter, or it will cause sdpa_origin to use the old positional value instead of the updated one.

gemini-code-assist · 2026-04-21T09:39:13Z

                if num_items_in_batch is None:
-                    num_items_in_batch = (labels[:, 1:] != -100).sum()
+                    # https://github.com/huggingface/transformers/blob/9dff7ca5c9693f4c02cdd2a9c2abc4772fcea5da/src/transformers/trainer.py#L2137
+                    num_items_in_batch = (labels != -100).sum()  # compat SP


Changing the num_items_in_batch calculation to (labels != -100).sum() for all cases may introduce a slight inaccuracy in non-sequence-parallel training for Causal LMs. In the non-SP case, labels are typically not shifted yet, and the loss is computed on labels[:, 1:]. Using the full labels count as the denominator will result in a slightly lower loss value than intended. It is better to keep the conditional logic to ensure the count matches the tokens that actually contribute to the loss.

num_items_in_batch = (labels != -100).sum() if self.template.sequence_parallel_size > 1 else (labels[:, 1:] != -100).sum() # compat SP

gemini-code-assist · 2026-04-21T09:39:13Z

                        # to full length via GatherLoss. Reduce the denominator
                        # across the SP group so it matches the gathered loss.
-                        dist.all_reduce(num_items_in_batch, op=dist.ReduceOp.SUM, group=sequence_parallel.sp_group)
+                        dist.all_reduce(num_items_in_batch, op=dist.ReduceOp.SUM)


The all_reduce call for num_items_in_batch now performs a global reduction (default group) instead of reducing only across the sp_group. However, the comment on lines 201-203 still states that the reduction is 'across the SP group'. If a global reduction is intended (e.g., to support global token-based averaging in newer transformers versions), the comment should be updated. Otherwise, if the goal is to reconstruct the local batch size for the SP group, the group argument should be restored.

…9167)

update docs 0421

1a190c5

gemini-code-assist Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread swift/trainers/seq2seq_trainer.py Outdated

fix

6dbcc8f

Jintao-Huang changed the title ~~[docs] update docs~~ Add sequence parallel compatibility with transformers >= 5.4.0 Apr 21, 2026

fix

73dbada

gemini-code-assist Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread swift/sequence_parallel/ulysses.py

Comment thread swift/sequence_parallel/ulysses.py Outdated

Jintao-Huang added 2 commits April 21, 2026 16:44

fix

3e3d9f6

fix

bf1f5df

gemini-code-assist Bot reviewed Apr 21, 2026

View reviewed changes

tastelikefeet approved these changes Apr 21, 2026

View reviewed changes

hjh0119 approved these changes Apr 21, 2026

View reviewed changes

update

e8587d7

Jintao-Huang merged commit e7c9c7f into modelscope:main Apr 21, 2026
2 of 3 checks passed

Jintao-Huang added a commit that referenced this pull request Apr 23, 2026

[sp] Add sequence parallel compatibility with transformers >= 5.4.0 (#…

5d63aa6

…9167)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sequence parallel compatibility with transformers >= 5.4.0#9167

Add sequence parallel compatibility with transformers >= 5.4.0#9167
Jintao-Huang merged 6 commits into
modelscope:mainfrom
Jintao-Huang:update_docs_0421

Jintao-Huang commented Apr 21, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Jintao-Huang commented Apr 21, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Jintao-Huang commented Apr 21, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 21, 2026

Uh oh!

gemini-code-assist Bot Apr 21, 2026

Uh oh!

gemini-code-assist Bot Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Jintao-Huang commented Apr 21, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Jintao-Huang commented Apr 21, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Jintao-Huang commented Apr 21, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants