sequence parallel fix bug by meichangsu1 · Pull Request #47 · modelscope/twinkle

meichangsu1 · 2026-02-09T11:22:09Z

No description provided.

gemini-code-assist · 2026-02-09T11:22:40Z

Summary of Changes

Hello @meichangsu1, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on enhancing the robustness and correctness of sequence parallelism, particularly for scenarios involving 'packed batches' where multiple sequences are concatenated. Key changes include refined handling of attention masks and position IDs to ensure proper attention computation and prevent incorrect label supervision across sequence boundaries. The PR also improves the loss reduction mechanism for distributed training, corrects a gradient scaling error, and introduces a new example for packed datasets. These updates collectively aim to provide a more stable and accurate sequence parallel training experience.

Highlights

Sequence Parallelism for Packed Batches: Introduced explicit handling for 'packed batches' (from PackingDataset with padding-free collate) in sequence parallelism. This includes deriving cu_seqlens for FlashAttention2 and preventing cross-sequence supervision in labels by masking out boundary tokens.
Loss Reduction Refinement: Enhanced the reduce_loss mechanism in sequence parallelism to correctly support both 'sum' and 'mean' reductions, ensuring accurate global loss calculation across distributed ranks.
Gradient Scaling Correction: Removed an incorrect world_size scaling factor from the backward pass of sequence parallel loss, improving the accuracy and correctness of gradient computations.
New Packed Dataset Example: Added a new cookbook example (single_controller_sp_packing.py) demonstrating how to effectively use sequence parallelism with packed datasets for improved training efficiency.
Attention Backend Compatibility: Implemented a runtime check to prevent the use of the SDPA backend with packed batches, enforcing the use of FlashAttention2 for such cases due to its specialized varlen/packed semantics.
Improved LLM Model Extraction: Introduced a more robust get_llm_model utility to consistently extract the base language model from various wrapped model types (e.g., by Accelerate, PEFT, Swift).
Test Coverage: Added comprehensive unit tests for sequence parallel attention, including scenarios with padding and integration with FSDP, to validate the correctness and stability of the implemented fixes.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

cookbook/sft/single_controller_sp.py
- Increased max_length in create_dataset from 64 to 256.
- Removed the eval function and refactored the training loop to calculate metrics per step and save a final checkpoint.
cookbook/sft/single_controller_sp_packing.py
- Added a new training script for sequence parallelism with packed datasets, utilizing PackingDataset and padding_free input processing.
src/twinkle/model/transformers/strategy/sequence_parallel.py
- Moved get_llm_model to twinkle.utils.transformers_utils and replaced HfConfigFactory with a standalone get_config_attr function.
- Removed world_size scaling from _SequenceParallelLoss.backward.
- Implemented logic to derive cu_seqlens for FlashAttention2 from position_ids in packed batches and added a runtime check to disallow SDPA backend for such cases.
- Removed padding_free parameter from _SequenceParallel initialization and pad_and_split_mm_tokens method.
- Introduced _is_packed_position_ids to detect packed batches and modified pad_and_split_inputs to track this status, generate attention masks for SP-alignment padding, and adjust labels for packed batches.
- Added loss_reduction to SequenceParallelConfig and updated reduce_loss to support 'sum' and 'mean' reductions.
- Refactored postprocess_outputs for consistent handling and trimming of gathered logits.
src/twinkle/model/transformers/transformers.py
- Ensured loss_reduction is propagated to sp_strategy.sp_config before calling reduce_loss.
src/twinkle/utils/transformers_utils.py
- Added a comprehensive get_llm_model utility for robust LLM module extraction from wrapped models.
tests/transformers/test_sequence_parallel_single_attention.py
- Added a new test file with distributed tests for sequence parallel attention, covering scenarios with and without padding, and FSDP integration.

Activity

The author meichangsu1 initiated this pull request to address and resolve a bug within the sequence parallelism implementation.
The changes encompass modifications to the core logic of sequence parallelism, updates to training scripts, and enhancements to utility functions.
New, comprehensive unit tests have been introduced to thoroughly validate the correctness of the sequence parallel attention mechanism across various conditions, including different padding strategies and integration with FSDP.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces significant fixes and improvements for sequence parallelism, with a focus on supporting packed datasets. Key changes include refactoring sequence parallel logic for clarity, fixing critical bugs in gradient calculation and label handling, and adding a new test file for sequence parallel attention. While the core changes are solid, I've identified a couple of areas for improvement in an example script and a minor code quality issue.

gemini-code-assist · 2026-02-09T11:24:24Z

cookbook/transformers/sp_fsdp_dense.py

+        if step % 1 == 0:
+            metric = model.calculate_metric(is_training=True, adapter_name='default')
+            _metrics = {}
+            for key, value in metric.items():
+                try:
+                    value = float(value)
+                    _metrics[key] = value
+                except:
+                    pass
+            logger.info(f'Current is step {step} of {len(dataloader)}, metric: {metric}')


This block has a few issues:

Inefficiency: Calculating metrics on every step (step % 1 == 0) can significantly slow down training. It's better to do this periodically, for example, every 50 or 100 steps, as was done previously (if step % 50 == 0 and step > 0:).

Redundant condition: The if step % 1 == 0: condition is always true and can be removed.

Dead code: The _metrics dictionary is created and populated but never used. The log message uses the original metric dictionary. This block of code can be removed.

Bare except: The except: on line 86 is a bare except, which is bad practice as it can hide unexpected errors. It should catch specific exceptions like ValueError or TypeError.

Consider refactoring this to calculate metrics periodically and removing the unused code for better performance and readability.

Suggested change

if step % 1 == 0:

metric = model.calculate_metric(is_training=True, adapter_name='default')

_metrics = {}

for key, value in metric.items():

try:

value = float(value)

_metrics[key] = value

except:

pass

logger.info(f'Current is step {step} of {len(dataloader)}, metric: {metric}')

if step > 0 and step % 50 == 0:

metric = model.calculate_metric(is_training=True, adapter_name='default')

logger.info(f'Current is step {step} of {len(dataloader)}, metric: {metric}')

gemini-code-assist · 2026-02-09T11:24:25Z

src/twinkle/model/transformers/transformers.py

@@ -1,5 +1,6 @@
 # Copyright (c) ModelScope Contributors. All rights reserved.
 import contextlib
+import os


This import os is redundant as the os module is already imported on line 5. Please remove this duplicate import to keep the code clean.

Add a new test file `test_sequence_parallel_single_attention.py` to verify the correctness of the sequence parallel attention implementation. The test includes a distributed setup using torch.distributed and compares outputs between sequence parallel and local attention modes. Also adds an empty `__init__.py` to the transformers test directory for proper module imports.

- Add `_enable_strict_determinism` helper to disable TF32 and enable deterministic algorithms - Add `_to_local` helper to unwrap DTensors for gradient comparison - Update test to use full world size for sequence parallel group and increase head count - Switch to float32 dtype for stricter numerical alignment - Improve gradient comparison by cloning and unwrapping tensors

…free logic - Replace HfConfigFactory utility with direct get_config_attr function - Move get_llm_model to shared transformers utilities - Remove padding_free parameter and related conditional logic - Simplify attention mask construction for padded tokens - Update SequenceParallelConfig to drop padding_free field

- Add detection of packed batches via `_is_packed_position_ids` heuristic - Raise RuntimeError when SDPA backend is used with packed batches, as SDPA lacks native packed/varlen support - Build 2D attention_mask for padded sequences to ensure correct FlashAttention2 unpad behavior - Avoid unnecessary 4D causal mask generation for packed/padding-free batches

Introduce a new cookbook script demonstrating supervised fine-tuning with a single controller using sequence parallelism (SP) and FSDP across 4 GPUs. The example includes: - Device mesh configuration with dp=2 and fsdp=2 dimensions - PackingDataset setup with self-cognition data and left truncation - Training loop with LoRA adapter, AdamW optimizer, and periodic evaluation - Checkpoint saving based on loss improvement - Validation of FSDP + SP input slicing across multiple GPUs

…formers cookbook - Add new single_controller_sp.py example demonstrating FSDP + SP validation over 4 GPUs - Move legacy single_controller_sp.py to transformers/sp_fsdp_dense.py - Add shell script sp_fsdp_dense.sh for running the example - Update imports and structure to use twinkle framework components

…irectory Relocate test_sequence_parallel_single_attention.py from tests/transformers/ to tests/sequence_parallel/ to better organize test files by feature area. This improves maintainability and aligns with the project's test structure conventions.

- Add bash script header and comments to `sp_fsdp_dense.sh` explaining how to enable sequence parallelism with ulysses_size - Remove duplicate `import os` statement in transformers.py for cleaner code - Fix minor formatting by removing extra blank line in transformers_utils.py

- Switch from `ray` to `local` mode for twinkle initialization - Add evaluation function with separate dataset slice - Increase dataset size from 100 to 500 samples - Add cosine warmup learning rate scheduler - Remove unused torch import and remote_group parameters - Adjust batch size from 4 to 8 and logging frequency to every 20 steps - Improve logging with train configs and total steps information

Removed unnecessary imports (`math`, `os`, `SimpleNamespace`) from the sequence_parallel strategy file to clean up the codebase and improve maintainability.

gemini-code-assist bot reviewed Feb 9, 2026

View reviewed changes

meichangsu1 added 8 commits February 9, 2026 19:29

wip

4858e37

remove __init__

cb6e343

fix loss computation bug

2086e87

meichangsu1 force-pushed the sp_ljl_dev branch from 9b9d4bc to 2086e87 Compare February 9, 2026 11:30

meichangsu1 added 6 commits February 9, 2026 19:43

refactor

c385e8b

feat(transformers): remove unused imports in sequence_parallel module

6783a9a

Removed unnecessary imports (`math`, `os`, `SimpleNamespace`) from the sequence_parallel strategy file to clean up the codebase and improve maintainability.

meichangsu1 merged commit 2dc5ff8 into dev Feb 9, 2026
0 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sequence parallel fix bug#47

sequence parallel fix bug#47
meichangsu1 merged 14 commits intodevfrom
sp_ljl_dev

meichangsu1 commented Feb 9, 2026

Uh oh!

gemini-code-assist bot commented Feb 9, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 9, 2026

Uh oh!

gemini-code-assist bot Feb 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

meichangsu1 commented Feb 9, 2026

Uh oh!

gemini-code-assist bot commented Feb 9, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant