Skip to content

sequence parallel#25

Merged
meichangsu1 merged 7 commits intodevfrom
sp_ljl_dev
Feb 3, 2026
Merged

sequence parallel#25
meichangsu1 merged 7 commits intodevfrom
sp_ljl_dev

Conversation

@meichangsu1
Copy link
Copy Markdown
Collaborator

No description provided.

Add `to_transformers_dict` function to convert InputFeature instances into a dictionary compatible with transformers models. The function extracts relevant keys and ensures values are either numpy arrays or torch tensors as required by the transformers library.
- Update `eval` function to pass `adapter_name="default"` to `forward_only`, `calculate_loss`, and `calculate_metric` methods
- In `train` function, set optimizer for adapter and include `adapter_name` in `get_train_configs`, `forward_backward`, and `clip_grad_and_step` calls
- Ensures proper adapter-specific operations during training and evaluation
- Precompute decay and no-decay parameter name lists before optimizer group creation
- Add explicit param_names field to optimizer groups for better debugging and transparency
- Maintain identical functional behavior while improving code readability
Modify TransformersModel to only apply sp_strategy.postprocess_outputs when labels are None, preventing unintended postprocessing during training or evaluation with labels present. This ensures postprocessing is reserved for inference scenarios.
Add conditional loss reduction using sp_strategy when labels are present in inputs. This ensures that the loss calculation accounts for the sp_strategy's specific reduction logic, improving model training consistency and alignment with the strategy's objectives.
- Add comprehensive docstring to `_get_sp_group_from_device_mesh` explaining how SP groups are derived when no explicit "sp" mesh dimension exists
- Include inline comments in backward passes and attention logic to clarify gradient handling and layout transformations
- Improve readability and maintainability of sequence parallel implementation
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @meichangsu1, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refines the implementation of sequence parallelism within the twinkle framework, focusing on better integration with various training strategies and adapter-based fine-tuning. It introduces explicit strategy selection for TransformersModel, ensures adapter-aware operations in a sample SFT cookbook, and significantly enhances the clarity of the sequence parallelism logic through detailed code comments. Additionally, a new utility function is provided to facilitate data preparation for transformers models.

Highlights

  • Enhanced Sequence Parallelism Integration: Improved handling of sequence parallelism (SP) by adding conditional logic for post-processing outputs and integrating loss reduction across SP ranks, ensuring correct behavior in distributed training.
  • Flexible Training Strategy Selection: Introduced the ability to choose between 'accelerate' and 'native_fsdp' strategies for TransformersModel initialization, enhancing flexibility for distributed training setups.
  • Adapter-Aware Operations: Modified single_controller_sp.py to explicitly pass adapter_name="default" to model operations, ensuring correct behavior when using adapters like LoRA in a sequence parallel context.
  • Improved Code Clarity for SP: Added comprehensive docstrings and comments to the sequence_parallel.py module, explaining the intricate logic of SP group derivation, data padding, splitting, gathering, and gradient handling.
  • New Data Utility Function: Introduced to_transformers_dict in input_feature.py to streamline the conversion of InputFeature objects into a format compatible with transformers models.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • cookbook/sft/single_controller_sp.py
    • Updated model calls (forward_only, calculate_loss, calculate_metric, forward_backward, clip_grad_and_step, get_train_configs) to include adapter_name="default".
    • Added model.set_optimizer call with adapter_name="default".
  • src/twinkle/data_format/input_feature.py
    • Added to_transformers_dict function to convert InputFeature to a transformers-compatible dictionary.
  • src/twinkle/model/transformers/strategy/sequence_parallel.py
    • Added extensive docstrings and comments to functions and methods explaining sequence parallel logic (e.g., _get_sp_group_from_device_mesh, _LossReduction, _SeqAllToAllAttention, pad, gather, split).
    • Removed the _mask_qkv method.
  • src/twinkle/model/transformers/transformers.py
    • Imported NativeFSDPStrategy.
    • Added strategy parameter to TransformersModel.__init__ to select between 'accelerate' and 'native_fsdp'.
    • Modified self.strategy initialization based on the new strategy parameter.
    • Updated forward and forward_only to conditionally call sp_strategy.postprocess_outputs only when labels is None.
    • Integrated self.sp_strategy.reduce_loss into calculate_loss for sequence parallel setups.
    • Refactored _get_optimizer_grouped_parameters to include param_names and improve parameter grouping logic.
Activity
  • No activity (comments, reviews, etc.) was provided for this pull request.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for sequence parallelism, primarily by adding a NativeFSDPStrategy and integrating it into the TransformersModel. The changes also involve extensive additions of comments and docstrings to clarify the new sequence parallel logic, especially in src/twinkle/model/transformers/strategy/sequence_parallel.py. My review focuses on improving code efficiency and maintainability in the new and modified code sections. I've suggested using constants for magic strings, optimizing data handling functions, and improving the efficiency of parameter grouping logic.

Comment on lines +76 to +78
model.forward_only(inputs=batch, adapter_name="default")
model.calculate_loss(adapter_name="default")
metrics = model.calculate_metric(is_training=False, adapter_name="default")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The string "default" is used as a magic string for adapter_name in multiple places in this file (e.g., lines 76, 77, 78, 101, 102, 104, 115, 119). It would be better to define this as a constant at the beginning of the file to improve readability and maintainability. For example: DEFAULT_ADAPTER_NAME = "default".

Comment on lines +55 to +56
_keys = ['input_ids', 'input_embeddings', 'attention_mask', 'position_ids', 'labels', 'completion_mask', 'logits_to_keep', 'num_items_in_batch']
for key in list(feature.keys()):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For performance, it's better to use a set for _keys for O(1) average time complexity for membership testing. Also, iterating over list(feature.keys()) is inefficient as it creates a new list. You can iterate directly over the dictionary keys.

Additionally, import torch is inside the function. According to PEP 8, imports should usually be at the top of the file, unless there's a specific reason for lazy loading.

Suggested change
_keys = ['input_ids', 'input_embeddings', 'attention_mask', 'position_ids', 'labels', 'completion_mask', 'logits_to_keep', 'num_items_in_batch']
for key in list(feature.keys()):
_keys = {'input_ids', 'input_embeddings', 'attention_mask', 'position_ids', 'labels', 'completion_mask', 'logits_to_keep', 'num_items_in_batch'}
for key in feature:

Comment on lines +446 to +451
decay_param_names = [
n for n, p in params.items() if n in decay_parameters and p.requires_grad
]
no_decay_param_names = [
n for n, p in params.items() if n not in decay_parameters and p.requires_grad
]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic iterates over params.items() twice to create decay_param_names and no_decay_param_names. You could achieve the same result with a single loop for better performance, especially if params is large.

Consider this alternative:

decay_param_names = []
no_decay_param_names = []
for n, p in params.items():
    if p.requires_grad:
        if n in decay_parameters:
            decay_param_names.append(n)
        else:
            no_decay_param_names.append(n)

Also, for better performance of n in decay_parameters, consider converting decay_parameters to a set after it's created.

@meichangsu1 meichangsu1 merged commit 3837bc9 into dev Feb 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant