Conversation
Summary of ChangesHello @meichangsu1, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request focuses on enhancing the robustness and correctness of sequence parallelism, particularly for scenarios involving 'packed batches' where multiple sequences are concatenated. Key changes include refined handling of attention masks and position IDs to ensure proper attention computation and prevent incorrect label supervision across sequence boundaries. The PR also improves the loss reduction mechanism for distributed training, corrects a gradient scaling error, and introduces a new example for packed datasets. These updates collectively aim to provide a more stable and accurate sequence parallel training experience. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces significant fixes and improvements for sequence parallelism, with a focus on supporting packed datasets. Key changes include refactoring sequence parallel logic for clarity, fixing critical bugs in gradient calculation and label handling, and adding a new test file for sequence parallel attention. While the core changes are solid, I've identified a couple of areas for improvement in an example script and a minor code quality issue.
| if step % 1 == 0: | ||
| metric = model.calculate_metric(is_training=True, adapter_name='default') | ||
| _metrics = {} | ||
| for key, value in metric.items(): | ||
| try: | ||
| value = float(value) | ||
| _metrics[key] = value | ||
| except: | ||
| pass | ||
| logger.info(f'Current is step {step} of {len(dataloader)}, metric: {metric}') |
There was a problem hiding this comment.
This block has a few issues:
- Inefficiency: Calculating metrics on every step (
step % 1 == 0) can significantly slow down training. It's better to do this periodically, for example, every 50 or 100 steps, as was done previously (if step % 50 == 0 and step > 0:). - Redundant condition: The
if step % 1 == 0:condition is always true and can be removed. - Dead code: The
_metricsdictionary is created and populated but never used. The log message uses the originalmetricdictionary. This block of code can be removed. - Bare except: The
except:on line 86 is a bare except, which is bad practice as it can hide unexpected errors. It should catch specific exceptions likeValueErrororTypeError.
Consider refactoring this to calculate metrics periodically and removing the unused code for better performance and readability.
| if step % 1 == 0: | |
| metric = model.calculate_metric(is_training=True, adapter_name='default') | |
| _metrics = {} | |
| for key, value in metric.items(): | |
| try: | |
| value = float(value) | |
| _metrics[key] = value | |
| except: | |
| pass | |
| logger.info(f'Current is step {step} of {len(dataloader)}, metric: {metric}') | |
| if step > 0 and step % 50 == 0: | |
| metric = model.calculate_metric(is_training=True, adapter_name='default') | |
| logger.info(f'Current is step {step} of {len(dataloader)}, metric: {metric}') |
| @@ -1,5 +1,6 @@ | |||
| # Copyright (c) ModelScope Contributors. All rights reserved. | |||
| import contextlib | |||
| import os | |||
Add a new test file `test_sequence_parallel_single_attention.py` to verify the correctness of the sequence parallel attention implementation. The test includes a distributed setup using torch.distributed and compares outputs between sequence parallel and local attention modes. Also adds an empty `__init__.py` to the transformers test directory for proper module imports.
- Add `_enable_strict_determinism` helper to disable TF32 and enable deterministic algorithms - Add `_to_local` helper to unwrap DTensors for gradient comparison - Update test to use full world size for sequence parallel group and increase head count - Switch to float32 dtype for stricter numerical alignment - Improve gradient comparison by cloning and unwrapping tensors
…free logic - Replace HfConfigFactory utility with direct get_config_attr function - Move get_llm_model to shared transformers utilities - Remove padding_free parameter and related conditional logic - Simplify attention mask construction for padded tokens - Update SequenceParallelConfig to drop padding_free field
- Add detection of packed batches via `_is_packed_position_ids` heuristic - Raise RuntimeError when SDPA backend is used with packed batches, as SDPA lacks native packed/varlen support - Build 2D attention_mask for padded sequences to ensure correct FlashAttention2 unpad behavior - Avoid unnecessary 4D causal mask generation for packed/padding-free batches
Introduce a new cookbook script demonstrating supervised fine-tuning with a single controller using sequence parallelism (SP) and FSDP across 4 GPUs. The example includes: - Device mesh configuration with dp=2 and fsdp=2 dimensions - PackingDataset setup with self-cognition data and left truncation - Training loop with LoRA adapter, AdamW optimizer, and periodic evaluation - Checkpoint saving based on loss improvement - Validation of FSDP + SP input slicing across multiple GPUs
9b9d4bc to
2086e87
Compare
…formers cookbook - Add new single_controller_sp.py example demonstrating FSDP + SP validation over 4 GPUs - Move legacy single_controller_sp.py to transformers/sp_fsdp_dense.py - Add shell script sp_fsdp_dense.sh for running the example - Update imports and structure to use twinkle framework components
…irectory Relocate test_sequence_parallel_single_attention.py from tests/transformers/ to tests/sequence_parallel/ to better organize test files by feature area. This improves maintainability and aligns with the project's test structure conventions.
- Add bash script header and comments to `sp_fsdp_dense.sh` explaining how to enable sequence parallelism with ulysses_size - Remove duplicate `import os` statement in transformers.py for cleaner code - Fix minor formatting by removing extra blank line in transformers_utils.py
- Switch from `ray` to `local` mode for twinkle initialization - Add evaluation function with separate dataset slice - Increase dataset size from 100 to 500 samples - Add cosine warmup learning rate scheduler - Remove unused torch import and remote_group parameters - Adjust batch size from 4 to 8 and logging frequency to every 20 steps - Improve logging with train configs and total steps information
Removed unnecessary imports (`math`, `os`, `SimpleNamespace`) from the sequence_parallel strategy file to clean up the codebase and improve maintainability.
No description provided.