Skip to content

fix native_fsdp+moe#33

Merged
kevssim merged 6 commits intodevfrom
dev-wkw
Feb 6, 2026
Merged

fix native_fsdp+moe#33
kevssim merged 6 commits intodevfrom
dev-wkw

Conversation

@kevssim
Copy link
Copy Markdown
Collaborator

@kevssim kevssim commented Feb 5, 2026

No description provided.

@kevssim kevssim changed the base branch from main to dev February 5, 2026 01:32
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @kevssim, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly upgrades the twinkle framework's capabilities for distributed training of large language models. It focuses on robust integration of FSDP and MoE, particularly for Qwen3 architectures, by introducing new parallelism strategies and specialized model handling. The changes also bring a new client code generation system, advanced data handling features for datasets and dataloaders, and an expanded set of loss functions and metrics. A new kernel module is added to enhance performance, while development workflows are improved through updated pre-commit hooks and new project documentation.

Highlights

  • Enhanced Distributed Training for FSDP and MoE: Introduced comprehensive support and fixes for Fully Sharded Data Parallel (FSDP) and Mixture of Experts (MoE) configurations, particularly for Qwen3 models. This includes new strategies for native FSDP2 with expert parallel compatibility, and specialized handling for Megatron-Core models.
  • New Client Generation Tooling: Added a new client code generator that streamlines the creation of client wrappers for processors, models, and samplers, improving the usability and integration of twinkle components.
  • Refined Data Handling and Dataloaders: Implemented new data structures for input features, messages, and trajectories. Significant enhancements to dataloaders include retry mechanisms and device mesh-aware sampling, alongside new iterable, lazy, and packing dataset types for more efficient data processing.
  • Expanded Loss Functions and Metrics: Introduced a suite of new loss functions, including specialized ones for Megatron's vocab-parallel cross-entropy and various GRPO (Group Relative Policy Optimization) variants. New metric classes for accuracy, loss, and training progress are also added.
  • Integrated Kernel Module for Performance: A new Twinkle Kernel Module is integrated, offering both layer-level and function-level kernel replacement capabilities, with support for HuggingFace kernels to optimize model performance.
  • Development Tooling and Documentation Updates: Updated pre-commit hooks to newer versions of flake8, isort, yapf, and pre-commit-hooks. New documentation includes guidelines for AI coding agents and a project roadmap.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • .github/copilot-instructions.md
    • Added new AI coding agent guidelines.
  • .gitignore
    • Added new entries for src/twinkle_client subdirectories, .locks/, .qoder, test_cookbook/, and /test*.py.
    • Removed transformers/.
  • .pre-commit-config.yaml
    • Updated flake8 from 4.0.0 to 7.3.0.
    • Updated isort from 4.3.21 to 7.0.0.
    • Updated yapf from v0.30.0 to v0.43.0.
    • Updated pre-commit-hooks from v3.1.0 to v6.0.0.
    • Removed fix-encoding-pragma hook.
  • ROADMAP.md
    • Added a new project roadmap document.
  • client_tools/client_generator.py
    • Added a new tool for generating client wrappers for processors, models, and samplers.
  • cookbook/client/tinker/megatron/lora.py
    • Added a new example for Megatron LoRA training.
  • cookbook/client/tinker/megatron/server.py
    • Added a new server configuration for Megatron.
  • cookbook/client/tinker/megatron/server_config.yaml
    • Added a new server configuration file for Megatron.
  • cookbook/client/tinker/transformer/lora.py
    • Added a new example for Transformer LoRA training.
  • cookbook/client/tinker/transformer/sample.py
    • Added a new example for Transformer sampling.
  • cookbook/client/tinker/transformer/self_congnition.py
    • Added a new example for Transformer self-cognition training.
  • cookbook/client/tinker/transformer/server.py
    • Added a new server configuration for Transformer.
  • cookbook/client/tinker/transformer/server_config.yaml
    • Added a new server configuration file for Transformer.
  • cookbook/client/twinkle/megatron/lora.py
    • Added a new example for Twinkle Megatron LoRA training.
  • cookbook/client/twinkle/megatron/server.py
    • Added a new server configuration for Twinkle Megatron.
  • cookbook/client/twinkle/megatron/server_config.yaml
    • Added a new server configuration file for Twinkle Megatron.
  • cookbook/client/twinkle/transformer/grpo_lora.py
    • Added a new example for Twinkle Transformer GRPO LoRA training.
  • cookbook/client/twinkle/transformer/lora.py
    • Added a new example for Twinkle Transformer LoRA training.
  • cookbook/client/twinkle/transformer/server.py
    • Added a new server configuration for Twinkle Transformer.
  • cookbook/client/twinkle/transformer/server_config.yaml
    • Added a new server configuration file for Twinkle Transformer.
  • cookbook/grpo/lora.py
    • Added a new GRPO training cookbook with LoRA.
  • cookbook/grpo/lora_gpu.py
    • Added a new GRPO LoRA training script for GPU.
  • cookbook/grpo/lora_npu.py
    • Added a new GRPO LoRA training script for NPU.
  • cookbook/megatron/lora.py
    • Added a new Megatron-Core LoRA training example.
  • cookbook/megatron/moe_lora.py
    • Added a new Megatron-Core MoE LoRA training example.
  • cookbook/megatron/vlm_lora.py
    • Added a new Megatron-Core VLM LoRA training example.
  • cookbook/remote/tinker/ascend/lora.py
    • Added a new remote Tinker Ascend LoRA example.
  • cookbook/remote/tinker/ascend/server.py
    • Added a new remote Tinker Ascend server configuration.
  • cookbook/remote/tinker/ascend/server_config.yaml
    • Added a new remote Tinker Ascend server configuration file.
  • cookbook/remote/tinker/lora.py
    • Added a new remote Tinker LoRA example.
  • cookbook/remote/tinker/server.py
    • Added a new remote Tinker server configuration.
  • cookbook/remote/tinker/server_config.yaml
    • Added a new remote Tinker server configuration file.
  • cookbook/remote/twinkle/lora.py
    • Added a new remote Twinkle LoRA example.
  • cookbook/remote/twinkle/server.py
    • Added a new remote Twinkle server configuration.
  • cookbook/remote/twinkle/server_config.yaml
    • Added a new remote Twinkle server configuration file.
  • cookbook/sampler_demo.py
    • Added a new sampler demo.
  • cookbook/sft/ep_fsdp_qwen3_moe.py
    • Added a new SFT example for Qwen3-30B EP+FSDP2 training.
  • cookbook/sft/fsdp_qwen3_moe.py
    • Added a new SFT example for Qwen3-30B FSDP training.
  • cookbook/sft/full_sft.py
    • Added a new full SFT example.
  • cookbook/sft/local_dataset.py
    • Added a new SFT example for local dataset.
  • cookbook/sft/lora_npu.py
    • Added a new SFT example for LoRA NPU training.
  • cookbook/sft/multi_lora.py
    • Added a new SFT example for multi-LoRA.
  • cookbook/sft/single_controller.py
    • Added a new SFT example for single controller.
  • cookbook/sft/single_controller_sp.py
    • Added a new SFT example for single controller with sequence parallelism.
  • cookbook/sft/single_program.py
    • Added a new SFT example for single program.
  • cookbook/sft/single_program_full.py
    • Added a new SFT example for full single program.
  • cookbook/sft/single_program_megatron.py
    • Added a new SFT example for single program Megatron.
  • cookbook/sft/streaming_dataset.py
    • Added a new SFT example for streaming dataset.
  • cookbook/sft/vlm_lora.py
    • Added a new SFT example for VLM LoRA training.
  • examples/expert_parallel/train_qwen3_30b_ep_fsdp_demo.py
    • Added a new example for Qwen3-30B EP+FSDP2 training.
  • pyproject.toml
    • Added twinkle_client to poetry packages.
  • src/twinkle/init.py
    • Modified _import_structure to include twinkle.model.transformers.moe.apply_expert_parallel.
  • src/twinkle/data_format/init.py
    • Added Tool, ToolCall, Message, Trajectory, InputFeature imports.
  • src/twinkle/data_format/input_feature.py
    • Added a new file defining InputFeature TypedDict.
  • src/twinkle/data_format/message.py
    • Added a new file defining ToolCall, Tool, Message TypedDicts.
  • src/twinkle/data_format/trajectory.py
    • Added a new file defining Trajectory TypedDict.
  • src/twinkle/dataloader/init.py
    • Added DeviceMeshIterableFetcher and RetrySampler imports.
  • src/twinkle/dataloader/dataloader.py
    • Modified DataLoader to use DeviceMeshIterableFetcher and RetrySampler.
    • Updated _seed_worker and _lazy_init_dataloader methods.
  • src/twinkle/dataloader/device_mesh_fetcher.py
    • Added a new file implementing DeviceMeshIterableFetcher.
  • src/twinkle/dataloader/device_mesh_sampler.py
    • Added a new file implementing DeviceMeshSampler.
  • src/twinkle/dataloader/retry_sampler.py
    • Added a new file implementing RetrySampler.
  • src/twinkle/dataset/init.py
    • Added LazyDataset, IterableDataset, IterablePackingDataset, PackingDataset imports.
  • src/twinkle/dataset/base.py
    • Modified Dataset class to include DatasetMeta, _load_dataset, map, filter, add_dataset, mix_dataset methods.
    • Added set_template, encode, check methods.
  • src/twinkle/dataset/iterable_dataset.py
    • Added a new file implementing IterableDataset.
  • src/twinkle/dataset/iterable_packing_dataset.py
    • Added a new file implementing IterablePackingDataset.
  • src/twinkle/dataset/lazy_dataset.py
    • Added a new file implementing LazyDataset.
  • src/twinkle/dataset/packing_dataset.py
    • Added a new file implementing PackingDataset.
  • src/twinkle/gym/init.py
    • Added Gym import.
  • src/twinkle/gym/base.py
    • Added a new file defining Gym class.
  • src/twinkle/hub/init.py
    • Added MSHub, HFHub, HubOperation imports.
  • src/twinkle/hub/hub.py
    • Added a new file implementing HubOperation, MSHub, HFHub classes for model/dataset hub interactions.
  • src/twinkle/infra/init.py
    • Modified initialize to handle groups and global_device_mesh.
    • Updated remote_class and remote_function to support _lazy_collect and sync.
    • Updated get_device_placement for better visualization.
  • src/twinkle/infra/_ray/init.py
    • Added RayHelper and ResourceManager imports.
  • src/twinkle/infra/_ray/ray_helper.py
    • Added a new file implementing RayHelper for Ray cluster management.
  • src/twinkle/infra/_ray/resource_manager.py
    • Added a new file implementing ResourceManager for managing Ray resources.
  • src/twinkle/kernel/README.md
    • Added a new README for the Twinkle Kernel Module.
  • src/twinkle/kernel/init.py
    • Added kernelize_model, register_layer_kernel, register_function_kernel, register_external_layer, register_kernels functions.
  • src/twinkle/kernel/base.py
    • Added a new file defining base types and utility functions for the kernel module.
  • src/twinkle/kernel/function.py
    • Added a new file implementing function-level kernel replacement.
  • src/twinkle/kernel/layer.py
    • Added a new file implementing layer-level kernel replacement.
  • src/twinkle/kernel/registry.py
    • Added a new file implementing kernel registries.
  • src/twinkle/loss/init.py
    • Added Loss, ChunkedCrossEntropyLoss, VocabParallelCrossEntropyLoss, GRPOLoss, GSPOLoss, SAPOLoss, CISPOLoss, BNPOLoss, DRGRPOLoss imports.
    • Updated torch_loss_mapping.
  • src/twinkle/loss/base.py
    • Added a new file defining Loss base class.
  • src/twinkle/loss/chunked_cross_entropy.py
    • Added a new file implementing ChunkedCrossEntropyLoss.
  • src/twinkle/loss/cross_entropy.py
    • Added a new file implementing CrossEntropyLoss.
  • src/twinkle/loss/grpo.py
    • Added a new file implementing GRPO-related loss functions.
  • src/twinkle/loss/mse.py
    • Added a new file implementing MSELoss.
  • src/twinkle/loss/vocab_parallel_cross_entropy.py
    • Added a new file implementing VocabParallelCrossEntropyLoss.
  • src/twinkle/loss_scale/init.py
    • Added LossScale import.
  • src/twinkle/loss_scale/base.py
    • Added a new file defining LossScale base class.
  • src/twinkle/metric/init.py
    • Added Metric, Accuracy, LossMetric, TrainMetric imports.
  • src/twinkle/metric/accuracy.py
    • Added a new file implementing Accuracy metric.
  • src/twinkle/metric/base.py
    • Added a new file defining Metric base class.
  • src/twinkle/metric/loss.py
    • Added a new file implementing LossMetric.
  • src/twinkle/metric/train_metric.py
    • Added a new file implementing TrainMetric.
  • src/twinkle/model/init.py
    • Modified _import_structure to include MultiLoraTransformersModel, MegatronModel, MultiLoraMegatronModel.
  • src/twinkle/model/base.py
    • Added a new file defining TwinkleModel abstract base class.
  • src/twinkle/model/megatron/init.py
    • Added MegatronStrategy, MegatronModel, MultiLoraMegatronModel imports.
  • src/twinkle/model/megatron/args.py
    • Added a new file defining TwinkleMegatronArgs for Megatron model configuration.
  • src/twinkle/model/megatron/megatron.py
    • Added a new file implementing MegatronModel for Megatron-Core integration.
  • src/twinkle/model/megatron/model/init.py
    • Added gpts, mm_gpts, MegatronModelType, MegatronModelMeta, get_megatron_model_meta, register_megatron_model, GPTBridge imports.
  • src/twinkle/model/megatron/model/constant.py
    • Added a new file defining LLMModelType, MLLMModelType, ModelType, LLMMegatronModelType, MLLMMegatronModelType, MegatronModelType constants.
  • src/twinkle/model/megatron/model/gpt_bridge.py
    • Added a new file implementing GPTBridge for converting between HuggingFace and Megatron model weights.
  • src/twinkle/model/megatron/model/gpt_model.py
    • Added a new file implementing GPTModel for Megatron-Core.
  • src/twinkle/model/megatron/model/gpts/init.py
    • Added register_megatron_model for GPT models.
  • src/twinkle/model/megatron/model/mm_gpt_model.py
    • Added a new file implementing MultimodalGPTModel.
  • src/twinkle/model/megatron/model/mm_gpts/init.py
    • Added qwen, qwen3_vl, utils imports.
  • src/twinkle/model/megatron/model/mm_gpts/qwen.py
    • Added a new file implementing Qwen2_5VL_Vit, Qwen2_5VLBridge, Qwen2VL_Vit for Qwen2/2.5-VL models.
  • src/twinkle/model/megatron/model/mm_gpts/qwen3_vl.py
    • Added a new file implementing Qwen3Omni_Vit, Qwen3VLTransformerBlock, Qwen3VLGPTModel, Qwen3OmniBridge, Qwen3VL_Vit for Qwen3-VL models.
  • src/twinkle/model/megatron/model/mm_gpts/utils.py
    • Added a new file implementing patch_hf_initialize_weight, patch_device_map_meta, HuggingFaceModule utilities.
  • src/twinkle/model/megatron/model/register.py
    • Added a new file implementing MegatronModelMeta, register_megatron_model, get_megatron_model_meta for model registration.
  • src/twinkle/model/megatron/model/rope.py
    • Added a new file implementing RoPE-related utilities for Megatron.
  • src/twinkle/model/megatron/multi_lora_megatron.py
    • Added a new file implementing MultiLoraMegatronModel for multi-LoRA support in Megatron.
  • src/twinkle/model/megatron/strategy/init.py
    • Added MegatronStrategy import.
  • src/twinkle/model/megatron/strategy/megatron.py
    • Added a new file implementing MegatronStrategy for Megatron-Core training.
  • src/twinkle/model/megatron/tuners/init.py
    • Added LoraParallelLinear, dispatch_megatron imports.
  • src/twinkle/model/megatron/tuners/lora.py
    • Added a new file implementing LoraParallelLinear for Megatron-compatible LoRA.
  • src/twinkle/model/megatron/tuners/utils.py
    • Added a new file implementing utility functions for Megatron-Core tuners.
  • src/twinkle/model/megatron/utils/init.py
    • Added split_cp_inputs, convert_hf_config imports.
  • src/twinkle/model/megatron/utils/config.py
    • Added a new file implementing config_mapping and convert_hf_config for HuggingFace to Megatron config conversion.
  • src/twinkle/model/megatron/utils/utils.py
    • Added a new file implementing split_cp_inputs utility.
  • src/twinkle/model/multi_lora.py
    • Added a new file implementing MultiLora for managing multiple LoRA adapters.
  • src/twinkle/model/transformers/init.py
    • Added MultiLoraTransformersModel import.
  • src/twinkle/model/transformers/moe/init.py
    • Added apply_expert_parallel import.
  • src/twinkle/model/transformers/moe/expert_parallel.py
    • Added a new file implementing apply_expert_parallel for Transformer MoE.
  • src/twinkle/model/transformers/multi_lora_transformers.py
    • Added a new file implementing MultiLoraTransformersModel for multi-LoRA support in Transformers.
  • src/twinkle/model/transformers/strategy/init.py
    • Added AccelerateStrategy, NativeFSDPStrategy imports.
  • src/twinkle/model/transformers/strategy/accelerate.py
    • Added a new file implementing AccelerateStrategy.
  • src/twinkle/model/transformers/strategy/native_fsdp.py
    • Added a new file implementing NativeFSDPStrategy for FSDP2.
  • tests/infra/init.py
    • Renamed from src/twinkle/infra/ray/__init__.py.
  • tests/kernel/init.py
    • Renamed from src/twinkle/plugin/__init__.py.
Activity
  • The pull request was created by kevssim.
  • The title "fix fsdp+moe" indicates the primary intent of the changes.
  • Numerous new files and modifications across core modules suggest a significant development effort to enhance distributed training capabilities and framework extensibility.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a fix for running FSDP with MoE models, specifically allowing expert parallelism to be disabled. This is achieved by adding an enable_ep flag to the NativeFSDPStrategy. The changes also include bug fixes in the data_world_size calculation within DeviceMesh. A new example script for fine-tuning a Qwen3 MoE model with FSDP is added, likely to test this new capability.

My review has identified a couple of issues in the new example script:

  • A critical issue with the gradient accumulation logic in the training loop, which will lead to incorrect model updates.
  • A minor point of confusion in how the data processor is selected, which could be improved for clarity.

The core changes to enable FSDP without EP and the bug fixes in platform.py look correct.

I am having trouble creating individual review comments. Click here to see my feedback.

cookbook/sft/fsdp_qwen3_moe.py (83-98)

critical

The gradient accumulation logic in this training loop is incorrect. Due to an issue in the underlying twinkle library's do_grad_sync method, the first optimizer update will use gradients from N+1 steps instead of N (where N is GRAD_ACCUM_STEPS). This will lead to incorrect training behavior. While the root cause is in the library, this script is directly affected and will not function as expected. The do_grad_sync method should be fixed to ensure updates happen after exactly N steps.

cookbook/sft/fsdp_qwen3_moe.py (60-62)

medium

For better readability and to avoid confusion, it's better to check and modify the processor variable directly, rather than re-checking the original PROCESSOR_ID.

    processor = PROCESSOR_ID
    if processor.lower() == "alpaca":
        processor = "AlpacaProcessor"

@kevssim kevssim changed the title fix fsdp+moe fix native_fsdp+moe Feb 6, 2026
@kevssim kevssim marked this pull request as ready for review February 6, 2026 02:57
@kevssim kevssim merged commit 5522ff1 into dev Feb 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant