Skip to content

Add claude skill for adding a new MoE model#22

Merged
haok1402 merged 1 commit intomlc-ai:mainfrom
MasterJH5574:04-19-add-model-skill
Apr 19, 2026
Merged

Add claude skill for adding a new MoE model#22
haok1402 merged 1 commit intomlc-ai:mainfrom
MasterJH5574:04-19-add-model-skill

Conversation

@MasterJH5574
Copy link
Copy Markdown
Member

The add-new-model skill walks a developer through integrating a new MoE language model into PithTrain end to end: analyzing HuggingFace's reference implementation, writing the model file against the 5-stage DualPipeV protocol, wiring it into setup_model / apply_fsdp / test_fsdp, running the pp/ep scaling ladder from pp=1/ep=1 up to pp=2/ep=2, and (when needed) adding a checkpoint converter and an ad-hoc real-weight inference test.

Layout is a phase-gated SKILL.md entry point plus six reference docs (protocol, conventions, compile, checkpoint, testing, pitfalls) and two templates (a structural model skeleton and a DualPipeV inference harness). The reference docs are loaded on demand per phase to keep the entry point compact. Every pitfall learned from the gpt-oss bring-up - NaN-padding in grouped_mm, .view vs .transpose layout bugs, silent-zero experts from missing fill_weights branches, dynamic seq_len compile thrash - is encoded as a checklable rule.

The template files carry placeholder syntax (Model, ) that is not valid Python, so .pre-commit-config.yaml excludes .claude/skills/*/templates/ from every hook.

The add-new-model skill walks a developer through integrating a new
MoE language model into PithTrain end to end: analyzing HuggingFace's
reference implementation, writing the model file against the 5-stage
DualPipeV protocol, wiring it into setup_model / apply_fsdp /
test_fsdp, running the pp/ep scaling ladder from pp=1/ep=1 up to
pp=2/ep=2, and (when needed) adding a checkpoint converter and an
ad-hoc real-weight inference test.

Layout is a phase-gated SKILL.md entry point plus six reference docs
(protocol, conventions, compile, checkpoint, testing, pitfalls) and
two templates (a structural model skeleton and a DualPipeV inference
harness). The reference docs are loaded on demand per phase to keep
the entry point compact. Every pitfall learned from the gpt-oss
bring-up - NaN-padding in grouped_mm, .view vs .transpose layout
bugs, silent-zero experts from missing fill_weights branches, dynamic
seq_len compile thrash - is encoded as a checklable rule.

The template files carry placeholder syntax (<Model>Model, <model>)
that is not valid Python, so .pre-commit-config.yaml excludes
.claude/skills/*/templates/ from every hook.
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive workflow and documentation for adding new MoE language models to PithTrain, including templates and reference guides for the 5-stage protocol, compilation, and testing. Feedback includes aligning class naming conventions in the skill guide with the templates, ensuring slicing logic for fused projections is verified against reference implementations, and improving the router template's robustness by dynamically handling potential bias terms.

`decoder_layer_backward`, and runs the prolog backward via
`run_backward(record.outs, dx)`.

**Gate:** file imports cleanly (`python -c "from pithtrain.models.<model> import <Model>"`).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The gate condition mentions <Model>, but the templates (model_skeleton.py and inference_test.py) consistently use the naming convention <Model>Model. It's better to align the gate check with the actual class name used in the templates to avoid confusion.

Suggested change
**Gate:** file imports cleanly (`python -c "from pithtrain.models.<model> import <Model>"`).
**Gate:** file imports cleanly (`python -c "from pithtrain.models.<model> import <Model>Model"`).

Comment on lines +93 to +95
gate_up = F.grouped_mm(x, self.gate_up_proj.transpose(-2, -1), offs=offs)
gate = gate_up[:, ::2] # interleaved
up = gate_up[:, 1::2]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The example assumes an interleaved layout for fused projections (using ::2 slicing). While common, some models use a concatenated layout (e.g., [:inter] and [inter:]). It would be helpful to explicitly mention that the slicing logic must be verified against the HuggingFace reference implementation.

Suggested change
gate_up = F.grouped_mm(x, self.gate_up_proj.transpose(-2, -1), offs=offs)
gate = gate_up[:, ::2] # interleaved
up = gate_up[:, 1::2]
gate_up = F.grouped_mm(x, self.gate_up_proj.transpose(-2, -1), offs=offs)
# Verify if HF uses interleaved (::2) or concatenated ([:inter]) layout
gate = gate_up[:, ::2]
up = gate_up[:, 1::2]

# topk_weight, topk_idx = torch.topk(scores, k=..., dim=-1, sorted=False)
# if self.norm_topk_prob:
# topk_weight = topk_weight / topk_weight.sum(dim=-1, keepdim=True)
logits = F.linear(hidden_states, self.weight, None) # TODO_HF: add bias if HF has it
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The F.linear call hardcodes None for the bias. If the router/gate in the HuggingFace reference implementation includes a bias term, it will be silently ignored here. Using getattr allows the implementation to be more generic and robust to different model architectures.

logits = F.linear(hidden_states, self.weight, getattr(self, "bias", None))

@haok1402 haok1402 merged commit e80f25c into mlc-ai:main Apr 19, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants