Add claude skill for adding a new MoE model by MasterJH5574 · Pull Request #22 · mlc-ai/Pith-Train

MasterJH5574 · 2026-04-19T16:58:41Z

The add-new-model skill walks a developer through integrating a new MoE language model into PithTrain end to end: analyzing HuggingFace's reference implementation, writing the model file against the 5-stage DualPipeV protocol, wiring it into setup_model / apply_fsdp / test_fsdp, running the pp/ep scaling ladder from pp=1/ep=1 up to pp=2/ep=2, and (when needed) adding a checkpoint converter and an ad-hoc real-weight inference test.

Layout is a phase-gated SKILL.md entry point plus six reference docs (protocol, conventions, compile, checkpoint, testing, pitfalls) and two templates (a structural model skeleton and a DualPipeV inference harness). The reference docs are loaded on demand per phase to keep the entry point compact. Every pitfall learned from the gpt-oss bring-up - NaN-padding in grouped_mm, .view vs .transpose layout bugs, silent-zero experts from missing fill_weights branches, dynamic seq_len compile thrash - is encoded as a checklable rule.

The template files carry placeholder syntax (Model, ) that is not valid Python, so .pre-commit-config.yaml excludes .claude/skills/*/templates/ from every hook.

The add-new-model skill walks a developer through integrating a new MoE language model into PithTrain end to end: analyzing HuggingFace's reference implementation, writing the model file against the 5-stage DualPipeV protocol, wiring it into setup_model / apply_fsdp / test_fsdp, running the pp/ep scaling ladder from pp=1/ep=1 up to pp=2/ep=2, and (when needed) adding a checkpoint converter and an ad-hoc real-weight inference test. Layout is a phase-gated SKILL.md entry point plus six reference docs (protocol, conventions, compile, checkpoint, testing, pitfalls) and two templates (a structural model skeleton and a DualPipeV inference harness). The reference docs are loaded on demand per phase to keep the entry point compact. Every pitfall learned from the gpt-oss bring-up - NaN-padding in grouped_mm, .view vs .transpose layout bugs, silent-zero experts from missing fill_weights branches, dynamic seq_len compile thrash - is encoded as a checklable rule. The template files carry placeholder syntax (<Model>Model, <model>) that is not valid Python, so .pre-commit-config.yaml excludes .claude/skills/*/templates/ from every hook.

gemini-code-assist

Code Review

This pull request introduces a comprehensive workflow and documentation for adding new MoE language models to PithTrain, including templates and reference guides for the 5-stage protocol, compilation, and testing. Feedback includes aligning class naming conventions in the skill guide with the templates, ensuring slicing logic for fused projections is verified against reference implementations, and improving the router template's robustness by dynamically handling potential bias terms.

gemini-code-assist · 2026-04-19T17:07:40Z

+         `decoder_layer_backward`, and runs the prolog backward via
+         `run_backward(record.outs, dx)`.
+
+**Gate:** file imports cleanly (`python -c "from pithtrain.models.<model> import <Model>"`).


The gate condition mentions <Model>, but the templates (model_skeleton.py and inference_test.py) consistently use the naming convention <Model>Model. It's better to align the gate check with the actual class name used in the templates to avoid confusion.

Suggested change

**Gate:** file imports cleanly (`python -c "from pithtrain.models.<model> import <Model>"`).

**Gate:** file imports cleanly (`python -c "from pithtrain.models.<model> import <Model>Model"`).

gemini-code-assist · 2026-04-19T17:07:40Z

+gate_up = F.grouped_mm(x, self.gate_up_proj.transpose(-2, -1), offs=offs)
+gate = gate_up[:, ::2]   # interleaved
+up   = gate_up[:, 1::2]


The example assumes an interleaved layout for fused projections (using ::2 slicing). While common, some models use a concatenated layout (e.g., [:inter] and [inter:]). It would be helpful to explicitly mention that the slicing logic must be verified against the HuggingFace reference implementation.

Suggested change

gate_up = F.grouped_mm(x, self.gate_up_proj.transpose(-2, -1), offs=offs)

gate = gate_up[:, ::2] # interleaved

up = gate_up[:, 1::2]

gate_up = F.grouped_mm(x, self.gate_up_proj.transpose(-2, -1), offs=offs)

# Verify if HF uses interleaved (::2) or concatenated ([:inter]) layout

gate = gate_up[:, ::2]

up = gate_up[:, 1::2]

gemini-code-assist · 2026-04-19T17:07:40Z

+        #       topk_weight, topk_idx = torch.topk(scores, k=..., dim=-1, sorted=False)
+        #       if self.norm_topk_prob:
+        #           topk_weight = topk_weight / topk_weight.sum(dim=-1, keepdim=True)
+        logits = F.linear(hidden_states, self.weight, None)  # TODO_HF: add bias if HF has it


The F.linear call hardcodes None for the bias. If the router/gate in the HuggingFace reference implementation includes a bias term, it will be silently ignored here. Using getattr allows the implementation to be more generic and robust to different model architectures.

logits = F.linear(hidden_states, self.weight, getattr(self, "bias", None))

gemini-code-assist Bot reviewed Apr 19, 2026

View reviewed changes

haok1402 approved these changes Apr 19, 2026

View reviewed changes

haok1402 merged commit e80f25c into mlc-ai:main Apr 19, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add claude skill for adding a new MoE model#22

Add claude skill for adding a new MoE model#22
haok1402 merged 1 commit intomlc-ai:mainfrom
MasterJH5574:04-19-add-model-skill

MasterJH5574 commented Apr 19, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 19, 2026

Uh oh!

gemini-code-assist Bot Apr 19, 2026

Uh oh!

gemini-code-assist Bot Apr 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	Gate: file imports cleanly (`python -c "from pithtrain.models.<model> import <Model>"`).
	Gate: file imports cleanly (`python -c "from pithtrain.models.<model> import <Model>Model"`).

Conversation

MasterJH5574 commented Apr 19, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants