Add claude skill for adding a new MoE model#22
Conversation
The add-new-model skill walks a developer through integrating a new MoE language model into PithTrain end to end: analyzing HuggingFace's reference implementation, writing the model file against the 5-stage DualPipeV protocol, wiring it into setup_model / apply_fsdp / test_fsdp, running the pp/ep scaling ladder from pp=1/ep=1 up to pp=2/ep=2, and (when needed) adding a checkpoint converter and an ad-hoc real-weight inference test. Layout is a phase-gated SKILL.md entry point plus six reference docs (protocol, conventions, compile, checkpoint, testing, pitfalls) and two templates (a structural model skeleton and a DualPipeV inference harness). The reference docs are loaded on demand per phase to keep the entry point compact. Every pitfall learned from the gpt-oss bring-up - NaN-padding in grouped_mm, .view vs .transpose layout bugs, silent-zero experts from missing fill_weights branches, dynamic seq_len compile thrash - is encoded as a checklable rule. The template files carry placeholder syntax (<Model>Model, <model>) that is not valid Python, so .pre-commit-config.yaml excludes .claude/skills/*/templates/ from every hook.
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive workflow and documentation for adding new MoE language models to PithTrain, including templates and reference guides for the 5-stage protocol, compilation, and testing. Feedback includes aligning class naming conventions in the skill guide with the templates, ensuring slicing logic for fused projections is verified against reference implementations, and improving the router template's robustness by dynamically handling potential bias terms.
| `decoder_layer_backward`, and runs the prolog backward via | ||
| `run_backward(record.outs, dx)`. | ||
|
|
||
| **Gate:** file imports cleanly (`python -c "from pithtrain.models.<model> import <Model>"`). |
There was a problem hiding this comment.
The gate condition mentions <Model>, but the templates (model_skeleton.py and inference_test.py) consistently use the naming convention <Model>Model. It's better to align the gate check with the actual class name used in the templates to avoid confusion.
| **Gate:** file imports cleanly (`python -c "from pithtrain.models.<model> import <Model>"`). | |
| **Gate:** file imports cleanly (`python -c "from pithtrain.models.<model> import <Model>Model"`). |
| gate_up = F.grouped_mm(x, self.gate_up_proj.transpose(-2, -1), offs=offs) | ||
| gate = gate_up[:, ::2] # interleaved | ||
| up = gate_up[:, 1::2] |
There was a problem hiding this comment.
The example assumes an interleaved layout for fused projections (using ::2 slicing). While common, some models use a concatenated layout (e.g., [:inter] and [inter:]). It would be helpful to explicitly mention that the slicing logic must be verified against the HuggingFace reference implementation.
| gate_up = F.grouped_mm(x, self.gate_up_proj.transpose(-2, -1), offs=offs) | |
| gate = gate_up[:, ::2] # interleaved | |
| up = gate_up[:, 1::2] | |
| gate_up = F.grouped_mm(x, self.gate_up_proj.transpose(-2, -1), offs=offs) | |
| # Verify if HF uses interleaved (::2) or concatenated ([:inter]) layout | |
| gate = gate_up[:, ::2] | |
| up = gate_up[:, 1::2] |
| # topk_weight, topk_idx = torch.topk(scores, k=..., dim=-1, sorted=False) | ||
| # if self.norm_topk_prob: | ||
| # topk_weight = topk_weight / topk_weight.sum(dim=-1, keepdim=True) | ||
| logits = F.linear(hidden_states, self.weight, None) # TODO_HF: add bias if HF has it |
There was a problem hiding this comment.
The F.linear call hardcodes None for the bias. If the router/gate in the HuggingFace reference implementation includes a bias term, it will be silently ignored here. Using getattr allows the implementation to be more generic and robust to different model architectures.
logits = F.linear(hidden_states, self.weight, getattr(self, "bias", None))
The add-new-model skill walks a developer through integrating a new MoE language model into PithTrain end to end: analyzing HuggingFace's reference implementation, writing the model file against the 5-stage DualPipeV protocol, wiring it into setup_model / apply_fsdp / test_fsdp, running the pp/ep scaling ladder from pp=1/ep=1 up to pp=2/ep=2, and (when needed) adding a checkpoint converter and an ad-hoc real-weight inference test.
Layout is a phase-gated SKILL.md entry point plus six reference docs (protocol, conventions, compile, checkpoint, testing, pitfalls) and two templates (a structural model skeleton and a DualPipeV inference harness). The reference docs are loaded on demand per phase to keep the entry point compact. Every pitfall learned from the gpt-oss bring-up - NaN-padding in grouped_mm, .view vs .transpose layout bugs, silent-zero experts from missing fill_weights branches, dynamic seq_len compile thrash - is encoded as a checklable rule.
The template files carry placeholder syntax (Model, ) that is not valid Python, so .pre-commit-config.yaml excludes .claude/skills/*/templates/ from every hook.