docs: add CLAUDE.md with codebase guide for AI assistants#1
Merged
Conversation
Comprehensive documentation covering project structure, build system (CMake/CUDA), test suite, code conventions, and architecture notes including attention dispatch, quantization, and CUDA 13.1 features. https://claude.ai/code/session_013apimo3KbFHVuw5gpHS6ua
kekzl
added a commit
that referenced
this pull request
Mar 21, 2026
- Add h_sample_pinned_ to GraphExecutor for truly async D2H sample copies (stack variables force synchronous cudaMemcpyAsync fallback) - forward() uses sample_greedy_device/sample_topk_topp_device when pinned buffer is available, with explicit cudaStreamSynchronize - Batch all logprobs D2H copies in engine.cpp decode loop: N async copies + single sync instead of N copies with N syncs - ensure_logits_pinned() now accepts total_floats for multi-sequence batches Tested: #1 (RMSNorm+NVFP4 fusion), #2 (__ldcs dp4a), #4 (AccessPolicyWindow) confirmed as dead ends — reverted. See memory for details. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
kekzl
added a commit
that referenced
this pull request
Apr 13, 2026
Final test results (Gemma-4 26B-A4B Q4_K_M, chat template): - "Capital of France?" → "Capital of France is **Paris**." ✓ - "Capital of Germany?" → "Capital of Germany is **Berlin** (though..." ✓✓ - "Hello, who are you?" → "I am Gemma" ✓ - "What color is the sky?" → "Blue" ✓ - "What is 2+2?" → "2+2" (partial) ~ Key fixes that made this work: 1. Docker CMake cache mount removed (the #1 blocker) 2. Custom router: rmsnorm(h) * gate_inp_scale / sqrt(d) 3. RoPE decode/prefill consistency: hd/2 for global layers 4. Chat template: <|turn> / <turn|> token IDs 5. FP32 norm for decode (fused norm+Q8_1 bypass) 6. Q4_K MMVQ for attention Q/K projections 7. All other structural fixes Performance: pp=123 tok/s, tg=73 tok/s Decode degenerates after ~15-20 tokens (Q5_K/Q8_0 MMVQ needed). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
kekzl
added a commit
that referenced
this pull request
Apr 30, 2026
kekzl
added a commit
that referenced
this pull request
Apr 30, 2026
- Add h_sample_pinned_ to GraphExecutor for truly async D2H sample copies (stack variables force synchronous cudaMemcpyAsync fallback) - forward() uses sample_greedy_device/sample_topk_topp_device when pinned buffer is available, with explicit cudaStreamSynchronize - Batch all logprobs D2H copies in engine.cpp decode loop: N async copies + single sync instead of N copies with N syncs - ensure_logits_pinned() now accepts total_floats for multi-sequence batches Tested: #1 (RMSNorm+NVFP4 fusion), #2 (__ldcs dp4a), #4 (AccessPolicyWindow) confirmed as dead ends — reverted. See memory for details. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
kekzl
added a commit
that referenced
this pull request
Apr 30, 2026
Final test results (Gemma-4 26B-A4B Q4_K_M, chat template): - "Capital of France?" → "Capital of France is **Paris**." ✓ - "Capital of Germany?" → "Capital of Germany is **Berlin** (though..." ✓✓ - "Hello, who are you?" → "I am Gemma" ✓ - "What color is the sky?" → "Blue" ✓ - "What is 2+2?" → "2+2" (partial) ~ Key fixes that made this work: 1. Docker CMake cache mount removed (the #1 blocker) 2. Custom router: rmsnorm(h) * gate_inp_scale / sqrt(d) 3. RoPE decode/prefill consistency: hd/2 for global layers 4. Chat template: <|turn> / <turn|> token IDs 5. FP32 norm for decode (fused norm+Q8_1 bypass) 6. Q4_K MMVQ for attention Q/K projections 7. All other structural fixes Performance: pp=123 tok/s, tg=73 tok/s Decode degenerates after ~15-20 tokens (Q5_K/Q8_0 MMVQ needed). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This was referenced May 2, 2026
NVFP4 stability bundle: encoder clamp, prefill chunk fix, warmup default off, validation harness
#94
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Comprehensive documentation covering project structure, build system
(CMake/CUDA), test suite, code conventions, and architecture notes
including attention dispatch, quantization, and CUDA 13.1 features.
https://claude.ai/code/session_013apimo3KbFHVuw5gpHS6ua