docs: add CLAUDE.md with codebase guide for AI assistants by kekzl · Pull Request #1 · kekzl/imp

kekzl · 2026-02-23T02:04:05Z

Comprehensive documentation covering project structure, build system
(CMake/CUDA), test suite, code conventions, and architecture notes
including attention dispatch, quantization, and CUDA 13.1 features.

https://claude.ai/code/session_013apimo3KbFHVuw5gpHS6ua

Comprehensive documentation covering project structure, build system (CMake/CUDA), test suite, code conventions, and architecture notes including attention dispatch, quantization, and CUDA 13.1 features. https://claude.ai/code/session_013apimo3KbFHVuw5gpHS6ua

- Add h_sample_pinned_ to GraphExecutor for truly async D2H sample copies (stack variables force synchronous cudaMemcpyAsync fallback) - forward() uses sample_greedy_device/sample_topk_topp_device when pinned buffer is available, with explicit cudaStreamSynchronize - Batch all logprobs D2H copies in engine.cpp decode loop: N async copies + single sync instead of N copies with N syncs - ensure_logits_pinned() now accepts total_floats for multi-sequence batches Tested: #1 (RMSNorm+NVFP4 fusion), #2 (__ldcs dp4a), #4 (AccessPolicyWindow) confirmed as dead ends — reverted. See memory for details. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Final test results (Gemma-4 26B-A4B Q4_K_M, chat template): - "Capital of France?" → "Capital of France is **Paris**." ✓ - "Capital of Germany?" → "Capital of Germany is **Berlin** (though..." ✓✓ - "Hello, who are you?" → "I am Gemma" ✓ - "What color is the sky?" → "Blue" ✓ - "What is 2+2?" → "2+2" (partial) ~ Key fixes that made this work: 1. Docker CMake cache mount removed (the #1 blocker) 2. Custom router: rmsnorm(h) * gate_inp_scale / sqrt(d) 3. RoPE decode/prefill consistency: hd/2 for global layers 4. Chat template: <|turn> / <turn|> token IDs 5. FP32 norm for decode (fused norm+Q8_1 bypass) 6. Q4_K MMVQ for attention Q/K projections 7. All other structural fixes Performance: pp=123 tok/s, tg=73 tok/s Decode degenerates after ~15-20 tokens (Q5_K/Q8_0 MMVQ needed). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add h_sample_pinned_ to GraphExecutor for truly async D2H sample copies (stack variables force synchronous cudaMemcpyAsync fallback) - forward() uses sample_greedy_device/sample_topk_topp_device when pinned buffer is available, with explicit cudaStreamSynchronize - Batch all logprobs D2H copies in engine.cpp decode loop: N async copies + single sync instead of N copies with N syncs - ensure_logits_pinned() now accepts total_floats for multi-sequence batches Tested: #1 (RMSNorm+NVFP4 fusion), #2 (__ldcs dp4a), #4 (AccessPolicyWindow) confirmed as dead ends — reverted. See memory for details. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Final test results (Gemma-4 26B-A4B Q4_K_M, chat template): - "Capital of France?" → "Capital of France is **Paris**." ✓ - "Capital of Germany?" → "Capital of Germany is **Berlin** (though..." ✓✓ - "Hello, who are you?" → "I am Gemma" ✓ - "What color is the sky?" → "Blue" ✓ - "What is 2+2?" → "2+2" (partial) ~ Key fixes that made this work: 1. Docker CMake cache mount removed (the #1 blocker) 2. Custom router: rmsnorm(h) * gate_inp_scale / sqrt(d) 3. RoPE decode/prefill consistency: hd/2 for global layers 4. Chat template: <|turn> / <turn|> token IDs 5. FP32 norm for decode (fused norm+Q8_1 bypass) 6. Q4_K MMVQ for attention Q/K projections 7. All other structural fixes Performance: pp=123 tok/s, tg=73 tok/s Decode degenerates after ~15-20 tokens (Q5_K/Q8_0 MMVQ needed). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

kekzl merged commit b121a42 into main Feb 23, 2026

kekzl deleted the claude/add-claude-documentation-nMwLe branch February 23, 2026 02:05

kekzl added a commit that referenced this pull request Apr 30, 2026

Merge pull request #1 from kekzl/claude/add-claude-documentation-nMwLe

85e61a1

This was referenced May 2, 2026

NVFP4 stability bundle: encoder clamp, prefill chunk fix, warmup default off, validation harness #94

Merged

feat(safetensors): zero-config auto-detect + observability (audit Phase 2) #116

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add CLAUDE.md with codebase guide for AI assistants#1

docs: add CLAUDE.md with codebase guide for AI assistants#1
kekzl merged 1 commit into
mainfrom
claude/add-claude-documentation-nMwLe

kekzl commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kekzl commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants