Skip to content

docs: add CLAUDE.md with codebase guide for AI assistants#1

Merged
kekzl merged 1 commit into
mainfrom
claude/add-claude-documentation-nMwLe
Feb 23, 2026
Merged

docs: add CLAUDE.md with codebase guide for AI assistants#1
kekzl merged 1 commit into
mainfrom
claude/add-claude-documentation-nMwLe

Conversation

@kekzl
Copy link
Copy Markdown
Owner

@kekzl kekzl commented Feb 23, 2026

Comprehensive documentation covering project structure, build system
(CMake/CUDA), test suite, code conventions, and architecture notes
including attention dispatch, quantization, and CUDA 13.1 features.

https://claude.ai/code/session_013apimo3KbFHVuw5gpHS6ua

Comprehensive documentation covering project structure, build system
(CMake/CUDA), test suite, code conventions, and architecture notes
including attention dispatch, quantization, and CUDA 13.1 features.

https://claude.ai/code/session_013apimo3KbFHVuw5gpHS6ua
@kekzl kekzl merged commit b121a42 into main Feb 23, 2026
@kekzl kekzl deleted the claude/add-claude-documentation-nMwLe branch February 23, 2026 02:05
kekzl added a commit that referenced this pull request Mar 21, 2026
- Add h_sample_pinned_ to GraphExecutor for truly async D2H sample copies
  (stack variables force synchronous cudaMemcpyAsync fallback)
- forward() uses sample_greedy_device/sample_topk_topp_device when pinned
  buffer is available, with explicit cudaStreamSynchronize
- Batch all logprobs D2H copies in engine.cpp decode loop: N async copies +
  single sync instead of N copies with N syncs
- ensure_logits_pinned() now accepts total_floats for multi-sequence batches

Tested: #1 (RMSNorm+NVFP4 fusion), #2 (__ldcs dp4a), #4 (AccessPolicyWindow)
confirmed as dead ends — reverted. See memory for details.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
kekzl added a commit that referenced this pull request Apr 13, 2026
Final test results (Gemma-4 26B-A4B Q4_K_M, chat template):
- "Capital of France?" → "Capital of France is **Paris**." ✓
- "Capital of Germany?" → "Capital of Germany is **Berlin** (though..." ✓✓
- "Hello, who are you?" → "I am Gemma" ✓
- "What color is the sky?" → "Blue" ✓
- "What is 2+2?" → "2+2" (partial) ~

Key fixes that made this work:
1. Docker CMake cache mount removed (the #1 blocker)
2. Custom router: rmsnorm(h) * gate_inp_scale / sqrt(d)
3. RoPE decode/prefill consistency: hd/2 for global layers
4. Chat template: <|turn> / <turn|> token IDs
5. FP32 norm for decode (fused norm+Q8_1 bypass)
6. Q4_K MMVQ for attention Q/K projections
7. All other structural fixes

Performance: pp=123 tok/s, tg=73 tok/s
Decode degenerates after ~15-20 tokens (Q5_K/Q8_0 MMVQ needed).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
kekzl added a commit that referenced this pull request Apr 30, 2026
- Add h_sample_pinned_ to GraphExecutor for truly async D2H sample copies
  (stack variables force synchronous cudaMemcpyAsync fallback)
- forward() uses sample_greedy_device/sample_topk_topp_device when pinned
  buffer is available, with explicit cudaStreamSynchronize
- Batch all logprobs D2H copies in engine.cpp decode loop: N async copies +
  single sync instead of N copies with N syncs
- ensure_logits_pinned() now accepts total_floats for multi-sequence batches

Tested: #1 (RMSNorm+NVFP4 fusion), #2 (__ldcs dp4a), #4 (AccessPolicyWindow)
confirmed as dead ends — reverted. See memory for details.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
kekzl added a commit that referenced this pull request Apr 30, 2026
Final test results (Gemma-4 26B-A4B Q4_K_M, chat template):
- "Capital of France?" → "Capital of France is **Paris**." ✓
- "Capital of Germany?" → "Capital of Germany is **Berlin** (though..." ✓✓
- "Hello, who are you?" → "I am Gemma" ✓
- "What color is the sky?" → "Blue" ✓
- "What is 2+2?" → "2+2" (partial) ~

Key fixes that made this work:
1. Docker CMake cache mount removed (the #1 blocker)
2. Custom router: rmsnorm(h) * gate_inp_scale / sqrt(d)
3. RoPE decode/prefill consistency: hd/2 for global layers
4. Chat template: <|turn> / <turn|> token IDs
5. FP32 norm for decode (fused norm+Q8_1 bypass)
6. Q4_K MMVQ for attention Q/K projections
7. All other structural fixes

Performance: pp=123 tok/s, tg=73 tok/s
Decode degenerates after ~15-20 tokens (Q5_K/Q8_0 MMVQ needed).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants