Step 8: Sampling pipeline, stop conditions, and CLI options#24
Conversation
Implement composable sampling pipeline with temperature, top-K, top-P, min-P, repetition penalty, and categorical sampling. Add EOS, max-tokens, and stop-string stop conditions. Wire sampling options into `dotllm run` CLI with llama.cpp-style flags (--temp, --top-k, --top-p, --min-p, --repeat-penalty, --repeat-last-n, --seed). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@codex please review this PR thoroughly - performance issues and possible improvements, potential bugs, code quality and architecture |
|
@gemini please review this PR thoroughly - performance issues and possible improvements, potential bugs, code quality and architecture |
|
Gemini ✦ Great work on the composable architecture! The design of However, there are several critical performance issues and hidden allocations on the inference hot path that violate the zero-GC pressure goal and will severely degrade performance for modern models with large vocabularies (e.g., Llama 3 has a 128k vocab). Here is the review, categorized by severity: 🚨 Critical Performance Bugs1. O(N²) Insertion Sort in // src/DotLLM.Engine/Samplers/TopPSampler.cs
private static void SortDescendingByProbability(Span<int> indices, ReadOnlySpan<float> probs)You implemented a manual insertion sort. For a model with a 128,000 vocabulary size, this will execute up to ~8 billion operations per token, completely freezing the engine.
2. Unnecessary Softmax & Allocations in // src/DotLLM.Engine/Samplers/MinPSampler.cs
TensorPrimitives.SoftMax(logits, probs);
float maxProb = TensorPrimitives.Max(probs);
float threshold = minP * maxProb;You don't need to compute probabilities (which requires an expensive
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 84fbfb9a32
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| generatedIds.Add(firstTokenId); | ||
| string decodedText = _tokenizer.Decode(generatedIds.ToArray().AsSpan()); |
There was a problem hiding this comment.
Skip first token when max token budget is zero
The generator always appends the first sampled token before enforcing the token budget, so MaxTokens = 0 (or negative values) still returns one generated token. In TextGenerator.Generate, generatedIds.Add(firstTokenId) runs unconditionally and then MaxTokensStopCondition(0) reports StopInclude, which violates strict token-budget semantics for callers expecting prompt-only responses.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in ec2b338 — added if (maxTokens <= 0) early return before prefill, returning empty response with FinishReason.Length.
| int j = i - 1; | ||
| while (j >= 0 && probs[indices[j]] < keyProb) | ||
| { | ||
| indices[j + 1] = indices[j]; | ||
| j--; |
There was a problem hiding this comment.
Replace quadratic top-p ordering with scalable sorting
Top-p filtering currently uses insertion sort over the full vocabulary each decoding step, which is O(V^2) and becomes prohibitively slow at typical LLM vocab sizes (tens of thousands of tokens). This can dominate generation latency whenever top_p < 1.0; using an O(V log V) sort/selection strategy is needed for production-scale sampling performance.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in ec2b338 — replaced O(V²) insertion sort with Array.Sort (IntroSort, O(V log V)), walking backwards from the top. Also replaced the heap-allocated bool[] for large vocabs with ArrayPool<bool> rental.
| if (stopResult == StopResult.Stop) | ||
| generatedIds.RemoveAt(generatedIds.Count - 1); |
There was a problem hiding this comment.
Trim stop suffix instead of dropping the whole token
When a stop string matches, the code removes the entire last token to exclude the stop text. If the tokenizer merges extra prefix characters with the stop substring in the same token (common with BPE), this also removes valid content before the stop sequence, so returned text is incorrect. Stop handling should remove only the matched suffix text, not always delete the full triggering token.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Deferred — added a TODO comment. The correct fix requires byte-level decode tracking to trim only the matched suffix; current behavior (dropping the whole token) is acceptable for v1 since stop strings are typically token-aligned.
| if (options.SamplerSteps is not null) | ||
| { | ||
| _greedy = false; | ||
| _processors = options.LogitProcessors?.ToArray() ?? []; | ||
| _steps = options.SamplerSteps.ToArray(); |
There was a problem hiding this comment.
Keep repetition penalty active with custom sampler steps
In the SamplerSteps branch, processors default to empty unless LogitProcessors is explicitly provided, so InferenceOptions.RepetitionPenalty/RepetitionPenaltyWindow are silently ignored whenever custom sampler steps are used. This breaks mixed configurations (custom step order plus flat repetition-penalty settings) and diverges from the option comments that only sampler flat properties are replaced.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in ec2b338 — when SamplerSteps is explicit but LogitProcessors is null, we now auto-build processors from flat properties (e.g., RepetitionPenalty). Only skipped when LogitProcessors is explicitly provided.
| positions[i] = i; | ||
|
|
||
| int firstTokenId; | ||
| using (ITensor prefillLogits = _model.Forward(promptIds, positions, deviceId: -1, kvCache)) |
There was a problem hiding this comment.
Handle empty prompts before calling prefill forward
Generate forwards promptIds directly even when the encoded prompt is empty, which can trigger failures in the KV-cache forward path (it expects at least one position). Since API callers can pass an empty prompt, this leads to runtime exceptions instead of valid generation behavior; the method should inject a start token or special-case zero-length prompts before prefill.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in ec2b338 — empty promptIds now falls back to [BosTokenId] as a seed, avoiding the empty-array forward pass crash.
- TopPSampler: replace O(V²) insertion sort with Array.Sort; pool bool[] - MinPSampler: eliminate SoftMax, use logit-space threshold (zero alloc) - CategoricalSampler: ArrayPool for large vocabs (avoid LOH) - RepetitionPenaltyProcessor: replace HashSet with sorted rented array - TextGenerator: CollectionsMarshal.AsSpan instead of ToArray; guard MaxTokens=0 and empty prompt; TODO for stop-string suffix trimming - SamplerPipeline: auto-build processors when SamplerSteps is explicit but LogitProcessors is null Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Addressed in ec2b338. Here's the triage: Fixed
Deferred (with TODOs)
|
Steps 8 (sampling pipeline) and 9 (stop conditions) marked complete. Phase 1 is now 9/9 — status updated to Done. Added news entry. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Drop no-KV-cache history. Eval per token now 24.4 ms (3x llama.cpp), prompt eval 300 ms (27x), total 31.9 tok/s — all Release + Dynamic PGO. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Closes #10
Summary
SamplerPipeline) with temperature, top-K, top-P, min-P, repetition penalty, and categorical samplingTextGeneratorautoregressive generation loop usingLlamaModel+SamplerPipelinedotllm run:--temp,--top-k,--top-p,--min-p,--repeat-penalty,--repeat-last-n,--seed(llama.cpp-style)InferenceOptionsextended with composableSamplerSteps,LogitProcessors,StopConditionspropertiesDesign
SamplerPipelineauto-builds from flatInferenceOptionsproperties (skip disabled steps) or accepts explicitISamplerStep[]for full composabilitySamplerContext) constructorsCategoricalSampleris a static helper (softmax + CDF), not anISamplerStep--temp 0), matching previous behaviorTest plan
dotnet buildsucceeds, all 223 unit tests passdotllm run <model> -p "prompt"greedy (default) produces same output as beforedotllm run <model> -p "prompt" --temp 0.8 --top-k 40 --seed 42sampling mode🤖 Generated with Claude Code