Skip to content

Step 8: Sampling pipeline, stop conditions, and CLI options#24

Merged
kkokosa merged 4 commits intomainfrom
issue/10-sampling-pipeline
Mar 4, 2026
Merged

Step 8: Sampling pipeline, stop conditions, and CLI options#24
kkokosa merged 4 commits intomainfrom
issue/10-sampling-pipeline

Conversation

@kkokosa
Copy link
Copy Markdown
Owner

@kkokosa kkokosa commented Mar 4, 2026

Closes #10

Summary

  • Composable sampling pipeline (SamplerPipeline) with temperature, top-K, top-P, min-P, repetition penalty, and categorical sampling
  • Stop conditions: EOS token, max tokens, stop string
  • TextGenerator autoregressive generation loop using LlamaModel + SamplerPipeline
  • CLI flags on dotllm run: --temp, --top-k, --top-p, --min-p, --repeat-penalty, --repeat-last-n, --seed (llama.cpp-style)
  • InferenceOptions extended with composable SamplerSteps, LogitProcessors, StopConditions properties
  • Sample console app updated to demonstrate composable sampling API

Design

  • SamplerPipeline auto-builds from flat InferenceOptions properties (skip disabled steps) or accepts explicit ISamplerStep[] for full composability
  • Temperature ≤ 0 → greedy (argmax), skipping entire pipeline
  • Each step has both parameterized (self-configuring) and default (reads SamplerContext) constructors
  • CategoricalSampler is a static helper (softmax + CDF), not an ISamplerStep
  • CLI defaults to greedy (--temp 0), matching previous behavior

Test plan

  • Unit tests for each sampler step (Temperature, TopK, TopP, MinP) with known inputs/outputs
  • Unit tests for RepetitionPenaltyProcessor
  • Unit tests for CategoricalSampler (distribution validity, seed determinism)
  • Unit tests for SamplerPipeline (greedy, seeded, composable, repetition penalty)
  • Unit tests for all stop conditions (EOS, MaxTokens, StopString)
  • Integration test: end-to-end text generation with SmolLM-135M Q8_0
  • dotnet build succeeds, all 223 unit tests pass
  • Manual: dotllm run <model> -p "prompt" greedy (default) produces same output as before
  • Manual: dotllm run <model> -p "prompt" --temp 0.8 --top-k 40 --seed 42 sampling mode

🤖 Generated with Claude Code

Implement composable sampling pipeline with temperature, top-K, top-P,
min-P, repetition penalty, and categorical sampling. Add EOS, max-tokens,
and stop-string stop conditions. Wire sampling options into `dotllm run`
CLI with llama.cpp-style flags (--temp, --top-k, --top-p, --min-p,
--repeat-penalty, --repeat-last-n, --seed).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@kkokosa
Copy link
Copy Markdown
Owner Author

kkokosa commented Mar 4, 2026

@codex please review this PR thoroughly - performance issues and possible improvements, potential bugs, code quality and architecture

@kkokosa
Copy link
Copy Markdown
Owner Author

kkokosa commented Mar 4, 2026

@gemini please review this PR thoroughly - performance issues and possible improvements, potential bugs, code quality and architecture

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 4, 2026

Gemini

Great work on the composable architecture! The design of SamplerPipeline and the integration with InferenceOptions is clean and extensible.

However, there are several critical performance issues and hidden allocations on the inference hot path that violate the zero-GC pressure goal and will severely degrade performance for modern models with large vocabularies (e.g., Llama 3 has a 128k vocab).

Here is the review, categorized by severity:

🚨 Critical Performance Bugs

1. O(N²) Insertion Sort in TopPSampler.cs

// src/DotLLM.Engine/Samplers/TopPSampler.cs
private static void SortDescendingByProbability(Span<int> indices, ReadOnlySpan<float> probs)

You implemented a manual insertion sort. For a model with a 128,000 vocabulary size, this will execute up to ~8 billion operations per token, completely freezing the engine.

  • Fix: Use the highly optimized BCL Array.Sort. Since you already rent arrays, sort them together ascending, then iterate backwards:
    Array.Sort(rentedProbs, rentedIndices, 0, vocabSize);
    // Then iterate backwards from vocabSize - 1 down to 0 to accumulate Top-P

2. Unnecessary Softmax & Allocations in MinPSampler.cs

// src/DotLLM.Engine/Samplers/MinPSampler.cs
TensorPrimitives.SoftMax(logits, probs);
float maxProb = TensorPrimitives.Max(probs);
float threshold = minP * maxProb;

You don't need to compute probabilities (which requires an expensive SoftMax and an ArrayPool rental) to apply Min-P! You can do this entirely in logit space using a mathematical equivalence: prob < minP * maxProb is equivalent to logit < maxLogit + ln(minP).

  • Fix: This reduces the step to O(N) with zero allocations:
    float maxLogit = TensorPrimitives.Max(logits);
    float threshold = maxLogit + MathF.Log(minP);
    
    for (int i = 0; i < logits.Length; i++)
    {
        if (logits[i] < threshold)
            logits[i] = float.NegativeInfinity;
    }

⚠️ Hot-Path GC Allocations (Zero-GC Violations)

1. ToArray() inside the autoregressive loop

// src/DotLLM.Engine/TextGenerator.cs (Lines 95 & 127)
decodedText = _tokenizer.Decode(generatedIds.ToArray().AsSpan());

Calling .ToArray() on a List<T> allocates a new array on the heap every single time a token is generated. This creates O(N²) memory traffic over the sequence length.

  • Fix: Use CollectionsMarshal to get a span directly from the list's backing array without allocating:
    using System.Runtime.InteropServices;
    // ...
    decodedText = _tokenizer.Decode(CollectionsMarshal.AsSpan(generatedIds));

2. Heap allocations for large vocabularies

// src/DotLLM.Engine/Samplers/CategoricalSampler.cs (Line 20)
Span<float> probs = vocabSize <= 4096 ? stackalloc float[vocabSize] : new float[vocabSize];

// src/DotLLM.Engine/Samplers/TopPSampler.cs (Line 64)
Span<bool> keep = vocabSize <= 4096 ? stackalloc bool[vocabSize] : new bool[vocabSize];

Modern LLMs have vocabularies far exceeding 4096 (Llama 3 is 128k, Qwen is 151k). new float[128000] allocates ~512KB on the Large Object Heap (LOH) per token.

  • Fix: Use ArrayPool<T>.Shared.Rent(vocabSize) and ensure you return it in a finally block, just like you did earlier in TopPSampler.

3. HashSet allocation per token

// src/DotLLM.Engine/Samplers/RepetitionPenaltyProcessor.cs (Line 23)
var penalizedTokens = new HashSet<int>();

Allocating a HashSet on every forward pass generates significant GC pressure.

  • Fix: Rent an array for the window, copy the previous tokens into it, sort it, and iterate to apply the penalty to unique tokens:
    int length = previousTokens.Count - startIndex;
    int[] rented = ArrayPool<int>.Shared.Rent(length);
    // copy to rented, then:
    Array.Sort(rented, 0, length);
    
    for (int i = 0; i < length; i++) {
        if (i > 0 && rented[i] == rented[i - 1]) continue; // skip duplicates
        // apply penalty to rented[i]
    }
    ArrayPool<int>.Shared.Return(rented);

💡 Minor Improvements

  • TopKSampler.cs: sorted.Sort(); uses Span<T>.Sort() which is O(N log N). While acceptable for now, for a 128k vocab, sorting the entire array just to find the top K is heavy. Consider implementing a QuickSelect (O(N)) or a Min-Heap (O(N log K)) in the future.
  • CategoricalSampler.cs: The fallback return vocabSize - 1; at the end of the method is slightly biased. If floating-point math causes cumulative to fall just short of r, you should return the last valid (non-negative infinity) token, not blindly the last index in the vocab.
  • TextGenerator.cs: FinishReason.Length is assigned if stopResult == StopResult.StopInclude. This logic seems slightly conflated. StopInclude usually applies to MaxTokens, which correctly maps to Length, but if a user implements a custom StopInclude condition, it might incorrectly report Length instead of Stop.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 84fbfb9a32

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/DotLLM.Engine/TextGenerator.cs Outdated
Comment on lines +94 to +95
generatedIds.Add(firstTokenId);
string decodedText = _tokenizer.Decode(generatedIds.ToArray().AsSpan());
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Skip first token when max token budget is zero

The generator always appends the first sampled token before enforcing the token budget, so MaxTokens = 0 (or negative values) still returns one generated token. In TextGenerator.Generate, generatedIds.Add(firstTokenId) runs unconditionally and then MaxTokensStopCondition(0) reports StopInclude, which violates strict token-budget semantics for callers expecting prompt-only responses.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ec2b338 — added if (maxTokens <= 0) early return before prefill, returning empty response with FinishReason.Length.

Comment on lines +91 to +95
int j = i - 1;
while (j >= 0 && probs[indices[j]] < keyProb)
{
indices[j + 1] = indices[j];
j--;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Replace quadratic top-p ordering with scalable sorting

Top-p filtering currently uses insertion sort over the full vocabulary each decoding step, which is O(V^2) and becomes prohibitively slow at typical LLM vocab sizes (tens of thousands of tokens). This can dominate generation latency whenever top_p < 1.0; using an O(V log V) sort/selection strategy is needed for production-scale sampling performance.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ec2b338 — replaced O(V²) insertion sort with Array.Sort (IntroSort, O(V log V)), walking backwards from the top. Also replaced the heap-allocated bool[] for large vocabs with ArrayPool<bool> rental.

Comment on lines +100 to +101
if (stopResult == StopResult.Stop)
generatedIds.RemoveAt(generatedIds.Count - 1);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Trim stop suffix instead of dropping the whole token

When a stop string matches, the code removes the entire last token to exclude the stop text. If the tokenizer merges extra prefix characters with the stop substring in the same token (common with BPE), this also removes valid content before the stop sequence, so returned text is incorrect. Stop handling should remove only the matched suffix text, not always delete the full triggering token.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferred — added a TODO comment. The correct fix requires byte-level decode tracking to trim only the matched suffix; current behavior (dropping the whole token) is acceptable for v1 since stop strings are typically token-aligned.

Comment on lines +60 to +64
if (options.SamplerSteps is not null)
{
_greedy = false;
_processors = options.LogitProcessors?.ToArray() ?? [];
_steps = options.SamplerSteps.ToArray();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep repetition penalty active with custom sampler steps

In the SamplerSteps branch, processors default to empty unless LogitProcessors is explicitly provided, so InferenceOptions.RepetitionPenalty/RepetitionPenaltyWindow are silently ignored whenever custom sampler steps are used. This breaks mixed configurations (custom step order plus flat repetition-penalty settings) and diverges from the option comments that only sampler flat properties are replaced.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ec2b338 — when SamplerSteps is explicit but LogitProcessors is null, we now auto-build processors from flat properties (e.g., RepetitionPenalty). Only skipped when LogitProcessors is explicitly provided.

positions[i] = i;

int firstTokenId;
using (ITensor prefillLogits = _model.Forward(promptIds, positions, deviceId: -1, kvCache))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Handle empty prompts before calling prefill forward

Generate forwards promptIds directly even when the encoded prompt is empty, which can trigger failures in the KV-cache forward path (it expects at least one position). Since API callers can pass an empty prompt, this leads to runtime exceptions instead of valid generation behavior; the method should inject a start token or special-case zero-length prompts before prefill.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ec2b338 — empty promptIds now falls back to [BosTokenId] as a seed, avoiding the empty-array forward pass crash.

- TopPSampler: replace O(V²) insertion sort with Array.Sort; pool bool[]
- MinPSampler: eliminate SoftMax, use logit-space threshold (zero alloc)
- CategoricalSampler: ArrayPool for large vocabs (avoid LOH)
- RepetitionPenaltyProcessor: replace HashSet with sorted rented array
- TextGenerator: CollectionsMarshal.AsSpan instead of ToArray; guard
  MaxTokens=0 and empty prompt; TODO for stop-string suffix trimming
- SamplerPipeline: auto-build processors when SamplerSteps is explicit
  but LogitProcessors is null

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dotllm-claude-code-bot
Copy link
Copy Markdown

Addressed in ec2b338. Here's the triage:

Fixed

Issue Fix
O(V²) TopPSampler sort Replaced insertion sort with Array.Sort (IntroSort O(V log V)); walks backwards for descending order
MinPSampler unnecessary SoftMax Eliminated entirely — now uses logit-space threshold: logit < maxLogit + ln(minP) (zero allocations)
CategoricalSampler LOH alloc ArrayPool<float> rental for vocabs > 4096
TopPSampler bool[] LOH alloc ArrayPool<bool> rental
RepetitionPenaltyProcessor HashSet Sorted rented int[] with duplicate skip — zero managed allocations
generatedIds.ToArray() in decode loop CollectionsMarshal.AsSpan(generatedIds) throughout

Deferred (with TODOs)

Issue Reason
TopKSampler QuickSelect Span.Sort() IntroSort is O(V log V), acceptable for now
CategoricalSampler fallback bias Practically harmless FP edge case; added TODO comment
FinishReason conflation Current mapping is correct for built-in conditions

kkokosa and others added 2 commits March 4, 2026 18:02
Steps 8 (sampling pipeline) and 9 (stop conditions) marked complete.
Phase 1 is now 9/9 — status updated to Done. Added news entry.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Drop no-KV-cache history. Eval per token now 24.4 ms (3x llama.cpp),
prompt eval 300 ms (27x), total 31.9 tok/s — all Release + Dynamic PGO.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@kkokosa kkokosa merged commit 482b750 into main Mar 4, 2026
4 checks passed
@kkokosa kkokosa deleted the issue/10-sampling-pipeline branch March 4, 2026 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement sampling pipeline and stop conditions

1 participant