Step 8: Sampling pipeline, stop conditions, and CLI options by kkokosa · Pull Request #24 · kkokosa/dotLLM

kkokosa · 2026-03-04T16:15:20Z

Closes #10

Summary

Composable sampling pipeline (SamplerPipeline) with temperature, top-K, top-P, min-P, repetition penalty, and categorical sampling
Stop conditions: EOS token, max tokens, stop string
TextGenerator autoregressive generation loop using LlamaModel + SamplerPipeline
CLI flags on dotllm run: --temp, --top-k, --top-p, --min-p, --repeat-penalty, --repeat-last-n, --seed (llama.cpp-style)
InferenceOptions extended with composable SamplerSteps, LogitProcessors, StopConditions properties
Sample console app updated to demonstrate composable sampling API

Design

SamplerPipeline auto-builds from flat InferenceOptions properties (skip disabled steps) or accepts explicit ISamplerStep[] for full composability
Temperature ≤ 0 → greedy (argmax), skipping entire pipeline
Each step has both parameterized (self-configuring) and default (reads SamplerContext) constructors
CategoricalSampler is a static helper (softmax + CDF), not an ISamplerStep
CLI defaults to greedy (--temp 0), matching previous behavior

Test plan

Unit tests for each sampler step (Temperature, TopK, TopP, MinP) with known inputs/outputs
Unit tests for RepetitionPenaltyProcessor
Unit tests for CategoricalSampler (distribution validity, seed determinism)
Unit tests for SamplerPipeline (greedy, seeded, composable, repetition penalty)
Unit tests for all stop conditions (EOS, MaxTokens, StopString)
Integration test: end-to-end text generation with SmolLM-135M Q8_0
dotnet build succeeds, all 223 unit tests pass
Manual: dotllm run <model> -p "prompt" greedy (default) produces same output as before
Manual: dotllm run <model> -p "prompt" --temp 0.8 --top-k 40 --seed 42 sampling mode

🤖 Generated with Claude Code

Implement composable sampling pipeline with temperature, top-K, top-P, min-P, repetition penalty, and categorical sampling. Add EOS, max-tokens, and stop-string stop conditions. Wire sampling options into `dotllm run` CLI with llama.cpp-style flags (--temp, --top-k, --top-p, --min-p, --repeat-penalty, --repeat-last-n, --seed). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

kkokosa · 2026-03-04T16:16:26Z

@codex please review this PR thoroughly - performance issues and possible improvements, potential bugs, code quality and architecture

kkokosa · 2026-03-04T16:16:34Z

@gemini please review this PR thoroughly - performance issues and possible improvements, potential bugs, code quality and architecture

github-actions · 2026-03-04T16:17:33Z

Gemini ✦

Great work on the composable architecture! The design of SamplerPipeline and the integration with InferenceOptions is clean and extensible.

However, there are several critical performance issues and hidden allocations on the inference hot path that violate the zero-GC pressure goal and will severely degrade performance for modern models with large vocabularies (e.g., Llama 3 has a 128k vocab).

Here is the review, categorized by severity:

🚨 Critical Performance Bugs

1. O(N²) Insertion Sort in TopPSampler.cs

// src/DotLLM.Engine/Samplers/TopPSampler.cs
private static void SortDescendingByProbability(Span<int> indices, ReadOnlySpan<float> probs)

You implemented a manual insertion sort. For a model with a 128,000 vocabulary size, this will execute up to ~8 billion operations per token, completely freezing the engine.

Fix: Use the highly optimized BCL Array.Sort. Since you already rent arrays, sort them together ascending, then iterate backwards:

Array.Sort(rentedProbs, rentedIndices, 0, vocabSize);
// Then iterate backwards from vocabSize - 1 down to 0 to accumulate Top-P

2. Unnecessary Softmax & Allocations in MinPSampler.cs

// src/DotLLM.Engine/Samplers/MinPSampler.cs
TensorPrimitives.SoftMax(logits, probs);
float maxProb = TensorPrimitives.Max(probs);
float threshold = minP * maxProb;

You don't need to compute probabilities (which requires an expensive SoftMax and an ArrayPool rental) to apply Min-P! You can do this entirely in logit space using a mathematical equivalence: prob < minP * maxProb is equivalent to logit < maxLogit + ln(minP).

Fix: This reduces the step to O(N) with zero allocations:

float maxLogit = TensorPrimitives.Max(logits);
float threshold = maxLogit + MathF.Log(minP);

for (int i = 0; i < logits.Length; i++)
{
    if (logits[i] < threshold)
        logits[i] = float.NegativeInfinity;
}

⚠️ Hot-Path GC Allocations (Zero-GC Violations)

1. ToArray() inside the autoregressive loop

// src/DotLLM.Engine/TextGenerator.cs (Lines 95 & 127)
decodedText = _tokenizer.Decode(generatedIds.ToArray().AsSpan());

Calling .ToArray() on a List<T> allocates a new array on the heap every single time a token is generated. This creates O(N²) memory traffic over the sequence length.

Fix: Use CollectionsMarshal to get a span directly from the list's backing array without allocating:

using System.Runtime.InteropServices;
// ...
decodedText = _tokenizer.Decode(CollectionsMarshal.AsSpan(generatedIds));

2. Heap allocations for large vocabularies

// src/DotLLM.Engine/Samplers/CategoricalSampler.cs (Line 20)
Span<float> probs = vocabSize <= 4096 ? stackalloc float[vocabSize] : new float[vocabSize];

// src/DotLLM.Engine/Samplers/TopPSampler.cs (Line 64)
Span<bool> keep = vocabSize <= 4096 ? stackalloc bool[vocabSize] : new bool[vocabSize];

Modern LLMs have vocabularies far exceeding 4096 (Llama 3 is 128k, Qwen is 151k). new float[128000] allocates ~512KB on the Large Object Heap (LOH) per token.

Fix: Use ArrayPool<T>.Shared.Rent(vocabSize) and ensure you return it in a finally block, just like you did earlier in TopPSampler.

3. HashSet allocation per token

// src/DotLLM.Engine/Samplers/RepetitionPenaltyProcessor.cs (Line 23)
var penalizedTokens = new HashSet<int>();

Allocating a HashSet on every forward pass generates significant GC pressure.

Fix: Rent an array for the window, copy the previous tokens into it, sort it, and iterate to apply the penalty to unique tokens:

int length = previousTokens.Count - startIndex;
int[] rented = ArrayPool<int>.Shared.Rent(length);
// copy to rented, then:
Array.Sort(rented, 0, length);

for (int i = 0; i < length; i++) {
    if (i > 0 && rented[i] == rented[i - 1]) continue; // skip duplicates
    // apply penalty to rented[i]
}
ArrayPool<int>.Shared.Return(rented);

💡 Minor Improvements

TopKSampler.cs: sorted.Sort(); uses Span<T>.Sort() which is O(N log N). While acceptable for now, for a 128k vocab, sorting the entire array just to find the top K is heavy. Consider implementing a QuickSelect (O(N)) or a Min-Heap (O(N log K)) in the future.
CategoricalSampler.cs: The fallback return vocabSize - 1; at the end of the method is slightly biased. If floating-point math causes cumulative to fall just short of r, you should return the last valid (non-negative infinity) token, not blindly the last index in the vocab.
TextGenerator.cs: FinishReason.Length is assigned if stopResult == StopResult.StopInclude. This logic seems slightly conflated. StopInclude usually applies to MaxTokens, which correctly maps to Length, but if a user implements a custom StopInclude condition, it might incorrectly report Length instead of Stop.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 84fbfb9a32

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-04T16:22:33Z

+        generatedIds.Add(firstTokenId);
+        string decodedText = _tokenizer.Decode(generatedIds.ToArray().AsSpan());


Skip first token when max token budget is zero

The generator always appends the first sampled token before enforcing the token budget, so MaxTokens = 0 (or negative values) still returns one generated token. In TextGenerator.Generate, generatedIds.Add(firstTokenId) runs unconditionally and then MaxTokensStopCondition(0) reports StopInclude, which violates strict token-budget semantics for callers expecting prompt-only responses.

Useful? React with 👍 / 👎.

Fixed in ec2b338 — added if (maxTokens <= 0) early return before prefill, returning empty response with FinishReason.Length.

chatgpt-codex-connector · 2026-03-04T16:22:33Z

+            int j = i - 1;
+            while (j >= 0 && probs[indices[j]] < keyProb)
+            {
+                indices[j + 1] = indices[j];
+                j--;


Replace quadratic top-p ordering with scalable sorting

Top-p filtering currently uses insertion sort over the full vocabulary each decoding step, which is O(V^2) and becomes prohibitively slow at typical LLM vocab sizes (tens of thousands of tokens). This can dominate generation latency whenever top_p < 1.0; using an O(V log V) sort/selection strategy is needed for production-scale sampling performance.

Useful? React with 👍 / 👎.

Fixed in ec2b338 — replaced O(V²) insertion sort with Array.Sort (IntroSort, O(V log V)), walking backwards from the top. Also replaced the heap-allocated bool[] for large vocabs with ArrayPool<bool> rental.

chatgpt-codex-connector · 2026-03-04T16:22:34Z

+            if (stopResult == StopResult.Stop)
+                generatedIds.RemoveAt(generatedIds.Count - 1);


Trim stop suffix instead of dropping the whole token

When a stop string matches, the code removes the entire last token to exclude the stop text. If the tokenizer merges extra prefix characters with the stop substring in the same token (common with BPE), this also removes valid content before the stop sequence, so returned text is incorrect. Stop handling should remove only the matched suffix text, not always delete the full triggering token.

Useful? React with 👍 / 👎.

Deferred — added a TODO comment. The correct fix requires byte-level decode tracking to trim only the matched suffix; current behavior (dropping the whole token) is acceptable for v1 since stop strings are typically token-aligned.

chatgpt-codex-connector · 2026-03-04T16:22:34Z

+        if (options.SamplerSteps is not null)
+        {
+            _greedy = false;
+            _processors = options.LogitProcessors?.ToArray() ?? [];
+            _steps = options.SamplerSteps.ToArray();


Keep repetition penalty active with custom sampler steps

In the SamplerSteps branch, processors default to empty unless LogitProcessors is explicitly provided, so InferenceOptions.RepetitionPenalty/RepetitionPenaltyWindow are silently ignored whenever custom sampler steps are used. This breaks mixed configurations (custom step order plus flat repetition-penalty settings) and diverges from the option comments that only sampler flat properties are replaced.

Useful? React with 👍 / 👎.

Fixed in ec2b338 — when SamplerSteps is explicit but LogitProcessors is null, we now auto-build processors from flat properties (e.g., RepetitionPenalty). Only skipped when LogitProcessors is explicitly provided.

chatgpt-codex-connector · 2026-03-04T16:22:34Z

+            positions[i] = i;
+
+        int firstTokenId;
+        using (ITensor prefillLogits = _model.Forward(promptIds, positions, deviceId: -1, kvCache))


Handle empty prompts before calling prefill forward

Generate forwards promptIds directly even when the encoded prompt is empty, which can trigger failures in the KV-cache forward path (it expects at least one position). Since API callers can pass an empty prompt, this leads to runtime exceptions instead of valid generation behavior; the method should inject a start token or special-case zero-length prompts before prefill.

Useful? React with 👍 / 👎.

Fixed in ec2b338 — empty promptIds now falls back to [BosTokenId] as a seed, avoiding the empty-array forward pass crash.

- TopPSampler: replace O(V²) insertion sort with Array.Sort; pool bool[] - MinPSampler: eliminate SoftMax, use logit-space threshold (zero alloc) - CategoricalSampler: ArrayPool for large vocabs (avoid LOH) - RepetitionPenaltyProcessor: replace HashSet with sorted rented array - TextGenerator: CollectionsMarshal.AsSpan instead of ToArray; guard MaxTokens=0 and empty prompt; TODO for stop-string suffix trimming - SamplerPipeline: auto-build processors when SamplerSteps is explicit but LogitProcessors is null Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

dotllm-claude-code-bot · 2026-03-04T16:59:20Z

Addressed in ec2b338. Here's the triage:

Fixed

Issue	Fix
O(V²) TopPSampler sort	Replaced insertion sort with `Array.Sort` (IntroSort O(V log V)); walks backwards for descending order
MinPSampler unnecessary SoftMax	Eliminated entirely — now uses logit-space threshold: `logit < maxLogit + ln(minP)` (zero allocations)
CategoricalSampler LOH alloc	`ArrayPool<float>` rental for vocabs > 4096
TopPSampler `bool[]` LOH alloc	`ArrayPool<bool>` rental
RepetitionPenaltyProcessor HashSet	Sorted rented `int[]` with duplicate skip — zero managed allocations
`generatedIds.ToArray()` in decode loop	`CollectionsMarshal.AsSpan(generatedIds)` throughout

Deferred (with TODOs)

Issue	Reason
TopKSampler QuickSelect	`Span.Sort()` IntroSort is O(V log V), acceptable for now
CategoricalSampler fallback bias	Practically harmless FP edge case; added TODO comment
FinishReason conflation	Current mapping is correct for built-in conditions

Steps 8 (sampling pipeline) and 9 (stop conditions) marked complete. Phase 1 is now 9/9 — status updated to Done. Added news entry. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Drop no-KV-cache history. Eval per token now 24.4 ms (3x llama.cpp), prompt eval 300 ms (27x), total 31.9 tok/s — all Release + Dynamic PGO. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector Bot reviewed Mar 4, 2026

View reviewed changes

kkokosa and others added 2 commits March 4, 2026 18:02

Update roadmap and README for completed Phase 1 (#10)

5315f60

Steps 8 (sampling pipeline) and 9 (stop conditions) marked complete. Phase 1 is now 9/9 — status updated to Done. Added news entry. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Rewrite benchmarks with Release build numbers (#10)

dc6d9da

Drop no-KV-cache history. Eval per token now 24.4 ms (3x llama.cpp), prompt eval 300 ms (27x), total 31.9 tok/s — all Release + Dynamic PGO. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

kkokosa merged commit 482b750 into main Mar 4, 2026
4 checks passed

kkokosa deleted the issue/10-sampling-pipeline branch March 4, 2026 17:36

		generatedIds.Add(firstTokenId);
		string decodedText = _tokenizer.Decode(generatedIds.ToArray().AsSpan());

		if (stopResult == StopResult.Stop)
		generatedIds.RemoveAt(generatedIds.Count - 1);

Conversation

kkokosa commented Mar 4, 2026

Summary

Design

Test plan

Uh oh!

kkokosa commented Mar 4, 2026

Uh oh!

kkokosa commented Mar 4, 2026

Uh oh!

github-actions Bot commented Mar 4, 2026

🚨 Critical Performance Bugs

⚠️ Hot-Path GC Allocations (Zero-GC Violations)

💡 Minor Improvements

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

dotllm-claude-code-bot Bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

dotllm-claude-code-bot Bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

dotllm-claude-code-bot Bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

dotllm-claude-code-bot Bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

dotllm-claude-code-bot Bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

dotllm-claude-code-bot Bot commented Mar 4, 2026

Fixed

Deferred (with TODOs)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant