Fix TextChunker orphan chunk merging to use token count instead of word count by joaquinhuigomez · Pull Request #13714 · microsoft/semantic-kernel

joaquinhuigomez · 2026-03-27T01:21:42Z

Problem

ProcessParagraphs in TextChunker.cs has an orphan chunk gluing path that splits the last two paragraphs by whitespace and uses the resulting word counts to decide whether they can be merged. It then compares this word count against adjustedMaxTokensPerParagraph, which is a token limit — not a word limit.

When a custom TokenCounter is provided (e.g., a tiktoken-based counter), word count and token count can diverge significantly, causing merged chunks to exceed the target token limit. This leads to silent data loss when the oversized chunks are later truncated by embedding models or rerankers.

Fix

Replace the String.Split word counting with GetTokenCount() calls, which already correctly dispatch to either the custom TokenCounter or the default length-based heuristic. This also simplifies the merge — the original paragraph strings are concatenated directly instead of being split and re-joined, preserving their original formatting.

Fixes #13713

… count In ProcessParagraphs, the orphan chunk merging logic was splitting strings by spaces to get word counts and comparing those against the token limit. This caused merged chunks to exceed the target token count when the token counter produces different counts than naive word splitting. Use GetTokenCount() consistently, matching how the rest of the method already measures chunk sizes. Fixes microsoft#13713

github-actions

Automated Code Review

Reviewers: 4 | Confidence: 89%

✓ Correctness

This diff fixes a real token-counting consistency bug: the old code counted 'tokens' by splitting on spaces (word count) and compared the result against adjustedMaxTokensPerParagraph, which is derived from GetTokenCount — a function that, by default, computes length >> 2 (character length / 4), not word count. When a custom tokenCounter is supplied, the mismatch is even more pronounced. The fix correctly uses GetTokenCount for both paragraphs. A secondary benefit is that whitespace in user content is no longer silently normalized. The only issue is a minor redundancy: GetTokenCount(lastParagraph, tokenCounter) is computed on line 193 for the guard condition and then recomputed identically on the first line inside the block.

✓ Security Reliability

The change replaces an inconsistent space-splitting word-count proxy with the canonical GetTokenCount helper for the merge-short-last-paragraph logic, and stops normalizing internal whitespace during the merge. This is a correctness improvement with no security or reliability regressions. The only minor issue is a redundant GetTokenCount call (the same value was already computed two lines earlier in the guard condition).

✓ Test Coverage

The diff changes the paragraph-merge logic in ProcessParagraphs to use GetTokenCount (which delegates to a custom tokenCounter or the default length/4 heuristic) instead of counting space-separated words, and stops normalizing whitespace when merging. The existing test CanSplitTextParagraphsEvenly exercises this merge path but only with the default tokenCounter. No existing test exercises the merge path with a custom tokenCounter, which is the most behaviorally significant change. Additionally, the old code normalized multiple consecutive spaces during merge (split+rejoin), while the new code preserves original whitespace—this behavioral difference is untested. These are not blocking issues since the primary code path is covered, but dedicated tests would prevent regressions.

✓ Design Approach

The change correctly fixes a real inconsistency: the original code used word count (.Split(' ').Length) as a proxy for token count when deciding whether to merge the last two paragraphs, completely ignoring any custom TokenCounter delegate. The fix replaces that proxy with the same GetTokenCount() helper used everywhere else. The simplified concatenation ($"{secondLastParagraph} {lastParagraph}") is safe because BuildParagraph already .Trim()s every paragraph before storing it. As a bonus, the old split-then-rejoin actually destroyed internal newlines (e.g., multi-line paragraphs built with AppendLine), which the new code avoids. One minor inefficiency: GetTokenCount(lastParagraph, tokenCounter) is now called twice — once in the outer if guard on line 193 and once inside the block. With a custom tokenizer this is an extra round-trip, but it is not a design problem.

Suggestions

GetTokenCount(lastParagraph, tokenCounter) is evaluated on line 193 for the if guard and again on line 195 for assignment. Hoist the result into a local variable before the if to avoid invoking a potentially expensive custom tokenCounter twice on the same string.
Add a test exercising the short-last-paragraph merge logic with a custom tokenCounter (e.g., CanSplitTextParagraphsEvenlyWithCustomTokenCounter) to verify the merge decision now respects the supplied delegate rather than space-split word count.
Add a test with multi-space or tab-separated content in the last/second-to-last paragraphs to confirm whitespace preservation is intentional — the old code collapsed consecutive spaces during merge; the new code preserves them.

Automated review by joaquinhuigomez's agents

github-actions · 2026-03-27T01:24:25Z

dotnet/src/SemanticKernel.Core/Text/TextChunker.cs

            if (GetTokenCount(lastParagraph, tokenCounter) < adjustedMaxTokensPerParagraph / 4)
            {
-                var lastParagraphTokens = lastParagraph.Split(s_spaceChar, StringSplitOptions.RemoveEmptyEntries);
-                var secondLastParagraphTokens = secondLastParagraph.Split(s_spaceChar, StringSplitOptions.RemoveEmptyEntries);
+                var lastParagraphTokenCount = GetTokenCount(lastParagraph, tokenCounter);


Nit: GetTokenCount(lastParagraph, tokenCounter) is already computed on line 193 for the if guard, then recomputed identically here. Hoist the result into a local before the if to avoid invoking the token counter twice — especially relevant when tokenCounter is a non-trivial delegate.

Suggested change

if (GetTokenCount(lastParagraph, tokenCounter) < adjustedMaxTokensPerParagraph / 4)

{

var lastParagraphTokens = lastParagraph.Split(s_spaceChar, StringSplitOptions.RemoveEmptyEntries);

var secondLastParagraphTokens = secondLastParagraph.Split(s_spaceChar, StringSplitOptions.RemoveEmptyEntries);

var lastParagraphTokenCount = GetTokenCount(lastParagraph, tokenCounter);

var lastParagraphTokenCount = GetTokenCount(lastParagraph, tokenCounter);

if (lastParagraphTokenCount < adjustedMaxTokensPerParagraph / 4)

{

Addresses review feedback: GetTokenCount(lastParagraph) was computed twice — once for the size guard and again inside the block. Moved the variable declaration before the if-check so the result is reused.

joaquinhuigomez requested a review from a team as a code owner March 27, 2026 01:21

github-actions bot reviewed Mar 27, 2026

View reviewed changes

Hoist lastParagraphTokenCount to avoid redundant token counter call

4a2b7b5

Addresses review feedback: GetTokenCount(lastParagraph) was computed twice — once for the size guard and again inside the block. Moved the variable declaration before the if-check so the result is reused.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix TextChunker orphan chunk merging to use token count instead of word count#13714

Fix TextChunker orphan chunk merging to use token count instead of word count#13714
joaquinhuigomez wants to merge 2 commits intomicrosoft:mainfrom
joaquinhuigomez:fix/text-chunker-token-count

joaquinhuigomez commented Mar 27, 2026

Uh oh!

github-actions bot left a comment

Uh oh!

github-actions bot Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

joaquinhuigomez commented Mar 27, 2026

Problem

Fix

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Automated Code Review

✓ Correctness

✓ Security Reliability

✓ Test Coverage

✓ Design Approach

Suggestions

Uh oh!

github-actions bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant