Skip to content

Fix TextChunker orphan chunk merging to use token count instead of word count#13714

Open
joaquinhuigomez wants to merge 2 commits intomicrosoft:mainfrom
joaquinhuigomez:fix/text-chunker-token-count
Open

Fix TextChunker orphan chunk merging to use token count instead of word count#13714
joaquinhuigomez wants to merge 2 commits intomicrosoft:mainfrom
joaquinhuigomez:fix/text-chunker-token-count

Conversation

@joaquinhuigomez
Copy link

Problem

ProcessParagraphs in TextChunker.cs has an orphan chunk gluing path that splits the last two paragraphs by whitespace and uses the resulting word counts to decide whether they can be merged. It then compares this word count against adjustedMaxTokensPerParagraph, which is a token limit — not a word limit.

When a custom TokenCounter is provided (e.g., a tiktoken-based counter), word count and token count can diverge significantly, causing merged chunks to exceed the target token limit. This leads to silent data loss when the oversized chunks are later truncated by embedding models or rerankers.

Fix

Replace the String.Split word counting with GetTokenCount() calls, which already correctly dispatch to either the custom TokenCounter or the default length-based heuristic. This also simplifies the merge — the original paragraph strings are concatenated directly instead of being split and re-joined, preserving their original formatting.

Fixes #13713

… count

In ProcessParagraphs, the orphan chunk merging logic was splitting
strings by spaces to get word counts and comparing those against the
token limit. This caused merged chunks to exceed the target token count
when the token counter produces different counts than naive word splitting.

Use GetTokenCount() consistently, matching how the rest of the method
already measures chunk sizes.

Fixes microsoft#13713
@joaquinhuigomez joaquinhuigomez requested a review from a team as a code owner March 27, 2026 01:21
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated Code Review

Reviewers: 4 | Confidence: 89%

✓ Correctness

This diff fixes a real token-counting consistency bug: the old code counted 'tokens' by splitting on spaces (word count) and compared the result against adjustedMaxTokensPerParagraph, which is derived from GetTokenCount — a function that, by default, computes length >> 2 (character length / 4), not word count. When a custom tokenCounter is supplied, the mismatch is even more pronounced. The fix correctly uses GetTokenCount for both paragraphs. A secondary benefit is that whitespace in user content is no longer silently normalized. The only issue is a minor redundancy: GetTokenCount(lastParagraph, tokenCounter) is computed on line 193 for the guard condition and then recomputed identically on the first line inside the block.

✓ Security Reliability

The change replaces an inconsistent space-splitting word-count proxy with the canonical GetTokenCount helper for the merge-short-last-paragraph logic, and stops normalizing internal whitespace during the merge. This is a correctness improvement with no security or reliability regressions. The only minor issue is a redundant GetTokenCount call (the same value was already computed two lines earlier in the guard condition).

✓ Test Coverage

The diff changes the paragraph-merge logic in ProcessParagraphs to use GetTokenCount (which delegates to a custom tokenCounter or the default length/4 heuristic) instead of counting space-separated words, and stops normalizing whitespace when merging. The existing test CanSplitTextParagraphsEvenly exercises this merge path but only with the default tokenCounter. No existing test exercises the merge path with a custom tokenCounter, which is the most behaviorally significant change. Additionally, the old code normalized multiple consecutive spaces during merge (split+rejoin), while the new code preserves original whitespace—this behavioral difference is untested. These are not blocking issues since the primary code path is covered, but dedicated tests would prevent regressions.

✓ Design Approach

The change correctly fixes a real inconsistency: the original code used word count (.Split(' ').Length) as a proxy for token count when deciding whether to merge the last two paragraphs, completely ignoring any custom TokenCounter delegate. The fix replaces that proxy with the same GetTokenCount() helper used everywhere else. The simplified concatenation ($"{secondLastParagraph} {lastParagraph}") is safe because BuildParagraph already .Trim()s every paragraph before storing it. As a bonus, the old split-then-rejoin actually destroyed internal newlines (e.g., multi-line paragraphs built with AppendLine), which the new code avoids. One minor inefficiency: GetTokenCount(lastParagraph, tokenCounter) is now called twice — once in the outer if guard on line 193 and once inside the block. With a custom tokenizer this is an extra round-trip, but it is not a design problem.

Suggestions

  • GetTokenCount(lastParagraph, tokenCounter) is evaluated on line 193 for the if guard and again on line 195 for assignment. Hoist the result into a local variable before the if to avoid invoking a potentially expensive custom tokenCounter twice on the same string.
  • Add a test exercising the short-last-paragraph merge logic with a custom tokenCounter (e.g., CanSplitTextParagraphsEvenlyWithCustomTokenCounter) to verify the merge decision now respects the supplied delegate rather than space-split word count.
  • Add a test with multi-space or tab-separated content in the last/second-to-last paragraphs to confirm whitespace preservation is intentional — the old code collapsed consecutive spaces during merge; the new code preserves them.

Automated review by joaquinhuigomez's agents

Comment on lines +193 to +195
if (GetTokenCount(lastParagraph, tokenCounter) < adjustedMaxTokensPerParagraph / 4)
{
var lastParagraphTokens = lastParagraph.Split(s_spaceChar, StringSplitOptions.RemoveEmptyEntries);
var secondLastParagraphTokens = secondLastParagraph.Split(s_spaceChar, StringSplitOptions.RemoveEmptyEntries);
var lastParagraphTokenCount = GetTokenCount(lastParagraph, tokenCounter);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: GetTokenCount(lastParagraph, tokenCounter) is already computed on line 193 for the if guard, then recomputed identically here. Hoist the result into a local before the if to avoid invoking the token counter twice — especially relevant when tokenCounter is a non-trivial delegate.

Suggested change
if (GetTokenCount(lastParagraph, tokenCounter) < adjustedMaxTokensPerParagraph / 4)
{
var lastParagraphTokens = lastParagraph.Split(s_spaceChar, StringSplitOptions.RemoveEmptyEntries);
var secondLastParagraphTokens = secondLastParagraph.Split(s_spaceChar, StringSplitOptions.RemoveEmptyEntries);
var lastParagraphTokenCount = GetTokenCount(lastParagraph, tokenCounter);
var lastParagraphTokenCount = GetTokenCount(lastParagraph, tokenCounter);
if (lastParagraphTokenCount < adjustedMaxTokensPerParagraph / 4)
{

Addresses review feedback: GetTokenCount(lastParagraph) was computed
twice — once for the size guard and again inside the block. Moved the
variable declaration before the if-check so the result is reused.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: the TextChunker.SplitPlainTextParagraphs sometimes overcount the chunk sizes

1 participant