Fix TextChunker orphan chunk merging to use token count instead of word count#13714
Fix TextChunker orphan chunk merging to use token count instead of word count#13714joaquinhuigomez wants to merge 2 commits intomicrosoft:mainfrom
Conversation
… count In ProcessParagraphs, the orphan chunk merging logic was splitting strings by spaces to get word counts and comparing those against the token limit. This caused merged chunks to exceed the target token count when the token counter produces different counts than naive word splitting. Use GetTokenCount() consistently, matching how the rest of the method already measures chunk sizes. Fixes microsoft#13713
There was a problem hiding this comment.
Automated Code Review
Reviewers: 4 | Confidence: 89%
✓ Correctness
This diff fixes a real token-counting consistency bug: the old code counted 'tokens' by splitting on spaces (word count) and compared the result against
adjustedMaxTokensPerParagraph, which is derived fromGetTokenCount— a function that, by default, computeslength >> 2(character length / 4), not word count. When a customtokenCounteris supplied, the mismatch is even more pronounced. The fix correctly usesGetTokenCountfor both paragraphs. A secondary benefit is that whitespace in user content is no longer silently normalized. The only issue is a minor redundancy:GetTokenCount(lastParagraph, tokenCounter)is computed on line 193 for the guard condition and then recomputed identically on the first line inside the block.
✓ Security Reliability
The change replaces an inconsistent space-splitting word-count proxy with the canonical GetTokenCount helper for the merge-short-last-paragraph logic, and stops normalizing internal whitespace during the merge. This is a correctness improvement with no security or reliability regressions. The only minor issue is a redundant GetTokenCount call (the same value was already computed two lines earlier in the guard condition).
✓ Test Coverage
The diff changes the paragraph-merge logic in ProcessParagraphs to use GetTokenCount (which delegates to a custom tokenCounter or the default length/4 heuristic) instead of counting space-separated words, and stops normalizing whitespace when merging. The existing test CanSplitTextParagraphsEvenly exercises this merge path but only with the default tokenCounter. No existing test exercises the merge path with a custom tokenCounter, which is the most behaviorally significant change. Additionally, the old code normalized multiple consecutive spaces during merge (split+rejoin), while the new code preserves original whitespace—this behavioral difference is untested. These are not blocking issues since the primary code path is covered, but dedicated tests would prevent regressions.
✓ Design Approach
The change correctly fixes a real inconsistency: the original code used word count (
.Split(' ').Length) as a proxy for token count when deciding whether to merge the last two paragraphs, completely ignoring any customTokenCounterdelegate. The fix replaces that proxy with the sameGetTokenCount()helper used everywhere else. The simplified concatenation ($"{secondLastParagraph} {lastParagraph}") is safe becauseBuildParagraphalready.Trim()s every paragraph before storing it. As a bonus, the old split-then-rejoin actually destroyed internal newlines (e.g., multi-line paragraphs built withAppendLine), which the new code avoids. One minor inefficiency:GetTokenCount(lastParagraph, tokenCounter)is now called twice — once in the outerifguard on line 193 and once inside the block. With a custom tokenizer this is an extra round-trip, but it is not a design problem.
Suggestions
GetTokenCount(lastParagraph, tokenCounter)is evaluated on line 193 for theifguard and again on line 195 for assignment. Hoist the result into a local variable before theifto avoid invoking a potentially expensive custom tokenCounter twice on the same string.- Add a test exercising the short-last-paragraph merge logic with a custom
tokenCounter(e.g.,CanSplitTextParagraphsEvenlyWithCustomTokenCounter) to verify the merge decision now respects the supplied delegate rather than space-split word count. - Add a test with multi-space or tab-separated content in the last/second-to-last paragraphs to confirm whitespace preservation is intentional — the old code collapsed consecutive spaces during merge; the new code preserves them.
Automated review by joaquinhuigomez's agents
| if (GetTokenCount(lastParagraph, tokenCounter) < adjustedMaxTokensPerParagraph / 4) | ||
| { | ||
| var lastParagraphTokens = lastParagraph.Split(s_spaceChar, StringSplitOptions.RemoveEmptyEntries); | ||
| var secondLastParagraphTokens = secondLastParagraph.Split(s_spaceChar, StringSplitOptions.RemoveEmptyEntries); | ||
| var lastParagraphTokenCount = GetTokenCount(lastParagraph, tokenCounter); |
There was a problem hiding this comment.
Nit: GetTokenCount(lastParagraph, tokenCounter) is already computed on line 193 for the if guard, then recomputed identically here. Hoist the result into a local before the if to avoid invoking the token counter twice — especially relevant when tokenCounter is a non-trivial delegate.
| if (GetTokenCount(lastParagraph, tokenCounter) < adjustedMaxTokensPerParagraph / 4) | |
| { | |
| var lastParagraphTokens = lastParagraph.Split(s_spaceChar, StringSplitOptions.RemoveEmptyEntries); | |
| var secondLastParagraphTokens = secondLastParagraph.Split(s_spaceChar, StringSplitOptions.RemoveEmptyEntries); | |
| var lastParagraphTokenCount = GetTokenCount(lastParagraph, tokenCounter); | |
| var lastParagraphTokenCount = GetTokenCount(lastParagraph, tokenCounter); | |
| if (lastParagraphTokenCount < adjustedMaxTokensPerParagraph / 4) | |
| { |
Addresses review feedback: GetTokenCount(lastParagraph) was computed twice — once for the size guard and again inside the block. Moved the variable declaration before the if-check so the result is reused.
Problem
ProcessParagraphsinTextChunker.cshas an orphan chunk gluing path that splits the last two paragraphs by whitespace and uses the resulting word counts to decide whether they can be merged. It then compares this word count againstadjustedMaxTokensPerParagraph, which is a token limit — not a word limit.When a custom
TokenCounteris provided (e.g., a tiktoken-based counter), word count and token count can diverge significantly, causing merged chunks to exceed the target token limit. This leads to silent data loss when the oversized chunks are later truncated by embedding models or rerankers.Fix
Replace the
String.Splitword counting withGetTokenCount()calls, which already correctly dispatch to either the customTokenCounteror the default length-based heuristic. This also simplifies the merge — the original paragraph strings are concatenated directly instead of being split and re-joined, preserving their original formatting.Fixes #13713