Skip to content

.Net: [codex] Fix TextChunker orphan merge token count#14020

Open
pragnyanramtha wants to merge 1 commit into
microsoft:mainfrom
pragnyanramtha:codex/dotnet-textchunker-token-merge
Open

.Net: [codex] Fix TextChunker orphan merge token count#14020
pragnyanramtha wants to merge 1 commit into
microsoft:mainfrom
pragnyanramtha:codex/dotnet-textchunker-token-merge

Conversation

@pragnyanramtha
Copy link
Copy Markdown

Summary

Fixes #13713. TextChunker.SplitPlainTextParagraphs now checks the configured token counter before merging a short final paragraph into the previous paragraph. This prevents the orphan-paragraph balancing step from creating a chunk that exceeds the requested token limit when a custom token counter is used.

Root Cause

The orphan merge logic compared the number of whitespace-delimited words in the last two paragraphs against adjustedMaxTokensPerParagraph. That word count can be lower than the actual token count reported by the provided TokenCounter, so the merge could produce an oversized chunk.

Change

  • Build the candidate merged paragraph using the existing paragraph strings.
  • Call GetTokenCount(mergedParagraph, tokenCounter) before merging.
  • Add a regression test using a length-based token counter to cover the oversized merge case.

Validation

  • PATH=/tmp/dotnet:$PATH dotnet test dotnet/src/SemanticKernel.UnitTests/SemanticKernel.UnitTests.csproj --filter FullyQualifiedName~TextChunkerTests
    • Passed: 40, Failed: 0, Skipped: 0

Full repository test suite was not run because the focused TextChunker unit tests cover the changed behavior.

@moonbox3 moonbox3 added .NET Issue or Pull requests regarding .NET code kernel Issues or pull requests impacting the core kernel labels May 17, 2026
@github-actions github-actions Bot changed the title [codex] Fix TextChunker orphan merge token count .Net: [codex] Fix TextChunker orphan merge token count May 17, 2026
@pragnyanramtha pragnyanramtha marked this pull request as ready for review May 17, 2026 00:37
@pragnyanramtha pragnyanramtha requested a review from a team as a code owner May 17, 2026 00:37
Copilot AI review requested due to automatic review settings May 17, 2026 00:37
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated Code Review

Reviewers: 4 | Confidence: 93% | Result: All clear

Reviewed: Correctness, Security Reliability, Test Coverage, Design Approach


Automated review by pragnyanramtha's agents

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kernel Issues or pull requests impacting the core kernel .NET Issue or Pull requests regarding .NET code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: the TextChunker.SplitPlainTextParagraphs sometimes overcount the chunk sizes

3 participants