Skip to content

Conversation

@aowen87
Copy link

@aowen87 aowen87 commented May 13, 2025

Description:
I've updated the base text splitter so that it correctly captures the start_index. Without this fix, there are many cases of start_index receiving -1. This was a result of the chunk overlap being in number of tokens, not characters.

The search for a matching text would often begin at an index that was past the true starting point of the chunk. In the best case scenario, we would end up with -1 as a start_index. In the worst case scenario, we would end up with an incorrect start_index due to finding another instance of that chunk later in the text.

Issue:
Fixes #29884

Dependencies:
None

@vercel
Copy link

vercel bot commented May 13, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Jul 3, 2025 3:26pm

@dosubot dosubot bot added size:M bug Related to a bug, vulnerability, unexpected error with an existing feature labels May 13, 2025
@aowen87 aowen87 marked this pull request as draft May 13, 2025 22:30
@aowen87 aowen87 marked this pull request as ready for review May 13, 2025 23:26
@aowen87 aowen87 marked this pull request as draft May 13, 2025 23:30
@aowen87 aowen87 marked this pull request as ready for review May 14, 2025 23:43
@dosubot dosubot bot added size:L and removed size:M labels May 14, 2025
@aowen87 aowen87 marked this pull request as draft June 17, 2025 00:11
@codspeed-hq
Copy link

codspeed-hq bot commented Jun 17, 2025

CodSpeed WallTime Performance Report

Merging #31222 will not alter performance

Comparing aowen87:bugfix/text_splitter_start_index (f2e5795) with master (572020c)

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

Summary

✅ 13 untouched benchmarks

@codspeed-hq
Copy link

codspeed-hq bot commented Jun 17, 2025

CodSpeed Instrumentation Performance Report

Merging #31222 will not alter performance

Comparing aowen87:bugfix/text_splitter_start_index (f2e5795) with master (572020c)

Summary

✅ 14 untouched benchmarks

@aowen87
Copy link
Author

aowen87 commented Jul 10, 2025

Unfortunately, this approach of finding the index offset from the tokenized/encoded chunk doesn't work because the tokenizers and encoders often don't preserve the original text (removing whitespace, for example). I don't see an easy way to fix this without a heavy rewrite.

@aowen87 aowen87 closed this Jul 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Related to a bug, vulnerability, unexpected error with an existing feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TokenTextSplitter start indices are sometimes -1

1 participant