-
Notifications
You must be signed in to change notification settings - Fork 20.1k
text-splitters: fix text splitter start index #31222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
text-splitters: fix text splitter start index #31222
Conversation
|
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Skipped Deployment
|
libs/text-splitters/tests/integration_tests/test_text_splitter.py
Outdated
Show resolved
Hide resolved
CodSpeed WallTime Performance ReportMerging #31222 will not alter performanceComparing
|
CodSpeed Instrumentation Performance ReportMerging #31222 will not alter performanceComparing Summary
|
07fe389 to
04b1ea9
Compare
|
Unfortunately, this approach of finding the index offset from the tokenized/encoded chunk doesn't work because the tokenizers and encoders often don't preserve the original text (removing whitespace, for example). I don't see an easy way to fix this without a heavy rewrite. |
Description:
I've updated the base text splitter so that it correctly captures the
start_index. Without this fix, there are many cases ofstart_indexreceiving -1. This was a result of the chunk overlap being in number of tokens, not characters.The search for a matching text would often begin at an index that was past the true starting point of the chunk. In the best case scenario, we would end up with -1 as a
start_index. In the worst case scenario, we would end up with an incorrectstart_indexdue to finding another instance of that chunk later in the text.Issue:
Fixes #29884
Dependencies:
None