text-splitters: fix text splitter start index #31222

aowen87 · 2025-05-13T22:24:34Z

Description:
I've updated the base text splitter so that it correctly captures the start_index. Without this fix, there are many cases of start_index receiving -1. This was a result of the chunk overlap being in number of tokens, not characters.

The search for a matching text would often begin at an index that was past the true starting point of the chunk. In the best case scenario, we would end up with -1 as a start_index. In the worst case scenario, we would end up with an incorrect start_index due to finding another instance of that chunk later in the text.

Issue:
Fixes #29884

Dependencies:
None

vercel · 2025-05-13T22:24:39Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment

Name	Status	Preview	Comments	Updated (UTC)
langchain	⬜️ Ignored (Inspect)	Visit Preview		Jul 3, 2025 3:26pm

libs/text-splitters/tests/integration_tests/test_text_splitter.py

codspeed-hq · 2025-06-17T00:49:05Z

CodSpeed WallTime Performance Report

Merging #31222 will not alter performance

_{Comparing aowen87:bugfix/text_splitter_start_index (f2e5795) with master (572020c)}

⚠️

Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

Summary

✅ 13 untouched benchmarks

codspeed-hq · 2025-06-17T00:54:04Z

CodSpeed Instrumentation Performance Report

Merging #31222 will not alter performance

_{Comparing aowen87:bugfix/text_splitter_start_index (f2e5795) with master (572020c)}

Summary

✅ 14 untouched benchmarks

aowen87 · 2025-07-10T22:56:22Z

Unfortunately, this approach of finding the index offset from the tokenized/encoded chunk doesn't work because the tokenizers and encoders often don't preserve the original text (removing whitespace, for example). I don't see an easy way to fix this without a heavy rewrite.

aowen87 added 4 commits May 13, 2025 13:32

fix start_index: can't use chunk_overlap

c90fc23

add test for change

68ade97

cleanup

2724124

ruff fix

b570923

dosubot bot added size:M bug Related to a bug, vulnerability, unexpected error with an existing feature labels May 13, 2025

aowen87 marked this pull request as draft May 13, 2025 22:30

optional dependency

182a835

aowen87 marked this pull request as ready for review May 13, 2025 23:26

aowen87 marked this pull request as draft May 13, 2025 23:30

aowen87 added 6 commits May 13, 2025 16:41

formatter

bd15cfb

more robust solution

cda0c9d

handle TokenTextSplitter

b04f2d7

ruff check

e9bebda

ruff format

2e15e47

fixing type hints

8272cc9

vercel bot deployed to Preview May 14, 2025 21:45 View deployment

aowen87 added 2 commits May 14, 2025 15:00

huggingface test

1686610

uncomment

639a5c9

aowen87 commented May 14, 2025

View reviewed changes

libs/text-splitters/tests/integration_tests/test_text_splitter.py Outdated Show resolved Hide resolved

vercel bot deployed to Preview May 14, 2025 22:24 View deployment

fixing import

632fb29

vercel bot deployed to Preview May 14, 2025 23:26 View deployment

vercel bot deployed to Preview May 14, 2025 23:42 View deployment

aowen87 marked this pull request as ready for review May 14, 2025 23:43

dosubot bot added size:L and removed size:M labels May 14, 2025

aowen87 marked this pull request as draft June 17, 2025 00:11

vercel bot deployed to Preview June 17, 2025 00:59 View deployment

vercel bot deployed to Preview June 17, 2025 20:51 View deployment

vercel bot deployed to Preview June 17, 2025 21:06 View deployment

aowen87 marked this pull request as ready for review June 17, 2025 23:57

dosubot bot added size:M and removed size:L labels Jun 17, 2025

aowen87 marked this pull request as draft July 2, 2025 21:46

fixes

04b1ea9

aowen87 force-pushed the bugfix/text_splitter_start_index branch from 07fe389 to 04b1ea9 Compare July 2, 2025 23:00

vercel bot deployed to Preview July 2, 2025 23:14 View deployment

Merge branch 'master' into bugfix/text_splitter_start_index

f2e5795

aowen87 closed this Jul 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

text-splitters: fix text splitter start index #31222

text-splitters: fix text splitter start index #31222

Uh oh!

aowen87 commented May 13, 2025 •

edited

Loading

Uh oh!

vercel bot commented May 13, 2025 •

edited

Loading

Uh oh!

Uh oh!

codspeed-hq bot commented Jun 17, 2025 •

edited

Loading

Uh oh!

codspeed-hq bot commented Jun 17, 2025 •

edited

Loading

Uh oh!

aowen87 commented Jul 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

text-splitters: fix text splitter start index #31222

text-splitters: fix text splitter start index #31222

Uh oh!

Conversation

aowen87 commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vercel bot commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

codspeed-hq bot commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CodSpeed WallTime Performance Report

Merging #31222 will not alter performance

Summary

Uh oh!

codspeed-hq bot commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CodSpeed Instrumentation Performance Report

Merging #31222 will not alter performance

Summary

Uh oh!

aowen87 commented Jul 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aowen87 commented May 13, 2025 •

edited

Loading

vercel bot commented May 13, 2025 •

edited

Loading

codspeed-hq bot commented Jun 17, 2025 •

edited

Loading

codspeed-hq bot commented Jun 17, 2025 •

edited

Loading