RecursiveCharacterTextSplitter uses regex value instead of original separator when merging and keep_separator is false #23394

loperamos · 2024-06-25T09:39:09Z

Checked other resources

I added a very descriptive title to this issue.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.
I am sure that this is a bug in LangChain rather than my code.
The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

from langchain.text_splitter import RecursiveCharacterTextSplitter


if __name__ == "__main__":
    # Wrong behaviour, using \s instead of regular space
    splitter_keep = RecursiveCharacterTextSplitter(
        separators=[r"\s"],
        keep_separator=False,
        is_separator_regex=True,
        chunk_size=15,
        chunk_overlap=0,
        strip_whitespace=False)
    assert splitter_keep.split_text("Hello world")[0] == r"Hello\sworld"

    # Expected behaviour, keeping regular space
    splitter_no_keep = RecursiveCharacterTextSplitter(
        separators=[r"\s"],
        keep_separator=True,
        is_separator_regex=True,
        chunk_size=15,
        chunk_overlap=0,
        strip_whitespace=False)
    assert splitter_no_keep.split_text("Hello world")[0] == r"Hello world"

Error Message and Stack Trace (if applicable)

No response

Description

I am trying to use the langchain library to split a test using regex separators. I expect the output strings to contain the original separators, but what happens is that when using the keep_separator flag as False it uses the regex value instead of the original separator.

Possible code pointer where the problem might be coming from: libs/text-splitters/langchain_text_splitters/character.py#L98

System Info

langchain==0.2.5
langchain-core==0.2.9
langchain-text-splitters==0.2.1

Platform: Apple M1 Pro
macOS: 14.5 (23F79)

python version: Python 3.12.3

The text was updated successfully, but these errors were encountered:

loperamos · 2024-06-25T12:04:11Z

Adding some post clarification, I believe that the expected behaviour should be as the following code snippet, since I would expect we want to keep the chunk exactly as the original text. The keep separator flag should only affect when the separator is at the begining of a chunk.

if __name__ == "__main__":
    # Wrong behaviour, using \s instead of regular space
    splitter_keep = RecursiveCharacterTextSplitter(
        separators=[r"\s"],
        keep_separator=False,
        is_separator_regex=True,
        chunk_size=15,
        chunk_overlap=0,
        strip_whitespace=False)
    assert splitter_keep.split_text("Hello world")[0] == r"Hello world"

wulifu2hao · 2024-06-25T13:25:07Z

I drafted a fix for it and will open an pull request once I have the time to add some test cases tomorrow

dosubot bot added Ɑ: text splitters Related to text splitters package 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Jun 25, 2024

wenngong mentioned this issue Jun 25, 2024

text-splitters: fix text_splitter keep_seprator git bug #23397

Closed

wulifu2hao mentioned this issue Jun 26, 2024

text-splitters: bug fix for CharacterTextSplitter replacing original separator with regex pattern when merging #23519

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RecursiveCharacterTextSplitter uses regex value instead of original separator when merging and keep_separator is false #23394

RecursiveCharacterTextSplitter uses regex value instead of original separator when merging and keep_separator is false #23394

loperamos commented Jun 25, 2024

loperamos commented Jun 25, 2024

wulifu2hao commented Jun 25, 2024 •

edited

Loading

RecursiveCharacterTextSplitter uses regex value instead of original separator when merging and keep_separator is false #23394

RecursiveCharacterTextSplitter uses regex value instead of original separator when merging and keep_separator is false #23394

Comments

loperamos commented Jun 25, 2024

Checked other resources

Example Code

Error Message and Stack Trace (if applicable)

Description

System Info

loperamos commented Jun 25, 2024

wulifu2hao commented Jun 25, 2024 • edited Loading

wulifu2hao commented Jun 25, 2024 •

edited

Loading