Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RecursiveCharacterTextSplitter uses regex value instead of original separator when merging and keep_separator is false #23394

Open
5 tasks done
loperamos opened this issue Jun 25, 2024 · 2 comments
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: text splitters Related to text splitters package

Comments

@loperamos
Copy link

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

from langchain.text_splitter import RecursiveCharacterTextSplitter


if __name__ == "__main__":
    # Wrong behaviour, using \s instead of regular space
    splitter_keep = RecursiveCharacterTextSplitter(
        separators=[r"\s"],
        keep_separator=False,
        is_separator_regex=True,
        chunk_size=15,
        chunk_overlap=0,
        strip_whitespace=False)
    assert splitter_keep.split_text("Hello world")[0] == r"Hello\sworld"

    # Expected behaviour, keeping regular space
    splitter_no_keep = RecursiveCharacterTextSplitter(
        separators=[r"\s"],
        keep_separator=True,
        is_separator_regex=True,
        chunk_size=15,
        chunk_overlap=0,
        strip_whitespace=False)
    assert splitter_no_keep.split_text("Hello world")[0] == r"Hello world"

Error Message and Stack Trace (if applicable)

No response

Description

I am trying to use the langchain library to split a test using regex separators. I expect the output strings to contain the original separators, but what happens is that when using the keep_separator flag as False it uses the regex value instead of the original separator.

Possible code pointer where the problem might be coming from: libs/text-splitters/langchain_text_splitters/character.py#L98

System Info

langchain==0.2.5
langchain-core==0.2.9
langchain-text-splitters==0.2.1

Platform: Apple M1 Pro
macOS: 14.5 (23F79)

python version: Python 3.12.3

@dosubot dosubot bot added Ɑ: text splitters Related to text splitters package 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Jun 25, 2024
@loperamos
Copy link
Author

Adding some post clarification, I believe that the expected behaviour should be as the following code snippet, since I would expect we want to keep the chunk exactly as the original text. The keep separator flag should only affect when the separator is at the begining of a chunk.

if __name__ == "__main__":
    # Wrong behaviour, using \s instead of regular space
    splitter_keep = RecursiveCharacterTextSplitter(
        separators=[r"\s"],
        keep_separator=False,
        is_separator_regex=True,
        chunk_size=15,
        chunk_overlap=0,
        strip_whitespace=False)
    assert splitter_keep.split_text("Hello world")[0] == r"Hello world"

@wulifu2hao
Copy link
Contributor

wulifu2hao commented Jun 25, 2024

I drafted a fix for it and will open an pull request once I have the time to add some test cases tomorrow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: text splitters Related to text splitters package
Projects
None yet
Development

No branches or pull requests

2 participants