Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

text-splitters: fix text_splitter keep_seprator git bug #23397

Closed

Conversation

wenngong
Copy link
Contributor

@wenngong wenngong commented Jun 25, 2024

Description: fix text_splitter keep_seprator git bug:

  1. both _split_text_with_regex and self._merge_splits functions dill with the self._keep_separator param. So remove the _keep_separator relate process in self._merge_splits;
  2. To minimize the impact of code changes, Just keep the self._merge_splits second separator param to "". I can further modify the _merge_splits function to remove separator related codes by a further patch.

Issue: #23394

from langchain.text_splitter import RecursiveCharacterTextSplitter


if __name__ == "__main__":
    # Wrong behaviour, using \s instead of regular space
    splitter_keep = RecursiveCharacterTextSplitter(
        separators=[r"\s"],
        keep_separator=False,
        is_separator_regex=True,
        chunk_size=15,
        chunk_overlap=0,
        strip_whitespace=False)
    assert splitter_keep.split_text("Hello world")[0] == r"Helloworld"

    # Expected behaviour, keeping regular space
    splitter_no_keep = RecursiveCharacterTextSplitter(
        separators=[r"\s"],
        keep_separator=True,
        is_separator_regex=True,
        chunk_size=15,
        chunk_overlap=0,
        strip_whitespace=False)
    assert splitter_no_keep.split_text("Hello world")[0] == r"Hello world"

Copy link

vercel bot commented Jun 25, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Jun 25, 2024 11:30am

@dosubot dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. Ɑ: text splitters Related to text splitters package 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Jun 25, 2024
@wenngong
Copy link
Contributor Author

wrong fix up now, I will fix by another patch...

@wenngong wenngong closed this Jun 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature size:XS This PR changes 0-9 lines, ignoring generated files. Ɑ: text splitters Related to text splitters package
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant