Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CharacterTextSplitter with keep_separator=True sets the separator to the beginning of each chunk instead of an end #20908

Closed
5 tasks done
VPetukhov opened this issue Apr 25, 2024 · 1 comment · Fixed by #21130
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: text splitters Related to text splitters package

Comments

@VPetukhov
Copy link

VPetukhov commented Apr 25, 2024

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(
    chunk_size=10,
    chunk_overlap=0,
    separator=". ",
    keep_separator=True,
)

for d in splitter.create_documents(["Text 1. Text 2. Text 3. Text 4."]):
    print(d.page_content)
    print("---")

Error Message and Stack Trace (if applicable)

Text 1
---
. Text 2
---
. Text 3
---
. Text 4.
---

Description

I'm trying to split text by sentence, while keeping end-of-sentence punctuation. Instead of putting the punctuation back at the end of the corresponding chunk, the library adds it to the front of the following chunk.

This problem is quite critical if the output is used for text-to-speech input.

System Info

langchain==0.1.16
langchain-community==0.0.34
langchain-core==0.1.46
langchain-text-splitters==0.0.1

MacOS, Python 3.10.13

@dosubot dosubot bot added Ɑ: text splitters Related to text splitters package 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Apr 25, 2024
@VPetukhov VPetukhov changed the title CharacterTextSplitter with keep_separator=True sets the separator to the beginning of each chung instead of an end CharacterTextSplitter with keep_separator=True sets the separator to the beginning of each chunk instead of an end Apr 25, 2024
@harsh204016
Copy link

Hi @VPetukhov , can I work on this issue ?

xbouroseu added a commit to xbouroseu/langchain that referenced this issue Apr 30, 2024
baskaryan added a commit that referenced this issue May 22, 2024
…ty (#21130)

**Description:** Added extra functionality to `CharacterTextSplitter`,
`TextSplitter` classes.
The user can select whether to append the separator to the previous
chunk with `keep_separator='end' ` or else prepend to the next chunk.
Previous functionality prepended by default to next chunk.
  
**Issue:** Fixes #20908

---------

Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
JuHyung-Son pushed a commit to JuHyung-Son/langchain that referenced this issue May 23, 2024
…ty (langchain-ai#21130)

**Description:** Added extra functionality to `CharacterTextSplitter`,
`TextSplitter` classes.
The user can select whether to append the separator to the previous
chunk with `keep_separator='end' ` or else prepend to the next chunk.
Previous functionality prepended by default to next chunk.
  
**Issue:** Fixes langchain-ai#20908

---------

Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: text splitters Related to text splitters package
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants