Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

text-splitters: bug fix for incorrect start_index if the chunk is substring of another chunk #21477

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

maggonravi
Copy link

Description: Bug fix to find correct value of start_index if chunk is substring of another chunk.

Issue: #21475

Sample code:

    from langchain.text_splitter import RecursiveCharacterTextSplitter
    from langchain.docstore.document import Document
    splitter = RecursiveCharacterTextSplitter(chunk_size=5, chunk_overlap=5, separators=[" ", ""], add_start_index=True)
    splitter.split_documents([Document(page_content="chunk chunk")])

Before this commit:

    [Document(page_content='chunk', metadata={'start_index': 0}),
     Document(page_content='chun', metadata={'start_index': 0}),
     Document(page_content='chunk', metadata={'start_index': 0})]

After this commit:

    [Document(page_content='chunk', metadata={'start_index': 0}),
     Document(page_content='chun', metadata={'start_index': 6}),
     Document(page_content='chunk', metadata={'start_index': 6})]

Sample code:

    from langchain.text_splitter import RecursiveCharacterTextSplitter
    from langchain.docstore.document import Document
    splitter = RecursiveCharacterTextSplitter(chunk_size=5, chunk_overlap=5, separators=[" ", ""], add_start_index=True)
    splitter.split_documents([Document(page_content="chunk chunk")])

    Before this commit:

    [Document(page_content='chunk', metadata={'start_index': 0}),
     Document(page_content='chun', metadata={'start_index': 0}),
     Document(page_content='chunk', metadata={'start_index': 0})]

    After this commit:

    [Document(page_content='chunk', metadata={'start_index': 0}),
     Document(page_content='chun', metadata={'start_index': 6}),
     Document(page_content='chunk', metadata={'start_index': 6})]

    This resolves langchain-ai#21475
Copy link

vercel bot commented May 9, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview May 9, 2024 11:20am

@dosubot dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. Ɑ: text splitters Related to text splitters package 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels May 9, 2024
@maggonravi
Copy link
Author

@baskaryan / @efriis : can one of you please take a look at this PR and linked issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature size:S This PR changes 10-29 lines, ignoring generated files. Ɑ: text splitters Related to text splitters package
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant