ParentDocumentRetriever: parent_splitter and ids are incompatible #11982

pprados · 2023-10-18T15:39:29Z

System Info

description
With parent_splitter, it's not possible to know the number of IDs before the split.
So, it's not possible to know the ID of each fragment.
Then, it's not possible to manage the life cycle of the fragment because it's impossible to know the list of IDs associated with the original big document.

Who can help?

@eyurtsev

Information

The official example notebooks/scripts
My own modified scripts

Related Components

Reproduction

from langchain.embeddings import OpenAIEmbeddings
from langchain.retrievers import ParentDocumentRetriever
from langchain.schema import Document
from langchain.storage import InMemoryStore
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores.chroma import Chroma

vectorstore = Chroma(
    collection_name="full_documents",
    embedding_function=OpenAIEmbeddings()
)
store = InMemoryStore()

docs = [Document(page_content=txt, metadata={"id": id}) for txt, id in [("aaaaaa", 1), ("bbbbbb", 2)]]
ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    id_key="id",
    parent_splitter=RecursiveCharacterTextSplitter(
        chunk_size = 2,
        chunk_overlap  = 0,
        length_function = len,
        add_start_index = True,
    ),
    child_splitter=RecursiveCharacterTextSplitter(
        chunk_size = 1,
        chunk_overlap  = 0,
        length_function = len,
        add_start_index = True,
    ),
).add_documents(docs,ids=[doc.metadata["id"] for doc in docs])

Produce:

ValueError: Got uneven list of documents and ids. If `ids` is provided, should be same length as `documents`.

Expected behavior

No error.

The text was updated successfully, but these errors were encountered:

dosubot · 2023-10-18T15:47:49Z

🤖

Hello,

Thank you for reaching out and providing a detailed description of the issue you're facing.

The ParentDocumentRetriever class in LangChain is designed to handle the splitting of documents into smaller chunks for more accurate embeddings while retaining enough context. It does this by splitting and storing small chunks of data. During retrieval, it first fetches the small chunks but then looks up the parent ids for those chunks and returns those larger documents. The "parent document" refers to the document that a small chunk originated from. This can either be the whole raw document or a larger chunk.

The error you're encountering is due to the mismatch between the number of documents and the number of IDs you're providing. The add_documents method in the ParentDocumentRetriever class checks if the length of the documents and IDs are equal. If they are not, it raises a ValueError.

Here's the relevant code snippet:

if ids is None:
    doc_ids = [str(uuid.uuid4()) for _ in documents]
    if not add_to_docstore:
        raise ValueError(
            "If ids are not passed in, `add_to_docstore` MUST be True"
        )
else:
    if len(documents) != len(ids):
        raise ValueError(
            "Got uneven list of documents and ids. "
            "If `ids` is provided, should be same length as `documents`."
        )
    doc_ids = ids

If you don't provide IDs, the method generates random UUIDs for each document. If you do provide IDs, you need to ensure that the number of IDs matches the number of documents. If you don't want to add documents to the docstore, you must provide IDs. If you don't provide IDs and set add_to_docstore to False, the method raises a ValueError.

To resolve the ValueError, you need to ensure that you provide an ID for each document if you are providing IDs, or set add_to_docstore to True if you are not providing IDs.

I hope this helps! If you have any other questions or need further clarification, please don't hesitate to ask.

Best regards.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

eyurtsev · 2023-10-19T02:38:11Z

@pprados That part of the interface doesn't make sense. Not sure of a good solution right now, but I don't think that a parent splitter should be allowed as an argument

pprados · 2023-11-10T15:46:01Z

@eyurtsev
I propose a new implementation of the same idea.
See my full response in pull-request number 11968.

I started proposing preliminary pull-requests, to enrich lanchain before eventually accepting my proposal.

I can't valide one pull-request, because of a lint bug in code that is not mine.

pprados · 2023-11-29T09:35:21Z

@eyurtsev
A complet new implementation is in pull-request 13910

dosubot bot added Ɑ: vector store Related to vector store module 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature 🔌: chroma Primarily related to ChromaDB integrations labels Oct 18, 2023

eyurtsev self-assigned this Oct 19, 2023

dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Feb 28, 2024

dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 13, 2024

dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ParentDocumentRetriever: parent_splitter and ids are incompatible #11982

ParentDocumentRetriever: parent_splitter and ids are incompatible #11982

pprados commented Oct 18, 2023

dosubot bot commented Oct 18, 2023

eyurtsev commented Oct 19, 2023

pprados commented Nov 10, 2023

pprados commented Nov 29, 2023

ParentDocumentRetriever: parent_splitter and ids are incompatible #11982

ParentDocumentRetriever: parent_splitter and ids are incompatible #11982

Comments

pprados commented Oct 18, 2023

System Info

Who can help?

Information

Related Components

Reproduction

Expected behavior

dosubot bot commented Oct 18, 2023

Sources

eyurtsev commented Oct 19, 2023

pprados commented Nov 10, 2023

pprados commented Nov 29, 2023