Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ParentDocumentRetriever: parent_splitter and ids are incompatible #11982

Closed
1 of 14 tasks
pprados opened this issue Oct 18, 2023 · 4 comments
Closed
1 of 14 tasks

ParentDocumentRetriever: parent_splitter and ids are incompatible #11982

pprados opened this issue Oct 18, 2023 · 4 comments
Assignees
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature 🔌: chroma Primarily related to ChromaDB integrations Ɑ: vector store Related to vector store module

Comments

@pprados
Copy link
Contributor

pprados commented Oct 18, 2023

System Info

description
With parent_splitter, it's not possible to know the number of IDs before the split.
So, it's not possible to know the ID of each fragment.
Then, it's not possible to manage the life cycle of the fragment because it's impossible to know the list of IDs associated with the original big document.

Who can help?

@eyurtsev

Information

  • The official example notebooks/scripts
  • My own modified scripts

Related Components

  • LLMs/Chat Models
  • Embedding Models
  • Prompts / Prompt Templates / Prompt Selectors
  • Output Parsers
  • Document Loaders
  • Vector Stores / Retrievers
  • Memory
  • Agents / Agent Executors
  • Tools / Toolkits
  • Chains
  • Callbacks/Tracing
  • Async

Reproduction

from langchain.embeddings import OpenAIEmbeddings
from langchain.retrievers import ParentDocumentRetriever
from langchain.schema import Document
from langchain.storage import InMemoryStore
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores.chroma import Chroma

vectorstore = Chroma(
    collection_name="full_documents",
    embedding_function=OpenAIEmbeddings()
)
store = InMemoryStore()

docs = [Document(page_content=txt, metadata={"id": id}) for txt, id in [("aaaaaa", 1), ("bbbbbb", 2)]]
ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    id_key="id",
    parent_splitter=RecursiveCharacterTextSplitter(
        chunk_size = 2,
        chunk_overlap  = 0,
        length_function = len,
        add_start_index = True,
    ),
    child_splitter=RecursiveCharacterTextSplitter(
        chunk_size = 1,
        chunk_overlap  = 0,
        length_function = len,
        add_start_index = True,
    ),
).add_documents(docs,ids=[doc.metadata["id"] for doc in docs])

Produce:

ValueError: Got uneven list of documents and ids. If `ids` is provided, should be same length as `documents`.

Expected behavior

No error.

@dosubot dosubot bot added Ɑ: vector store Related to vector store module 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature 🔌: chroma Primarily related to ChromaDB integrations labels Oct 18, 2023
@dosubot
Copy link

dosubot bot commented Oct 18, 2023

🤖

Hello,

Thank you for reaching out and providing a detailed description of the issue you're facing.

The ParentDocumentRetriever class in LangChain is designed to handle the splitting of documents into smaller chunks for more accurate embeddings while retaining enough context. It does this by splitting and storing small chunks of data. During retrieval, it first fetches the small chunks but then looks up the parent ids for those chunks and returns those larger documents. The "parent document" refers to the document that a small chunk originated from. This can either be the whole raw document or a larger chunk.

The error you're encountering is due to the mismatch between the number of documents and the number of IDs you're providing. The add_documents method in the ParentDocumentRetriever class checks if the length of the documents and IDs are equal. If they are not, it raises a ValueError.

Here's the relevant code snippet:

if ids is None:
    doc_ids = [str(uuid.uuid4()) for _ in documents]
    if not add_to_docstore:
        raise ValueError(
            "If ids are not passed in, `add_to_docstore` MUST be True"
        )
else:
    if len(documents) != len(ids):
        raise ValueError(
            "Got uneven list of documents and ids. "
            "If `ids` is provided, should be same length as `documents`."
        )
    doc_ids = ids

If you don't provide IDs, the method generates random UUIDs for each document. If you do provide IDs, you need to ensure that the number of IDs matches the number of documents. If you don't want to add documents to the docstore, you must provide IDs. If you don't provide IDs and set add_to_docstore to False, the method raises a ValueError.

To resolve the ValueError, you need to ensure that you provide an ID for each document if you are providing IDs, or set add_to_docstore to True if you are not providing IDs.

I hope this helps! If you have any other questions or need further clarification, please don't hesitate to ask.

Best regards.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@eyurtsev eyurtsev self-assigned this Oct 19, 2023
@eyurtsev
Copy link
Collaborator

@pprados That part of the interface doesn't make sense. Not sure of a good solution right now, but I don't think that a parent splitter should be allowed as an argument

@pprados
Copy link
Contributor Author

pprados commented Nov 10, 2023

@eyurtsev
I propose a new implementation of the same idea.
See my full response in pull-request number 11968.

I started proposing preliminary pull-requests, to enrich lanchain before eventually accepting my proposal.

I can't valide one pull-request, because of a lint bug in code that is not mine.

@pprados
Copy link
Contributor Author

pprados commented Nov 29, 2023

@eyurtsev
A complet new implementation is in pull-request 13910

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Feb 28, 2024
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 13, 2024
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature 🔌: chroma Primarily related to ChromaDB integrations Ɑ: vector store Related to vector store module
Projects
None yet
Development

No branches or pull requests

2 participants