Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Type error in ParentDocumentRetriever using LocalFileStore #9345

Open
2 of 14 tasks
thedevstone opened this issue Aug 16, 2023 · 25 comments
Open
2 of 14 tasks

Type error in ParentDocumentRetriever using LocalFileStore #9345

thedevstone opened this issue Aug 16, 2023 · 25 comments
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: vector store Related to vector store module

Comments

@thedevstone
Copy link

thedevstone commented Aug 16, 2023

Bug

LocalFileStore tries to treat Document as byte

store = LocalFileStore(get_project_relative_path("doc_store"))
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
    retriever = ParentDocumentRetriever(vectorstore=vectorstore,
                                        docstore=store,
                                        parent_splitter=parent_splitter,
                                        child_splitter=child_splitter)
    if embed:
        docs = []
        data_folder = get_project_relative_path("documents")
        for i, file_path in enumerate(data_folder.iterdir()):
            document = TextLoader(str(file_path))
            docs.extend(document.load())
        retriever.add_documents(docs, None)

Here the broken method:

    def mset(self, key_value_pairs: Sequence[Tuple[str, bytes]]) -> None:
        """Set the values for the given keys.

        Args:
            key_value_pairs: A sequence of key-value pairs.

        Returns:
            None
        """
        for key, value in key_value_pairs:
            full_path = self._get_full_path(key)
            full_path.parent.mkdir(parents=True, exist_ok=True)
            full_path.write_bytes(value)

TypeError: memoryview: a bytes-like object is required, not 'Document'

Who can help?

No response

Information

  • The official example notebooks/scripts
  • My own modified scripts

Related Components

  • LLMs/Chat Models
  • Embedding Models
  • Prompts / Prompt Templates / Prompt Selectors
  • Output Parsers
  • Document Loaders
  • Vector Stores / Retrievers
  • Memory
  • Agents / Agent Executors
  • Tools / Toolkits
  • Chains
  • Callbacks/Tracing
  • Async

Reproduction

Create a LocalFileStore
Use a ParentDoucmentRetriever

Expected behavior

Serialize the documents as bytes

@dosubot dosubot bot added Ɑ: vector store Related to vector store module 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Aug 16, 2023
@dosubot
Copy link

dosubot bot commented Aug 16, 2023

🤖

Hello,

Thank you for bringing this issue to our attention. The error you're encountering is due to the fact that the LocalFileStore is trying to write a Document object directly to the file system, which is not possible because the Document object is not a bytes-like object.

To resolve this issue, you need to serialize the Document objects before storing them in the LocalFileStore. You can use Python's built-in pickle module to do this. Here's how you can modify the add_documents method in the ParentDocumentRetriever class to serialize the documents before storing them:

import pickle

# ...

def add_documents(
    self,
    documents: List[Document],
    ids: Optional[List[str]],
    add_to_docstore: bool = True,
) -> None:
    # ...
    if add_to_docstore:
        # Serialize the documents before storing them
        serialized_docs = [(id, pickle.dumps(doc)) for id, doc in full_docs]
        self.docstore.mset(serialized_docs)

In this code, pickle.dumps(doc) is used to serialize each Document object. The serialized documents are then stored in the LocalFileStore using the mset method.

Please note that you will also need to deserialize the documents when retrieving them from the LocalFileStore. You can use pickle.loads() for this purpose. When retrieving the documents using the mget method from the LocalFileStore, remember to deserialize the returned bytes into Document objects.

I hope this helps! If you have any other questions or run into any other issues, please don't hesitate to ask.

Best,
Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@thedevstone
Copy link
Author

Thanks,
But It's seems a bug, for now it's ok as a workaround but I think it must be considered a bug and resolve it in the framework

@k1ng
Copy link

k1ng commented Aug 20, 2023

Same problem, fix it please

@k1ng
Copy link

k1ng commented Aug 20, 2023

And with RedisStore:
redis.exceptions.DataError: Invalid input of type: 'Document'. Convert to a bytes, string, int or float first.

@nickhausman
Copy link

Bump on this, both RedisStore and Local not working

eyurtsev added a commit that referenced this issue Aug 30, 2023
This PR makes the following changes:

1. Documents become serializable using langhchain serialization
2. Make a utility to create a docstore kw store

Will help to address issue here:
#9345
@cam-barts
Copy link

cam-barts commented Oct 3, 2023

@nickhausman @k1ng @Giulianini I came to this thread because I had the same issue, but after looking at the commit @eyurtsev (thank you so much!) linked to this issue, I was able to resolve my issue and wanted to make sure that was documented here. Here's what I had before that didn't work:

    store = LocalFileStore("./store_location")
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
    vectorstore = ...
    retriever = ParentDocumentRetriever(
        vectorstore=vectorstore,
        docstore=store,
        child_splitter=child_splitter,
        parent_splitter=parent_splitter,
    )

Here are the changes I made that got it work:

from langchain.storage._lc_store import create_kv_docstore
# ...
    fs = LocalFileStore("./store_location")
    store = create_kv_docstore(fs)
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
    vectorstore = ...
    retriever = ParentDocumentRetriever(
        vectorstore=vectorstore,
        docstore=store,
        child_splitter=child_splitter,
        parent_splitter=parent_splitter,
    )

It looks like this was made available since v0.0.277.

I hope this helps!

@weissenbacherpwc
Copy link

weissenbacherpwc commented Nov 16, 2023

@cam-barts This helped a lot! What is confusiong to me is the filestore of "fs" and vectorstore. E.g. my code here with using Chroma as vectorstore:

def run_db_build():
    loader = DirectoryLoader(cfg.DATA_PATH,
                             glob='*.pdf',
                             loader_cls=PyPDFLoader)
    documents = loader.load()
    
    embeddings = HuggingFaceEmbeddings(model_name=cfg.EMBEDDING_MODEL_NAME,
                                       model_kwargs={'device': 'mps'}, encode_kwargs={'device': 'mps', 'batch_size': 32})
    
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
    #store = InMemoryStore()
    fs = LocalFileStore("./chroma_db_filestore")
    store = create_kv_docstore(fs)


    vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings,
                         persist_directory="chroma_db/") 
    big_chunks_retriever = ParentDocumentRetriever(
        vectorstore=vectorstore,
        docstore=store,
        child_splitter=child_splitter,
        parent_splitter=parent_splitter,
    )
    big_chunks_retriever.add_documents(documents)

In several examples I saw that the LocalFileStore path and the persist_directory path are different. Why is this? And if I want to use the saved ParentDocumentRetriever in a RetrievalQA chain, which folder do I have to load?

@cam-barts
Copy link

In several examples I saw that the LocalFileStore path and the persist_directory path are different. Why is this?

@weissenbacherpwc glad it helped. To the best of my knowledge, the persist_directory that you pass into the vector store initialization is where specifically only the vector store's data exists. In your example, Chroma doesn't know about ParentDocumentRetriever at all. What ParentDocumentRetriever does for your chain is create associations between document IDs which are stored in the vector store, which happens to be Chroma but could be anything. Those associations need to live somewhere, and that is what you declare with the LocalFileStore path. That path could exist inside of your Chroma directory if you wish it to, and indeed for me that's what I've done in the past, but it doesn't have to.

And if I want to use the saved ParentDocumentRetriever in a RetrievalQA chain, which folder do I have to load?

I think when you initialize most chains that use docs in this way, you pass in the retriever itself and not a path. So in this case, if you wanted to use the newly created big_chunks_retriever, which is your parent-child retriever object, you'd pass that in directly:

from langchain.chains import RetrievalQA
from langchain.llms import OpenAI # Or whatever makes sense for you

my_llm = OpenAI() # Or whatever makes sense for you

# ...

big_chunks_retriever.add_documents(documents)
retrievalQA = RetrievalQA.from_llm(llm=my_llm, retriever=big_chunks_retriever)

@weissenbacherpwc
Copy link

Got it with the directories, thanks!

In the script it would be the easiest way to use the created big_chunks_retriever in the RetrievalQA chain. But if I think if one wants to use the the retriever in production, it will take too long to create the retriever and calling the big_chunks_retriever.add_documents(documents) everytime at the start right? Thats why I wanted to store the retriever somewhere.

@cam-barts
Copy link

@weissenbacherpwc correct! I usually have different ingestion and use scripts, but the way that you build the retriever will be the same. So in one part of your process, you do document upload, and in another part you just use the retriever.

For example, your function that you shared would be really good for ingest. Depending on my requirements, I'd either return the retriever from that function to be used elsewhere, or I'd rebuild it where I need it. Here are two examples:

def run_db_build():
    # ....
    big_chunks_retriever.add_documents(documents)
    return big_chunks_retriever

retriever = run_db_build() # get retriever object as a global to be reused

def run_qa_chain(query):
    retrievalQA = RetrievalQA.from_llm(llm=my_llm, retriever=retriever) # use global retriever
    return retrievalQA.run(query)

Alternatively:

# src/ingest.py
def run_db_build():
    # ....
# src/build_retriever.py
def rebuild_retriever():
    """Recreate Retriever Object to be reused."""
    # only do what's needed to recreate the retriever
    # no need to actually load or split docs
    embeddings = HuggingFaceEmbeddings(model_name=cfg.EMBEDDING_MODEL_NAME,
                                       model_kwargs={'device': 'mps'}, encode_kwargs={'device': 'mps', 'batch_size': 32})
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
    fs = LocalFileStore("./chroma_db_filestore")
    store = create_kv_docstore(fs)
    vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings,
                         persist_directory="chroma_db/") 
    big_chunks_retriever = ParentDocumentRetriever(
        vectorstore=vectorstore,
        docstore=store,
        child_splitter=child_splitter,
        parent_splitter=parent_splitter,
    )
    return big_chunks_retriever

That will return the same usable retriever, without needing to actually load any docs.

@gcheron
Copy link
Contributor

gcheron commented Jan 11, 2024

This PR aims at adding support for document storage in a SQL database: #15909

@kallebl0mquist
Copy link

i am having the same problem with LocalFileStore but with multi vector retriever. I get the type error when I am trying to store texts. no type error when I store images.

so, when I use create_kv_docstore as a workaround my problem is the other way around. Type error with images , no type error with texts. the texts are list of strings and not documents at that point.

With Inmemory storing everything is fine, but that's not an option for real use.

@tyatabe
Copy link

tyatabe commented Jan 25, 2024

Whas this ever fixed for Redis? I'm getting the same error when adding documents to the retriever
DataError: Invalid input of type: 'Document'. Convert to a bytes, string, int or float first

@rchen19
Copy link

rchen19 commented Feb 15, 2024

@weissenbacherpwc correct! I usually have different ingestion and use scripts, but the way that you build the retriever will be the same. So in one part of your process, you do document upload, and in another part you just use the retriever.

For example, your function that you shared would be really good for ingest. Depending on my requirements, I'd either return the retriever from that function to be used elsewhere, or I'd rebuild it where I need it. Here are two examples:

def run_db_build():
    # ....
    big_chunks_retriever.add_documents(documents)
    return big_chunks_retriever

retriever = run_db_build() # get retriever object as a global to be reused

def run_qa_chain(query):
    retrievalQA = RetrievalQA.from_llm(llm=my_llm, retriever=retriever) # use global retriever
    return retrievalQA.run(query)

Alternatively:

# src/ingest.py
def run_db_build():
    # ....
# src/build_retriever.py
def rebuild_retriever():
    """Recreate Retriever Object to be reused."""
    # only do what's needed to recreate the retriever
    # no need to actually load or split docs
    embeddings = HuggingFaceEmbeddings(model_name=cfg.EMBEDDING_MODEL_NAME,
                                       model_kwargs={'device': 'mps'}, encode_kwargs={'device': 'mps', 'batch_size': 32})
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
    fs = LocalFileStore("./chroma_db_filestore")
    store = create_kv_docstore(fs)
    vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings,
                         persist_directory="chroma_db/") 
    big_chunks_retriever = ParentDocumentRetriever(
        vectorstore=vectorstore,
        docstore=store,
        child_splitter=child_splitter,
        parent_splitter=parent_splitter,
    )
    return big_chunks_retriever

That will return the same usable retriever, without needing to actually load any docs.

Thanks for the example. One question, in your example of rebuild_retriever() function, the two textsplitters are in fact not used, correct, I think they are only used when ParentDocumentRetriever.add_documents() is called? Thus the parameters for them do not matter I guess?

@muazhari
Copy link

muazhari commented Apr 1, 2024

Any fix to the Redis store? I have the same issue when using that with MultiVectorRetriever as docstore:

DataError: Invalid input of type: 'Document'. Convert to a bytes, string, int or float first.

@thdesc
Copy link
Contributor

thdesc commented Apr 8, 2024

@weissenbacherpwc correct! I usually have different ingestion and use scripts, but the way that you build the retriever will be the same. So in one part of your process, you do document upload, and in another part you just use the retriever.
For example, your function that you shared would be really good for ingest. Depending on my requirements, I'd either return the retriever from that function to be used elsewhere, or I'd rebuild it where I need it. Here are two examples:

def run_db_build():
    # ....
    big_chunks_retriever.add_documents(documents)
    return big_chunks_retriever

retriever = run_db_build() # get retriever object as a global to be reused

def run_qa_chain(query):
    retrievalQA = RetrievalQA.from_llm(llm=my_llm, retriever=retriever) # use global retriever
    return retrievalQA.run(query)

Alternatively:

# src/ingest.py
def run_db_build():
    # ....
# src/build_retriever.py
def rebuild_retriever():
    """Recreate Retriever Object to be reused."""
    # only do what's needed to recreate the retriever
    # no need to actually load or split docs
    embeddings = HuggingFaceEmbeddings(model_name=cfg.EMBEDDING_MODEL_NAME,
                                       model_kwargs={'device': 'mps'}, encode_kwargs={'device': 'mps', 'batch_size': 32})
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
    fs = LocalFileStore("./chroma_db_filestore")
    store = create_kv_docstore(fs)
    vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings,
                         persist_directory="chroma_db/") 
    big_chunks_retriever = ParentDocumentRetriever(
        vectorstore=vectorstore,
        docstore=store,
        child_splitter=child_splitter,
        parent_splitter=parent_splitter,
    )
    return big_chunks_retriever

That will return the same usable retriever, without needing to actually load any docs.

Thanks for the example. One question, in your example of rebuild_retriever() function, the two textsplitters are in fact not used, correct, I think they are only used when ParentDocumentRetriever.add_documents() is called? Thus the parameters for them do not matter I guess?

@rchen19 Yes, the two text splitters are not useful here and could be removed. In fact, the ParentDocumentRetriever could be replaced by a MultiVectorRetriever instance since the difference between the two is the add_documents method that the ParentDocumentRetriever has. The rebuild_retriever function can be implemented this way:

# src/build_retriever.py
from langchain.retrievers import MultiVectorRetriever

def rebuild_retriever():
    """Recreate Retriever Object to be reused."""
    # only do what's needed to recreate the retriever
    # no need to actually load or split docs
    embeddings = HuggingFaceEmbeddings(model_name=cfg.EMBEDDING_MODEL_NAME,
                                       model_kwargs={'device': 'mps'}, encode_kwargs={'device': 'mps', 'batch_size': 32})
    fs = LocalFileStore("./chroma_db_filestore")
    store = create_kv_docstore(fs)
    vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings,
                         persist_directory="chroma_db/") 
    big_chunks_retriever = MultiVectorRetriever(
        vectorstore=vectorstore,
        docstore=store,
    )
    return big_chunks_retriever

@weissenbacherpwc
Copy link

@weissenbacherpwc correct! I usually have different ingestion and use scripts, but the way that you build the retriever will be the same. So in one part of your process, you do document upload, and in another part you just use the retriever.
For example, your function that you shared would be really good for ingest. Depending on my requirements, I'd either return the retriever from that function to be used elsewhere, or I'd rebuild it where I need it. Here are two examples:

def run_db_build():
    # ....
    big_chunks_retriever.add_documents(documents)
    return big_chunks_retriever

retriever = run_db_build() # get retriever object as a global to be reused

def run_qa_chain(query):
    retrievalQA = RetrievalQA.from_llm(llm=my_llm, retriever=retriever) # use global retriever
    return retrievalQA.run(query)

Alternatively:

# src/ingest.py
def run_db_build():
    # ....
# src/build_retriever.py
def rebuild_retriever():
    """Recreate Retriever Object to be reused."""
    # only do what's needed to recreate the retriever
    # no need to actually load or split docs
    embeddings = HuggingFaceEmbeddings(model_name=cfg.EMBEDDING_MODEL_NAME,
                                       model_kwargs={'device': 'mps'}, encode_kwargs={'device': 'mps', 'batch_size': 32})
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
    fs = LocalFileStore("./chroma_db_filestore")
    store = create_kv_docstore(fs)
    vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings,
                         persist_directory="chroma_db/") 
    big_chunks_retriever = ParentDocumentRetriever(
        vectorstore=vectorstore,
        docstore=store,
        child_splitter=child_splitter,
        parent_splitter=parent_splitter,
    )
    return big_chunks_retriever

That will return the same usable retriever, without needing to actually load any docs.

Thanks for the example. One question, in your example of rebuild_retriever() function, the two textsplitters are in fact not used, correct, I think they are only used when ParentDocumentRetriever.add_documents() is called? Thus the parameters for them do not matter I guess?

@rchen19 Yes, the two text splitters are not useful here and could be removed. In fact, the ParentDocumentRetriever could be replaced by a MultiVectorRetriever instance since the difference between the two is the add_documents method that the ParentDocumentRetriever has. The rebuild_retriever function can be implemented this way:

# src/build_retriever.py
from langchain.retrievers import MultiVectorRetriever

def rebuild_retriever():
    """Recreate Retriever Object to be reused."""
    # only do what's needed to recreate the retriever
    # no need to actually load or split docs
    embeddings = HuggingFaceEmbeddings(model_name=cfg.EMBEDDING_MODEL_NAME,
                                       model_kwargs={'device': 'mps'}, encode_kwargs={'device': 'mps', 'batch_size': 32})
    fs = LocalFileStore("./chroma_db_filestore")
    store = create_kv_docstore(fs)
    vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings,
                         persist_directory="chroma_db/") 
    big_chunks_retriever = MultiVectorRetriever(
        vectorstore=vectorstore,
        docstore=store,
    )
    return big_chunks_retriever

nice this works!! Is there any way to select PGVector as vector database?

@rchen19
Copy link

rchen19 commented Apr 16, 2024

@weissenbacherpwc correct! I usually have different ingestion and use scripts, but the way that you build the retriever will be the same. So in one part of your process, you do document upload, and in another part you just use the retriever.
For example, your function that you shared would be really good for ingest. Depending on my requirements, I'd either return the retriever from that function to be used elsewhere, or I'd rebuild it where I need it. Here are two examples:

def run_db_build():
    # ....
    big_chunks_retriever.add_documents(documents)
    return big_chunks_retriever

retriever = run_db_build() # get retriever object as a global to be reused

def run_qa_chain(query):
    retrievalQA = RetrievalQA.from_llm(llm=my_llm, retriever=retriever) # use global retriever
    return retrievalQA.run(query)

Alternatively:

# src/ingest.py
def run_db_build():
    # ....
# src/build_retriever.py
def rebuild_retriever():
    """Recreate Retriever Object to be reused."""
    # only do what's needed to recreate the retriever
    # no need to actually load or split docs
    embeddings = HuggingFaceEmbeddings(model_name=cfg.EMBEDDING_MODEL_NAME,
                                       model_kwargs={'device': 'mps'}, encode_kwargs={'device': 'mps', 'batch_size': 32})
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
    fs = LocalFileStore("./chroma_db_filestore")
    store = create_kv_docstore(fs)
    vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings,
                         persist_directory="chroma_db/") 
    big_chunks_retriever = ParentDocumentRetriever(
        vectorstore=vectorstore,
        docstore=store,
        child_splitter=child_splitter,
        parent_splitter=parent_splitter,
    )
    return big_chunks_retriever

That will return the same usable retriever, without needing to actually load any docs.

Thanks for the example. One question, in your example of rebuild_retriever() function, the two textsplitters are in fact not used, correct, I think they are only used when ParentDocumentRetriever.add_documents() is called? Thus the parameters for them do not matter I guess?

@rchen19 Yes, the two text splitters are not useful here and could be removed. In fact, the ParentDocumentRetriever could be replaced by a MultiVectorRetriever instance since the difference between the two is the add_documents method that the ParentDocumentRetriever has. The rebuild_retriever function can be implemented this way:

# src/build_retriever.py
from langchain.retrievers import MultiVectorRetriever

def rebuild_retriever():
    """Recreate Retriever Object to be reused."""
    # only do what's needed to recreate the retriever
    # no need to actually load or split docs
    embeddings = HuggingFaceEmbeddings(model_name=cfg.EMBEDDING_MODEL_NAME,
                                       model_kwargs={'device': 'mps'}, encode_kwargs={'device': 'mps', 'batch_size': 32})
    fs = LocalFileStore("./chroma_db_filestore")
    store = create_kv_docstore(fs)
    vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings,
                         persist_directory="chroma_db/") 
    big_chunks_retriever = MultiVectorRetriever(
        vectorstore=vectorstore,
        docstore=store,
    )
    return big_chunks_retriever

Very informative, thank you. I did not realize the only difference between the two classes is the add_documents method.

@parthamadhira
Copy link

parthamadhira commented Apr 24, 2024

# src/build_retriever.py
from langchain.retrievers import MultiVectorRetriever

def rebuild_retriever():
    """Recreate Retriever Object to be reused."""
    # only do what's needed to recreate the retriever
    # no need to actually load or split docs
    embeddings = HuggingFaceEmbeddings(model_name=cfg.EMBEDDING_MODEL_NAME,
                                       model_kwargs={'device': 'mps'}, encode_kwargs={'device': 'mps', 'batch_size': 32})
    fs = LocalFileStore("./chroma_db_filestore")
    store = create_kv_docstore(fs)
    #vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings,
                         persist_directory="chroma_db/") 
    vectorstore =  Chroma(client=chroma_client, collection_name="ap_collection_parent", 
                             embedding_function=embeddings)
    big_chunks_retriever = MultiVectorRetriever(
        vectorstore=vectorstore,
        docstore=store,
    )
    return big_chunks_retriever

In the above code, the parent collection is only passed which is a local dir, How is it linked to the child collection that is persisted in the chromadb?
When using conversational chain as below:

  conversation_chain = ConversationalRetrievalChain.from_llm(
       llm=llm,
       retriever=big_chunks_retriever,
       memory=memory, verbose=True, return_source_documents=True
   )

how will the query first search for child documents and then return the corresponding parent documents? Thanks for the clarification in advance

@codemigs
Copy link

codemigs commented Jun 9, 2024

@weissenbacherpwc correct! I usually have different ingestion and use scripts, but the way that you build the retriever will be the same. So in one part of your process, you do document upload, and in another part you just use the retriever.

For example, your function that you shared would be really good for ingest. Depending on my requirements, I'd either return the retriever from that function to be used elsewhere, or I'd rebuild it where I need it. Here are two examples:

def run_db_build():
    # ....
    big_chunks_retriever.add_documents(documents)
    return big_chunks_retriever

retriever = run_db_build() # get retriever object as a global to be reused

def run_qa_chain(query):
    retrievalQA = RetrievalQA.from_llm(llm=my_llm, retriever=retriever) # use global retriever
    return retrievalQA.run(query)

Alternatively:

# src/ingest.py
def run_db_build():
    # ....
# src/build_retriever.py
def rebuild_retriever():
    """Recreate Retriever Object to be reused."""
    # only do what's needed to recreate the retriever
    # no need to actually load or split docs
    embeddings = HuggingFaceEmbeddings(model_name=cfg.EMBEDDING_MODEL_NAME,
                                       model_kwargs={'device': 'mps'}, encode_kwargs={'device': 'mps', 'batch_size': 32})
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
    fs = LocalFileStore("./chroma_db_filestore")
    store = create_kv_docstore(fs)
    vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings,
                         persist_directory="chroma_db/") 
    big_chunks_retriever = ParentDocumentRetriever(
        vectorstore=vectorstore,
        docstore=store,
        child_splitter=child_splitter,
        parent_splitter=parent_splitter,
    )
    return big_chunks_retriever

That will return the same usable retriever, without needing to actually load any docs.

Without loading any docs, we won't get any context right?

In a production environment, I'd have to add the docs so we can retrieve something. The process of adding the documents while kicking off the system will take some time. Part of the reason why I want to load the vector store in a chroma db is for faster retrieval and so I can skip the indexing part.

Is there a faster way to add docs or skip that part and get context in a db environment? Loading the files, then adding docs will take some time. How do we skip that part and make the docs part of the chroma db whilst using ParentDocument retrieval?

@huangpan2507
Copy link

@nickhausman @k1ng @Giulianini I came to this thread because I had the same issue, but after looking at the commit @eyurtsev (thank you so much!) linked to this issue, I was able to resolve my issue and wanted to make sure that was documented here. Here's what I had before that didn't work:

    store = LocalFileStore("./store_location")
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
    vectorstore = ...
    retriever = ParentDocumentRetriever(
        vectorstore=vectorstore,
        docstore=store,
        child_splitter=child_splitter,
        parent_splitter=parent_splitter,
    )

Here are the changes I made that got it work:

from langchain.storage._lc_store import create_kv_docstore
# ...
    fs = LocalFileStore("./store_location")
    store = create_kv_docstore(fs)
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
    vectorstore = ...
    retriever = ParentDocumentRetriever(
        vectorstore=vectorstore,
        docstore=store,
        child_splitter=child_splitter,
        parent_splitter=parent_splitter,
    )

It looks like this was made available since v0.0.277.

I hope this helps!

Great! I will try it, genius!!!

@huangpan2507
Copy link

huangpan2507 commented Jul 4, 2024

In several examples I saw that the LocalFileStore path and the persist_directory path are different. Why is this?

@weissenbacherpwc glad it helped. To the best of my knowledge, the persist_directory that you pass into the vector store initialization is where specifically only the vector store's data exists. In your example, Chroma doesn't know about ParentDocumentRetriever at all. What ParentDocumentRetriever does for your chain is create associations between document IDs which are stored in the vector store, which happens to be Chroma but could be anything. Those associations need to live somewhere, and that is what you declare with the LocalFileStore path. That path could exist inside of your Chroma directory if you wish it to, and indeed for me that's what I've done in the past, but it doesn't have to.

And if I want to use the saved ParentDocumentRetriever in a RetrievalQA chain, which folder do I have to load?

I think when you initialize most chains that use docs in this way, you pass in the retriever itself and not a path. So in this case, if you wanted to use the newly created big_chunks_retriever, which is your parent-child retriever object, you'd pass that in directly:

from langchain.chains import RetrievalQA
from langchain.llms import OpenAI # Or whatever makes sense for you

my_llm = OpenAI() # Or whatever makes sense for you

# ...

big_chunks_retriever.add_documents(documents)
retrievalQA = RetrievalQA.from_llm(llm=my_llm, retriever=big_chunks_retriever)

Hi, @cam-barts , thanks for your solution,

       # Get elements
        one_raw_pdf_elements = partition_pdf(
        filename=file_name,
        languages=["chinese",],
        # strategy='hi_res',
        # Using pdf format to find embedded image blocks
        extract_images_in_pdf=True,
        # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
        # Titles are any sub-section of the document
        infer_table_structure=True,
        # Post processing to aggregate text once we have the title
        chunking_strategy="by_title",
        extract_image_block_output_dir=self._img_path,
        form_extraction_skip_tables = False
        )
        raw_pdf_elements.extend(one_raw_pdf_elements)
        

        # Categorize by type
        categorized_elements = []
        for element in raw_pdf_elements:
        if "unstructured.documents.elements.Table" in str(type(element)):
           categorized_elements.append(Element(type="table", text=str(element)))
        elif "unstructured.documents.elements.CompositeElement" in >str(type(element)):
           categorized_elements.append(Element(type="text", text=str(element)))

        # Tables
        table_elements = [e for e in categorized_elements if e.type == "table"]

        # Text
        text_elements = [e for e in categorized_elements if e.type == "text"]  

        embeddings = >HuggingFaceEmbeddings(model_name="/mnt/AI/models/embedding_model")
        vectorstore = Chroma(
            persist_directory=self._persist_directory,
            embedding_function=embeddings
         )
        parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, >chunk_overlap=200)
       child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
       fs = >LocalFileStore("/mnt/AI/data_base/vector_db/chroma_docs_raw_text_table_image")
      store = create_kv_docstore(fs)
      id_key = "doc_id"
      retriever = ParentDocumentRetriever(
         vectorstore=vectorstore,
         docstore=store,
         child_splitter=child_splitter,
         parent_splitter=parent_splitter,
       )                

if I want to useretriever .add_documents(xx), should I use retriever .add_documents(one_raw_pdf_elements )`? It is right?

@huangpan2507
Copy link

how will the query first search for child documents and then return the corresponding parent documents? Thanks for the clarification in advance

good question! same doubt about the process with child documents(store in the vector store,eg chroma) and parent documents(store in LocalFileStore), how to deal with them when I query the question to llm?

@rbs333
Copy link

rbs333 commented Jul 23, 2024

Touched base with some langchain folks on this issue. One way of handling this without overriding the default is to use the EncoderBackedStore class. We did this with Redis as our base store but would be extendable.

from langchain.storage.encoder_backed import EncoderBackedStore
from langchain.storage import RedisStore
import pickle

def key_encoder(key: int | str) -> str:
    return str(key)

def value_serializer(value: float) -> str:
    return pickle.dumps(value)

def value_deserializer(serialized_value: str) -> float:
    return pickle.loads(serialized_value)

# Create an instance of the abstract store
abstract_store = RedisStore(redis_url="redis://localhost:6379", namespace="parent_docs")

# Create an instance of the encoder-backed store
encoder_store = EncoderBackedStore(
    store=abstract_store,
    key_encoder=key_encoder,
    value_serializer=value_serializer,
    value_deserializer=value_deserializer
)

from langchain.retrievers import ParentDocumentRetriever

parent_doc_retriever = ParentDocumentRetriever(
    vectorstore=vector_store,
    docstore=encoder_store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

@jaideep11061982
Copy link

In several examples I saw that the LocalFileStore path and the persist_directory path are different. Why is this?

@weissenbacherpwc glad it helped. To the best of my knowledge, the persist_directory that you pass into the vector store initialization is where specifically only the vector store's data exists. In your example, Chroma doesn't know about ParentDocumentRetriever at all. What ParentDocumentRetriever does for your chain is create associations between document IDs which are stored in the vector store, which happens to be Chroma but could be anything. Those associations need to live somewhere, and that is what you declare with the LocalFileStore path. That path could exist inside of your Chroma directory if you wish it to, and indeed for me that's what I've done in the past, but it doesn't have to.

And if I want to use the saved ParentDocumentRetriever in a RetrievalQA chain, which folder do I have to load?

I think when you initialize most chains that use docs in this way, you pass in the retriever itself and not a path. So in this case, if you wanted to use the newly created big_chunks_retriever, which is your parent-child retriever object, you'd pass that in directly:

from langchain.chains import RetrievalQA
from langchain.llms import OpenAI # Or whatever makes sense for you

my_llm = OpenAI() # Or whatever makes sense for you

# ...

big_chunks_retriever.add_documents(documents)
retrievalQA = RetrievalQA.from_llm(llm=my_llm, retriever=big_chunks_retriever)

@rbs333 if i have got huge no of documents, how do write a scalable logic . ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: vector store Related to vector store module
Projects
None yet
Development

No branches or pull requests