Type error in ParentDocumentRetriever using LocalFileStore #9345

thedevstone · 2023-08-16T22:36:06Z

Bug

LocalFileStore tries to treat Document as byte

store = LocalFileStore(get_project_relative_path("doc_store"))
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
    retriever = ParentDocumentRetriever(vectorstore=vectorstore,
                                        docstore=store,
                                        parent_splitter=parent_splitter,
                                        child_splitter=child_splitter)
    if embed:
        docs = []
        data_folder = get_project_relative_path("documents")
        for i, file_path in enumerate(data_folder.iterdir()):
            document = TextLoader(str(file_path))
            docs.extend(document.load())
        retriever.add_documents(docs, None)

Here the broken method:

    def mset(self, key_value_pairs: Sequence[Tuple[str, bytes]]) -> None:
        """Set the values for the given keys.

        Args:
            key_value_pairs: A sequence of key-value pairs.

        Returns:
            None
        """
        for key, value in key_value_pairs:
            full_path = self._get_full_path(key)
            full_path.parent.mkdir(parents=True, exist_ok=True)
            full_path.write_bytes(value)

TypeError: memoryview: a bytes-like object is required, not 'Document'

Who can help?

No response

Information

The official example notebooks/scripts
My own modified scripts

Related Components

Reproduction

Create a LocalFileStore
Use a ParentDoucmentRetriever

Expected behavior

Serialize the documents as bytes

The text was updated successfully, but these errors were encountered:

dosubot · 2023-08-16T22:42:11Z

🤖

Hello,

Thank you for bringing this issue to our attention. The error you're encountering is due to the fact that the LocalFileStore is trying to write a Document object directly to the file system, which is not possible because the Document object is not a bytes-like object.

To resolve this issue, you need to serialize the Document objects before storing them in the LocalFileStore. You can use Python's built-in pickle module to do this. Here's how you can modify the add_documents method in the ParentDocumentRetriever class to serialize the documents before storing them:

import pickle

# ...

def add_documents(
    self,
    documents: List[Document],
    ids: Optional[List[str]],
    add_to_docstore: bool = True,
) -> None:
    # ...
    if add_to_docstore:
        # Serialize the documents before storing them
        serialized_docs = [(id, pickle.dumps(doc)) for id, doc in full_docs]
        self.docstore.mset(serialized_docs)

In this code, pickle.dumps(doc) is used to serialize each Document object. The serialized documents are then stored in the LocalFileStore using the mset method.

Please note that you will also need to deserialize the documents when retrieving them from the LocalFileStore. You can use pickle.loads() for this purpose. When retrieving the documents using the mget method from the LocalFileStore, remember to deserialize the returned bytes into Document objects.

I hope this helps! If you have any other questions or run into any other issues, please don't hesitate to ask.

Best,
Dosu

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

thedevstone · 2023-08-16T23:30:06Z

Thanks,
But It's seems a bug, for now it's ok as a workaround but I think it must be considered a bug and resolve it in the framework

k1ng · 2023-08-20T20:17:34Z

Same problem, fix it please

k1ng · 2023-08-20T21:13:08Z

And with RedisStore:
redis.exceptions.DataError: Invalid input of type: 'Document'. Convert to a bytes, string, int or float first.

nickhausman · 2023-08-23T22:30:02Z

Bump on this, both RedisStore and Local not working

This PR makes the following changes: 1. Documents become serializable using langhchain serialization 2. Make a utility to create a docstore kw store Will help to address issue here: #9345

cam-barts · 2023-10-03T15:11:43Z

@nickhausman @k1ng @Giulianini I came to this thread because I had the same issue, but after looking at the commit @eyurtsev (thank you so much!) linked to this issue, I was able to resolve my issue and wanted to make sure that was documented here. Here's what I had before that didn't work:

    store = LocalFileStore("./store_location")
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
    vectorstore = ...
    retriever = ParentDocumentRetriever(
        vectorstore=vectorstore,
        docstore=store,
        child_splitter=child_splitter,
        parent_splitter=parent_splitter,
    )

Here are the changes I made that got it work:

from langchain.storage._lc_store import create_kv_docstore
# ...
    fs = LocalFileStore("./store_location")
    store = create_kv_docstore(fs)
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
    vectorstore = ...
    retriever = ParentDocumentRetriever(
        vectorstore=vectorstore,
        docstore=store,
        child_splitter=child_splitter,
        parent_splitter=parent_splitter,
    )

It looks like this was made available since v0.0.277.

I hope this helps!

weissenbacherpwc · 2023-11-16T07:37:27Z

@cam-barts This helped a lot! What is confusiong to me is the filestore of "fs" and vectorstore. E.g. my code here with using Chroma as vectorstore:

def run_db_build():
    loader = DirectoryLoader(cfg.DATA_PATH,
                             glob='*.pdf',
                             loader_cls=PyPDFLoader)
    documents = loader.load()
    
    embeddings = HuggingFaceEmbeddings(model_name=cfg.EMBEDDING_MODEL_NAME,
                                       model_kwargs={'device': 'mps'}, encode_kwargs={'device': 'mps', 'batch_size': 32})
    
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
    #store = InMemoryStore()
    fs = LocalFileStore("./chroma_db_filestore")
    store = create_kv_docstore(fs)


    vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings,
                         persist_directory="chroma_db/") 
    big_chunks_retriever = ParentDocumentRetriever(
        vectorstore=vectorstore,
        docstore=store,
        child_splitter=child_splitter,
        parent_splitter=parent_splitter,
    )
    big_chunks_retriever.add_documents(documents)

In several examples I saw that the LocalFileStore path and the persist_directory path are different. Why is this? And if I want to use the saved ParentDocumentRetriever in a RetrievalQA chain, which folder do I have to load?

cam-barts · 2023-11-16T15:14:51Z

In several examples I saw that the LocalFileStore path and the persist_directory path are different. Why is this?

@weissenbacherpwc glad it helped. To the best of my knowledge, the persist_directory that you pass into the vector store initialization is where specifically only the vector store's data exists. In your example, Chroma doesn't know about ParentDocumentRetriever at all. What ParentDocumentRetriever does for your chain is create associations between document IDs which are stored in the vector store, which happens to be Chroma but could be anything. Those associations need to live somewhere, and that is what you declare with the LocalFileStore path. That path could exist inside of your Chroma directory if you wish it to, and indeed for me that's what I've done in the past, but it doesn't have to.

And if I want to use the saved ParentDocumentRetriever in a RetrievalQA chain, which folder do I have to load?

I think when you initialize most chains that use docs in this way, you pass in the retriever itself and not a path. So in this case, if you wanted to use the newly created big_chunks_retriever, which is your parent-child retriever object, you'd pass that in directly:

from langchain.chains import RetrievalQA
from langchain.llms import OpenAI # Or whatever makes sense for you

my_llm = OpenAI() # Or whatever makes sense for you

# ...

big_chunks_retriever.add_documents(documents)
retrievalQA = RetrievalQA.from_llm(llm=my_llm, retriever=big_chunks_retriever)

weissenbacherpwc · 2023-11-16T15:26:11Z

Got it with the directories, thanks!

In the script it would be the easiest way to use the created big_chunks_retriever in the RetrievalQA chain. But if I think if one wants to use the the retriever in production, it will take too long to create the retriever and calling the big_chunks_retriever.add_documents(documents) everytime at the start right? Thats why I wanted to store the retriever somewhere.

cam-barts · 2023-11-16T15:40:28Z

@weissenbacherpwc correct! I usually have different ingestion and use scripts, but the way that you build the retriever will be the same. So in one part of your process, you do document upload, and in another part you just use the retriever.

For example, your function that you shared would be really good for ingest. Depending on my requirements, I'd either return the retriever from that function to be used elsewhere, or I'd rebuild it where I need it. Here are two examples:

def run_db_build():
    # ....
    big_chunks_retriever.add_documents(documents)
    return big_chunks_retriever

retriever = run_db_build() # get retriever object as a global to be reused

def run_qa_chain(query):
    retrievalQA = RetrievalQA.from_llm(llm=my_llm, retriever=retriever) # use global retriever
    return retrievalQA.run(query)

Alternatively:

# src/ingest.py
def run_db_build():
    # ....

# src/build_retriever.py
def rebuild_retriever():
    """Recreate Retriever Object to be reused."""
    # only do what's needed to recreate the retriever
    # no need to actually load or split docs
    embeddings = HuggingFaceEmbeddings(model_name=cfg.EMBEDDING_MODEL_NAME,
                                       model_kwargs={'device': 'mps'}, encode_kwargs={'device': 'mps', 'batch_size': 32})
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
    fs = LocalFileStore("./chroma_db_filestore")
    store = create_kv_docstore(fs)
    vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings,
                         persist_directory="chroma_db/") 
    big_chunks_retriever = ParentDocumentRetriever(
        vectorstore=vectorstore,
        docstore=store,
        child_splitter=child_splitter,
        parent_splitter=parent_splitter,
    )
    return big_chunks_retriever

That will return the same usable retriever, without needing to actually load any docs.

gcheron · 2024-01-11T20:36:38Z

This PR aims at adding support for document storage in a SQL database: #15909

kallebl0mquist · 2024-01-12T16:34:07Z

i am having the same problem with LocalFileStore but with multi vector retriever. I get the type error when I am trying to store texts. no type error when I store images.

so, when I use create_kv_docstore as a workaround my problem is the other way around. Type error with images , no type error with texts. the texts are list of strings and not documents at that point.

With Inmemory storing everything is fine, but that's not an option for real use.

tyatabe · 2024-01-25T14:49:14Z

Whas this ever fixed for Redis? I'm getting the same error when adding documents to the retriever
DataError: Invalid input of type: 'Document'. Convert to a bytes, string, int or float first

rchen19 · 2024-02-15T18:48:41Z

@weissenbacherpwc correct! I usually have different ingestion and use scripts, but the way that you build the retriever will be the same. So in one part of your process, you do document upload, and in another part you just use the retriever.

For example, your function that you shared would be really good for ingest. Depending on my requirements, I'd either return the retriever from that function to be used elsewhere, or I'd rebuild it where I need it. Here are two examples:

def run_db_build():
    # ....
    big_chunks_retriever.add_documents(documents)
    return big_chunks_retriever

retriever = run_db_build() # get retriever object as a global to be reused

def run_qa_chain(query):
    retrievalQA = RetrievalQA.from_llm(llm=my_llm, retriever=retriever) # use global retriever
    return retrievalQA.run(query)

Alternatively:

# src/ingest.py
def run_db_build():
    # ....

# src/build_retriever.py
def rebuild_retriever():
    """Recreate Retriever Object to be reused."""
    # only do what's needed to recreate the retriever
    # no need to actually load or split docs
    embeddings = HuggingFaceEmbeddings(model_name=cfg.EMBEDDING_MODEL_NAME,
                                       model_kwargs={'device': 'mps'}, encode_kwargs={'device': 'mps', 'batch_size': 32})
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
    fs = LocalFileStore("./chroma_db_filestore")
    store = create_kv_docstore(fs)
    vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings,
                         persist_directory="chroma_db/") 
    big_chunks_retriever = ParentDocumentRetriever(
        vectorstore=vectorstore,
        docstore=store,
        child_splitter=child_splitter,
        parent_splitter=parent_splitter,
    )
    return big_chunks_retriever

That will return the same usable retriever, without needing to actually load any docs.

Thanks for the example. One question, in your example of rebuild_retriever() function, the two textsplitters are in fact not used, correct, I think they are only used when ParentDocumentRetriever.add_documents() is called? Thus the parameters for them do not matter I guess?

muazhari · 2024-04-01T04:34:03Z

Any fix to the Redis store? I have the same issue when using that with MultiVectorRetriever as docstore:

DataError: Invalid input of type: 'Document'. Convert to a bytes, string, int or float first.

thdesc · 2024-04-08T14:07:15Z

@weissenbacherpwc correct! I usually have different ingestion and use scripts, but the way that you build the retriever will be the same. So in one part of your process, you do document upload, and in another part you just use the retriever.
For example, your function that you shared would be really good for ingest. Depending on my requirements, I'd either return the retriever from that function to be used elsewhere, or I'd rebuild it where I need it. Here are two examples:
def run_db_build():
    # ....
    big_chunks_retriever.add_documents(documents)
    return big_chunks_retriever

retriever = run_db_build() # get retriever object as a global to be reused

def run_qa_chain(query):
    retrievalQA = RetrievalQA.from_llm(llm=my_llm, retriever=retriever) # use global retriever
    return retrievalQA.run(query)
Alternatively:
# src/ingest.py
def run_db_build():
    # ....
# src/build_retriever.py
def rebuild_retriever():
    """Recreate Retriever Object to be reused."""
    # only do what's needed to recreate the retriever
    # no need to actually load or split docs
    embeddings = HuggingFaceEmbeddings(model_name=cfg.EMBEDDING_MODEL_NAME,
                                       model_kwargs={'device': 'mps'}, encode_kwargs={'device': 'mps', 'batch_size': 32})
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
    fs = LocalFileStore("./chroma_db_filestore")
    store = create_kv_docstore(fs)
    vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings,
                         persist_directory="chroma_db/") 
    big_chunks_retriever = ParentDocumentRetriever(
        vectorstore=vectorstore,
        docstore=store,
        child_splitter=child_splitter,
        parent_splitter=parent_splitter,
    )
    return big_chunks_retriever
That will return the same usable retriever, without needing to actually load any docs.
Thanks for the example. One question, in your example of rebuild_retriever() function, the two textsplitters are in fact not used, correct, I think they are only used when ParentDocumentRetriever.add_documents() is called? Thus the parameters for them do not matter I guess?

@rchen19 Yes, the two text splitters are not useful here and could be removed. In fact, the ParentDocumentRetriever could be replaced by a MultiVectorRetriever instance since the difference between the two is the add_documents method that the ParentDocumentRetriever has. The rebuild_retriever function can be implemented this way:

# src/build_retriever.py
from langchain.retrievers import MultiVectorRetriever

def rebuild_retriever():
    """Recreate Retriever Object to be reused."""
    # only do what's needed to recreate the retriever
    # no need to actually load or split docs
    embeddings = HuggingFaceEmbeddings(model_name=cfg.EMBEDDING_MODEL_NAME,
                                       model_kwargs={'device': 'mps'}, encode_kwargs={'device': 'mps', 'batch_size': 32})
    fs = LocalFileStore("./chroma_db_filestore")
    store = create_kv_docstore(fs)
    vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings,
                         persist_directory="chroma_db/") 
    big_chunks_retriever = MultiVectorRetriever(
        vectorstore=vectorstore,
        docstore=store,
    )
    return big_chunks_retriever

weissenbacherpwc · 2024-04-15T15:27:59Z

@weissenbacherpwc correct! I usually have different ingestion and use scripts, but the way that you build the retriever will be the same. So in one part of your process, you do document upload, and in another part you just use the retriever.
For example, your function that you shared would be really good for ingest. Depending on my requirements, I'd either return the retriever from that function to be used elsewhere, or I'd rebuild it where I need it. Here are two examples:
def run_db_build():
    # ....
    big_chunks_retriever.add_documents(documents)
    return big_chunks_retriever

retriever = run_db_build() # get retriever object as a global to be reused

def run_qa_chain(query):
    retrievalQA = RetrievalQA.from_llm(llm=my_llm, retriever=retriever) # use global retriever
    return retrievalQA.run(query)
Alternatively:
# src/ingest.py
def run_db_build():
    # ....
# src/build_retriever.py
def rebuild_retriever():
    """Recreate Retriever Object to be reused."""
    # only do what's needed to recreate the retriever
    # no need to actually load or split docs
    embeddings = HuggingFaceEmbeddings(model_name=cfg.EMBEDDING_MODEL_NAME,
                                       model_kwargs={'device': 'mps'}, encode_kwargs={'device': 'mps', 'batch_size': 32})
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
    fs = LocalFileStore("./chroma_db_filestore")
    store = create_kv_docstore(fs)
    vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings,
                         persist_directory="chroma_db/") 
    big_chunks_retriever = ParentDocumentRetriever(
        vectorstore=vectorstore,
        docstore=store,
        child_splitter=child_splitter,
        parent_splitter=parent_splitter,
    )
    return big_chunks_retriever
That will return the same usable retriever, without needing to actually load any docs.
Thanks for the example. One question, in your example of rebuild_retriever() function, the two textsplitters are in fact not used, correct, I think they are only used when ParentDocumentRetriever.add_documents() is called? Thus the parameters for them do not matter I guess?

@rchen19 Yes, the two text splitters are not useful here and could be removed. In fact, the ParentDocumentRetriever could be replaced by a MultiVectorRetriever instance since the difference between the two is the add_documents method that the ParentDocumentRetriever has. The rebuild_retriever function can be implemented this way:

# src/build_retriever.py
from langchain.retrievers import MultiVectorRetriever

def rebuild_retriever():
    """Recreate Retriever Object to be reused."""
    # only do what's needed to recreate the retriever
    # no need to actually load or split docs
    embeddings = HuggingFaceEmbeddings(model_name=cfg.EMBEDDING_MODEL_NAME,
                                       model_kwargs={'device': 'mps'}, encode_kwargs={'device': 'mps', 'batch_size': 32})
    fs = LocalFileStore("./chroma_db_filestore")
    store = create_kv_docstore(fs)
    vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings,
                         persist_directory="chroma_db/") 
    big_chunks_retriever = MultiVectorRetriever(
        vectorstore=vectorstore,
        docstore=store,
    )
    return big_chunks_retriever

nice this works!! Is there any way to select PGVector as vector database?

rchen19 · 2024-04-16T23:16:34Z

@weissenbacherpwc correct! I usually have different ingestion and use scripts, but the way that you build the retriever will be the same. So in one part of your process, you do document upload, and in another part you just use the retriever.
For example, your function that you shared would be really good for ingest. Depending on my requirements, I'd either return the retriever from that function to be used elsewhere, or I'd rebuild it where I need it. Here are two examples:
def run_db_build():
    # ....
    big_chunks_retriever.add_documents(documents)
    return big_chunks_retriever

retriever = run_db_build() # get retriever object as a global to be reused

def run_qa_chain(query):
    retrievalQA = RetrievalQA.from_llm(llm=my_llm, retriever=retriever) # use global retriever
    return retrievalQA.run(query)
Alternatively:
# src/ingest.py
def run_db_build():
    # ....
# src/build_retriever.py
def rebuild_retriever():
    """Recreate Retriever Object to be reused."""
    # only do what's needed to recreate the retriever
    # no need to actually load or split docs
    embeddings = HuggingFaceEmbeddings(model_name=cfg.EMBEDDING_MODEL_NAME,
                                       model_kwargs={'device': 'mps'}, encode_kwargs={'device': 'mps', 'batch_size': 32})
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
    fs = LocalFileStore("./chroma_db_filestore")
    store = create_kv_docstore(fs)
    vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings,
                         persist_directory="chroma_db/") 
    big_chunks_retriever = ParentDocumentRetriever(
        vectorstore=vectorstore,
        docstore=store,
        child_splitter=child_splitter,
        parent_splitter=parent_splitter,
    )
    return big_chunks_retriever
That will return the same usable retriever, without needing to actually load any docs.
Thanks for the example. One question, in your example of rebuild_retriever() function, the two textsplitters are in fact not used, correct, I think they are only used when ParentDocumentRetriever.add_documents() is called? Thus the parameters for them do not matter I guess?

@rchen19 Yes, the two text splitters are not useful here and could be removed. In fact, the ParentDocumentRetriever could be replaced by a MultiVectorRetriever instance since the difference between the two is the add_documents method that the ParentDocumentRetriever has. The rebuild_retriever function can be implemented this way:

# src/build_retriever.py
from langchain.retrievers import MultiVectorRetriever

def rebuild_retriever():
    """Recreate Retriever Object to be reused."""
    # only do what's needed to recreate the retriever
    # no need to actually load or split docs
    embeddings = HuggingFaceEmbeddings(model_name=cfg.EMBEDDING_MODEL_NAME,
                                       model_kwargs={'device': 'mps'}, encode_kwargs={'device': 'mps', 'batch_size': 32})
    fs = LocalFileStore("./chroma_db_filestore")
    store = create_kv_docstore(fs)
    vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings,
                         persist_directory="chroma_db/") 
    big_chunks_retriever = MultiVectorRetriever(
        vectorstore=vectorstore,
        docstore=store,
    )
    return big_chunks_retriever

Very informative, thank you. I did not realize the only difference between the two classes is the add_documents method.

parthamadhira · 2024-04-24T15:29:31Z

# src/build_retriever.py
from langchain.retrievers import MultiVectorRetriever

def rebuild_retriever():
    """Recreate Retriever Object to be reused."""
    # only do what's needed to recreate the retriever
    # no need to actually load or split docs
    embeddings = HuggingFaceEmbeddings(model_name=cfg.EMBEDDING_MODEL_NAME,
                                       model_kwargs={'device': 'mps'}, encode_kwargs={'device': 'mps', 'batch_size': 32})
    fs = LocalFileStore("./chroma_db_filestore")
    store = create_kv_docstore(fs)
    #vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings,
                         persist_directory="chroma_db/") 
    vectorstore =  Chroma(client=chroma_client, collection_name="ap_collection_parent", 
                             embedding_function=embeddings)
    big_chunks_retriever = MultiVectorRetriever(
        vectorstore=vectorstore,
        docstore=store,
    )
    return big_chunks_retriever

In the above code, the parent collection is only passed which is a local dir, How is it linked to the child collection that is persisted in the chromadb?
When using conversational chain as below:

  conversation_chain = ConversationalRetrievalChain.from_llm(
       llm=llm,
       retriever=big_chunks_retriever,
       memory=memory, verbose=True, return_source_documents=True
   )

how will the query first search for child documents and then return the corresponding parent documents? Thanks for the clarification in advance

codemigs · 2024-06-09T08:11:45Z

@weissenbacherpwc correct! I usually have different ingestion and use scripts, but the way that you build the retriever will be the same. So in one part of your process, you do document upload, and in another part you just use the retriever.

For example, your function that you shared would be really good for ingest. Depending on my requirements, I'd either return the retriever from that function to be used elsewhere, or I'd rebuild it where I need it. Here are two examples:

def run_db_build():
    # ....
    big_chunks_retriever.add_documents(documents)
    return big_chunks_retriever

retriever = run_db_build() # get retriever object as a global to be reused

def run_qa_chain(query):
    retrievalQA = RetrievalQA.from_llm(llm=my_llm, retriever=retriever) # use global retriever
    return retrievalQA.run(query)

Alternatively:

# src/ingest.py
def run_db_build():
    # ....

# src/build_retriever.py
def rebuild_retriever():
    """Recreate Retriever Object to be reused."""
    # only do what's needed to recreate the retriever
    # no need to actually load or split docs
    embeddings = HuggingFaceEmbeddings(model_name=cfg.EMBEDDING_MODEL_NAME,
                                       model_kwargs={'device': 'mps'}, encode_kwargs={'device': 'mps', 'batch_size': 32})
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
    fs = LocalFileStore("./chroma_db_filestore")
    store = create_kv_docstore(fs)
    vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings,
                         persist_directory="chroma_db/") 
    big_chunks_retriever = ParentDocumentRetriever(
        vectorstore=vectorstore,
        docstore=store,
        child_splitter=child_splitter,
        parent_splitter=parent_splitter,
    )
    return big_chunks_retriever

That will return the same usable retriever, without needing to actually load any docs.

Without loading any docs, we won't get any context right?

In a production environment, I'd have to add the docs so we can retrieve something. The process of adding the documents while kicking off the system will take some time. Part of the reason why I want to load the vector store in a chroma db is for faster retrieval and so I can skip the indexing part.

Is there a faster way to add docs or skip that part and get context in a db environment? Loading the files, then adding docs will take some time. How do we skip that part and make the docs part of the chroma db whilst using ParentDocument retrieval?

huangpan2507 · 2024-06-13T10:08:56Z

@nickhausman @k1ng @Giulianini I came to this thread because I had the same issue, but after looking at the commit @eyurtsev (thank you so much!) linked to this issue, I was able to resolve my issue and wanted to make sure that was documented here. Here's what I had before that didn't work:

    store = LocalFileStore("./store_location")
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
    vectorstore = ...
    retriever = ParentDocumentRetriever(
        vectorstore=vectorstore,
        docstore=store,
        child_splitter=child_splitter,
        parent_splitter=parent_splitter,
    )

Here are the changes I made that got it work:

from langchain.storage._lc_store import create_kv_docstore
# ...
    fs = LocalFileStore("./store_location")
    store = create_kv_docstore(fs)
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
    vectorstore = ...
    retriever = ParentDocumentRetriever(
        vectorstore=vectorstore,
        docstore=store,
        child_splitter=child_splitter,
        parent_splitter=parent_splitter,
    )

It looks like this was made available since v0.0.277.

I hope this helps!

Great！ I will try it, genius!!!

huangpan2507 · 2024-07-04T07:52:43Z

In several examples I saw that the LocalFileStore path and the persist_directory path are different. Why is this?

@weissenbacherpwc glad it helped. To the best of my knowledge, the persist_directory that you pass into the vector store initialization is where specifically only the vector store's data exists. In your example, Chroma doesn't know about ParentDocumentRetriever at all. What ParentDocumentRetriever does for your chain is create associations between document IDs which are stored in the vector store, which happens to be Chroma but could be anything. Those associations need to live somewhere, and that is what you declare with the LocalFileStore path. That path could exist inside of your Chroma directory if you wish it to, and indeed for me that's what I've done in the past, but it doesn't have to.

And if I want to use the saved ParentDocumentRetriever in a RetrievalQA chain, which folder do I have to load?

I think when you initialize most chains that use docs in this way, you pass in the retriever itself and not a path. So in this case, if you wanted to use the newly created big_chunks_retriever, which is your parent-child retriever object, you'd pass that in directly:
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI # Or whatever makes sense for you

my_llm = OpenAI() # Or whatever makes sense for you

# ...

big_chunks_retriever.add_documents(documents)
retrievalQA = RetrievalQA.from_llm(llm=my_llm, retriever=big_chunks_retriever)

Hi, @cam-barts , thanks for your solution,

       # Get elements
        one_raw_pdf_elements = partition_pdf(
        filename=file_name,
        languages=["chinese",],
        # strategy='hi_res',
        # Using pdf format to find embedded image blocks
        extract_images_in_pdf=True,
        # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
        # Titles are any sub-section of the document
        infer_table_structure=True,
        # Post processing to aggregate text once we have the title
        chunking_strategy="by_title",
        extract_image_block_output_dir=self._img_path,
        form_extraction_skip_tables = False
        )
        raw_pdf_elements.extend(one_raw_pdf_elements)
        

        # Categorize by type
        categorized_elements = []
        for element in raw_pdf_elements:
        if "unstructured.documents.elements.Table" in str(type(element)):
           categorized_elements.append(Element(type="table", text=str(element)))
        elif "unstructured.documents.elements.CompositeElement" in >str(type(element)):
           categorized_elements.append(Element(type="text", text=str(element)))

        # Tables
        table_elements = [e for e in categorized_elements if e.type == "table"]

        # Text
        text_elements = [e for e in categorized_elements if e.type == "text"]  

        embeddings = >HuggingFaceEmbeddings(model_name="/mnt/AI/models/embedding_model")

        vectorstore = Chroma(
            persist_directory=self._persist_directory,
            embedding_function=embeddings
         )

        parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, >chunk_overlap=200)
       child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
       fs = >LocalFileStore("/mnt/AI/data_base/vector_db/chroma_docs_raw_text_table_image")
      store = create_kv_docstore(fs)
      id_key = "doc_id"
      retriever = ParentDocumentRetriever(
         vectorstore=vectorstore,
         docstore=store,
         child_splitter=child_splitter,
         parent_splitter=parent_splitter,
       )

if I want to useretriever .add_documents(xx), should I use retriever .add_documents(one_raw_pdf_elements )`? It is right?

huangpan2507 · 2024-07-12T02:57:29Z

how will the query first search for child documents and then return the corresponding parent documents? Thanks for the clarification in advance

good question! same doubt about the process with child documents(store in the vector store,eg chroma) and parent documents(store in LocalFileStore), how to deal with them when I query the question to llm?

rbs333 · 2024-07-23T13:01:27Z

Touched base with some langchain folks on this issue. One way of handling this without overriding the default is to use the EncoderBackedStore class. We did this with Redis as our base store but would be extendable.

from langchain.storage.encoder_backed import EncoderBackedStore
from langchain.storage import RedisStore
import pickle

def key_encoder(key: int | str) -> str:
    return str(key)

def value_serializer(value: float) -> str:
    return pickle.dumps(value)

def value_deserializer(serialized_value: str) -> float:
    return pickle.loads(serialized_value)

# Create an instance of the abstract store
abstract_store = RedisStore(redis_url="redis://localhost:6379", namespace="parent_docs")

# Create an instance of the encoder-backed store
encoder_store = EncoderBackedStore(
    store=abstract_store,
    key_encoder=key_encoder,
    value_serializer=value_serializer,
    value_deserializer=value_deserializer
)

from langchain.retrievers import ParentDocumentRetriever

parent_doc_retriever = ParentDocumentRetriever(
    vectorstore=vector_store,
    docstore=encoder_store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

jaideep11061982 · 2024-09-28T15:08:39Z

In several examples I saw that the LocalFileStore path and the persist_directory path are different. Why is this?

@weissenbacherpwc glad it helped. To the best of my knowledge, the persist_directory that you pass into the vector store initialization is where specifically only the vector store's data exists. In your example, Chroma doesn't know about ParentDocumentRetriever at all. What ParentDocumentRetriever does for your chain is create associations between document IDs which are stored in the vector store, which happens to be Chroma but could be anything. Those associations need to live somewhere, and that is what you declare with the LocalFileStore path. That path could exist inside of your Chroma directory if you wish it to, and indeed for me that's what I've done in the past, but it doesn't have to.

And if I want to use the saved ParentDocumentRetriever in a RetrievalQA chain, which folder do I have to load?

I think when you initialize most chains that use docs in this way, you pass in the retriever itself and not a path. So in this case, if you wanted to use the newly created big_chunks_retriever, which is your parent-child retriever object, you'd pass that in directly:
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI # Or whatever makes sense for you

my_llm = OpenAI() # Or whatever makes sense for you

# ...

big_chunks_retriever.add_documents(documents)
retrievalQA = RetrievalQA.from_llm(llm=my_llm, retriever=big_chunks_retriever)

@rbs333 if i have got huge no of documents, how do write a scalable logic . ?

dosubot bot added Ɑ: vector store Related to vector store module 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Aug 16, 2023

eyurtsev mentioned this issue Aug 24, 2023

Make document serializable, create utility to create a docstore #9674

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Type error in ParentDocumentRetriever using LocalFileStore #9345

Type error in ParentDocumentRetriever using LocalFileStore #9345

thedevstone commented Aug 16, 2023 •

edited

Loading

dosubot bot commented Aug 16, 2023

thedevstone commented Aug 16, 2023

k1ng commented Aug 20, 2023

k1ng commented Aug 20, 2023

nickhausman commented Aug 23, 2023

cam-barts commented Oct 3, 2023 •

edited

Loading

weissenbacherpwc commented Nov 16, 2023 •

edited

Loading

cam-barts commented Nov 16, 2023

weissenbacherpwc commented Nov 16, 2023

cam-barts commented Nov 16, 2023

gcheron commented Jan 11, 2024

kallebl0mquist commented Jan 12, 2024

tyatabe commented Jan 25, 2024

rchen19 commented Feb 15, 2024

muazhari commented Apr 1, 2024

thdesc commented Apr 8, 2024 •

edited

Loading

weissenbacherpwc commented Apr 15, 2024

rchen19 commented Apr 16, 2024 •

edited

Loading

parthamadhira commented Apr 24, 2024 •

edited

Loading

codemigs commented Jun 9, 2024

huangpan2507 commented Jun 13, 2024

huangpan2507 commented Jul 4, 2024 •

edited

Loading

huangpan2507 commented Jul 12, 2024

rbs333 commented Jul 23, 2024

jaideep11061982 commented Sep 28, 2024

Type error in ParentDocumentRetriever using LocalFileStore #9345

Type error in ParentDocumentRetriever using LocalFileStore #9345

Comments

thedevstone commented Aug 16, 2023 • edited Loading

Bug

Who can help?

Information

Related Components

Reproduction

Expected behavior

dosubot bot commented Aug 16, 2023

Sources

thedevstone commented Aug 16, 2023

k1ng commented Aug 20, 2023

k1ng commented Aug 20, 2023

nickhausman commented Aug 23, 2023

cam-barts commented Oct 3, 2023 • edited Loading

weissenbacherpwc commented Nov 16, 2023 • edited Loading

cam-barts commented Nov 16, 2023

weissenbacherpwc commented Nov 16, 2023

cam-barts commented Nov 16, 2023

gcheron commented Jan 11, 2024

kallebl0mquist commented Jan 12, 2024

tyatabe commented Jan 25, 2024

rchen19 commented Feb 15, 2024

muazhari commented Apr 1, 2024

thdesc commented Apr 8, 2024 • edited Loading

weissenbacherpwc commented Apr 15, 2024

rchen19 commented Apr 16, 2024 • edited Loading

parthamadhira commented Apr 24, 2024 • edited Loading

codemigs commented Jun 9, 2024

huangpan2507 commented Jun 13, 2024

huangpan2507 commented Jul 4, 2024 • edited Loading

huangpan2507 commented Jul 12, 2024

rbs333 commented Jul 23, 2024

jaideep11061982 commented Sep 28, 2024

thedevstone commented Aug 16, 2023 •

edited

Loading

cam-barts commented Oct 3, 2023 •

edited

Loading

weissenbacherpwc commented Nov 16, 2023 •

edited

Loading

thdesc commented Apr 8, 2024 •

edited

Loading

rchen19 commented Apr 16, 2024 •

edited

Loading

parthamadhira commented Apr 24, 2024 •

edited

Loading

huangpan2507 commented Jul 4, 2024 •

edited

Loading