-
Notifications
You must be signed in to change notification settings - Fork 15.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Type error in ParentDocumentRetriever using LocalFileStore #9345
Comments
🤖 Hello, Thank you for bringing this issue to our attention. The error you're encountering is due to the fact that the To resolve this issue, you need to serialize the import pickle
# ...
def add_documents(
self,
documents: List[Document],
ids: Optional[List[str]],
add_to_docstore: bool = True,
) -> None:
# ...
if add_to_docstore:
# Serialize the documents before storing them
serialized_docs = [(id, pickle.dumps(doc)) for id, doc in full_docs]
self.docstore.mset(serialized_docs) In this code, Please note that you will also need to deserialize the documents when retrieving them from the I hope this helps! If you have any other questions or run into any other issues, please don't hesitate to ask. Best, Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
Thanks, |
Same problem, fix it please |
And with RedisStore: |
Bump on this, both RedisStore and Local not working |
This PR makes the following changes: 1. Documents become serializable using langhchain serialization 2. Make a utility to create a docstore kw store Will help to address issue here: #9345
@nickhausman @k1ng @Giulianini I came to this thread because I had the same issue, but after looking at the commit @eyurtsev (thank you so much!) linked to this issue, I was able to resolve my issue and wanted to make sure that was documented here. Here's what I had before that didn't work: store = LocalFileStore("./store_location")
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
vectorstore = ...
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=store,
child_splitter=child_splitter,
parent_splitter=parent_splitter,
) Here are the changes I made that got it work: from langchain.storage._lc_store import create_kv_docstore
# ...
fs = LocalFileStore("./store_location")
store = create_kv_docstore(fs)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
vectorstore = ...
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=store,
child_splitter=child_splitter,
parent_splitter=parent_splitter,
) It looks like this was made available since v0.0.277. I hope this helps! |
@cam-barts This helped a lot! What is confusiong to me is the filestore of "fs" and vectorstore. E.g. my code here with using Chroma as vectorstore:
In several examples I saw that the LocalFileStore path and the persist_directory path are different. Why is this? And if I want to use the saved ParentDocumentRetriever in a RetrievalQA chain, which folder do I have to load? |
@weissenbacherpwc glad it helped. To the best of my knowledge, the
I think when you initialize most chains that use docs in this way, you pass in the retriever itself and not a path. So in this case, if you wanted to use the newly created from langchain.chains import RetrievalQA
from langchain.llms import OpenAI # Or whatever makes sense for you
my_llm = OpenAI() # Or whatever makes sense for you
# ...
big_chunks_retriever.add_documents(documents)
retrievalQA = RetrievalQA.from_llm(llm=my_llm, retriever=big_chunks_retriever) |
Got it with the directories, thanks! In the script it would be the easiest way to use the created |
@weissenbacherpwc correct! I usually have different ingestion and use scripts, but the way that you build the retriever will be the same. So in one part of your process, you do document upload, and in another part you just use the retriever. For example, your function that you shared would be really good for ingest. Depending on my requirements, I'd either return the retriever from that function to be used elsewhere, or I'd rebuild it where I need it. Here are two examples: def run_db_build():
# ....
big_chunks_retriever.add_documents(documents)
return big_chunks_retriever
retriever = run_db_build() # get retriever object as a global to be reused
def run_qa_chain(query):
retrievalQA = RetrievalQA.from_llm(llm=my_llm, retriever=retriever) # use global retriever
return retrievalQA.run(query) Alternatively: # src/ingest.py
def run_db_build():
# .... # src/build_retriever.py
def rebuild_retriever():
"""Recreate Retriever Object to be reused."""
# only do what's needed to recreate the retriever
# no need to actually load or split docs
embeddings = HuggingFaceEmbeddings(model_name=cfg.EMBEDDING_MODEL_NAME,
model_kwargs={'device': 'mps'}, encode_kwargs={'device': 'mps', 'batch_size': 32})
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
fs = LocalFileStore("./chroma_db_filestore")
store = create_kv_docstore(fs)
vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings,
persist_directory="chroma_db/")
big_chunks_retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=store,
child_splitter=child_splitter,
parent_splitter=parent_splitter,
)
return big_chunks_retriever That will return the same usable retriever, without needing to actually load any docs. |
This PR aims at adding support for document storage in a SQL database: #15909 |
i am having the same problem with LocalFileStore but with multi vector retriever. I get the type error when I am trying to store texts. no type error when I store images. so, when I use create_kv_docstore as a workaround my problem is the other way around. Type error with images , no type error with texts. the texts are list of strings and not documents at that point. With Inmemory storing everything is fine, but that's not an option for real use. |
Whas this ever fixed for Redis? I'm getting the same error when adding documents to the retriever |
Thanks for the example. One question, in your example of |
Any fix to the Redis store? I have the same issue when using that with MultiVectorRetriever as docstore:
|
@rchen19 Yes, the two text splitters are not useful here and could be removed. In fact, the # src/build_retriever.py
from langchain.retrievers import MultiVectorRetriever
def rebuild_retriever():
"""Recreate Retriever Object to be reused."""
# only do what's needed to recreate the retriever
# no need to actually load or split docs
embeddings = HuggingFaceEmbeddings(model_name=cfg.EMBEDDING_MODEL_NAME,
model_kwargs={'device': 'mps'}, encode_kwargs={'device': 'mps', 'batch_size': 32})
fs = LocalFileStore("./chroma_db_filestore")
store = create_kv_docstore(fs)
vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings,
persist_directory="chroma_db/")
big_chunks_retriever = MultiVectorRetriever(
vectorstore=vectorstore,
docstore=store,
)
return big_chunks_retriever |
nice this works!! Is there any way to select PGVector as vector database? |
Very informative, thank you. I did not realize the only difference between the two classes is the |
In the above code, the parent collection is only passed which is a local dir, How is it linked to the child collection that is persisted in the chromadb?
how will the query first search for child documents and then return the corresponding parent documents? Thanks for the clarification in advance |
Without loading any docs, we won't get any context right? In a production environment, I'd have to add the docs so we can retrieve something. The process of adding the documents while kicking off the system will take some time. Part of the reason why I want to load the vector store in a chroma db is for faster retrieval and so I can skip the indexing part. Is there a faster way to add docs or skip that part and get context in a db environment? Loading the files, then adding docs will take some time. How do we skip that part and make the docs part of the chroma db whilst using ParentDocument retrieval? |
Great! I will try it, genius!!! |
Hi, @cam-barts , thanks for your solution,
|
good question! same doubt about the process with child documents(store in the vector store,eg chroma) and parent documents(store in LocalFileStore), how to deal with them when I query the question to llm? |
Touched base with some langchain folks on this issue. One way of handling this without overriding the default is to use the EncoderBackedStore class. We did this with Redis as our base store but would be extendable.
|
@rbs333 if i have got huge no of documents, how do write a scalable logic . ? |
Bug
LocalFileStore tries to treat Document as byte
Here the broken method:
TypeError: memoryview: a bytes-like object is required, not 'Document'
Who can help?
No response
Information
Related Components
Reproduction
Create a LocalFileStore
Use a ParentDoucmentRetriever
Expected behavior
Serialize the documents as bytes
The text was updated successfully, but these errors were encountered: