[Pinecone] Controlling index ID's #8118
-
When testing with the Pinecone functionality within Langchain, it's not quite clear to me how indexing works. If you are inserting a couple of documents which are chunked in smaller parts, how is the indexing working? Is there a standard way of giving those records an ID? How to update 1 document (which consists of an unknown number of chunks as records inside the database) if you don't know the ID's? I'm getting the feeling that I better use the API from Pinecone itself so I can create a list of Is there a good way of defining those ID's per document consisting of multiple records? In the end you want to update a whole document if an update exists, so you need to filter on all the records from the file inside the vector database. Hopefully somebody has some nice thoughts! Thank you! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
Answer generated by a 🤖 AnswerI understand that you're trying to figure out how indexing works within the Pinecone functionality of LangChain, particularly when dealing with documents that are chunked into smaller parts. You're also interested in understanding how to update a document consisting of multiple records when the IDs of these records are unknown. In LangChain, when you use the If you need to update a document, you'll need to know its ID. If a document is split into multiple records, each with its own ID, you'll need to maintain a record of which IDs correspond to which document. This could be achieved by storing a mapping from document names to lists of IDs in a separate data structure. To update a whole document, you would need to delete the old records corresponding to that document and insert the new records. You can use the Please note that this is one potential solution based on the current functionality of LangChain. If you find this solution helpful and believe it could benefit other users, I encourage you to make a pull request to update the documentation. This way, you can contribute to making LangChain more user-friendly for everyone. SourcesThis response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. |
Beta Was this translation helpful? Give feedback.
-
So after some extra research I think I have a good solution which is not having too much of a performance decrease. I'm know adding the external document ID from whatever service we're using to the metadata of the extracted document. This metadata will be duplicated when chunking, embedding and inserting in the vector database. When I want to delete 1 full document, I can just use the filter on metadata. Very powerful and quick.
|
Beta Was this translation helpful? Give feedback.
So after some extra research I think I have a good solution which is not having too much of a performance decrease.
I'm know adding the external document ID from whatever service we're using to the metadata of the extracted document. This metadata will be duplicated when chunking, embedding and inserting in the vector database. When I want to delete 1 full document, I can just use the filter on metadata. Very powerful and quick.
docsearch.delete(filter={'external_id':'2'})