-
Notifications
You must be signed in to change notification settings - Fork 13.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
langchain index incremental mode failed to detect existed documents once exceed the default batch_size #19335
Comments
sorry, I should provide more info about the package I use. |
Could you output the contents of: |
I am unable to reproduce this issue @sukiluvcode. Could you provide system information and feel free to review the unit tests to see if you could suggest how to reproduce |
Could you confirm that |
My package info:
|
@tianhanwen Below are the reproducible codes from chromadb import PersistentClient
from langchain.vectorstores.chroma import Chroma
from langchain.indexes import SQLRecordManager, index
from langchain_openai import OpenAIEmbeddings
from langchain_core.documents import Document
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
chroma_client = PersistentClient()
embedding = OpenAIEmbeddings()
chroma_db = Chroma(client=chroma_client, collection_name="vector_database", embedding_function=embedding)
record_manager = SQLRecordManager(namespace="chroma/vector_database", db_url="sqlite:///record_manager.db")
record_manager.create_schema()
def create_vector_db_test():
docs = [Document(page_content="1", metadata={"source": 1}), Document(page_content="2", metadata={"source": 1}), Document(page_content="3", metadata={"source": 1})]
info = index(
docs_source=docs,
record_manager=record_manager,
vector_store=chroma_db,
cleanup="incremental",
source_id_key="source",
batch_size=2
)
return info
def clear_database():
info = index(
docs_source=[],
record_manager=record_manager,
vector_store=chroma_db,
cleanup="full",
source_id_key="source"
)
return info
clear_info = clear_database()
first_create_info = create_vector_db_test()
second_create_info = create_vector_db_test()
print(f"clear info: {clear_info}, first create: {first_create_info}, second create: {second_create_info}") The issue is that during second indexing, the first batch was identified and skipped but the second batch was failed. My scenario is that the documents from same source are splitted to differnent chunk then caused the problem. |
can any one tell me i am calling this function and it returns the error "'Index' object is not callable" |
@AliHaider0343 , Hello, index is a function, please make sure that you are importing this
or maybe it's your langchian version. run
|
@sukiluvcode as far as I can tell the code is working correctly. For your attached code could you write what you expect to see vs. what you see? What does it mean that the batch failed? Could you include the stack trace? I don't think the issue is the indexing API, it could be the vectorstore itself Added even more unit tests that cover the above test case everything is passing (#20387) |
@eyurtsev I change the vector db from chroma to memory database as you used in unit test, but it still not work as I expected.
this means that during my second indexing, it delete and recreate embeddings for not changed docs except first batch.
output: |
@eyurtsev , I believe I've identified the reason: when deleting in increasing mode, the mechanism removes documents from the same source if they're not found in that batch. In my scenario, during the first batch process of re-indexing, two previously unseen documents were deleted. While I understand this situation is somewhat uncommon, the solution would be to increase the batch size. |
Checked other resources
Example Code
Error Message and Stack Trace (if applicable)
I run the function twice, below is the second returned info.
{'num_added': 42, 'num_updated': 0, 'num_skipped': 100, 'num_deleted': 42}
Description
This is not as I expected, It should return message like num_skipped 142 instead of 100. I think there is something wrong with the record manager. Hope the developer of langchain can fix it.
System Info
langchain = 0.1.12
windows 11
python = 3.10
The text was updated successfully, but these errors were encountered: