langchain index incremental mode failed to detect existed documents once exceed the default batch_size #19335

sukiluvcode · 2024-03-20T13:25:47Z

Checked other resources

I added a very descriptive title to this issue.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.
I am sure that this is a bug in LangChain rather than my code.
The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

import logging

from chromadb import PersistentClient
from langchain.vectorstores.chroma import Chroma
from langchain.indexes import SQLRecordManager, index
from langchain_openai import OpenAIEmbeddings

from matextract_langchain_prototype.text_splitter import MatSplitter

TEST_FILE_NAME = "nlo_test.txt"
# logging.basicConfig(level=10)
# logger = logging.getLogger(__name__)

def create_vector_db():
    """simply create a vector database for a paper"""
    chroma_client = PersistentClient()
    embedding = OpenAIEmbeddings()
    chroma_db = Chroma(client=chroma_client, collection_name="vector_database", embedding_function=embedding)
    record_manager = SQLRecordManager(namespace="chroma/vector_database", db_url="sqlite:///record_manager.db")
    record_manager.create_schema()

    text_splitter = MatSplitter(chunk_size=100, chunk_overlap=0)
    with open(TEST_FILE_NAME, encoding='utf-8') as file:
        content = file.read()
    documents = text_splitter.create_documents([content], [{"source": TEST_FILE_NAME}])

    info = index(
        docs_source=documents,
        record_manager=record_manager,
        vector_store=chroma_db,
        cleanup="incremental",
        source_id_key="source",
        batch_size=100
    )
    print(info)

Error Message and Stack Trace (if applicable)

I run the function twice, below is the second returned info.
{'num_added': 42, 'num_updated': 0, 'num_skipped': 100, 'num_deleted': 42}

Description

This is not as I expected, It should return message like num_skipped 142 instead of 100. I think there is something wrong with the record manager. Hope the developer of langchain can fix it.

System Info

langchain = 0.1.12
windows 11
python = 3.10

The text was updated successfully, but these errors were encountered:

sukiluvcode · 2024-03-20T13:33:12Z

sorry, I should provide more info about the package I use.
chromadb = 0.4.24
and about MatSplitter class, it's a simple customized text splitter function and the output is determinant.

eyurtsev · 2024-03-20T15:36:28Z

Could you output the contents of: python -m langchain_core.sys_info

eyurtsev · 2024-03-20T16:32:00Z

I am unable to reproduce this issue @sukiluvcode. Could you provide system information and feel free to review the unit tests to see if you could suggest how to reproduce

eyurtsev · 2024-03-20T16:33:18Z

Could you confirm that text_splitter = MatSplitter(chunk_size=100, chunk_overlap=0) works as expected and that it propagates source information?

sukiluvcode · 2024-03-21T03:04:35Z

My package info:

langchain_core: 0.1.32
langchain: 0.1.12
langchain_community: 0.0.28
langsmith: 0.1.31
langchain_openai: 0.0.8
langchain_text_splitters: 0.0.

Could you confirm that text_splitter = MatSplitter(chunk_size=100, chunk_overlap=0) works as expected and that it propagates source information?

sukiluvcode · 2024-03-21T03:11:45Z

@tianhanwen Below are the reproducible codes

from chromadb import PersistentClient
from langchain.vectorstores.chroma import Chroma
from langchain.indexes import SQLRecordManager, index
from langchain_openai import OpenAIEmbeddings
from langchain_core.documents import Document

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

chroma_client = PersistentClient()
embedding = OpenAIEmbeddings()
chroma_db = Chroma(client=chroma_client, collection_name="vector_database", embedding_function=embedding)
record_manager = SQLRecordManager(namespace="chroma/vector_database", db_url="sqlite:///record_manager.db")
record_manager.create_schema()

def create_vector_db_test():
    docs = [Document(page_content="1", metadata={"source": 1}), Document(page_content="2", metadata={"source": 1}), Document(page_content="3", metadata={"source": 1})]
    info = index(
        docs_source=docs,
        record_manager=record_manager,
        vector_store=chroma_db,
        cleanup="incremental",
        source_id_key="source",
        batch_size=2
    )
    return info

def clear_database():
    info = index(
        docs_source=[],
        record_manager=record_manager,
        vector_store=chroma_db,
        cleanup="full",
        source_id_key="source"
    )
    return info

clear_info = clear_database()
first_create_info = create_vector_db_test()
second_create_info = create_vector_db_test()
print(f"clear info: {clear_info}, first create: {first_create_info}, second create: {second_create_info}")

The issue is that during second indexing, the first batch was identified and skipped but the second batch was failed. My scenario is that the documents from same source are splitted to differnent chunk then caused the problem.
Thanks in advance!

AliHaider0343 · 2024-03-24T22:31:30Z

can any one tell me i am calling this function and it returns the error "'Index' object is not callable"
info = index(
docs_source=texts,
record_manager=record_manager,
vector_store=chroma_db,
cleanup="incremental",
source_id_key="source"
)

sukiluvcode · 2024-03-25T07:52:37Z

can any one tell me i am calling this function and it returns the error "'Index' object is not callable" info = index( docs_source=texts, record_manager=record_manager, vector_store=chroma_db, cleanup="incremental", source_id_key="source" )

@AliHaider0343 , Hello, index is a function, please make sure that you are importing this

from langchain.indexes import index

or maybe it's your langchian version. run

pip install -U langchain

eyurtsev · 2024-04-12T14:15:38Z

@sukiluvcode as far as I can tell the code is working correctly.

For your attached code could you write what you expect to see vs. what you see? What does it mean that the batch failed? Could you include the stack trace?

I don't think the issue is the indexing API, it could be the vectorstore itself

Added even more unit tests that cover the above test case everything is passing (#20387)

sukiluvcode · 2024-04-18T14:35:28Z

@eyurtsev I change the vector db from chroma to memory database as you used in unit test, but it still not work as I expected.

What does it mean that the batch failed?

this means that during my second indexing, it delete and recreate embeddings for not changed docs except first batch.
below are my refactored code.

from langchain.indexes import SQLRecordManager, index
from langchain_openai import OpenAIEmbeddings
from langchain_core.documents import Document
from langchain_community.vectorstores.inmemory import InMemoryVectorStore

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

embedding = OpenAIEmbeddings()
db = InMemoryVectorStore(embedding=embedding)
record_manager = SQLRecordManager(namespace="memo/vector_database", db_url="sqlite:///record_manager.db")
record_manager.create_schema()

def create_vector_db_test():
    docs = [Document(page_content="1", metadata={"source": 1}), Document(page_content="2", metadata={"source": 1}), Document(page_content="3", metadata={"source": 1}), Document(page_content="4", metadata={"source": 1})]
    info = index(
        docs_source=docs,
        record_manager=record_manager,
        vector_store=db,
        cleanup="incremental",
        source_id_key="source",
        batch_size=2
    )
    return info

def clear_database():
    info = index(
        docs_source=[],
        record_manager=record_manager,
        vector_store=db,
        cleanup="full",
        source_id_key="source"
    )
    return info

clear_info = clear_database()
first_create_info = create_vector_db_test()
second_create_info = create_vector_db_test()
print(f"clear info: {clear_info}, first create: {first_create_info}, second create: {second_create_info}")

output:
clear info: {'num_added': 0, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}, first create: {'num_added': 4, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}, second create: {'num_added': 2, 'num_updated': 0, 'num_skipped': 2, 'num_deleted': 2}
Is my understanding of the increasing mode is wrong or this is actually a bug. 😕

sukiluvcode · 2024-05-04T09:42:24Z

@eyurtsev , I believe I've identified the reason: when deleting in increasing mode, the mechanism removes documents from the same source if they're not found in that batch. In my scenario, during the first batch process of re-indexing, two previously unseen documents were deleted. While I understand this situation is somewhat uncommon, the solution would be to increase the batch size.

dosubot bot added Ɑ: vector store Related to vector store module 🔌: chroma Primarily related to ChromaDB integrations 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Mar 20, 2024

eyurtsev self-assigned this Mar 20, 2024

eyurtsev mentioned this issue Mar 20, 2024

langchain[patch]: Add tests for indexing #19342

Merged

eyurtsev added unable-to-reproduce and removed 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Mar 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

langchain index incremental mode failed to detect existed documents once exceed the default batch_size #19335

langchain index incremental mode failed to detect existed documents once exceed the default batch_size #19335

sukiluvcode commented Mar 20, 2024

sukiluvcode commented Mar 20, 2024

eyurtsev commented Mar 20, 2024

eyurtsev commented Mar 20, 2024 •

edited

eyurtsev commented Mar 20, 2024

sukiluvcode commented Mar 21, 2024

sukiluvcode commented Mar 21, 2024 •

edited by eyurtsev

AliHaider0343 commented Mar 24, 2024

sukiluvcode commented Mar 25, 2024

eyurtsev commented Apr 12, 2024

sukiluvcode commented Apr 18, 2024

sukiluvcode commented May 4, 2024 •

edited

langchain index incremental mode failed to detect existed documents once exceed the default batch_size #19335

langchain index incremental mode failed to detect existed documents once exceed the default batch_size #19335

Comments

sukiluvcode commented Mar 20, 2024

Checked other resources

Example Code

Error Message and Stack Trace (if applicable)

Description

System Info

sukiluvcode commented Mar 20, 2024

eyurtsev commented Mar 20, 2024

eyurtsev commented Mar 20, 2024 • edited

eyurtsev commented Mar 20, 2024

sukiluvcode commented Mar 21, 2024

sukiluvcode commented Mar 21, 2024 • edited by eyurtsev

AliHaider0343 commented Mar 24, 2024

sukiluvcode commented Mar 25, 2024

eyurtsev commented Apr 12, 2024

sukiluvcode commented Apr 18, 2024

sukiluvcode commented May 4, 2024 • edited

eyurtsev commented Mar 20, 2024 •

edited

sukiluvcode commented Mar 21, 2024 •

edited by eyurtsev

sukiluvcode commented May 4, 2024 •

edited