Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

langchain index incremental mode failed to detect existed documents once exceed the default batch_size #19335

Open
5 tasks done
sukiluvcode opened this issue Mar 20, 2024 · 11 comments
Assignees
Labels
🔌: chroma Primarily related to ChromaDB integrations unable-to-reproduce Ɑ: vector store Related to vector store module

Comments

@sukiluvcode
Copy link

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

import logging

from chromadb import PersistentClient
from langchain.vectorstores.chroma import Chroma
from langchain.indexes import SQLRecordManager, index
from langchain_openai import OpenAIEmbeddings

from matextract_langchain_prototype.text_splitter import MatSplitter

TEST_FILE_NAME = "nlo_test.txt"
# logging.basicConfig(level=10)
# logger = logging.getLogger(__name__)

def create_vector_db():
    """simply create a vector database for a paper"""
    chroma_client = PersistentClient()
    embedding = OpenAIEmbeddings()
    chroma_db = Chroma(client=chroma_client, collection_name="vector_database", embedding_function=embedding)
    record_manager = SQLRecordManager(namespace="chroma/vector_database", db_url="sqlite:///record_manager.db")
    record_manager.create_schema()

    text_splitter = MatSplitter(chunk_size=100, chunk_overlap=0)
    with open(TEST_FILE_NAME, encoding='utf-8') as file:
        content = file.read()
    documents = text_splitter.create_documents([content], [{"source": TEST_FILE_NAME}])

    info = index(
        docs_source=documents,
        record_manager=record_manager,
        vector_store=chroma_db,
        cleanup="incremental",
        source_id_key="source",
        batch_size=100
    )
    print(info)

Error Message and Stack Trace (if applicable)

I run the function twice, below is the second returned info.
{'num_added': 42, 'num_updated': 0, 'num_skipped': 100, 'num_deleted': 42}

Description

This is not as I expected, It should return message like num_skipped 142 instead of 100. I think there is something wrong with the record manager. Hope the developer of langchain can fix it.

System Info

langchain = 0.1.12
windows 11
python = 3.10

@dosubot dosubot bot added Ɑ: vector store Related to vector store module 🔌: chroma Primarily related to ChromaDB integrations 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Mar 20, 2024
@sukiluvcode
Copy link
Author

sorry, I should provide more info about the package I use.
chromadb = 0.4.24
and about MatSplitter class, it's a simple customized text splitter function and the output is determinant.

@eyurtsev
Copy link
Collaborator

Could you output the contents of: python -m langchain_core.sys_info

@eyurtsev
Copy link
Collaborator

eyurtsev commented Mar 20, 2024

I am unable to reproduce this issue @sukiluvcode. Could you provide system information and feel free to review the unit tests to see if you could suggest how to reproduce

@eyurtsev
Copy link
Collaborator

Could you confirm that text_splitter = MatSplitter(chunk_size=100, chunk_overlap=0) works as expected and that it propagates source information?

@eyurtsev eyurtsev added unable-to-reproduce and removed 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Mar 20, 2024
@sukiluvcode
Copy link
Author

My package info:

langchain_core: 0.1.32
langchain: 0.1.12
langchain_community: 0.0.28
langsmith: 0.1.31
langchain_openai: 0.0.8
langchain_text_splitters: 0.0.

Could you confirm that text_splitter = MatSplitter(chunk_size=100, chunk_overlap=0) works as expected and that it propagates source information?

@sukiluvcode
Copy link
Author

sukiluvcode commented Mar 21, 2024

@tianhanwen Below are the reproducible codes

from chromadb import PersistentClient
from langchain.vectorstores.chroma import Chroma
from langchain.indexes import SQLRecordManager, index
from langchain_openai import OpenAIEmbeddings
from langchain_core.documents import Document

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

chroma_client = PersistentClient()
embedding = OpenAIEmbeddings()
chroma_db = Chroma(client=chroma_client, collection_name="vector_database", embedding_function=embedding)
record_manager = SQLRecordManager(namespace="chroma/vector_database", db_url="sqlite:///record_manager.db")
record_manager.create_schema()

def create_vector_db_test():
    docs = [Document(page_content="1", metadata={"source": 1}), Document(page_content="2", metadata={"source": 1}), Document(page_content="3", metadata={"source": 1})]
    info = index(
        docs_source=docs,
        record_manager=record_manager,
        vector_store=chroma_db,
        cleanup="incremental",
        source_id_key="source",
        batch_size=2
    )
    return info

def clear_database():
    info = index(
        docs_source=[],
        record_manager=record_manager,
        vector_store=chroma_db,
        cleanup="full",
        source_id_key="source"
    )
    return info

clear_info = clear_database()
first_create_info = create_vector_db_test()
second_create_info = create_vector_db_test()
print(f"clear info: {clear_info}, first create: {first_create_info}, second create: {second_create_info}")

The issue is that during second indexing, the first batch was identified and skipped but the second batch was failed. My scenario is that the documents from same source are splitted to differnent chunk then caused the problem.
Thanks in advance!

@AliHaider0343
Copy link

can any one tell me i am calling this function and it returns the error "'Index' object is not callable"
info = index(
docs_source=texts,
record_manager=record_manager,
vector_store=chroma_db,
cleanup="incremental",
source_id_key="source"
)

@sukiluvcode
Copy link
Author

can any one tell me i am calling this function and it returns the error "'Index' object is not callable" info = index( docs_source=texts, record_manager=record_manager, vector_store=chroma_db, cleanup="incremental", source_id_key="source" )

@AliHaider0343 , Hello, index is a function, please make sure that you are importing this

from langchain.indexes import index

or maybe it's your langchian version. run

pip install -U langchain

@eyurtsev
Copy link
Collaborator

@sukiluvcode as far as I can tell the code is working correctly.

For your attached code could you write what you expect to see vs. what you see? What does it mean that the batch failed? Could you include the stack trace?

I don't think the issue is the indexing API, it could be the vectorstore itself

Added even more unit tests that cover the above test case everything is passing (#20387)

@sukiluvcode
Copy link
Author

@eyurtsev I change the vector db from chroma to memory database as you used in unit test, but it still not work as I expected.

What does it mean that the batch failed?

this means that during my second indexing, it delete and recreate embeddings for not changed docs except first batch.
below are my refactored code.

from langchain.indexes import SQLRecordManager, index
from langchain_openai import OpenAIEmbeddings
from langchain_core.documents import Document
from langchain_community.vectorstores.inmemory import InMemoryVectorStore

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

embedding = OpenAIEmbeddings()
db = InMemoryVectorStore(embedding=embedding)
record_manager = SQLRecordManager(namespace="memo/vector_database", db_url="sqlite:///record_manager.db")
record_manager.create_schema()

def create_vector_db_test():
    docs = [Document(page_content="1", metadata={"source": 1}), Document(page_content="2", metadata={"source": 1}), Document(page_content="3", metadata={"source": 1}), Document(page_content="4", metadata={"source": 1})]
    info = index(
        docs_source=docs,
        record_manager=record_manager,
        vector_store=db,
        cleanup="incremental",
        source_id_key="source",
        batch_size=2
    )
    return info

def clear_database():
    info = index(
        docs_source=[],
        record_manager=record_manager,
        vector_store=db,
        cleanup="full",
        source_id_key="source"
    )
    return info

clear_info = clear_database()
first_create_info = create_vector_db_test()
second_create_info = create_vector_db_test()
print(f"clear info: {clear_info}, first create: {first_create_info}, second create: {second_create_info}")

output:
clear info: {'num_added': 0, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}, first create: {'num_added': 4, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}, second create: {'num_added': 2, 'num_updated': 0, 'num_skipped': 2, 'num_deleted': 2}
Is my understanding of the increasing mode is wrong or this is actually a bug. 😕

@sukiluvcode
Copy link
Author

sukiluvcode commented May 4, 2024

@eyurtsev , I believe I've identified the reason: when deleting in increasing mode, the mechanism removes documents from the same source if they're not found in that batch. In my scenario, during the first batch process of re-indexing, two previously unseen documents were deleted. While I understand this situation is somewhat uncommon, the solution would be to increase the batch size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🔌: chroma Primarily related to ChromaDB integrations unable-to-reproduce Ɑ: vector store Related to vector store module
Projects
None yet
Development

No branches or pull requests

3 participants