Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ChromaDb EmbeddingFunction definition updated #13061

Closed
3 of 14 tasks
jayant-yadav opened this issue Nov 8, 2023 · 9 comments
Closed
3 of 14 tasks

ChromaDb EmbeddingFunction definition updated #13061

jayant-yadav opened this issue Nov 8, 2023 · 9 comments
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature 🔌: chroma Primarily related to ChromaDB integrations Ɑ: embeddings Related to text embedding models module Ɑ: vector store Related to vector store module

Comments

@jayant-yadav
Copy link

jayant-yadav commented Nov 8, 2023

System Info

Using Google Colab Free version with T4 GPU.
chromadb==0.4.16

Who can help?

@agola11 @hwchase17

Information

  • The official example notebooks/scripts
  • My own modified scripts

Related Components

  • LLMs/Chat Models
  • Embedding Models
  • Prompts / Prompt Templates / Prompt Selectors
  • Output Parsers
  • Document Loaders
  • Vector Stores / Retrievers
  • Memory
  • Agents / Agent Executors
  • Tools / Toolkits
  • Chains
  • Callbacks/Tracing
  • Async

Reproduction

As per the latest Chromadb migration logs (link) EmbeddingFunction defnition has been updated and it affects all the custom made embedding function.
What this means is the langchain.embeddings.HuggingFaceBgeEmbeddings is inconsistent with this new definition and throws the following error:

ValueError: Expected EmbeddingFunction.__call__ to have the following signature: odict_keys(['self', 'input']), got odict_keys(['self', 'args', 'kwargs'])
Please see https://docs.trychroma.com/embeddings for details of the EmbeddingFunction interface.
Please note the recent change to the EmbeddingFunction interface: https://docs.trychroma.com/migration#migration-to-0416---november-7-2023 

The above code can be reproduced by inserting documents into Chromadb embedded using HuggingFaceBgeEmbeddings like so:

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceBgeEmbeddings
from transformers import AutoTokenizer

embedding_function = HuggingFaceBgeEmbeddings(
    model_name="BAAI/bge-base-en-v1.5",
    model_kwargs={'device': 'cuda'},
    encode_kwargs={'normalize_embeddings': True},
    query_instruction="Represent this sentence for searching relevant passages: "
)  
tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-base-en-v1.5')
text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
    tokenizer, chunk_size=100, chunk_overlap=0
)
text = 'Some text that needs to be embedded.'
print(len(embedding_function.embed_query(text))) # works so far
splits = text_splitter.create_documents([text])
db = Chroma.from_documents(splits, embedding_function, persist_directory="./chroma_db")

I am not sure, but the answer might lie in correcting the HuggingFaceBgeEmbeddings class : link ?

Expected behavior

The expected behaviour would have made a valid db object upon running the code

db = Chroma.from_documents(splits, embedding_function, persist_directory="./chroma_db")  
@dosubot dosubot bot added 🔌: chroma Primarily related to ChromaDB integrations Ɑ: embeddings Related to text embedding models module Ɑ: vector store Related to vector store module 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Nov 8, 2023
Copy link

dosubot bot commented Nov 8, 2023

🤖

Hello @jayant-yadav,

Nice to meet you! I'm here to assist you with your issues and questions related to the LangChain repository. I'm a bot created to provide immediate responses while you're waiting for a human maintainer. Please don't hesitate to ask me anything you need. Let's get started!

Based on the error message you provided, it seems that the EmbeddingFunction interface has indeed changed. The error message suggests that the EmbeddingFunction.__call__ method is now expected to have the signature odict_keys(['self', 'input']), but the HuggingFaceBgeEmbeddings class has the signature odict_keys(['self', 'args', 'kwargs']).

The EmbeddingFunction is not explicitly defined in the LangChain codebase, but it is used as an interface for embedding functions. The HuggingFaceBgeEmbeddings class, which implements this interface, uses the sentence_transformers.SentenceTransformer to encode or embed the documents and queries.

To resolve this issue, you might need to update the HuggingFaceBgeEmbeddings class to match the new EmbeddingFunction interface. Specifically, you might need to change the embed_documents and embed_query methods to accept a single 'input' argument instead of 'args' and 'kwargs'. Here's an example of how you might do this:

def embed_documents(self, input: List[str]) -> List[List[float]]:
    """Compute doc embeddings using a HuggingFace transformer model.

    Args:
        input: The list of texts to embed.

    Returns:
        List of embeddings, one for each text.
    """
    input = [t.replace("\n", " ") for t in input]
    embeddings = self.client.encode(input, **self.encode_kwargs)
    return embeddings.tolist()

def embed_query(self, input: str) -> List[float]:
    """Compute query embeddings using a HuggingFace transformer model.

    Args:
        input: The text to embed.

    Returns:
        Embeddings for the text.
    """
    input = input.replace("\n", " ")
    embedding = self.client.encode(
        self.query_instruction + input, **self.encode_kwargs
    )
    return embedding.tolist()

Please note that this is just a suggestion and might not fully resolve the issue. You might need to make additional changes to the HuggingFaceBgeEmbeddings class to fully comply with the new EmbeddingFunction interface. I recommend checking the ChromaDB migration logs and the EmbeddingFunction interface documentation for more details.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@mattalford
Copy link

Quick work around until the HuggingFaceEmbeddings is updated.
pip install chromadb==0.4.15

This breaking change affects chromadb 0.4.16

@efriis
Copy link
Member

efriis commented Nov 8, 2023

Langchain 0.0.332 released with the chroma team's fix for compatibility with chromadb>=0.4.16

Can now use latest of both pip install -U langchain chromadb

@efriis efriis closed this as completed Nov 8, 2023
@efriis
Copy link
Member

efriis commented Nov 8, 2023

Actually looks like there was something specific with HuggingFaceBgeEmbeddings as well - could you confirm 0.0.332 with the chroma fix addresses this, and reopen if it's something that needs to be addressed in hugging face?

@jayant-yadav
Copy link
Author

jayant-yadav commented Nov 8, 2023

@efriis The fix in 0.0.332 works! Now langchain's latest version (0.0.332) is compatible with chromadb==0.4.16.
If possible, i would like to know where were the changes made to fix this issue?

@efriis
Copy link
Member

efriis commented Nov 9, 2023

#13085

@BharatBindage
Copy link

BharatBindage commented Nov 28, 2023

Actually looks like there was something specific with HuggingFaceBgeEmbeddings as well - could you confirm 0.0.332 with the chroma fix addresses this, and reopen if it's something that needs to be addressed in hugging face?

This is fixed now .. Thank you @efriis

@xsuryanshx
Copy link

Langchain 0.0.332 released with the chroma team's fix for compatibility with chromadb>=0.4.16

Can now use latest of both pip install -U langchain chromadb

thanks this fixed my error!

bdoohan-goog added a commit to bdoohan-goog/generative-ai that referenced this issue Feb 21, 2024
this change fixes this error that arises due to a code breaking change from chromadb: langchain-ai/langchain#13061
holtskinner added a commit to GoogleCloudPlatform/generative-ai that referenced this issue Feb 21, 2024
this change fixes this error that arises due to a code breaking change
from chromadb: langchain-ai/langchain#13061

Co-authored-by: Holt Skinner <13262395+holtskinner@users.noreply.github.com>
kartikeygupta24 pushed a commit to nitchandak/generative-ai that referenced this issue May 3, 2024
this change fixes this error that arises due to a code breaking change
from chromadb: langchain-ai/langchain#13061

Co-authored-by: Holt Skinner <13262395+holtskinner@users.noreply.github.com>
@jashshah
Copy link

I am using chromadb-0.5.0 and langchain-0.2.1 and I still run into this error when I try to host ChromaDB using a docker container.


hf = HuggingFaceBgeEmbeddings(
    model_name=modelPath,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs, 
    cache_folder="./cache"
)

chroma_client = chromadb.HttpClient(host='localhost', port=8000)

collection = chroma_client.create_collection(name="DATA_V3",embedding_function=hf)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature 🔌: chroma Primarily related to ChromaDB integrations Ɑ: embeddings Related to text embedding models module Ɑ: vector store Related to vector store module
Projects
None yet
Development

No branches or pull requests

6 participants