Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with embedding model #9486

Closed
npk7264 opened this issue Aug 19, 2023 · 12 comments
Closed

Problem with embedding model #9486

npk7264 opened this issue Aug 19, 2023 · 12 comments
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: embeddings Related to text embedding models module

Comments

@npk7264
Copy link

npk7264 commented Aug 19, 2023

Issue you'd like to raise.

I use embedding model from huggingface vinai/phobert-base:
image
Then it has this problem:
WARNING:sentence_transformers.SentenceTransformer:No sentence-transformers model found with name /root/.cache/torch/sentence_transformers/vinai_phobert-base. Creating a new one with MEAN pooling.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Suggestion:

No response

@dosubot dosubot bot added Ɑ: embeddings Related to text embedding models module 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Aug 19, 2023
@npk7264 npk7264 closed this as completed Aug 19, 2023
@npk7264 npk7264 reopened this Aug 19, 2023
@npk7264 npk7264 closed this as completed Aug 20, 2023
@dosubot
Copy link

dosubot bot commented Aug 20, 2023

🤖

Hello,

Thank you for reaching out. From your description, it seems like you're trying to use the 'vinai/phobert-base' model from Hugging Face as an embedding model with the LangChain framework.

Currently, LangChain does support integration with Hugging Face models, but the 'vinai/phobert-base' model is not directly supported for embeddings. The framework would need to be extended or modified to support this specific model.

The warning message you're seeing is from the 'sentence-transformers' library, which is trying to download and cache a SentenceTransformer model with the name 'vinai/phobert-base'. However, it seems like it can't find a model with that name, so it's creating a new one with MEAN pooling.

If you want to use the 'vinai/phobert-base' model for tasks like 'text-generation' or 'text2text-generation', you can do so using the HuggingFaceEndpoint class in LangChain. Here's an example:

from langchain.llms import HuggingFaceEndpoint

endpoint_url = "https://<your-endpoint>.us-east-1.aws.endpoints.huggingface.cloud"
hf = HuggingFaceEndpoint(
    endpoint_url=endpoint_url,
    task="text-generation",
    model_kwargs={"max_new_tokens": 10},
    huggingfacehub_api_token="your-api-token"
)

Please note that the 'vinai/phobert-base' model should be compatible with the task you are trying to perform. If it's not, you will get a ValueError indicating that the task is not supported. Also, you need to have the huggingface_hub python package installed and the environment variable HUGGINGFACEHUB_API_TOKEN set with your API token, or pass it as a named parameter to the constructor.

I hope this helps! If you have any other questions or need further clarification, please let me know.

Best,
Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@npk7264 npk7264 reopened this Aug 20, 2023
@npk7264
Copy link
Author

npk7264 commented Aug 21, 2023

@dosu-bot Can you tell me which task of Huggingface Model is suitable for Retrieval in LangChain?

@dosubot
Copy link

dosubot bot commented Aug 21, 2023

🤖

Hello,

For a retrieval task in the LangChain Python framework, you can use the HuggingFaceInstructEmbeddings class which is suitable for this purpose. This class uses a specific model for embedding documents and queries. The model used can be specified during instantiation. For example, in the tests, the "hkunlp/instructor-base" model is used.

Here is the relevant code:

def test_huggingface_instructor_embedding_documents() -> None:
    """Test huggingface embeddings."""
    documents = ["foo bar"]
    model_name = "hkunlp/instructor-base"
    embedding = HuggingFaceInstructEmbeddings(model_name=model_name)
    output = embedding.embed_documents(documents)
    assert len(output) == 1
    assert len(output[0]) == 768


def test_huggingface_instructor_embedding_query() -> None:
    """Test huggingface embeddings."""
    query = "foo bar"
    model_name = "hkunlp/instructor-base"
    embedding = HuggingFaceInstructEmbeddings(model_name=model_name)
    output = embedding.embed_query(query)
    assert len(output) == 768

In this code, the embed_documents and embed_query methods are used to compute document and query embeddings using the specified HuggingFaceInstructEmbeddings model.

The HuggingFaceInstructEmbeddings class in the LangChain framework is a wrapper around the sentence_transformers embedding models. It is specifically designed to work with instruction-based models, which are models that generate embeddings based on a given instruction and a text.

This class provides methods to compute document and query embeddings using a HuggingFace instruction-based model. The embed_documents method computes document embeddings by pairing each document with an instruction for representation. The embed_query method computes query embeddings by pairing the query with an instruction for representation.

You can find more details in the source code:

I hope this helps! If you have any other questions, feel free to ask.

Best,
Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@npk7264
Copy link
Author

npk7264 commented Aug 21, 2023

@dosu-bot I am mentioning about tasks: sentence similarity, fill mask, token classification,... When I use HuggingFaceEmbeddings, which task is suitable to fix error like when I use vinai/phobert-base?

@dosubot
Copy link

dosubot bot commented Aug 21, 2023

🤖

Hello,

The error you're encountering is due to the fact that the 'vinai/phobert-base' model is not a SentenceTransformer model, which is what the HuggingFaceEmbeddings class in LangChain is trying to use. The SentenceTransformer library is trying to create a new SentenceTransformer model with MEAN pooling, but it's warning you that special tokens have been added to the vocabulary, which may affect the word embeddings.

The tasks you're interested in (sentence similarity, fill mask, and token classification) are not directly supported by the HuggingFaceEmbeddings class in its current form. As per the LangChain code, the HuggingFaceHub class currently supports 'text-generation', 'text2text-generation', and 'summarization' tasks.

To resolve this issue, you have a couple of options:

  1. You could use a SentenceTransformer model that is compatible with the tasks you're interested in. You can find a list of SentenceTransformer models on the Hugging Face Model Hub.

  2. If you're set on using the 'vinai/phobert-base' model, you would need to modify the LangChain Python framework to support this model and the tasks you're interested in. This would involve extending the HuggingFaceEmbeddings class or creating a new class to handle the 'vinai/phobert-base' model and the tasks of sentence similarity, fill mask, and token classification.

I hope this helps! If you have any further questions, please don't hesitate to ask.

Best,
Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@npk7264
Copy link
Author

npk7264 commented Aug 21, 2023

@dosu-bot Tell me more detail about option 2, I want to modify the LangChain Python framework to support this model and the tasks "sentence-similarity"

@dosubot
Copy link

dosubot bot commented Aug 21, 2023

🤖

Hello,

Thank you for reaching out. To support the 'vinai/phobert-base' model and the "sentence-similarity" task, you can create a new class that extends the HuggingFaceEmbeddings class. In this new class, you can override the model_name attribute to use the 'vinai/phobert-base' model and modify the embed_documents, embed_query, aembed_documents, and aembed_query methods to compute sentence similarity.

Here is an example of how you can do this:

from typing import List
from langchain.embeddings.huggingface import HuggingFaceEmbeddings

class PhoBertEmbeddings(HuggingFaceEmbeddings):
    """PhoBert sentence_transformers embedding models."""

    model_name: str = "vinai/phobert-base"
    """Model name to use."""

    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        """Compute sentence similarity using a HuggingFace transformer model.

        Args:
            texts: The list of texts to embed.

        Returns:
            List of similarity scores, one for each pair of texts.
        """
        # Your implementation here

    def embed_query(self, text: str) -> List[float]:
        """Compute sentence similarity using a HuggingFace transformer model.

        Args:
            text: The text to embed.

        Returns:
            Similarity score for the text.
        """
        # Your implementation here

    async def aembed_documents(self, texts: List[str]) -> List[List[float]]:
        """Asynchronous version of embed_documents."""
        # Your implementation here

    async def aembed_query(self, text: str) -> List[float]:
        """Asynchronous version of embed_query."""
        # Your implementation here

In the embed_documents, embed_query, aembed_documents, and aembed_query methods, you should implement the logic to compute sentence similarity using the 'vinai/phobert-base' model. The exact implementation will depend on how you want to compute sentence similarity.

Please note that the current implementation of the HuggingFaceEmbeddings class in the LangChain Python framework is a wrapper around the HuggingFace sentence_transformers embedding models. It is used to compute document and query embeddings using a HuggingFace transformer model. You can find more details about the current implementation in the source code.

I hope this helps! If you have any other questions, feel free to ask.

Best,
Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@npk7264
Copy link
Author

npk7264 commented Aug 21, 2023

@dosu-bot vinai/phobert-base is word-embedding model, while HuggingFacesEmbeddings in LangChain require sentence-embedding model, so it doesn't work. Am I thinking right?

@dosubot
Copy link

dosubot bot commented Aug 21, 2023

🤖

Hello,

You're correct in your understanding that the HuggingFaceEmbeddings class in LangChain is designed to work with sentence-embedding models, and the 'vinai/phobert-base' model from Hugging Face is a word-embedding model.

To use the 'vinai/phobert-base' model for the "sentence-similarity" task, you would need to create a new class that inherits from the Embeddings base class and implements the embed_documents and embed_query methods to generate sentence embeddings from the word embeddings produced by the 'vinai/phobert-base' model.

Here's an example of how you can create this new class:

from transformers import AutoTokenizer, AutoModel
import torch
from langchain.embeddings.base import Embeddings
from typing import List

phobert = AutoModel.from_pretrained("vinai/phobert-base")
tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base")

class PhoBertEmbeddings(Embeddings):
    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        """Generate sentence embeddings from the 'vinai/phobert-base' model.

        Args:
            texts: The list of texts to embed.

        Returns:
            List of embeddings, one for each text.
        """
        # replace newlines, which can negatively affect performance.
        texts = [text.replace("\n", " ") for text in texts]
        responses = []
        for text in texts:
            input_ids = torch.tensor([tokenizer.encode(text)])
            with torch.no_grad():
                features = phobert(input_ids)
            sentence_embedding = torch.mean(features.last_hidden_state[0], dim=0)
            responses.append(sentence_embedding.tolist())
        return responses

    def embed_query(self, text: str) -> List[float]:
        """Generate a sentence embedding for a query from the 'vinai/phobert-base' model.

        Args:
            text: The text to embed.

        Returns:
            Embedding for the text.
        """
        return self.embed_documents([text])[0]

This class generates word embeddings for each word in the input texts using the 'vinai/phobert-base' model, and then averages these word embeddings to generate sentence embeddings.

Please note that this is a simplified example and may not cover all possible edge cases. You may need to add additional error handling and input validation code to ensure robustness.

I hope this helps! If you have any further questions, feel free to ask.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@npk7264
Copy link
Author

npk7264 commented Aug 26, 2023

@dosu-bot When I use custom PhoBertEmbedding():
hf_embedding = PhoBertEmbeddings()
vectordb = Chroma.from_documents(
documents=data,
embedding=hf_embedding,
persist_directory=persist_directory
)

This is two problem and how to fix them, give me suggest:

Token indices sequence length is longer than the specified maximum sequence length for this model (555 > 256). Running this sequence through the model will result in indexing errors
index out of range in self

@dosubot
Copy link

dosubot bot commented Aug 26, 2023

🤖

Hello,

Based on the context you've provided, it seems like you're encountering two issues related to sequence length and indexing.

  1. The warning message you're seeing is due to the fact that the sequence length of your input data is exceeding the maximum sequence length that the 'vinai/phobert-base' model can handle in the LangChain framework. As per the TitanTakeoff class in the LangChain framework, the maximum sequence length is set to 128. If your input data has a sequence length greater than this, it will result in the warning message you're seeing. To resolve this, you can truncate or split your input data to ensure that the sequence length does not exceed 128.
from transformers import PhobertTokenizer

tokenizer = PhobertTokenizer.from_pretrained('vinai/phobert-base')
data = [tokenizer.encode(text, max_length=128, truncation=True) for text in data]
  1. The "index out of range in self" error is likely due to an indexing issue in your PhoBertEmbeddings class. Without seeing the implementation of your PhoBertEmbeddings class, it's hard to provide a specific solution. However, this error typically occurs when you're trying to access an index that does not exist in a list or array. Please ensure that your indexing operations in the PhoBertEmbeddings class are within the range of your data.

If you continue to encounter issues, please provide more details about your PhoBertEmbeddings class and how you're using it with the LangChain framework. This will help me provide a more accurate solution.

Best,
Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

Copy link

dosubot bot commented Nov 25, 2023

Hi, @npk7264! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, you were experiencing a problem with embedding a model from huggingface vinai/phobert-base. You mentioned receiving a warning message about no sentence-transformers model found. In response, Dosu-bot suggested using the HuggingFaceEndpoint class in LangChain for tasks like text-generation or text2text-generation. They provided code examples and links to relevant source code.

You also asked about tasks like sentence similarity, fill mask, and token classification. Dosu-bot explained that these tasks are not directly supported by the HuggingFaceEmbeddings class in LangChain. They suggested using a SentenceTransformer model compatible with the desired tasks or modifying the LangChain framework to support the vinai/phobert-base model and the tasks. Dosu-bot provided code examples and explanations for creating a new class that extends the HuggingFaceEmbeddings class to support the vinai/phobert-base model and the sentence-similarity task.

You also encountered issues related to sequence length and indexing, and Dosu-bot provided suggestions for resolving these issues.

Now, we would like to know if this issue is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.

Thank you for your understanding and contribution to the LangChain project!

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Nov 25, 2023
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 2, 2023
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Dec 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: embeddings Related to text embedding models module
Projects
None yet
Development

No branches or pull requests

1 participant