-
Notifications
You must be signed in to change notification settings - Fork 13.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with embedding model #9486
Comments
🤖 Hello, Thank you for reaching out. From your description, it seems like you're trying to use the 'vinai/phobert-base' model from Hugging Face as an embedding model with the LangChain framework. Currently, LangChain does support integration with Hugging Face models, but the 'vinai/phobert-base' model is not directly supported for embeddings. The framework would need to be extended or modified to support this specific model. The warning message you're seeing is from the 'sentence-transformers' library, which is trying to download and cache a SentenceTransformer model with the name 'vinai/phobert-base'. However, it seems like it can't find a model with that name, so it's creating a new one with MEAN pooling. If you want to use the 'vinai/phobert-base' model for tasks like 'text-generation' or 'text2text-generation', you can do so using the from langchain.llms import HuggingFaceEndpoint
endpoint_url = "https://<your-endpoint>.us-east-1.aws.endpoints.huggingface.cloud"
hf = HuggingFaceEndpoint(
endpoint_url=endpoint_url,
task="text-generation",
model_kwargs={"max_new_tokens": 10},
huggingfacehub_api_token="your-api-token"
) Please note that the 'vinai/phobert-base' model should be compatible with the task you are trying to perform. If it's not, you will get a ValueError indicating that the task is not supported. Also, you need to have the I hope this helps! If you have any other questions or need further clarification, please let me know. Best, Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
@dosu-bot Can you tell me which task of Huggingface Model is suitable for Retrieval in LangChain? |
🤖 Hello, For a retrieval task in the LangChain Python framework, you can use the Here is the relevant code: def test_huggingface_instructor_embedding_documents() -> None:
"""Test huggingface embeddings."""
documents = ["foo bar"]
model_name = "hkunlp/instructor-base"
embedding = HuggingFaceInstructEmbeddings(model_name=model_name)
output = embedding.embed_documents(documents)
assert len(output) == 1
assert len(output[0]) == 768
def test_huggingface_instructor_embedding_query() -> None:
"""Test huggingface embeddings."""
query = "foo bar"
model_name = "hkunlp/instructor-base"
embedding = HuggingFaceInstructEmbeddings(model_name=model_name)
output = embedding.embed_query(query)
assert len(output) == 768 In this code, the The This class provides methods to compute document and query embeddings using a HuggingFace instruction-based model. The You can find more details in the source code:
I hope this helps! If you have any other questions, feel free to ask. Best, Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
@dosu-bot I am mentioning about tasks: sentence similarity, fill mask, token classification,... When I use HuggingFaceEmbeddings, which task is suitable to fix error like when I use vinai/phobert-base? |
🤖 Hello, The error you're encountering is due to the fact that the 'vinai/phobert-base' model is not a SentenceTransformer model, which is what the The tasks you're interested in (sentence similarity, fill mask, and token classification) are not directly supported by the To resolve this issue, you have a couple of options:
I hope this helps! If you have any further questions, please don't hesitate to ask. Best, Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
@dosu-bot Tell me more detail about option 2, I want to modify the LangChain Python framework to support this model and the tasks "sentence-similarity" |
🤖 Hello, Thank you for reaching out. To support the 'vinai/phobert-base' model and the "sentence-similarity" task, you can create a new class that extends the Here is an example of how you can do this: from typing import List
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
class PhoBertEmbeddings(HuggingFaceEmbeddings):
"""PhoBert sentence_transformers embedding models."""
model_name: str = "vinai/phobert-base"
"""Model name to use."""
def embed_documents(self, texts: List[str]) -> List[List[float]]:
"""Compute sentence similarity using a HuggingFace transformer model.
Args:
texts: The list of texts to embed.
Returns:
List of similarity scores, one for each pair of texts.
"""
# Your implementation here
def embed_query(self, text: str) -> List[float]:
"""Compute sentence similarity using a HuggingFace transformer model.
Args:
text: The text to embed.
Returns:
Similarity score for the text.
"""
# Your implementation here
async def aembed_documents(self, texts: List[str]) -> List[List[float]]:
"""Asynchronous version of embed_documents."""
# Your implementation here
async def aembed_query(self, text: str) -> List[float]:
"""Asynchronous version of embed_query."""
# Your implementation here In the Please note that the current implementation of the I hope this helps! If you have any other questions, feel free to ask. Best, Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
@dosu-bot vinai/phobert-base is word-embedding model, while HuggingFacesEmbeddings in LangChain require sentence-embedding model, so it doesn't work. Am I thinking right? |
🤖 Hello, You're correct in your understanding that the To use the 'vinai/phobert-base' model for the "sentence-similarity" task, you would need to create a new class that inherits from the Here's an example of how you can create this new class: from transformers import AutoTokenizer, AutoModel
import torch
from langchain.embeddings.base import Embeddings
from typing import List
phobert = AutoModel.from_pretrained("vinai/phobert-base")
tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base")
class PhoBertEmbeddings(Embeddings):
def embed_documents(self, texts: List[str]) -> List[List[float]]:
"""Generate sentence embeddings from the 'vinai/phobert-base' model.
Args:
texts: The list of texts to embed.
Returns:
List of embeddings, one for each text.
"""
# replace newlines, which can negatively affect performance.
texts = [text.replace("\n", " ") for text in texts]
responses = []
for text in texts:
input_ids = torch.tensor([tokenizer.encode(text)])
with torch.no_grad():
features = phobert(input_ids)
sentence_embedding = torch.mean(features.last_hidden_state[0], dim=0)
responses.append(sentence_embedding.tolist())
return responses
def embed_query(self, text: str) -> List[float]:
"""Generate a sentence embedding for a query from the 'vinai/phobert-base' model.
Args:
text: The text to embed.
Returns:
Embedding for the text.
"""
return self.embed_documents([text])[0] This class generates word embeddings for each word in the input texts using the 'vinai/phobert-base' model, and then averages these word embeddings to generate sentence embeddings. Please note that this is a simplified example and may not cover all possible edge cases. You may need to add additional error handling and input validation code to ensure robustness. I hope this helps! If you have any further questions, feel free to ask. Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
@dosu-bot When I use custom PhoBertEmbedding(): This is two problem and how to fix them, give me suggest: Token indices sequence length is longer than the specified maximum sequence length for this model (555 > 256). Running this sequence through the model will result in indexing errors |
🤖 Hello, Based on the context you've provided, it seems like you're encountering two issues related to sequence length and indexing.
from transformers import PhobertTokenizer
tokenizer = PhobertTokenizer.from_pretrained('vinai/phobert-base')
data = [tokenizer.encode(text, max_length=128, truncation=True) for text in data]
If you continue to encounter issues, please provide more details about your Best, SourcesThis response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
Hi, @npk7264! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale. From what I understand, you were experiencing a problem with embedding a model from huggingface vinai/phobert-base. You mentioned receiving a warning message about no sentence-transformers model found. In response, Dosu-bot suggested using the HuggingFaceEndpoint class in LangChain for tasks like text-generation or text2text-generation. They provided code examples and links to relevant source code. You also asked about tasks like sentence similarity, fill mask, and token classification. Dosu-bot explained that these tasks are not directly supported by the HuggingFaceEmbeddings class in LangChain. They suggested using a SentenceTransformer model compatible with the desired tasks or modifying the LangChain framework to support the vinai/phobert-base model and the tasks. Dosu-bot provided code examples and explanations for creating a new class that extends the HuggingFaceEmbeddings class to support the vinai/phobert-base model and the sentence-similarity task. You also encountered issues related to sequence length and indexing, and Dosu-bot provided suggestions for resolving these issues. Now, we would like to know if this issue is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days. Thank you for your understanding and contribution to the LangChain project! |
Issue you'd like to raise.
I use embedding model from huggingface vinai/phobert-base:
Then it has this problem:
WARNING:sentence_transformers.SentenceTransformer:No sentence-transformers model found with name /root/.cache/torch/sentence_transformers/vinai_phobert-base. Creating a new one with MEAN pooling.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Suggestion:
No response
The text was updated successfully, but these errors were encountered: