Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Allow initializing HuggingFaceEmbeddings from the cached weight #3079

Closed
nicolefinnie opened this issue Apr 18, 2023 · 9 comments

Comments

@nicolefinnie
Copy link

nicolefinnie commented Apr 18, 2023

Motivation

Right now, HuggingFaceEmbeddings doesn't support loading an embedding model's weights from the cache but downloading the weights every time. Fixing this would be a low hanging fruit by allowing the user to pass their cache directory.

Suggestion

The only change has only a few lines in init()

class HuggingFaceEmbeddings(BaseModel, Embeddings):
    """Wrapper around sentence_transformers embedding models.

    To use, you should have the ``sentence_transformers`` python package installed.

    Example:
        .. code-block:: python

            from langchain.embeddings import HuggingFaceEmbeddings
            model_name = "sentence-transformers/all-mpnet-base-v2"
            hf = HuggingFaceEmbeddings(model_name=model_name)
    """

    client: Any  #: :meta private:
    model_name: str = DEFAULT_MODEL_NAME
    """Model name to use."""

    def __init__(self, cache_folder=None, **kwargs: Any):
        """Initialize the sentence_transformer."""
        super().__init__(**kwargs)
        try:
            import sentence_transformers
            self.client = sentence_transformers.SentenceTransformer(model_name_or_path=self.model_name, cache_folder=cache_folder)
        except ImportError:
            raise ValueError(
                "Could not import sentence_transformers python package. "
                "Please install it with `pip install sentence_transformers`."
            )

    class Config:
        """Configuration for this pydantic object."""

        extra = Extra.forbid

    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        """Compute doc embeddings using a HuggingFace transformer model.

        Args:
            texts: The list of texts to embed.

        Returns:
            List of embeddings, one for each text.
        """
        texts = list(map(lambda x: x.replace("\n", " "), texts))
        embeddings = self.client.encode(texts)
        return embeddings.tolist()

    def embed_query(self, text: str) -> List[float]:
        """Compute query embeddings using a HuggingFace transformer model.

        Args:
            text: The text to embed.

        Returns:
            Embeddings for the text.
        """
        text = text.replace("\n", " ")
        embedding = self.client.encode(text)
        return embedding.tolist()
@kanukolluGVT
Copy link

I am eager to learn how the solution to this problem is approached. Can you tell me where the weights are located and how they are downloaded? I am a beginner and am excited to see the solution, but I will only contribute if I have a better understanding of the process since I have limited experience in machine learning engineering.

@nicolefinnie
Copy link
Author

I am eager to learn how the solution to this problem is approached. Can you tell me where the weights are located and how they are downloaded? I am a beginner and am excited to see the solution, but I will only contribute if I have a better understanding of the process since I have limited experience in machine learning engineering.

Thanks for your quick response. The weight would be downloaded if the user doesn't specify the cache_folder, and initializes SentenceTransformer() (from the python package sequence_transformer, directly

By default, if no cache_folder is given, SequenceTransformer will search the weights from the directory SENTENCE_TRANSFORMERS_HOME, see here, if the weights are not found, it will download the weights from huggingface hub.

so the alternative for users without changing the LangChain code here is to create a env SENTENCE_TRANSFORMERS_HOME that points to the real weight location, not ideal, but acceptable. In this case, we could document the usage on the LangChain HuggingFaceEmbedding docstring, but it will transfer the complexity to the user with adding the env variable to their python script. To make it user-friendly, we could offer this cache_folder option.

@azamiftikhar1000
Copy link
Contributor

@nicolefinnie Yup this make sense. Thanks for the suggestion!

eavanvalkenburg pushed a commit to eavanvalkenburg/langchain that referenced this issue Apr 18, 2023
)

### langchain-ai#3079
Allow initializing HuggingFaceEmbeddings from the cached weight
@Chetan-Yeola
Copy link

Can we decode the embeddings?

@zlapp
Copy link

zlapp commented Apr 29, 2023

Isn't the dependency on sentence_transformers limiting? I.e. if I wanted to test openassistant llm initialized locally from weights I couldn't use the class HuggingFaceEmbeddings because sentence_transformers doesn't support openassistant. Am I missing something are all hf llms (i.e. open assistant, llama, vicuna etc) compatible with sentence_transformers embeddings (both library and actual model embeddings).

samching pushed a commit to samching/langchain that referenced this issue May 1, 2023
)

### langchain-ai#3079
Allow initializing HuggingFaceEmbeddings from the cached weight
@mirodrr
Copy link

mirodrr commented Aug 5, 2023

Does someone have a working example of initializing HuggingFaceEmbeddings without an internet connection? I have tried specifying the "cache_folder" parameter with the file path of pre-downloaded embeddings code from huggingface, but it seems to be ignored

@mirodrr
Copy link

mirodrr commented Sep 19, 2023

Hi, just asking again: Does anyone have a working example of initializing HuggingFaceEmbeddings without an internet connection?

I need to use this class with pre-downloaded embeddings code instead of downloading from huggingface everytime.

@bolongliu
Copy link

Hi, just asking again: Does anyone have a working example of initializing HuggingFaceEmbeddings without an internet connection?

I need to use this class with pre-downloaded embeddings code instead of downloading from huggingface everytime.

I have make it works by this method.

from langchain.embeddings import HuggingFaceEmbeddings
embedding_models_root = "/mnt/embedding_models"
model_ckpt_path = os.path.join(embedding_models_root, 'multi-qa-MiniLM-L6-cos-v1')
embeddings = HuggingFaceEmbeddings(model_name=model_ckpt_path)
text = "This is a test document."
query_result = embeddings.embed_query(text)
doc_result = embeddings.embed_documents([text, "This is not a test document."])

print("==="*20)
print("query_result: \n {}".format(query_result))
print("==="*20)
print("doc_result: \n {}".format(doc_result))
print("==="*20)

Copy link

dosubot bot commented Feb 6, 2024

Hi, @nicolefinnie! I'm helping the LangChain team manage their backlog and am marking this issue as stale.

It looks like the issue you raised requests adding support for initializing HuggingFaceEmbeddings from cached weights instead of downloading them every time. There have been discussions about potential limitations, working examples, and clarifications on the weight location and download process. One user has even shared a working example of initializing HuggingFaceEmbeddings with pre-downloaded embeddings.

Could you please confirm if this issue is still relevant to the latest version of the LangChain repository? If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days. Thank you!

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Feb 6, 2024
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 13, 2024
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants