Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unable to run get_len_safe_embeddings function which i wrote #17267

Closed
nithinreddyyyyyy opened this issue Feb 8, 2024 · 2 comments
Closed

unable to run get_len_safe_embeddings function which i wrote #17267

nithinreddyyyyyy opened this issue Feb 8, 2024 · 2 comments
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: embeddings Related to text embedding models module 🔌: openai Primarily related to OpenAI integrations

Comments

@nithinreddyyyyyy
Copy link

Issue with current documentation:

below's the code

def _get_len_safe_embeddings(
    self, texts: List[str], *, engine: str, chunk_size: Optional[int] = None
) -> List[List[float]]:
    """
    Generate length-safe embeddings for a list of texts.

    This method handles tokenization and embedding generation, respecting the
    set embedding context length and chunk size. It supports both tiktoken
    and HuggingFace tokenizer based on the tiktoken_enabled flag.
    Args:
        texts (List[str]): A list of texts to embed.
        engine (str): The engine or model to use for embeddings.
        chunk_size (Optional[int]): The size of chunks for processing embeddings.

    Returns:
        List[List[float]]: A list of embeddings for each input text.
    """

    tokens = []
    indices = []
    model_name = self.tiktoken_model_name or self.model
    _chunk_size = chunk_size or self.chunk_size

    # If tiktoken flag set to False
    if not self.tiktoken_enabled:
        try:
            from transformers import AutoTokenizer
        except ImportError:
            raise ValueError(
                "Could not import transformers python package. "
                "This is needed in order to for OpenAIEmbeddings without "
                "`tiktoken`. Please install it with `pip install transformers`. "
            )

        tokenizer = AutoTokenizer.from_pretrained(
            pretrained_model_name_or_path=model_name
        )
        for i, text in enumerate(texts):
            # Tokenize the text using HuggingFace transformers
            tokenized = tokenizer.encode(text, add_special_tokens=False)

            # Split tokens into chunks respecting the embedding_ctx_length
            for j in range(0, len(tokenized), self.embedding_ctx_length):
                token_chunk = tokenized[j : j + self.embedding_ctx_length]
                tokens.append(token_chunk)
                indices.append(i)

    # Embed each chunk separately
    batched_embeddings = []
    for i in range(0, len(tokens), _chunk_size):
        token_batch = tokens[i : i + _chunk_size]
        response = embed_with_retry(
            self,
            inputs=token_batch,
            **self._invocation_params,
        )
        if not isinstance(response, dict):
            response = response.dict()
        batched_embeddings.extend(r["embedding"] for r in response["data"])

    # Concatenate the embeddings for each text
    embeddings: List[List[float]] = [[] for _ in range(len(texts))]
    for i in range(len(indices)):
        embeddings[indices[i]].extend(batched_embeddings[i])

    return embeddings

followed by below

# Instantiate the OpenAIEmbeddings class
openai = OpenAIEmbeddings(openai_api_key="")

# Generate embeddings for your documents
embeddings = get_len_safe_embeddings([doc.page_content for doc in documents], engine="text-embedding-ada-002")

# Create tuples of text and corresponding embedding
text_embeddings = list(zip([doc.page_content for doc in documents], embeddings))

# Create a FAISS vector store from the embeddings
vectorStore = FAISS.from_embeddings(text_embeddings, openai)

it has returned below issue

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[<ipython-input-31-14fb4a40f661>](https://localhost:8080/#) in <cell line: 5>()
      3 
      4 # Generate embeddings for your documents
----> 5 embeddings = _get_len_safe_embeddings([doc.page_content for doc in documents], engine="text-embedding-ada-002")
      6 
      7 # Create tuples of text and corresponding embedding

TypeError: _get_len_safe_embeddings() missing 1 required positional argument: 'texts'

Can you assist me with this code? It'd much better resolving this issue. Can you write an updated code?

Idea or request for content:

No response

Copy link

dosubot bot commented Feb 8, 2024

🤖

Hey @nithinreddyyyyyy! Great to see you diving into another challenge. How's everything going on your end?

The error message you're seeing is due to the way you're calling the _get_len_safe_embeddings function. This function is a method of a class, and as such, it needs to be called on an instance of that class. The self parameter in the method definition refers to the instance of the class on which the method is being called.

From the code you've shared, it seems like _get_len_safe_embeddings is a method of the OpenAIEmbeddings class. Therefore, you need to call it on an instance of OpenAIEmbeddings, not as a standalone function.

Here's how you can modify your code:

# Instantiate the OpenAIEmbeddings class
openai = OpenAIEmbeddings(openai_api_key="")

# Generate embeddings for your documents
embeddings = openai._get_len_safe_embeddings([doc.page_content for doc in documents], engine="text-embedding-ada-002")

# Create tuples of text and corresponding embedding
text_embeddings = list(zip([doc.page_content for doc in documents], embeddings))

# Create a FAISS vector store from the embeddings
vectorStore = FAISS.from_embeddings(text_embeddings, openai)

Please note that methods starting with an underscore, like _get_len_safe_embeddings, are conventionally considered private in Python, meaning they're intended for internal use within the class and not meant to be called directly from outside the class. It's generally better to use public methods of the class, which are part of the class's public interface.

In this case, you might want to use the embed_documents method instead, which internally calls _get_len_safe_embeddings. Here's how you can do that:

# Instantiate the OpenAIEmbeddings class
openai = OpenAIEmbeddings(openai_api_key="")

# Generate embeddings for your documents
embeddings = openai.embed_documents([doc.page_content for doc in documents])

# Create tuples of text and corresponding embedding
text_embeddings = list(zip([doc.page_content for doc in documents], embeddings))

# Create a FAISS vector store from the embeddings
vectorStore = FAISS.from_embeddings(text_embeddings, openai)

I hope this helps! If you have any more questions, feel free to ask.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@dosubot dosubot bot added Ɑ: embeddings Related to text embedding models module 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature 🔌: openai Primarily related to OpenAI integrations labels Feb 8, 2024
@eyurtsev
Copy link
Collaborator

eyurtsev commented Feb 9, 2024

Closing issue since this is not a documentation question

@eyurtsev eyurtsev closed this as not planned Won't fix, can't repro, duplicate, stale Feb 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: embeddings Related to text embedding models module 🔌: openai Primarily related to OpenAI integrations
Projects
None yet
Development

No branches or pull requests

2 participants