Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenAIEmbeddings model argument does not work #17409

Closed
4 tasks done
nyck33 opened this issue Feb 12, 2024 · 4 comments
Closed
4 tasks done

OpenAIEmbeddings model argument does not work #17409

nyck33 opened this issue Feb 12, 2024 · 4 comments
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: embeddings Related to text embedding models module 🔌: openai Primarily related to OpenAI integrations

Comments

@nyck33
Copy link

nyck33 commented Feb 12, 2024

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.

Example Code

from langchain_openai import OpenAIEmbeddings

embedding_model = os.environ.get("EMBEDDING_MODEL")
print(embedding_model)
embedding_dimension = os.environ.get("EMBEDDING_DIMENSION")
print(embedding_dimension)

# the langchain way
embeddings_model_lg = OpenAIEmbeddings(api_key=OPENAI_API_KEY, model=embedding_model, deployment=embedding_model, dimensions=int(embedding_dimension)) 

vectorstore = SupabaseVectorStore(
    client=supabase,
    embedding=embeddings_model_lg,
    table_name="documents",
    query_name="match_documents",
)

retriever = SelfQueryRetriever.from_llm(
    llm,
    vectorstore,
    document_content_description,
    metadata_field_info,
    verbose=True
)

# %%
# specify a relevant query
query = "How does tree help the boy make the crown? return results with relevance scores"
embedded_query = embeddings_model_lg.embed_query(query)
response = retriever.get_relevant_documents(query)

and in my .env

EMBEDDING_DIMENSION=256 
# edit this based on your model preference, e.g. text-embedding-3-small, text-embedding-ada-002
EMBEDDING_MODEL=text-embedding-3-large

Error Message and Stack Trace (if applicable)

2024-02-12 21:49:08,618:WARNING - Warning: model not found. Using cl100k_base encoding.
2024-02-12 21:49:09,055:INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-02-12 21:49:10,285:INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-02-12 21:49:10,295:INFO - Generated Query: query='tree help boy crown' filter=None limit=None
2024-02-12 21:49:10,296:WARNING - Warning: model not found. Using cl100k_base encoding.
2024-02-12 21:49:10,584:INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-02-12 21:49:11,104:INFO - HTTP Request: POST https://lhbeoisvtsilsquybifs.supabase.co/rest/v1/rpc/match_documents?limit=4 "HTTP/1.1 200 OK"

it's a warning.

Description

I want it to use the model I designated. Can I change the default in base.py?

.
.
.
client: Any = Field(default=None, exclude=True)  #: :meta private:
    async_client: Any = Field(default=None, exclude=True)  #: :meta private:
    model: str = "text-embedding-ada-002"
    dimensions: Optional[int] = None
    """The number of dimensions the resulting o...

I can't believe the results are actually correct but this is a tiny tiny children's book so it could have been a fluke.

[Document(page_content='Once there was a tree.... and she loved a little boy. And everyday the boy would come and he would gather her leaves and make them into crowns and play king of the forest. He would climb up her trunk and swing from her branches and eat apples. And they would play hide-and-go-seek.'), Document(page_content='And the tree was happy. But time went by. And the boy grew older. And the tree was often alone. Then one day the boy came to the tree and the tree said, "Come, Boy, come and climb up my trunk and swing from my branches and eat apples and play in my shade and be happy.'), ...

System Info

(langchain) nyck33@nyck33-lenovo:/media/nyck33/65DA61B605B0A8C1/projects/langchain-deeplearning-ai-tutorial$ pip freeze | grep langchain
langchain==0.1.5
langchain-community==0.0.19
langchain-core==0.1.21
langchain-openai==0.0.5
Copy link

dosubot bot commented Feb 12, 2024

🤖

Hey there, @nyck33! Great to see you diving into the depths of LangChain once again. 🚀

Based on the information you've provided, it seems like the OpenAIEmbeddings class is not recognizing the model you've specified in your environment variables. This is likely due to the way the tiktoken library is used in the OpenAIEmbeddings class. If the model name is not found in the tiktoken library, it defaults to using the "cl100k_base" encoding.

This is evident in the following code snippet from the OpenAIEmbeddings class:

try:
    encoding = tiktoken.encoding_for_model(model_name)
except KeyError:
    logger.warning("Warning: model not found. Using cl100k_base encoding.")
    model = "cl100k_base"
    encoding = tiktoken.get_encoding(model)

This means that if the model you specified in your environment variables is not recognized by tiktoken, it will default to "cl100k_base".

To resolve this issue, you can either use a model name that is recognized by tiktoken, or modify the code to handle unrecognized models in a way that suits your needs. For example, you could raise an error instead of defaulting to "cl100k_base", or you could add a mechanism to add new models to tiktoken.

Regarding the tiktoken library, it is used in the OpenAIEmbeddings class in LangChain to handle tokenization of the input text when generating embeddings. This is particularly important when dealing with long inputs that exceed the maximum context length that the OpenAI API can handle.

In the _get_len_safe_embeddings method, if the tiktoken_enabled flag is set to True, the tiktoken library is used to tokenize the input text into chunks that respect the embedding_ctx_length (the maximum number of tokens to embed at once).

However, if the tiktoken_enabled flag is set to False, the HuggingFace AutoTokenizer is used for tokenization instead.

I hope this helps clarify the issue. If you have any further questions, feel free to ask.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@dosubot dosubot bot added Ɑ: embeddings Related to text embedding models module 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature 🔌: openai Primarily related to OpenAI integrations labels Feb 12, 2024
@nyck33
Copy link
Author

nyck33 commented Feb 13, 2024

So the model being mentioned in your response has to do with tokenization and not the embeddings? I was concerned that it was not able to make the embedding vectors using the model and dimensions I chose: ```EMBEDDING_DIMENSION=256

edit this based on your model preference, e.g. text-embedding-3-small, text-embedding-ada-002

EMBEDDING_MODEL=text-embedding-3-large```

but I guess it is doing that part okay. What are the differences between the tokenization models? You mention tiktoken but what is that?

@nyck33 nyck33 closed this as completed Feb 13, 2024
@oliverwang15
Copy link

TL;DR:

Try:

pip install -U langchain tiktoken

Reasons

From my perspective, the problem may related to the problem that "text-embedding-3-small" and "text-embedding-3-large" were not registered in the tiktoken so the package couldn't find the coding of this embedding so used ''cl100k_base" by default.

After updating, you may find the following in this file:

MODEL_TO_ENCODING: dict[str, str] = {
    # chat
    "gpt-4": "cl100k_base",
    "gpt-3.5-turbo": "cl100k_base",
    "gpt-3.5": "cl100k_base",  # Common shorthand
    "gpt-35-turbo": "cl100k_base",  # Azure deployment name
    # base
    "davinci-002": "cl100k_base",
    "babbage-002": "cl100k_base",
    # embeddings
    "text-embedding-ada-002": "cl100k_base",
    "text-embedding-3-small": "cl100k_base",
    "text-embedding-3-large": "cl100k_base",
    # DEPRECATED MODELS
    # text (DEPRECATED)
   ...

@ARajgor
Copy link

ARajgor commented Apr 6, 2024

issue is still persist @oliverwang15

Lib versions ->

Langchain: 0.1.14
langchain-openai: 0.1.1
tiktoken: 0.6.0
openai: 1.16.1
pinecone: 3.2.2

my code

from langchain_openai import OpenAIEmbeddings
from pinecone import Pinecone

pc = Pinecone()

old_index = pc.Index(INDEX_NAME)

embed = OpenAIEmbeddings(model="text-embedding-3-small")
vectors = embed.embed_query("hello")
print(len(vectors))

old_index.upsert(vectors=[
    {"id": "A", "values": vectors},
  ]
)

printing: 1536

that's why my old_index show miss-match in dimension

{"code":3,"message":"Vector dimension 1536 does not match the dimension of the index 512",

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: embeddings Related to text embedding models module 🔌: openai Primarily related to OpenAI integrations
Projects
None yet
Development

No branches or pull requests

3 participants