OpenAIEmbeddings model argument does not work #17409

nyck33 · 2024-02-12T12:59:44Z

Checked other resources

I added a very descriptive title to this issue.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.
I am sure that this is a bug in LangChain rather than my code.

Example Code

from langchain_openai import OpenAIEmbeddings

embedding_model = os.environ.get("EMBEDDING_MODEL")
print(embedding_model)
embedding_dimension = os.environ.get("EMBEDDING_DIMENSION")
print(embedding_dimension)

# the langchain way
embeddings_model_lg = OpenAIEmbeddings(api_key=OPENAI_API_KEY, model=embedding_model, deployment=embedding_model, dimensions=int(embedding_dimension)) 

vectorstore = SupabaseVectorStore(
    client=supabase,
    embedding=embeddings_model_lg,
    table_name="documents",
    query_name="match_documents",
)

retriever = SelfQueryRetriever.from_llm(
    llm,
    vectorstore,
    document_content_description,
    metadata_field_info,
    verbose=True
)

# %%
# specify a relevant query
query = "How does tree help the boy make the crown? return results with relevance scores"
embedded_query = embeddings_model_lg.embed_query(query)
response = retriever.get_relevant_documents(query)

and in my .env

EMBEDDING_DIMENSION=256 
# edit this based on your model preference, e.g. text-embedding-3-small, text-embedding-ada-002
EMBEDDING_MODEL=text-embedding-3-large

Error Message and Stack Trace (if applicable)

2024-02-12 21:49:08,618:WARNING - Warning: model not found. Using cl100k_base encoding.
2024-02-12 21:49:09,055:INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-02-12 21:49:10,285:INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-02-12 21:49:10,295:INFO - Generated Query: query='tree help boy crown' filter=None limit=None
2024-02-12 21:49:10,296:WARNING - Warning: model not found. Using cl100k_base encoding.
2024-02-12 21:49:10,584:INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-02-12 21:49:11,104:INFO - HTTP Request: POST https://lhbeoisvtsilsquybifs.supabase.co/rest/v1/rpc/match_documents?limit=4 "HTTP/1.1 200 OK"

it's a warning.

Description

I want it to use the model I designated. Can I change the default in base.py?

.
.
.
client: Any = Field(default=None, exclude=True)  #: :meta private:
    async_client: Any = Field(default=None, exclude=True)  #: :meta private:
    model: str = "text-embedding-ada-002"
    dimensions: Optional[int] = None
    """The number of dimensions the resulting o...

I can't believe the results are actually correct but this is a tiny tiny children's book so it could have been a fluke.

[Document(page_content='Once there was a tree.... and she loved a little boy. And everyday the boy would come and he would gather her leaves and make them into crowns and play king of the forest. He would climb up her trunk and swing from her branches and eat apples. And they would play hide-and-go-seek.'), Document(page_content='And the tree was happy. But time went by. And the boy grew older. And the tree was often alone. Then one day the boy came to the tree and the tree said, "Come, Boy, come and climb up my trunk and swing from my branches and eat apples and play in my shade and be happy.'), ...

System Info

(langchain) nyck33@nyck33-lenovo:/media/nyck33/65DA61B605B0A8C1/projects/langchain-deeplearning-ai-tutorial$ pip freeze | grep langchain
langchain==0.1.5
langchain-community==0.0.19
langchain-core==0.1.21
langchain-openai==0.0.5

dosubot · 2024-02-12T13:00:00Z

🤖

Hey there, @nyck33! Great to see you diving into the depths of LangChain once again. 🚀

Based on the information you've provided, it seems like the OpenAIEmbeddings class is not recognizing the model you've specified in your environment variables. This is likely due to the way the tiktoken library is used in the OpenAIEmbeddings class. If the model name is not found in the tiktoken library, it defaults to using the "cl100k_base" encoding.

This is evident in the following code snippet from the OpenAIEmbeddings class:

try:
    encoding = tiktoken.encoding_for_model(model_name)
except KeyError:
    logger.warning("Warning: model not found. Using cl100k_base encoding.")
    model = "cl100k_base"
    encoding = tiktoken.get_encoding(model)

This means that if the model you specified in your environment variables is not recognized by tiktoken, it will default to "cl100k_base".

To resolve this issue, you can either use a model name that is recognized by tiktoken, or modify the code to handle unrecognized models in a way that suits your needs. For example, you could raise an error instead of defaulting to "cl100k_base", or you could add a mechanism to add new models to tiktoken.

Regarding the tiktoken library, it is used in the OpenAIEmbeddings class in LangChain to handle tokenization of the input text when generating embeddings. This is particularly important when dealing with long inputs that exceed the maximum context length that the OpenAI API can handle.

In the _get_len_safe_embeddings method, if the tiktoken_enabled flag is set to True, the tiktoken library is used to tokenize the input text into chunks that respect the embedding_ctx_length (the maximum number of tokens to embed at once).

However, if the tiktoken_enabled flag is set to False, the HuggingFace AutoTokenizer is used for tokenization instead.

I hope this helps clarify the issue. If you have any further questions, feel free to ask.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

nyck33 · 2024-02-13T00:28:21Z

So the model being mentioned in your response has to do with tokenization and not the embeddings? I was concerned that it was not able to make the embedding vectors using the model and dimensions I chose: ```EMBEDDING_DIMENSION=256

edit this based on your model preference, e.g. text-embedding-3-small, text-embedding-ada-002

EMBEDDING_MODEL=text-embedding-3-large```

but I guess it is doing that part okay. What are the differences between the tokenization models? You mention tiktoken but what is that?

oliverwang15 · 2024-02-18T02:36:38Z

TL;DR:

Try:

pip install -U langchain tiktoken

Reasons

From my perspective, the problem may related to the problem that "text-embedding-3-small" and "text-embedding-3-large" were not registered in the tiktoken so the package couldn't find the coding of this embedding so used ''cl100k_base" by default.

After updating, you may find the following in this file:

MODEL_TO_ENCODING: dict[str, str] = {
    # chat
    "gpt-4": "cl100k_base",
    "gpt-3.5-turbo": "cl100k_base",
    "gpt-3.5": "cl100k_base",  # Common shorthand
    "gpt-35-turbo": "cl100k_base",  # Azure deployment name
    # base
    "davinci-002": "cl100k_base",
    "babbage-002": "cl100k_base",
    # embeddings
    "text-embedding-ada-002": "cl100k_base",
    "text-embedding-3-small": "cl100k_base",
    "text-embedding-3-large": "cl100k_base",
    # DEPRECATED MODELS
    # text (DEPRECATED)
   ...

ARajgor · 2024-04-06T09:39:40Z

issue is still persist @oliverwang15

Lib versions ->

Langchain: 0.1.14
langchain-openai: 0.1.1
tiktoken: 0.6.0
openai: 1.16.1
pinecone: 3.2.2

my code

from langchain_openai import OpenAIEmbeddings
from pinecone import Pinecone

pc = Pinecone()

old_index = pc.Index(INDEX_NAME)

embed = OpenAIEmbeddings(model="text-embedding-3-small")
vectors = embed.embed_query("hello")
print(len(vectors))

old_index.upsert(vectors=[
    {"id": "A", "values": vectors},
  ]
)

printing: 1536

that's why my old_index show miss-match in dimension

{"code":3,"message":"Vector dimension 1536 does not match the dimension of the index 512",

nyck33 mentioned this issue Feb 12, 2024

Warning: model not found. Using cl100k_base encoding. with Azure Openai and load_summarize_chain when I am trying to implement map_reduce #13224

Closed

14 tasks

dosubot bot added Ɑ: embeddings Related to text embedding models module 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature 🔌: openai Primarily related to OpenAI integrations labels Feb 12, 2024

nyck33 closed this as completed Feb 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenAIEmbeddings model argument does not work #17409

OpenAIEmbeddings model argument does not work #17409

nyck33 commented Feb 12, 2024

dosubot bot commented Feb 12, 2024 •

edited

Loading

About Dosu

nyck33 commented Feb 13, 2024

oliverwang15 commented Feb 18, 2024

ARajgor commented Apr 6, 2024

OpenAIEmbeddings model argument does not work #17409

OpenAIEmbeddings model argument does not work #17409

Comments

nyck33 commented Feb 12, 2024

Checked other resources

Example Code

Error Message and Stack Trace (if applicable)

Description

System Info

dosubot bot commented Feb 12, 2024 • edited Loading

Sources

About Dosu

nyck33 commented Feb 13, 2024

edit this based on your model preference, e.g. text-embedding-3-small, text-embedding-ada-002

oliverwang15 commented Feb 18, 2024

TL;DR:

Reasons

ARajgor commented Apr 6, 2024

dosubot bot commented Feb 12, 2024 •

edited

Loading