OpenAI embedding use invalid/constant weights #3777

ravwojdyla · 2023-04-29T05:57:29Z

While working on #3722 I have noticed that there might be a bug in the current implementation of the OpenAI length safe embeddings in _get_len_safe_embeddings, which before #3722 was actually the default implementation (after #2330).

It appears the weights used are constant and the length of the embedding vector (1536) and NOT the number of tokens in the batch, as in the reference implementation at https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_long_inputs.ipynb

The text was updated successfully, but these errors were encountered:

@Retry

Re: #3777 Copy pasting from the issue: While working on #3722 I have noticed that there might be a bug in the current implementation of the OpenAI length safe embeddings in `_get_len_safe_embeddings`, which before #3722 was actually the **default implementation** regardless of the length of the context (via #2330). It appears the weights used are constant and the length of the embedding vector (1536) and NOT the number of tokens in the batch, as in the reference implementation at https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_long_inputs.ipynb <hr> Here's some debug info: <img width="1094" alt="image" src="https://user-images.githubusercontent.com/1419010/235286595-a8b55298-7830-45df-b9f7-d2a2ad0356e0.png"> <hr> We can also validate this against the reference implementation: <details> <summary>Reference implementation (click to unroll)</summary> This implementation is copy pasted from https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_long_inputs.ipynb ```py import openai from itertools import islice import numpy as np from tenacity import retry, wait_random_exponential, stop_after_attempt, retry_if_not_exception_type EMBEDDING_MODEL = 'text-embedding-ada-002' EMBEDDING_CTX_LENGTH = 8191 EMBEDDING_ENCODING = 'cl100k_base' # let's make sure to not retry on an invalid request, because that is what we want to demonstrate @Retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6), retry=retry_if_not_exception_type(openai.InvalidRequestError)) def get_embedding(text_or_tokens, model=EMBEDDING_MODEL): return openai.Embedding.create(input=text_or_tokens, model=model)["data"][0]["embedding"] def batched(iterable, n): """Batch data into tuples of length n. The last batch may be shorter.""" # batched('ABCDEFG', 3) --> ABC DEF G if n < 1: raise ValueError('n must be at least one') it = iter(iterable) while (batch := tuple(islice(it, n))): yield batch def chunked_tokens(text, encoding_name, chunk_length): encoding = tiktoken.get_encoding(encoding_name) tokens = encoding.encode(text) chunks_iterator = batched(tokens, chunk_length) yield from chunks_iterator def reference_safe_get_embedding(text, model=EMBEDDING_MODEL, max_tokens=EMBEDDING_CTX_LENGTH, encoding_name=EMBEDDING_ENCODING, average=True): chunk_embeddings = [] chunk_lens = [] for chunk in chunked_tokens(text, encoding_name=encoding_name, chunk_length=max_tokens): chunk_embeddings.append(get_embedding(chunk, model=model)) chunk_lens.append(len(chunk)) if average: chunk_embeddings = np.average(chunk_embeddings, axis=0, weights=chunk_lens) chunk_embeddings = chunk_embeddings / np.linalg.norm(chunk_embeddings) # normalizes length to 1 chunk_embeddings = chunk_embeddings.tolist() return chunk_embeddings ``` </details> ```py long_text = 'foo bar' * 5000 reference_safe_get_embedding(long_text, average=True)[:10] # Here's the first 10 floats from the reference embeddings: [0.004407593824276758, 0.0017611146161865465, -0.019824815970984996, -0.02177626039794025, -0.012060967454897886, 0.0017955296329155309, -0.015609168983609643, -0.012059823076681351, -0.016990468527792825, -0.004970484452089445] # and now langchain implementation from langchain.embeddings.openai import OpenAIEmbeddings OpenAIEmbeddings().embed_query(long_text)[:10] [0.003791506184693747, 0.0025310066579390025, -0.019282322699514628, -0.021492679249899803, -0.012598522213242891, 0.0022181168611315662, -0.015858940621301307, -0.011754004130791204, -0.016402944319627515, -0.004125287485127554] # clearly they are different ^ ```

dosubot · 2023-08-23T16:03:14Z

Hi, @ravwojdyla! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, you opened this issue to highlight a potential bug in the OpenAI embedding implementation in the langchain repository. The bug involves the use of constant weights instead of adjusting the length of the embedding vector based on the number of tokens in the batch. However, there hasn't been any further activity or comments on the issue since you opened it.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your contribution, and we appreciate your understanding as we work to manage our backlog effectively. Let us know if you have any further questions or concerns!

ravwojdyla mentioned this issue Apr 29, 2023

Fixup OpenAI Embeddings - fix the weighted mean #3778

Merged

dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Aug 23, 2023

ravwojdyla closed this as completed Aug 23, 2023

dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Sep 15, 2023

dosubot bot mentioned this issue Nov 8, 2023

OpenAIEmbeddings #13082

Closed

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenAI embedding use invalid/constant weights #3777

OpenAI embedding use invalid/constant weights #3777

ravwojdyla commented Apr 29, 2023

dosubot bot commented Aug 23, 2023

OpenAI embedding use invalid/constant weights #3777

OpenAI embedding use invalid/constant weights #3777

Comments

ravwojdyla commented Apr 29, 2023

dosubot bot commented Aug 23, 2023