Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenAI embedding use invalid/constant weights #3777

Closed
ravwojdyla opened this issue Apr 29, 2023 · 1 comment
Closed

OpenAI embedding use invalid/constant weights #3777

ravwojdyla opened this issue Apr 29, 2023 · 1 comment

Comments

@ravwojdyla
Copy link
Contributor

While working on #3722 I have noticed that there might be a bug in the current implementation of the OpenAI length safe embeddings in _get_len_safe_embeddings, which before #3722 was actually the default implementation (after #2330).

It appears the weights used are constant and the length of the embedding vector (1536) and NOT the number of tokens in the batch, as in the reference implementation at https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_long_inputs.ipynb

hwchase17 pushed a commit that referenced this issue May 1, 2023
Re: #3777

Copy pasting from the issue:

While working on #3722 I
have noticed that there might be a bug in the current implementation of
the OpenAI length safe embeddings in `_get_len_safe_embeddings`, which
before #3722 was actually
the **default implementation** regardless of the length of the context
(via #2330).

It appears the weights used are constant and the length of the embedding
vector (1536) and NOT the number of tokens in the batch, as in the
reference implementation at
https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_long_inputs.ipynb

<hr>

Here's some debug info:

<img width="1094" alt="image"
src="https://user-images.githubusercontent.com/1419010/235286595-a8b55298-7830-45df-b9f7-d2a2ad0356e0.png">

<hr>

We can also validate this against the reference implementation:

<details>

<summary>Reference implementation (click to unroll)</summary>

This implementation is copy pasted from
https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_long_inputs.ipynb

```py
import openai
from itertools import islice
import numpy as np
from tenacity import retry, wait_random_exponential, stop_after_attempt, retry_if_not_exception_type


EMBEDDING_MODEL = 'text-embedding-ada-002'
EMBEDDING_CTX_LENGTH = 8191
EMBEDDING_ENCODING = 'cl100k_base'

# let's make sure to not retry on an invalid request, because that is what we want to demonstrate
@Retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6), retry=retry_if_not_exception_type(openai.InvalidRequestError))
def get_embedding(text_or_tokens, model=EMBEDDING_MODEL):
    return openai.Embedding.create(input=text_or_tokens, model=model)["data"][0]["embedding"]

def batched(iterable, n):
    """Batch data into tuples of length n. The last batch may be shorter."""
    # batched('ABCDEFG', 3) --> ABC DEF G
    if n < 1:
        raise ValueError('n must be at least one')
    it = iter(iterable)
    while (batch := tuple(islice(it, n))):
        yield batch
        
def chunked_tokens(text, encoding_name, chunk_length):
    encoding = tiktoken.get_encoding(encoding_name)
    tokens = encoding.encode(text)
    chunks_iterator = batched(tokens, chunk_length)
    yield from chunks_iterator


def reference_safe_get_embedding(text, model=EMBEDDING_MODEL, max_tokens=EMBEDDING_CTX_LENGTH, encoding_name=EMBEDDING_ENCODING, average=True):
    chunk_embeddings = []
    chunk_lens = []
    for chunk in chunked_tokens(text, encoding_name=encoding_name, chunk_length=max_tokens):
        chunk_embeddings.append(get_embedding(chunk, model=model))
        chunk_lens.append(len(chunk))

    if average:
        chunk_embeddings = np.average(chunk_embeddings, axis=0, weights=chunk_lens)
        chunk_embeddings = chunk_embeddings / np.linalg.norm(chunk_embeddings)  # normalizes length to 1
        chunk_embeddings = chunk_embeddings.tolist()
    return chunk_embeddings
```

</details>

```py
long_text = 'foo bar' * 5000

reference_safe_get_embedding(long_text, average=True)[:10]

# Here's the first 10 floats from the reference embeddings:
[0.004407593824276758,
 0.0017611146161865465,
 -0.019824815970984996,
 -0.02177626039794025,
 -0.012060967454897886,
 0.0017955296329155309,
 -0.015609168983609643,
 -0.012059823076681351,
 -0.016990468527792825,
 -0.004970484452089445]


# and now langchain implementation
from langchain.embeddings.openai import OpenAIEmbeddings
OpenAIEmbeddings().embed_query(long_text)[:10]

[0.003791506184693747,
 0.0025310066579390025,
 -0.019282322699514628,
 -0.021492679249899803,
 -0.012598522213242891,
 0.0022181168611315662,
 -0.015858940621301307,
 -0.011754004130791204,
 -0.016402944319627515,
 -0.004125287485127554]

# clearly they are different ^
```
@dosubot
Copy link

dosubot bot commented Aug 23, 2023

Hi, @ravwojdyla! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, you opened this issue to highlight a potential bug in the OpenAI embedding implementation in the langchain repository. The bug involves the use of constant weights instead of adjusting the length of the embedding vector based on the number of tokens in the batch. However, there hasn't been any further activity or comments on the issue since you opened it.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your contribution, and we appreciate your understanding as we work to manage our backlog effectively. Let us know if you have any further questions or concerns!

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Aug 23, 2023
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Sep 15, 2023
@dosubot dosubot bot mentioned this issue Nov 8, 2023
14 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant