set default embedding max token size #2330

AeroXi · 2023-04-03T07:36:15Z

#991 has already implemented this convenient feature to prevent exceeding max token limit in embedding model.

By default, this function is deactivated so as not to change the previous behavior. If you specify something like 8191 here, it will work as desired.
According to the author, this is not set by default.
Until now, the default model in OpenAIEmbeddings's max token size is 8191 tokens, no other openai model has a larger token limit.
So I believe it will be better to set this as default value, other wise users may encounter this error and hard to solve it.

Re: #3722 Copy pasting context from the issue: https://github.com/hwchase17/langchain/blob/1bf1c37c0cccb7c8c73d87ace27cf742f814dbe5/langchain/embeddings/openai.py#L210-L211 Means that the length safe embedding method is "always" used, initial implementation #991 has the `embedding_ctx_length` set to -1 (meaning you had to opt-in for the length safe method), #2330 changed that to max length of OpenAI embeddings v2, meaning the length safe method is used at all times. How about changing that if branch to use length safe method only when needed, meaning when the text is longer than the max context length?

Re: langchain-ai#3722 Copy pasting context from the issue: https://github.com/hwchase17/langchain/blob/1bf1c37c0cccb7c8c73d87ace27cf742f814dbe5/langchain/embeddings/openai.py#L210-L211 Means that the length safe embedding method is "always" used, initial implementation langchain-ai#991 has the `embedding_ctx_length` set to -1 (meaning you had to opt-in for the length safe method), langchain-ai#2330 changed that to max length of OpenAI embeddings v2, meaning the length safe method is used at all times. How about changing that if branch to use length safe method only when needed, meaning when the text is longer than the max context length?

@Retry

Re: #3777 Copy pasting from the issue: While working on #3722 I have noticed that there might be a bug in the current implementation of the OpenAI length safe embeddings in `_get_len_safe_embeddings`, which before #3722 was actually the **default implementation** regardless of the length of the context (via #2330). It appears the weights used are constant and the length of the embedding vector (1536) and NOT the number of tokens in the batch, as in the reference implementation at https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_long_inputs.ipynb <hr> Here's some debug info: <img width="1094" alt="image" src="https://user-images.githubusercontent.com/1419010/235286595-a8b55298-7830-45df-b9f7-d2a2ad0356e0.png"> <hr> We can also validate this against the reference implementation: <details> <summary>Reference implementation (click to unroll)</summary> This implementation is copy pasted from https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_long_inputs.ipynb ```py import openai from itertools import islice import numpy as np from tenacity import retry, wait_random_exponential, stop_after_attempt, retry_if_not_exception_type EMBEDDING_MODEL = 'text-embedding-ada-002' EMBEDDING_CTX_LENGTH = 8191 EMBEDDING_ENCODING = 'cl100k_base' # let's make sure to not retry on an invalid request, because that is what we want to demonstrate @Retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6), retry=retry_if_not_exception_type(openai.InvalidRequestError)) def get_embedding(text_or_tokens, model=EMBEDDING_MODEL): return openai.Embedding.create(input=text_or_tokens, model=model)["data"][0]["embedding"] def batched(iterable, n): """Batch data into tuples of length n. The last batch may be shorter.""" # batched('ABCDEFG', 3) --> ABC DEF G if n < 1: raise ValueError('n must be at least one') it = iter(iterable) while (batch := tuple(islice(it, n))): yield batch def chunked_tokens(text, encoding_name, chunk_length): encoding = tiktoken.get_encoding(encoding_name) tokens = encoding.encode(text) chunks_iterator = batched(tokens, chunk_length) yield from chunks_iterator def reference_safe_get_embedding(text, model=EMBEDDING_MODEL, max_tokens=EMBEDDING_CTX_LENGTH, encoding_name=EMBEDDING_ENCODING, average=True): chunk_embeddings = [] chunk_lens = [] for chunk in chunked_tokens(text, encoding_name=encoding_name, chunk_length=max_tokens): chunk_embeddings.append(get_embedding(chunk, model=model)) chunk_lens.append(len(chunk)) if average: chunk_embeddings = np.average(chunk_embeddings, axis=0, weights=chunk_lens) chunk_embeddings = chunk_embeddings / np.linalg.norm(chunk_embeddings) # normalizes length to 1 chunk_embeddings = chunk_embeddings.tolist() return chunk_embeddings ``` </details> ```py long_text = 'foo bar' * 5000 reference_safe_get_embedding(long_text, average=True)[:10] # Here's the first 10 floats from the reference embeddings: [0.004407593824276758, 0.0017611146161865465, -0.019824815970984996, -0.02177626039794025, -0.012060967454897886, 0.0017955296329155309, -0.015609168983609643, -0.012059823076681351, -0.016990468527792825, -0.004970484452089445] # and now langchain implementation from langchain.embeddings.openai import OpenAIEmbeddings OpenAIEmbeddings().embed_query(long_text)[:10] [0.003791506184693747, 0.0025310066579390025, -0.019282322699514628, -0.021492679249899803, -0.012598522213242891, 0.0022181168611315662, -0.015858940621301307, -0.011754004130791204, -0.016402944319627515, -0.004125287485127554] # clearly they are different ^ ```

set default embedding max token size

e70321a

AeroXi mentioned this pull request Apr 3, 2023

Fixed openai embeddings to be safe by batching them based on token size calculation. #991

Merged

hwchase17 merged commit e131156 into langchain-ai:master Apr 7, 2023
9 checks passed

This was referenced Apr 28, 2023

OpenAIEmbeddings should use length safe embedding method only when needed #3722

Closed

Handle length safe embedding only if needed #3723

Merged

This was referenced Apr 29, 2023

OpenAI embedding use invalid/constant weights #3777

Closed

Fixup OpenAI Embeddings - fix the weighted mean #3778

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

set default embedding max token size #2330

set default embedding max token size #2330

AeroXi commented Apr 3, 2023

set default embedding max token size #2330

set default embedding max token size #2330

Conversation

AeroXi commented Apr 3, 2023