Fixed openai embeddings to be safe by batching them based on token size calculation. #991

Hase-U · 2023-02-11T08:35:41Z

I modified the logic of the batch calculation for embedding according to this cookbook
https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_long_inputs.ipynb

Hase-U · 2023-02-11T13:21:16Z

The original method did not know the specific size of the tokens, so depending on how the text is divided, an error could occur at run time.
So I changed it to calculate based on token size.

However, since the tiktoken package is required, it is not used by default so as not to affect existing projects.

…imit

fix batch process of openai embedding to avoid errors in token

Hase-U · 2023-02-15T04:57:58Z

Fixed an issue where the token size was too large and this kind of error sometimes occurred

This model's maximum context length is 8191 tokens, however you requested 10001 tokens (10001 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.

Weighted averaging is always performed when the token size exceeds the specified upper limit.
This is to protect the specification that the indices of texts and metadatas Lists passed to langchain's VectorStore etc. correspond. This ensures that the return value from the embedding class corresponds to each embedding in the given texts.

restore chunk_size to original value

…ze calculation. (langchain-ai#991) I modified the logic of the batch calculation for embedding according to this cookbook https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_long_inputs.ipynb

AeroXi · 2023-04-03T07:09:48Z

I still got this error, is this related?

openai.error.InvalidRequestError: This model's maximum context length is 8191 tokens, however you requested 8576 tokens (8576 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import DirectoryLoader

import chromadb
from chromadb.config import Settings
from langchain.document_loaders import TextLoader
person = "wangxing"
loader = DirectoryLoader('source/' + person, loader_cls=TextLoader)
documents = loader.load()
print(f"Indexing {len(documents)} documents.")

text_splitter = CharacterTextSplitter(chunk_size=200, chunk_overlap=20)
documents = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(documents, embeddings, client_settings=Settings(chroma_api_impl="rest",
                                                                                    chroma_server_host="localhost",
                                                                                    chroma_server_http_port="8000"
                                                                                    ), collection_name=person)

Hase-U · 2023-04-03T07:18:41Z

By default, this function is deactivated so as not to change the previous behavior. If you specify something like 8191 here, it will work as desired.

https://github.com/hwchase17/langchain/blob/d85f57ef9cbbbd5e512e064fb81c531b28c6591c/langchain/embeddings/openai.py#L99

AeroXi · 2023-04-03T07:37:35Z

By default, this function is deactivated so as not to change the previous behavior. If you specify something like 8191 here, it will work as desired.

https://github.com/hwchase17/langchain/blob/d85f57ef9cbbbd5e512e064fb81c531b28c6591c/langchain/embeddings/openai.py#L99

This works perfectly! Thank you
#2330 I suggest set it as default, what do you think?

#991 has already implemented this convenient feature to prevent exceeding max token limit in embedding model. > By default, this function is deactivated so as not to change the previous behavior. If you specify something like 8191 here, it will work as desired. According to the author, this is not set by default. Until now, the default model in OpenAIEmbeddings's max token size is 8191 tokens, no other openai model has a larger token limit. So I believe it will be better to set this as default value, other wise users may encounter this error and hard to solve it.

Re: #3722 Copy pasting context from the issue: https://github.com/hwchase17/langchain/blob/1bf1c37c0cccb7c8c73d87ace27cf742f814dbe5/langchain/embeddings/openai.py#L210-L211 Means that the length safe embedding method is "always" used, initial implementation #991 has the `embedding_ctx_length` set to -1 (meaning you had to opt-in for the length safe method), #2330 changed that to max length of OpenAI embeddings v2, meaning the length safe method is used at all times. How about changing that if branch to use length safe method only when needed, meaning when the text is longer than the max context length?

Re: langchain-ai#3722 Copy pasting context from the issue: https://github.com/hwchase17/langchain/blob/1bf1c37c0cccb7c8c73d87ace27cf742f814dbe5/langchain/embeddings/openai.py#L210-L211 Means that the length safe embedding method is "always" used, initial implementation langchain-ai#991 has the `embedding_ctx_length` set to -1 (meaning you had to opt-in for the length safe method), langchain-ai#2330 changed that to max length of OpenAI embeddings v2, meaning the length safe method is used at all times. How about changing that if branch to use length safe method only when needed, meaning when the text is longer than the max context length?

update openai embeddings to calculate based on token size

a36b7c6

Hase-U and others added 2 commits February 15, 2023 13:41

fix batch process of openai embedding to avoid errors in token size l…

54331d6

…imit

Merge pull request #1 from Hase-U/openai_safe_embedding

71e6dc9

fix batch process of openai embedding to avoid errors in token

Hase-U changed the title ~~update openai embeddings to calculate based on token size~~ Fixed openai embeddings to be safe by batching them based on token size calculation. Feb 15, 2023

Hase-U and others added 2 commits February 15, 2023 17:26

restore chunk_size to original value

7e77853

Merge pull request #2 from Hase-U/fix_test

973afec

restore chunk_size to original value

hwchase17 merged commit e08961a into langchain-ai:master Feb 16, 2023

Sohojoe mentioned this pull request Feb 16, 2023

AttributeError: 'OpenAIEmbeddings' object has no attribute 'embedding_ctx_length' #1100

Closed

blob42 mentioned this pull request Feb 21, 2023

fix searx blob42/langchain#1

Closed

AeroXi mentioned this pull request Apr 3, 2023

set default embedding max token size #2330

Merged

This was referenced Apr 28, 2023

OpenAIEmbeddings should use length safe embedding method only when needed #3722

Closed

Handle length safe embedding only if needed #3723

Merged

ravwojdyla mentioned this pull request Apr 29, 2023

Fixup OpenAI Embeddings - fix the weighted mean #3778

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed openai embeddings to be safe by batching them based on token size calculation. #991

Fixed openai embeddings to be safe by batching them based on token size calculation. #991

Hase-U commented Feb 11, 2023

Hase-U commented Feb 11, 2023

Hase-U commented Feb 15, 2023

AeroXi commented Apr 3, 2023 •

edited

Hase-U commented Apr 3, 2023

AeroXi commented Apr 3, 2023

Fixed openai embeddings to be safe by batching them based on token size calculation. #991

Fixed openai embeddings to be safe by batching them based on token size calculation. #991

Conversation

Hase-U commented Feb 11, 2023

Hase-U commented Feb 11, 2023

Hase-U commented Feb 15, 2023

AeroXi commented Apr 3, 2023 • edited

Hase-U commented Apr 3, 2023

AeroXi commented Apr 3, 2023

AeroXi commented Apr 3, 2023 •

edited