New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed openai embeddings to be safe by batching them based on token size calculation. #991
Conversation
The original method did not know the specific size of the tokens, so depending on how the text is divided, an error could occur at run time. However, since the tiktoken package is required, it is not used by default so as not to affect existing projects. |
fix batch process of openai embedding to avoid errors in token
Fixed an issue where the token size was too large and this kind of error sometimes occurred
Weighted averaging is always performed when the token size exceeds the specified upper limit. |
restore chunk_size to original value
…ze calculation. (langchain-ai#991) I modified the logic of the batch calculation for embedding according to this cookbook https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_long_inputs.ipynb
…ze calculation. (langchain-ai#991) I modified the logic of the batch calculation for embedding according to this cookbook https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_long_inputs.ipynb
I still got this error, is this related?
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import DirectoryLoader
import chromadb
from chromadb.config import Settings
from langchain.document_loaders import TextLoader
person = "wangxing"
loader = DirectoryLoader('source/' + person, loader_cls=TextLoader)
documents = loader.load()
print(f"Indexing {len(documents)} documents.")
text_splitter = CharacterTextSplitter(chunk_size=200, chunk_overlap=20)
documents = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(documents, embeddings, client_settings=Settings(chroma_api_impl="rest",
chroma_server_host="localhost",
chroma_server_http_port="8000"
), collection_name=person) |
By default, this function is deactivated so as not to change the previous behavior. If you specify something like 8191 here, it will work as desired. |
This works perfectly! Thank you |
#991 has already implemented this convenient feature to prevent exceeding max token limit in embedding model. > By default, this function is deactivated so as not to change the previous behavior. If you specify something like 8191 here, it will work as desired. According to the author, this is not set by default. Until now, the default model in OpenAIEmbeddings's max token size is 8191 tokens, no other openai model has a larger token limit. So I believe it will be better to set this as default value, other wise users may encounter this error and hard to solve it.
Re: #3722 Copy pasting context from the issue: https://github.com/hwchase17/langchain/blob/1bf1c37c0cccb7c8c73d87ace27cf742f814dbe5/langchain/embeddings/openai.py#L210-L211 Means that the length safe embedding method is "always" used, initial implementation #991 has the `embedding_ctx_length` set to -1 (meaning you had to opt-in for the length safe method), #2330 changed that to max length of OpenAI embeddings v2, meaning the length safe method is used at all times. How about changing that if branch to use length safe method only when needed, meaning when the text is longer than the max context length?
Re: langchain-ai#3722 Copy pasting context from the issue: https://github.com/hwchase17/langchain/blob/1bf1c37c0cccb7c8c73d87ace27cf742f814dbe5/langchain/embeddings/openai.py#L210-L211 Means that the length safe embedding method is "always" used, initial implementation langchain-ai#991 has the `embedding_ctx_length` set to -1 (meaning you had to opt-in for the length safe method), langchain-ai#2330 changed that to max length of OpenAI embeddings v2, meaning the length safe method is used at all times. How about changing that if branch to use length safe method only when needed, meaning when the text is longer than the max context length?
I modified the logic of the batch calculation for embedding according to this cookbook
https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_long_inputs.ipynb