Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed openai embeddings to be safe by batching them based on token size calculation. #991

Merged
merged 5 commits into from Feb 16, 2023

Conversation

Hase-U
Copy link
Contributor

@Hase-U Hase-U commented Feb 11, 2023

I modified the logic of the batch calculation for embedding according to this cookbook
https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_long_inputs.ipynb

@Hase-U
Copy link
Contributor Author

Hase-U commented Feb 11, 2023

The original method did not know the specific size of the tokens, so depending on how the text is divided, an error could occur at run time.
So I changed it to calculate based on token size.

However, since the tiktoken package is required, it is not used by default so as not to affect existing projects.

@Hase-U
Copy link
Contributor Author

Hase-U commented Feb 15, 2023

Fixed an issue where the token size was too large and this kind of error sometimes occurred

This model's maximum context length is 8191 tokens, however you requested 10001 tokens (10001 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.

Weighted averaging is always performed when the token size exceeds the specified upper limit.
This is to protect the specification that the indices of texts and metadatas Lists passed to langchain's VectorStore etc. correspond. This ensures that the return value from the embedding class corresponds to each embedding in the given texts.

@Hase-U Hase-U changed the title update openai embeddings to calculate based on token size Fixed openai embeddings to be safe by batching them based on token size calculation. Feb 15, 2023
@hwchase17 hwchase17 merged commit e08961a into langchain-ai:master Feb 16, 2023
dongreenberg pushed a commit to dongreenberg/langchain that referenced this pull request Feb 17, 2023
@blob42 blob42 mentioned this pull request Feb 21, 2023
zachschillaci27 pushed a commit to zachschillaci27/langchain that referenced this pull request Mar 8, 2023
@AeroXi
Copy link
Contributor

AeroXi commented Apr 3, 2023

I still got this error, is this related?

openai.error.InvalidRequestError: This model's maximum context length is 8191 tokens, however you requested 8576 tokens (8576 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import DirectoryLoader

import chromadb
from chromadb.config import Settings
from langchain.document_loaders import TextLoader
person = "wangxing"
loader = DirectoryLoader('source/' + person, loader_cls=TextLoader)
documents = loader.load()
print(f"Indexing {len(documents)} documents.")

text_splitter = CharacterTextSplitter(chunk_size=200, chunk_overlap=20)
documents = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(documents, embeddings, client_settings=Settings(chroma_api_impl="rest",
                                                                                    chroma_server_host="localhost",
                                                                                    chroma_server_http_port="8000"
                                                                                    ), collection_name=person)

@Hase-U
Copy link
Contributor Author

Hase-U commented Apr 3, 2023

By default, this function is deactivated so as not to change the previous behavior. If you specify something like 8191 here, it will work as desired.

https://github.com/hwchase17/langchain/blob/d85f57ef9cbbbd5e512e064fb81c531b28c6591c/langchain/embeddings/openai.py#L99

@AeroXi
Copy link
Contributor

AeroXi commented Apr 3, 2023

By default, this function is deactivated so as not to change the previous behavior. If you specify something like 8191 here, it will work as desired.

https://github.com/hwchase17/langchain/blob/d85f57ef9cbbbd5e512e064fb81c531b28c6591c/langchain/embeddings/openai.py#L99

This works perfectly! Thank you
#2330 I suggest set it as default, what do you think?

hwchase17 pushed a commit that referenced this pull request Apr 7, 2023
#991 has already implemented this convenient feature to prevent
exceeding max token limit in embedding model.

> By default, this function is deactivated so as not to change the
previous behavior. If you specify something like 8191 here, it will work
as desired.
According to the author, this is not set by default. 
Until now, the default model in OpenAIEmbeddings's max token size is
8191 tokens, no other openai model has a larger token limit.
So I believe it will be better to set this as default value, other wise
users may encounter this error and hard to solve it.
hwchase17 pushed a commit that referenced this pull request Apr 29, 2023
Re: #3722

Copy pasting context from the issue:


https://github.com/hwchase17/langchain/blob/1bf1c37c0cccb7c8c73d87ace27cf742f814dbe5/langchain/embeddings/openai.py#L210-L211

Means that the length safe embedding method is "always" used, initial
implementation #991 has the
`embedding_ctx_length` set to -1 (meaning you had to opt-in for the
length safe method), #2330
changed that to max length of OpenAI embeddings v2, meaning the length
safe method is used at all times.

How about changing that if branch to use length safe method only when
needed, meaning when the text is longer than the max context length?
samching pushed a commit to samching/langchain that referenced this pull request May 1, 2023
Re: langchain-ai#3722

Copy pasting context from the issue:


https://github.com/hwchase17/langchain/blob/1bf1c37c0cccb7c8c73d87ace27cf742f814dbe5/langchain/embeddings/openai.py#L210-L211

Means that the length safe embedding method is "always" used, initial
implementation langchain-ai#991 has the
`embedding_ctx_length` set to -1 (meaning you had to opt-in for the
length safe method), langchain-ai#2330
changed that to max length of OpenAI embeddings v2, meaning the length
safe method is used at all times.

How about changing that if branch to use length safe method only when
needed, meaning when the text is longer than the max context length?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants