Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Passing large datasets to Embedding causes an error #519

Closed
ibeckermayer opened this issue Jul 4, 2023 · 7 comments
Closed

Passing large datasets to Embedding causes an error #519

ibeckermayer opened this issue Jul 4, 2023 · 7 comments
Labels
bug Something isn't working

Comments

@ibeckermayer
Copy link

Describe the bug

See the bug report originally filed in chroma-core/chroma#709

To Reproduce

See the bug report originally filed in chroma-core/chroma#709

Code snippets

No response

OS

macOS

Python version

Python v3.11.4

Library version

openai==0.27.8

@Alisultani1
Copy link

Alisultani1 commented Jul 4, 2023 via email

@ibeckermayer
Copy link
Author

I also encountered this error when trying to embed too-large of a "chunk":

Traceback (most recent call last):
  File "/Users/ibeckermayer/test/scripts/bug.py", line 321, in <module>
    main()
  File "/Users/ibeckermayer/test/scripts/bug.py", line 315, in main
    embed_all(texts, metadatas, ids)
  File "/Users/ibeckermayer/test/scripts/bug.py", line 41, in embed_all
    collection.add(documents=texts_chunk, metadatas=metadatas_chunk, ids=ids_chunk)
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/chromadb/api/models/Collection.py", line 96, in add
    ids, embeddings, metadatas, documents = self._validate_embedding_set(
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/chromadb/api/models/Collection.py", line 387, in _validate_embedding_set
    embeddings = self._embedding_function(documents)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/chromadb/utils/embedding_functions.py", line 111, in __call__
    embeddings = self._client.create(input=texts, engine=self._model_name)["data"]
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_resources/embedding.py", line 33, in create
    response = super().create(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_resources/abstract/engine_api_resource.py", line 153, in create
    response, _, api_key = requestor.request(
                           ^^^^^^^^^^^^^^^^^^
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_requestor.py", line 298, in request
    resp, got_stream = self._interpret_response(result, stream)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_requestor.py", line 700, in _interpret_response
    self._interpret_response_line(
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_requestor.py", line 763, in _interpret_response_line
    raise self.handle_error_response(
openai.error.RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-j9E4an588TDys1U9hTjfa8py on tokens per min. Limit: 1000000 / min. Current: 0 / min. Contact us through our help center at help.openai.com if you continue to have issues.

@emcd
Copy link

emcd commented Jul 16, 2023

@ibeckermayer : Thanks for reporting this. I ran into the openai.error.InvalidRequestError: '$.input' is invalid. Please check the API reference: https://platform.openai.com/docs/api-reference. error today and saw the issue your reported for ChromaDB. I have looked at the OpenAI Python client code and it has no specific handling around the input issue that you reported. The error is actually coming from the OpenAI API and is not a problem with the Python client per se.

Some trial and error (binary search on valid array/list size) and guesswork led me to discover that the maximum array size for input is 2048. So, assuming that you are using text-embedding-ada-002, all of the following constraints are true:

  • The input parameter may not take a list longer than 2048 elements (chunks of text).
  • The total number of tokens across all list elements of the input parameter cannot exceed 1,000,000. (Because the rate limit is 1,000,000 tokens per minute.)
  • Each individual array element (chunk of text) cannot be more than 8191 tokens.

@skskcco2o17
Copy link

no element in the list should be BLANK/EMPTY/NULL content in the input parameter (list of paragraph)

@LinqLover
Copy link

Very nice findings. Could someone from OpenAI team document 2048 and 1,000,000 in the docs? Unfortunately the feedback form there is not really usable (it is even single-line) ...

@rattrayalex
Copy link
Collaborator

Thanks for debugging & sharing your findings @emcd. I've put in a request to update the documentation accordingly.

I'm going to close this issue since it's not a bug in the Python library.

@maheshwaghmare
Copy link

I was facing the same.

Found the solution!

I am using the text-embedding-ada-002 model.


I was sending the large array and send it as input:

'input' => $input_text,

As I using PHP so converted the $input_text in JSON format as:

'input' => json_encode( $input_text ),

And it works for me.

Additionally, I refer to the doc:

1. string
The string will be turned into an embedding.

2. array
The array of strings that will be turned into an embedding.

3. array
The array of integers that will be turned into an embedding.

4. array
The array of arrays containing integers that will be turned into an embedding.

But there is no mention of that.

SOLUTION: Convert the Array to JSON format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants