Passing large datasets to `Embedding` causes an error #519

ibeckermayer · 2023-07-04T18:33:32Z

Describe the bug

See the bug report originally filed in chroma-core/chroma#709

To Reproduce

See the bug report originally filed in chroma-core/chroma#709

Code snippets

No response

OS

macOS

Python version

Python v3.11.4

Library version

openai==0.27.8

Alisultani1 · 2023-07-04T18:40:26Z

Hi

…

On Tue, 4 Jul 2023, 22:33 Isaiah Becker-Mayer ***@***.***> wrote: Describe the bug See the bug report originally filed in chroma-core/chroma#709 <chroma-core/chroma#709> To Reproduce See the bug report originally filed in chroma-core/chroma#709 <chroma-core/chroma#709> Code snippets *No response* OS macOS Python version Python v3.11.4 Library version openai==0.27.8 — Reply to this email directly, view it on GitHub <#519>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A6AGSLA6CVZ6HCWTYBGWK4TXOROZHANCNFSM6AAAAAAZ6BIQYA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

ibeckermayer · 2023-07-10T02:27:08Z

I also encountered this error when trying to embed too-large of a "chunk":

Traceback (most recent call last):
  File "/Users/ibeckermayer/test/scripts/bug.py", line 321, in <module>
    main()
  File "/Users/ibeckermayer/test/scripts/bug.py", line 315, in main
    embed_all(texts, metadatas, ids)
  File "/Users/ibeckermayer/test/scripts/bug.py", line 41, in embed_all
    collection.add(documents=texts_chunk, metadatas=metadatas_chunk, ids=ids_chunk)
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/chromadb/api/models/Collection.py", line 96, in add
    ids, embeddings, metadatas, documents = self._validate_embedding_set(
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/chromadb/api/models/Collection.py", line 387, in _validate_embedding_set
    embeddings = self._embedding_function(documents)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/chromadb/utils/embedding_functions.py", line 111, in __call__
    embeddings = self._client.create(input=texts, engine=self._model_name)["data"]
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_resources/embedding.py", line 33, in create
    response = super().create(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_resources/abstract/engine_api_resource.py", line 153, in create
    response, _, api_key = requestor.request(
                           ^^^^^^^^^^^^^^^^^^
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_requestor.py", line 298, in request
    resp, got_stream = self._interpret_response(result, stream)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_requestor.py", line 700, in _interpret_response
    self._interpret_response_line(
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_requestor.py", line 763, in _interpret_response_line
    raise self.handle_error_response(
openai.error.RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-j9E4an588TDys1U9hTjfa8py on tokens per min. Limit: 1000000 / min. Current: 0 / min. Contact us through our help center at help.openai.com if you continue to have issues.

emcd · 2023-07-16T00:02:26Z

@ibeckermayer : Thanks for reporting this. I ran into the openai.error.InvalidRequestError: '$.input' is invalid. Please check the API reference: https://platform.openai.com/docs/api-reference. error today and saw the issue your reported for ChromaDB. I have looked at the OpenAI Python client code and it has no specific handling around the input issue that you reported. The error is actually coming from the OpenAI API and is not a problem with the Python client per se.

Some trial and error (binary search on valid array/list size) and guesswork led me to discover that the maximum array size for input is 2048. So, assuming that you are using text-embedding-ada-002, all of the following constraints are true:

The input parameter may not take a list longer than 2048 elements (chunks of text).
The total number of tokens across all list elements of the input parameter cannot exceed 1,000,000. (Because the rate limit is 1,000,000 tokens per minute.)
Each individual array element (chunk of text) cannot be more than 8191 tokens.

skskcco2o17 · 2023-08-06T03:13:15Z

no element in the list should be BLANK/EMPTY/NULL content in the input parameter (list of paragraph)

LinqLover · 2023-08-28T11:13:15Z

Very nice findings. Could someone from OpenAI team document 2048 and 1,000,000 in the docs? Unfortunately the feedback form there is not really usable (it is even single-line) ...

rattrayalex · 2023-11-10T04:05:33Z

Thanks for debugging & sharing your findings @emcd. I've put in a request to update the documentation accordingly.

I'm going to close this issue since it's not a bug in the Python library.

maheshwaghmare · 2024-02-23T21:54:28Z

I was facing the same.

Found the solution!

I am using the text-embedding-ada-002 model.

I was sending the large array and send it as input:

'input' => $input_text,

As I using PHP so converted the $input_text in JSON format as:

'input' => json_encode( $input_text ),

And it works for me.

Additionally, I refer to the doc:

https://platform.openai.com/docs/api-reference/embeddings/create#embeddings-create-input

1. string
The string will be turned into an embedding.

2. array
The array of strings that will be turned into an embedding.

3. array
The array of integers that will be turned into an embedding.

4. array
The array of arrays containing integers that will be turned into an embedding.

But there is no mention of that.

SOLUTION: Convert the Array to JSON format.

ibeckermayer added the bug Something isn't working label Jul 4, 2023

ibeckermayer mentioned this issue Jul 4, 2023

[Bug]: Adding large datasets to a collection using OpenAI embedding function fails chroma-core/chroma#709

Closed

flash1293 mentioned this issue Oct 4, 2023

Vector DB CDK: Fix chunk size for openai embedder airbytehq/airbyte#31067

Merged

rattrayalex closed this as completed Nov 10, 2023

michaelsharpe mentioned this issue Feb 4, 2024

Fixes to null data results and openAI embedding limits embedchain/embedchain#1238

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Passing large datasets to `Embedding` causes an error #519

Passing large datasets to `Embedding` causes an error #519

ibeckermayer commented Jul 4, 2023

Alisultani1 commented Jul 4, 2023 via email

ibeckermayer commented Jul 10, 2023

emcd commented Jul 16, 2023 •

edited

skskcco2o17 commented Aug 6, 2023

LinqLover commented Aug 28, 2023

rattrayalex commented Nov 10, 2023

maheshwaghmare commented Feb 23, 2024

Passing large datasets to Embedding causes an error #519

Passing large datasets to Embedding causes an error #519

Comments

ibeckermayer commented Jul 4, 2023

Describe the bug

To Reproduce

Code snippets

OS

Python version

Library version

Alisultani1 commented Jul 4, 2023 via email

ibeckermayer commented Jul 10, 2023

emcd commented Jul 16, 2023 • edited

skskcco2o17 commented Aug 6, 2023

LinqLover commented Aug 28, 2023

rattrayalex commented Nov 10, 2023

maheshwaghmare commented Feb 23, 2024

Passing large datasets to `Embedding` causes an error #519

Passing large datasets to `Embedding` causes an error #519

emcd commented Jul 16, 2023 •

edited