Support rate limiting in embeddings API #579

AyushExel · 2023-10-17T13:57:33Z

Most LLM APIs and their derivates have some form of rate limiting. The trail versions for testing have smaller limits. When using new new embeddings API, the calls to model APIs is made implicitly when the data is added to the tables. There are some ways to manually rate limit by sleeping for a few seconds when adding data in smaller batches/individual rows. But if when adding data in larger batches using the ingested EmebddingFunction instance, there is no way to prevent hitting the rate limit.

There are 2 types of rate limits that we could potentially support:

Requests level - Allow user to set the RPM(request per minute) when initializing the EmbeddingFunction instance. It keeps the rolling count of requests made in the last 60 seconds and sleeps if the limit occurs. Each call to EmbeddingFunction.generate_embeddings can be assumed to be 1 batched request.
Token level - We should simply provide an interface for the user to handle this case themselves as handling this on a lower level can be a bit tricky, as there are 2 cases. 1 - TRM occurs because the combines token of multiple texts is more than the limit. This can be handled same as the request limit by waiting out. but in case 2 - the TRL can be exceeded by a single text, so the solution can be to either chunk it or maybe just truncate everything after the token limit. And token limit can be applied in combination with request limit.

something like this

cohere = EmbeddingFunctionRegistry().get_instance("model_name", rate_limit=10, token_limit=1000000)

The text was updated successfully, but these errors were encountered:

Sets things up for this -> #579 - Just separates out the registry/ingestion code from the function implementation code - adds a `get_registry` util - package name "open-clip" -> "open-clip-torch"

…g functions (#614) Users ingesting data using rate limited apis don't need to manually make the process sleep for counter rate limits resolves #579

Sets things up for this -> lancedb#579 - Just separates out the registry/ingestion code from the function implementation code - adds a `get_registry` util - package name "open-clip" -> "open-clip-torch"

…g functions (lancedb#614) Users ingesting data using rate limited apis don't need to manually make the process sleep for counter rate limits resolves lancedb#579

Sets things up for this -> #579 - Just separates out the registry/ingestion code from the function implementation code - adds a `get_registry` util - package name "open-clip" -> "open-clip-torch"

…g functions (#614) Users ingesting data using rate limited apis don't need to manually make the process sleep for counter rate limits resolves #579

AyushExel mentioned this issue Oct 17, 2023

[Python]Embeddings API refactor #580

Merged

AyushExel mentioned this issue Nov 1, 2023

Exponential standoff retry support for handling rate limited embedding functions #614

Merged

AyushExel closed this as completed in #614 Nov 2, 2023

alexkohler pushed a commit to alexkohler/lancedb that referenced this issue Apr 20, 2024

Clippy fixes (lancedb#579)

51afc00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support rate limiting in embeddings API #579

Support rate limiting in embeddings API #579

AyushExel commented Oct 17, 2023

Support rate limiting in embeddings API #579

Support rate limiting in embeddings API #579

Comments

AyushExel commented Oct 17, 2023