-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve parallel requests handling #30
Comments
Was looking into this and I don't think batch completion handles rate limit
errors. I am going to try complete the NO solution but I was wondering how
I would be able to test it. Would I just make multiple requests in a row or
is there a better way to do it?
…On Wed, Jul 10, 2024 at 12:07 AM Joshua Ashkinaze ***@***.***> wrote:
In the implementation I have, I am sending parallel LiteLLM completion
requests through Native Python libraries. I really thought LiteLLM sleeps
on rate limit errors but I got an error sometimes with non openai providers.
So I see a few paths forward and I am wondering if you can explore this
for next week:
Q1: First, is it the case that LiteLLM's batch completion endpoint handles
rate limit errors?
If the answer to Q1 is Yes:
- Task: Try to switch our current implementation to simply use batch
completion.
If the answer to Q1 is No:
I don't think it's worth re-factoring to that since it would have no
benefit. Instead, there are some low-lying options:
1.
What if we just set a high num_retries as the default for kwargs for
Ensemble? num_retries is LiteLLM using Tenacity to wait on rate limit
errors.
2.
If you find it still has problems then just use make a proposal for
our waiting scheme using tenacity itself (the library litellm is using to
wait on requests)
—
Reply to this email directly, view it on GitHub
<#30>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A2ZEBKNTJVICBJHLMOSEQH3ZLSXMTAVCNFSM6AAAAABKUBW5KGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGM4TSNRZHAYDMMQ>
.
You are receiving this because you were assigned.Message ID:
***@***.***>
|
Regarding Q1: Regarding replicating an error: Anyway even if this is an edge case it's good to handle. Regarding solution if answer to Q1 is no, as you suspect Also, it seems they have an open bug about this so I think there may be something interesting. Updated to P0 and let's discuss tomorrow what path forward is So there are two options:
Okay so I made a draft implementation (be sure to test this though!!) and I think it would look something like this. See how we catch error first and tell user to step up num_retries and then raise it. import warnings
from concurrent.futures import ThreadPoolExecutor, as_completed
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
from LiteLLM.exceptions import RateLimitError
class Ensemble(AbstractStructure):
def process(self):
"""
Process the task through an ensemble of agents, each handling the task independently with retries.
"""
original_task = self.agents[0].original_task_description
for _ in range(self.cycles):
with ThreadPoolExecutor() as executor:
futures = []
for agent in self.agents:
previous_responses_str = ""
agent.combination_instructions = self.combination_instructions
futures.append(executor.submit(self._process_with_retry, agent, previous_responses_str))
for future in as_completed(futures):
response = future.result()
self.responses.append(response)
if self.moderated and self.moderator:
moderated_response = self.moderator._moderate_responses(
self.responses, original_task)
self.responses.append(moderated_response)
self.final_response = self.responses[-1]
@staticmethod
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=10),
retry=retry_if_exception_type(RateLimitError))
def _process_with_retry(agent, previous_responses_str):
"""
Process an agent's task with retries in case of rate limit errors.
Args:
agent (Agent): The agent to process the task.
previous_responses_str (str): Previous responses to incorporate into the task.
Returns:
str: The response from the agent.
"""
try:
return agent.process(previous_responses=previous_responses_str)
except RateLimitError as e:
print("Try increasing num_retries in kwargs of model")
raise e
|
Unit tests:
To do this you want to use mocking to simulate the behavior of an error....I wrote some other unit tests that do this kind of thing (use mocking to check for errors being handled correctly) |
Okay based on our meeting:
|
In the implementation I have, I am sending parallel LiteLLM completion requests through Native Python libraries. But some providers have these super low rate limits.
So I see a few paths forward and I am wondering if you can explore this for next week:
Q1: First, is it the case that LiteLLM's batch completion endpoint handles rate limit errors?
If the answer to Q1 is Yes:
If the answer to Q1 is No:
I don't think it's worth re-factoring to that since it would have no benefit. Instead, there are some low-lying options:
Experiment: What if we just set a high
num_retries
as the default forkwargs
for Ensemble?num_retries
is LiteLLM using Tenacity to wait on rate limit errors.Propose/implement on
development
: If you find it still has problems then just make a proposal for our own waiting scheme using tenacity itself (the library litellm is using to wait on requests)The text was updated successfully, but these errors were encountered: