Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve parallel requests handling #30

Closed
josh-ashkinaze opened this issue Jul 10, 2024 · 4 comments
Closed

Improve parallel requests handling #30

josh-ashkinaze opened this issue Jul 10, 2024 · 4 comments
Assignees
Labels
enhancement New feature or request p0 very important!

Comments

@josh-ashkinaze
Copy link
Owner

josh-ashkinaze commented Jul 10, 2024

In the implementation I have, I am sending parallel LiteLLM completion requests through Native Python libraries. But some providers have these super low rate limits.

So I see a few paths forward and I am wondering if you can explore this for next week:

Q1: First, is it the case that LiteLLM's batch completion endpoint handles rate limit errors?

If the answer to Q1 is Yes:

  • Task: Try to switch our current implementation to simply use batch completion.

If the answer to Q1 is No:
I don't think it's worth re-factoring to that since it would have no benefit. Instead, there are some low-lying options:

Experiment: What if we just set a high num_retries as the default for kwargs for Ensemble? num_retries is LiteLLM using Tenacity to wait on rate limit errors.

Propose/implement on development: If you find it still has problems then just make a proposal for our own waiting scheme using tenacity itself (the library litellm is using to wait on requests)

@josh-ashkinaze josh-ashkinaze changed the title For ensembles we should switch to using LiteLLM's native batch completion Improve parallel requests handling Jul 10, 2024
@josh-ashkinaze josh-ashkinaze added the enhancement New feature or request label Jul 10, 2024
@narenedara
Copy link
Collaborator

narenedara commented Jul 10, 2024 via email

@josh-ashkinaze
Copy link
Owner Author

josh-ashkinaze commented Jul 10, 2024

Regarding Q1:
Are you looking at this code? I think you may be right. It seems like for default completions max_retries is equal to zero and I don't see batch_completion changing that value (though it's a lot of code so idk)

https://github.com/BerriAI/litellm/blob/e201b06d9c2ba4580ca9b1a51f35e12b08588ec6/litellm/main.py#L2747

Regarding replicating an error:
No, use the Ensemble method because that will launch them all at once. I weirdly was not able to replicate this error with OpenAI even doing 200 agents...so I don't know if this is because (1) my account has some kind of high rate limit level; (2) I experienced a blip of some sort with Anthropic. Try with Anthropic though. Make the prompt "say hello" and be sure to do kwargs = {'max_tokens':2}

Anyway even if this is an edge case it's good to handle.

Regarding solution if answer to Q1 is no, as you suspect
Can you look into this code from LiteLLM? You may have to mess around with the RetryPolicy to get a good amount.
https://litellm.vercel.app/docs/routing

Also, it seems they have an open bug about this so I think there may be something interesting.
BerriAI/litellm#527

Updated to P0 and let's discuss tomorrow what path forward is
I am going to upgrade this to a p0 thing to do since I think this is pretty important actually and the path forward seems clear. Reading LiteLLM's thread, we may as well just handle ourselves.

So there are two options:

  1. We can refactor agent.process so it can take in a kwargs argument. Up until now, we assume an Agent's kwags wouldn't depend on the Structure (which is a logical thing IMO) so I do not want to do this.

  2. Less disruptive, we just change ensemble.process so that instead of calling agent.process in parallel we create a wrapper function process_with_retry that just calls agent.process but uses Tenacity to wait on the RateLimit. We can even look at LitelLM's defaults....when you set num_retries=4 in LiteLLM I know it uses Tenacity but what are the exact parameters? Let's just stick with LiteLLM's defaults.

Okay so I made a draft implementation (be sure to test this though!!) and I think it would look something like this. See how we catch error first and tell user to step up num_retries and then raise it.

import warnings
from concurrent.futures import ThreadPoolExecutor, as_completed
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
from LiteLLM.exceptions import RateLimitError

class Ensemble(AbstractStructure):
    def process(self):
        """
        Process the task through an ensemble of agents, each handling the task independently with retries.
        """
        original_task = self.agents[0].original_task_description
        
        for _ in range(self.cycles):
            with ThreadPoolExecutor() as executor:
                futures = []
                for agent in self.agents:
                    previous_responses_str = ""
                    agent.combination_instructions = self.combination_instructions
                    futures.append(executor.submit(self._process_with_retry, agent, previous_responses_str))
                for future in as_completed(futures):
                    response = future.result()
                    self.responses.append(response)

        if self.moderated and self.moderator:
            moderated_response = self.moderator._moderate_responses(
                self.responses, original_task)
            self.responses.append(moderated_response)
        self.final_response = self.responses[-1]

    @staticmethod
    @retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=10), 
           retry=retry_if_exception_type(RateLimitError))
    def _process_with_retry(agent, previous_responses_str):
        """
        Process an agent's task with retries in case of rate limit errors.

        Args:
            agent (Agent): The agent to process the task.
            previous_responses_str (str): Previous responses to incorporate into the task.

        Returns:
            str: The response from the agent.
        """
        try:
            return agent.process(previous_responses=previous_responses_str)
        except RateLimitError as e:
            print("Try increasing num_retries in kwargs of model")
            raise e

@josh-ashkinaze josh-ashkinaze added the p0 very important! label Jul 10, 2024
@josh-ashkinaze
Copy link
Owner Author

josh-ashkinaze commented Jul 11, 2024

Unit tests:

  • Assume that max_retries is N=20

  • Just for testing make the retry time really small

  • Test 1: If LiteLLM throws a RateLimitError and try < 20, then (ASSERTION 1) we catch it and (ASSERTION 2) wait and (ASSERTION 3) try again

  • Test 2: If tries n=1...N do not work, then we (ASSERTION 1) raise an error to user that says to increase max_retries

To do this you want to use mocking to simulate the behavior of an error....I wrote some other unit tests that do this kind of thing (use mocking to check for errors being handled correctly)

@josh-ashkinaze
Copy link
Owner Author

josh-ashkinaze commented Jul 11, 2024

Okay based on our meeting:

  • Do parallel requests outside of our system with num_retries high
  • Post issue in LiteLLM repository since it seems their num_retries is behaving unexpected
  • If LiteLLM has no fix for this, let's just write our own Tenacity retry code inspired by their defaults, to be discussed at meeting

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request p0 very important!
Projects
None yet
Development

No branches or pull requests

2 participants