Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaling/Concurrent Requests #1187

Closed
jjsarf opened this issue Nov 18, 2023 · 4 comments
Closed

Scaling/Concurrent Requests #1187

jjsarf opened this issue Nov 18, 2023 · 4 comments

Comments

@jjsarf
Copy link

jjsarf commented Nov 18, 2023

Hello again. Great project. This may not be an issue, but I did notice that placing a second request while another one is currently processing makes the new request timeout.
Is this by design? This is not the case when using HuggingFace UI >0.4
Thanks.

@SMenigat
Copy link

Yes thats the current design as far as I understand it. All requests are currently handled sequentially. That allows the API to switch out the LLM it is using per request and allows for better planning of the needed resources to run the service. When Implementing my app that uses Ollama I implemented a worker queue that handles all requests in the background.

@jjsarf
Copy link
Author

jjsarf commented Nov 20, 2023

Yes thats the current design as far as I understand it. All requests are currently handled sequentially. That allows the API to switch out the LLM it is using per request and allows for better planning of the needed resources to run the service. When Implementing my app that uses Ollama I implemented a worker queue that handles all requests in the background.

It would be great to have this mechanism as a configuration parameter (as in on or off) as being able to handle just a single request at a time is a limitation.

@ishaan-jaff
Copy link

Hi @SMenigat I'm the maintainer of LiteLLM. We provider an OpenAI compatible endpoint + request queueing with workers for ollama if you're interested in using it (would love your feedback on this)

Here's a quick start on using it: Compatible with ollama, GPT-4, (any LiteLLM supported LLM)
docs: https://docs.litellm.ai/docs/routing#queuing-beta

Quick Start

  1. Add Redis credentials in a .env file
REDIS_HOST="my-redis-endpoint"
REDIS_PORT="my-redis-port"
REDIS_PASSWORD="my-redis-password" # [OPTIONAL] if self-hosted
REDIS_USERNAME="default" # [OPTIONAL] if self-hosted
  1. Start litellm server with your model config
$ litellm --config /path/to/config.yaml --use_queue

Here's an example config for ollama/llama2

config.yaml

model_list: 
  - model_name: llama2
    litellm_params: 
      model: ollama/llama2
      api_key: 
  - model_name: code-llama
    litellm_params: 
      model: ollama/code-llama # actual model name
  1. Test (in another window) → sends 100 simultaneous requests to the queue
$ litellm --test_async --num_requests 100

Available Endpoints

  • /queue/request - Queues a /chat/completions request. Returns a job id.
  • /queue/response/{id} - Returns the status of a job. If completed, returns the response as well. Potential status's are: queued and finished.

@jmorganca
Copy link
Member

Merging with #358

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants