Scaling/Concurrent Requests #1187

jjsarf · 2023-11-18T01:07:51Z

Hello again. Great project. This may not be an issue, but I did notice that placing a second request while another one is currently processing makes the new request timeout.
Is this by design? This is not the case when using HuggingFace UI >0.4
Thanks.

SMenigat · 2023-11-20T09:01:00Z

Yes thats the current design as far as I understand it. All requests are currently handled sequentially. That allows the API to switch out the LLM it is using per request and allows for better planning of the needed resources to run the service. When Implementing my app that uses Ollama I implemented a worker queue that handles all requests in the background.

jjsarf · 2023-11-20T13:30:11Z

Yes thats the current design as far as I understand it. All requests are currently handled sequentially. That allows the API to switch out the LLM it is using per request and allows for better planning of the needed resources to run the service. When Implementing my app that uses Ollama I implemented a worker queue that handles all requests in the background.

It would be great to have this mechanism as a configuration parameter (as in on or off) as being able to handle just a single request at a time is a limitation.

ishaan-jaff · 2023-11-22T17:06:36Z

Hi @SMenigat I'm the maintainer of LiteLLM. We provider an OpenAI compatible endpoint + request queueing with workers for ollama if you're interested in using it (would love your feedback on this)

Here's a quick start on using it: Compatible with ollama, GPT-4, (any LiteLLM supported LLM)
docs: https://docs.litellm.ai/docs/routing#queuing-beta

Quick Start

Add Redis credentials in a .env file

REDIS_HOST="my-redis-endpoint"
REDIS_PORT="my-redis-port"
REDIS_PASSWORD="my-redis-password" # [OPTIONAL] if self-hosted
REDIS_USERNAME="default" # [OPTIONAL] if self-hosted

Start litellm server with your model config

$ litellm --config /path/to/config.yaml --use_queue

Here's an example config for ollama/llama2

config.yaml

model_list: 
  - model_name: llama2
    litellm_params: 
      model: ollama/llama2
      api_key: 
  - model_name: code-llama
    litellm_params: 
      model: ollama/code-llama # actual model name

Test (in another window) → sends 100 simultaneous requests to the queue

$ litellm --test_async --num_requests 100

Available Endpoints

/queue/request - Queues a /chat/completions request. Returns a job id.
/queue/response/{id} - Returns the status of a job. If completed, returns the response as well. Potential status's are: queued and finished.

jmorganca · 2024-02-20T01:12:13Z

Merging with #358

jmorganca closed this as completed Feb 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaling/Concurrent Requests #1187

Scaling/Concurrent Requests #1187

jjsarf commented Nov 18, 2023

SMenigat commented Nov 20, 2023

jjsarf commented Nov 20, 2023

ishaan-jaff commented Nov 22, 2023

jmorganca commented Feb 20, 2024

Scaling/Concurrent Requests #1187

Scaling/Concurrent Requests #1187

Comments

jjsarf commented Nov 18, 2023

SMenigat commented Nov 20, 2023

jjsarf commented Nov 20, 2023

ishaan-jaff commented Nov 22, 2023

Quick Start

Available Endpoints

jmorganca commented Feb 20, 2024