Parallel requests #358

youssef02 · 2023-08-16T09:15:52Z

the app is amazing but the problem is If I want to create a multiagent from one api, I need to create a Queue system, as it can reply only to one request at a time, is there a way to improve this or do I have to implement a Queue system?

I just started here so sorry for any mistake, ;)

jmorganca · 2023-08-23T18:26:59Z

Not a mistake – Ollama will serve one generation at a time currently, but supporting 2+ concurrent requests is definitely on the roadmap

LeHaroun · 2023-10-17T23:49:50Z

Is there a way to run multiple instances on the same machine

the app is amazing but the problem is If I want to create a multiagent from one api, I need to create a Queue system, as it can reply only to one request at a time, is there a way to improve this or do I have to implement a Queue system?

I just started here so sorry for any mistake, ;)

I am working on a similar implementation using MAS, If you have beefy hardware, I recommend you run several instances in closed envs. Save completions or generation and process them later on.

Also check LiteLLM APIs on top of Ollama. Quite helpful.

skye0402 · 2024-01-09T00:17:32Z

Would be definitely a great addition to Ollama:

Concurrency of requests
Using GPU mem for several models

I'm running it on cloud using a T4 with 16GB GPU memory and having a phi-2 and codellama both in the V-RAM would be no issue at all. Ideally other models are kept in regular RAM instead of loading it from disk.
Adding to it: Users might switch models, so the queuing approach would apply to model switches, too 😃

ivanfioravanti · 2024-01-18T22:24:49Z

Any news on this one? Parallel requests can be a real game changer for Ollama

ParisNeo · 2024-01-30T07:39:26Z

If you have enough VRAM to run multiple models, you can create multiple instances of ollama with different port numbers then use my proxy to manage access and route to the requests to each:
ParisNeo's ollama_proxy_server

ehartford · 2024-01-30T22:40:41Z

I would really like to have an m2 ultra 192gb on my company's intranet that can service the whole R&D department (a dozen people)

as long as I have enough ram, I wish to be able to run multiple inference request at the same time. Thank you for considering my wish!

ivanfioravanti · 2024-01-30T23:44:01Z

same here, multiple 7B models served by an M2 Ultra. My dream! 🙏

Adphi · 2024-01-31T00:09:25Z

At first glance, when I started examining the source code, I thought that the problem of concurrent requests was due to the fact that the current implementation referred to the use of global variables in its llama.cpp binding. After paying a little more attention to the original llapma.cpp source code, I realized that the original implementation wasn't exactly geared towards multithreaded or server-side use, but rather towards to a local development/experimentation use case.
So apart from setting up an API using queued workers (managed using gRPC for example) based on a fork/exec model to get around the lack of batch processing on the llama.cpp side, which doesn't seem to procure too much attention from the developpers (reference missing, but found in the github project's conversations), which Local AI does in a way, I don't see exactly how it could be implemented here.

ParisNeo · 2024-01-31T08:58:46Z

That's why I had to build a proxy. You can install multiple servers on a single or multiple machines then use my proxy to service multiple users with a multi queue. For example if you have 2 servers then you can servicec at most 2 clients simultaniously and when the two are buzy, the current message is queued on the server that has the least full queue. In practice, since the generation is very fast, you don't get too much collisions. We rarely get more than 1 person in the queue.

I also added security features as well as logging features. It works fine and serves all of us perfectly.

Also, some dude built a docker docker version and I accepted his PR:
https://github.com/ParisNeo/ollama_proxy_server

If you are interested, you can try it, it is open source, you can also read the code, get inspired, help enhance it etc. It is Apache 2.0 so you can do whatever you want with it 100% free.

We get around 300 tokens/s for each user so it is not causing any significant delay.

ehartford · 2024-01-31T10:12:50Z

llama.cpp can be run as multiple process, multiple thread, or as server.
but it's totally fair for the feature request to be deprioritized, of course.

farhanhubble · 2024-02-01T06:33:16Z

The easiest way to multiplex Ollama, at least on linux system should be with a reverse proxy load balancer like HAProxy. Launch multiple instances of ollama serve on different ports and map them to a single port using HAproxy.

Note that the approach can sometimes deteriorate performance due to CPU contention. I have a decent system with 64 cores and 24GB of GPU RAM. When I run 3 instances of Ollama with HAproxy to generate embeddings it does speed up the process however if I try to generate text, the processing time is much worse than with a single instance.

trymeouteh · 2024-03-01T04:27:14Z

Would like to see the ability for using the same LLM in two or more apps at the same time and would like to see the ability to use multiple LLMs in two or more apps at the same time

ParisNeo · 2024-03-01T22:08:20Z

well technically, you can run multiple instances of the same model by runnin multiple instances of ollama with different port numbers, configure them in the proxy config file then they can be accessed from multiple clients at once.

ehartford · 2024-03-01T22:15:58Z

That's just wrong 😂

fix ollama#358 Signed-off-by: Adphi <philippe.adrien.nousse@gmail.com>

easp · 2024-03-30T21:07:24Z

Looks relevant: #3418

darwinvelez58 · 2024-04-09T01:21:53Z

any update?

0x77dev · 2024-04-09T01:26:30Z

@darwinvelez58 there is some work being done in #3418 ~~and Adphi@a9195b3~~

was testing both options, at the moment seems very unstable

ParisNeo · 2024-04-09T11:15:38Z

In the meantime you can try my ollama proxy :) It can be configured to fire up multiple ollama services, then you can also define users with keys to access the service and it balances the loads between the instances of ollama with multiple queues.

@ParisNeo Do you have a link for me please? Also how does that works from an architectur point of view? I guessing sticky-session, so people usng this wont get mixed up and have a english cooking recipe with c++ snippets and german dictionaries in there right?

Hi, sorry I was very very buzy lately and had no time to look at the github messages.
The proxy server can be found here:
https://github.com/ParisNeo/ollama_proxy_server

It has also an optional security to allow you to have a users base with a personal key for each user.
The architecture is simple:
You specify multiple ollama instances viewable only on the server side each one has a different port number, then the proxy manages both the authentication of the users, the logging of their access, and distribution of the users over multiple queues.
Each queue is behind one instance of ollama and the proxy will always put you in the least filled queue.
So if you run 4 ollama instances, you can simultaniously serve 4 users and the others will be queued.

I agree that you can do more if ollama manages this all inside to mutualize the weights and stuff. I have built my proxy because I needed the security and the management and back then ollama had no plans to do this so got to move, but I'll be happy if they integrate multiusers directly inside.

TheMasterFX · 2024-04-10T18:54:33Z

vLLM uses a PagedAttention. Is this a part which must be integrated in ollama or in the llama.cpp part?

0x77dev · 2024-04-10T19:20:00Z

@TheMasterFX this is llama.cpp side of things

TheMasterFX · 2024-04-12T17:56:03Z

Seems like there are improvements coming:
https://twitter.com/AlexReibman/status/1778695203957977148

guitmonk-1290 · 2024-04-21T17:43:05Z

I was reading ray serve's docs and it seems they support Dynamic Request Batching, which creates a queue system.
https://docs.ray.io/en/latest/serve/advanced-guides/dyn-req-batch.html

I am new to this, so can we use this with Ollama for batch inference?

jpmcb · 2024-04-24T14:24:14Z

I've had some success load balancing multiple Ollama instances on Kubernetes with a pool of GPU nodes:

❯ kubectl get nodes -A
NAME                     STATUS   ROLES    AGE   VERSION
defaultpool-88943984-0   Ready    <none>   5d    v1.29.2
defaultpool-88943984-1   Ready    <none>   5d    v1.29.2
gpupool-42074538-0       Ready    <none>   41h   v1.29.2
gpupool-42074538-1       Ready    <none>   41h   v1.29.2
gpupool-42074538-2       Ready    <none>   41h   v1.29.2
gpupool-42074538-3       Ready    <none>   41h   v1.29.2
gpupool-42074538-4       Ready    <none>   41h   v1.29.2

Notice the 5 "gpupool" nodes, each with a T4 nvidia GPU. Ollama is then deployed as a daemonset on each of the GPU nodes:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: ollama-daemonset
  namespace: ollama-ns
spec:
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
      - image: ollama/ollama:latest
        imagePullPolicy: Always
        lifecycle:
          postStart:
            exec:
              command:
              - ollama
              - pull
              - llama3
        name: ollama
        ports:
        - containerPort: 11434
          protocol: TCP

      # A few special selectors / tolerations to schedule on the GPU nodes
      nodeSelector:
        accelerator: nvidia
      tolerations:
      - effect: NoSchedule
        key: sku
        operator: Equal
        value: gpu

These then can be load balanced behind a Kubernetes service with the ClusterIP being exposed internally "as a service" to the rest of the cluster:

apiVersion: v1
kind: Service
metadata:
  name: ollama-service
  namespace: ollama-ns
spec:
  ports:
  - port: 80
    protocol: TCP
    targetPort: 11434
  selector:
    app: ollama
  sessionAffinity: None
  type: ClusterIP

Inside the cluster, I can hit the load balancing service resolved by CoreDNS:

$ curl ollama-service.ollama-ns.svc.cluster.local
Ollama is running

This works pretty well for slower requests but has 2 problems:

The scaling of this is directly tied to the number of nodes (and in this case, expensive GPU nodes) that have daemonsets on them
Kubernetes load balancing isn't perfect and could very easily send a request to a "busy" daemonset pod that already is servicing a request.
- I believe when this happens, the whole pod seems to stale and doesn't print any additional logs, just seems to hang. Might be worth some more experimentation to see if there's a smoking gun when multiple requests come through

It'd be absolutely incredible if multiple requests could be serviced by a single ollama serve: It would take this example daemonset deployment with a pool of GPUs and make it much more scalable.

Thanks for all the amazing work on ollama - I'm happy to test anything out and report back if there's a ollama/ollama:concurrency test image or if there's code reviews needed ❤️

skye0402 · 2024-04-26T05:55:37Z

I'm with vLLM now for that use case, it's very fast and manages concurrency plus has a big model pool and openAI compatible API.
I'm still with Ollama for trying out new models that's where it really shines and saves time.

shing100 · 2024-04-29T05:05:43Z

I really wish this feature was added.

iakashpaul · 2024-04-29T12:25:44Z

For anyone else stumbling onto this thread-

To use webUI with parallel chat generation on multiple clients use TGI or server.cpp/duplicate my HF Space onto a GPU instance directly with parallel flag set as 2 or more basis your VRAM, then run the ollama container with server.cpp acting as an OpenAI replacement as shown below.

Server.cpp

 ./server -m ./models/mistral-7b-instruct-v0.2.Q8_0.gguf -ngl 30 --host 127.0.0.1 --port 8080 --parallel 2

Ollama container

 docker run  -p 3000:8080   -v open-webui:/app/backend/data   -e OPENAI_API_BASE_URLS="http://127.0.0.1:8080/v1"   -e OPENAI_API_KEYS="randomplaceholder"   --restart always   ghcr.io/open-webui/open-webui:main

cc: @ehartford

Latest release has experimental flags,

OLLAMA_NUM_PARALLEL=4 OLLAMA_MAX_LOADED_MODELS=4 ollama serve

mili-tan · 2024-04-29T12:34:13Z

https://github.com/ollama/ollama/releases/tag/v0.1.33-rc5

hanzlaramey · 2024-04-29T15:13:40Z

game changer!!! Thank you @mili-tan

ehartford · 2024-04-30T00:58:25Z

Thank you!

ehartford · 2024-04-30T00:59:09Z

If I don't want to limit the number of loaded models, I just don't set that variable?

BBjie · 2024-04-30T13:33:12Z

My question is for the concurrency feature, can I edit and add them into my compose-up file.

ollama:
    image: ollama/ollama:latest
    container_name: ollama
    ports:
      - "11434:11434"

    
    volumes:
      - ollama:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
    command: serve
volumes:
  ollama:

or is there other way to pass the value in for
OLLAMA_NUM_PARALLEL=4 OLLAMA_MAX_LOADED_MODELS=4 ollama serve

jammsen · 2024-04-30T13:44:24Z

My question is for the concurrency feature, can I edit and add them into my compose-up file.
ollama:
    image: ollama/ollama:latest
    container_name: ollama
    ports:
      - "11434:11434"

    
    volumes:
      - ollama:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
    command: serve
volumes:
  ollama:
or is there other way to pass the value in for OLLAMA_NUM_PARALLEL=4 OLLAMA_MAX_LOADED_MODELS=4 ollama serve

I think thoose are 2 different things, the ENVs should be passed to ollama serve when you run the native app on your system. Not sure though how you can use this in a Docker Image of ollama @jmorganca could you clear this up for us please?

BBjie · 2024-04-30T15:17:36Z

Have not tested but should work:


ollama:

    image: ollama/ollama:latest

    container_name: ollama
    environment:
      - OLLAMA_NUM_PARALLEL=4
      - OLLAMA_MAX_LOADED_MODELS=4
...

I tested it one hour ago, it was not working...

laktosterror · 2024-04-30T15:22:40Z

Sorry you might need to get the pre-release image. Corrected previous answear.

laktosterror · 2024-04-30T15:45:38Z

On mobile. I seem to have messed up my original post.

Try this:

ollama:
    image: ollama/ollama:0.1.33-rc5
    container_name: ollama
    environment:
      - OLLAMA_NUM_PARALLEL=4
      - OLLAMA_MAX_LOADED_MODELS=4
...

BBjie · 2024-04-30T16:13:13Z

On mobile. I seem to have messed up my original post.

Try this:

ollama:
    image: ollama/ollama:0.1.33-rc5
    container_name: ollama
    environment:
      - OLLAMA_NUM_PARALLEL=4
      - OLLAMA_MAX_LOADED_MODELS=4
...

ollama  | {"function":"log_server_request","level":"INFO","line":2734,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":46394,"status":200,"tid":"140623411642368","timestamp":1714493452}
ollama  | {"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":4,"n_processing_slots":0,"task_id":399,"tid":"140624811528192","timestamp":1714493452}
ollama  | {"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":46410,"status":200,"tid":"140623403249664","timestamp":1714493452}
ollama  | {"function":"log_server_request","level":"INFO","line":2734,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":46410,"status":200,"tid":"140623403249664","timestamp":1714493452}
ollama  | {"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":4,"n_processing_slots":0,"task_id":400,"tid":"140624811528192","timestamp":1714493452}
ollama  | {"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":46410,"status":200,"tid":"140623403249664","timestamp":1ollama  | {"function":"print_timings","level":"INFO","line":269,"msg":"prompt eval time     =     991.40 ms /   713 tokens (    1.39 ms per token,   719.19 tokens per second)","n_prompt_tokens_processed":713,"n_tokens_second":719.1864417721239,"slot_id":1,"t_prompt_processing":991.398,"t_token":1.3904600280504908,"task_id":402,"tid":"140624811528192","timestamp":1714493454}
ollama  | {"function":"print_timings","level":"INFO","line":283,"msg":"generation eval time =    1251.42 ms /    56 runs   (   22.35 ms per token,    44.75 tokens per second)","n_decoded":56,"n_tokens_second":44.7492364661528,"slot_id":1,"t_token":22.346749999999997,"t_token_generation":1251.418,"task_id":402,"tid":"140624811528192","timestamp":1714493454}
ollama  | {"function":"print_timings","level":"INFO","line":293,"msg":"          total time =    2242.82 ms","slot_id":1,"t_prompt_processing":991.398,"t_token_generation":1251.418,"t_total":2242.816,"task_id":402,"tid":"140624811528192","timestamp":1714493454}
ollama  | {"function":"update_slots","level":"INFO","line":1640,"msg":"slot released","n_cache_tokens":832,"n_ctx":3904,"n_past":831,"n_system_tokens":0,"slot_id":1,"task_id":402,"tid":"140624811528192","timestamp":1714493454,"truncated":false}
ollama  | {"function":"log_server_request","level":"INFO","line":2734,"method":"POST","msg":"request","params":{},"path":"/completion","remote_addr":"127.0.0.1","remote_port":46416,"status":200,"tid":"140623966629888","timestamp":1714493454}
ollama  | [GIN] 2024/04/30 - 16:10:54 | 200 |  2.546651478s |      172.19.0.4 | POST     "/api/chat"

Thhanks, but I believe it still not working as expected..

dhiltgen · 2024-05-02T20:17:22Z

I'm going to close this as fixed now in 0.1.33. As commenters above have pointed out, it's opt-in for now, but we do intend to eventually do concurrency automatically without requiring the env vars.

To clarify some questions above: There are 2 layers of concurrency. There's OLLAMA_NUM_PARALLEL which controls how many requests can be answered against a single loaded model, and there's OLLAMA_MAX_LOADED_MODELS which controls how many models can be loaded at the same time, up to the limits of VRAM. (note: we do not support loading two or more copies of the same model)

https://github.com/ollama/ollama/releases

taozhiyuai · 2024-05-03T02:59:31Z

I'm going to close this as fixed now in 0.1.33. As commenters above have pointed out, it's opt-in for now, but we do intend to eventually do concurrency automatically without requiring the env vars.

To clarify some questions above: There are 2 layers of concurrency. There's OLLAMA_NUM_PARALLEL which controls how many requests can be answered against a single loaded model, and there's OLLAMA_MAX_LOADED_MODELS which controls how many models can be loaded at the same time, up to the limits of VRAM. (note: we do not support loading two or more copies of the same model)

https://github.com/ollama/ollama/releases

@dhiltgen
I install ollama on Mac.
are these the default value for this two env?OLLAMA_NUM_PARALLEL=4 OLLAMA_MAX_LOADED_MODELS=4.

the following should work, right? :)
t@603e5f4a42f1 ~ % launchctl setenv OLLAMA_NUM_PARALLEL 5
t@603e5f4a42f1 ~ % launchctl setenv OLLAMA_MAX_LOADED_MODELS 5

to check the value of ENVs :)
launchctl getenv OLLAMA_NUM_PARALLEL
launchctl getenv OLLAMA_MAX_LOADED_MODELS

restart ollama to activate new value of ENVs. :) very nice!

dhiltgen · 2024-05-04T21:25:22Z

The default values for these settings in 0.1.33 retain existing behavior of only 1 request at a time, and only 1 model at a time. In a future version we plan to adjust the default to enable this automatically, but until then, yes, if you want to use concurrency, you'll have to set these environment variables on the server.

jmorganca added the feature request New feature or request label Aug 16, 2023

jmorganca changed the title ~~Ollama does not handle multiple request at the same time.~~ Parallel/batch requests Oct 23, 2023

This was referenced Dec 22, 2023

Increase Inference Throughput by Employing Parallelism #380

Closed

Processing inference in parallel #761

Closed

Add parameters for continuous batching and parallel flags #1476

Closed

jmorganca changed the title ~~Parallel/batch requests~~ Parallel requests Dec 22, 2023

jmorganca mentioned this issue Dec 22, 2023

How to multi threading with api << python >> #764

Closed

This was referenced Jan 26, 2024

Can Ollama run more than one instance on Ubuntu #2201

Closed

Handle Multiple parallel request #1956

Closed

Handling High traffic #1927

Closed

How to serve multiple simultaneous request in Ollama? #1400

Closed

Concurrency and multiple calls #1356

Closed

andrewnguonly mentioned this issue Jan 31, 2024

On longs pages it seems to get stuck andrewnguonly/Lumos#53

Open

jmorganca mentioned this issue Feb 20, 2024

Scaling/Concurrent Requests #1187

Closed

This was referenced Mar 1, 2024

Multiple requests at once #2845

Closed

Make Ollama serve requests in parallel #2800

Closed

Adphi added a commit to Adphi/ollama that referenced this issue Mar 30, 2024

wip: add concurrent requests support (to the same model)

5340259

fix ollama#358 Signed-off-by: Adphi <philippe.adrien.nousse@gmail.com>

Adphi added a commit to Adphi/ollama that referenced this issue Mar 30, 2024

wip: add concurrent requests support (to the same model)

a9195b3

fix ollama#358 Signed-off-by: Adphi <philippe.adrien.nousse@gmail.com>

pdevine mentioned this issue Apr 12, 2024

Concurrency scheduling is not supported. #3590

Closed

dhiltgen self-assigned this May 2, 2024

dhiltgen closed this as completed May 2, 2024

Parallel requests #358

Parallel requests #358

Comments

youssef02 commented Aug 16, 2023

jmorganca commented Aug 23, 2023

LeHaroun commented Oct 17, 2023

skye0402 commented Jan 9, 2024 • edited

ivanfioravanti commented Jan 18, 2024

ParisNeo commented Jan 30, 2024

ehartford commented Jan 30, 2024

ivanfioravanti commented Jan 30, 2024

Adphi commented Jan 31, 2024 • edited

ParisNeo commented Jan 31, 2024

ehartford commented Jan 31, 2024

farhanhubble commented Feb 1, 2024 • edited

trymeouteh commented Mar 1, 2024

ParisNeo commented Mar 1, 2024

ehartford commented Mar 1, 2024

easp commented Mar 30, 2024

darwinvelez58 commented Apr 9, 2024

0x77dev commented Apr 9, 2024 • edited

ParisNeo commented Apr 9, 2024

TheMasterFX commented Apr 10, 2024

0x77dev commented Apr 10, 2024

TheMasterFX commented Apr 12, 2024

guitmonk-1290 commented Apr 21, 2024 • edited

jpmcb commented Apr 24, 2024 • edited

skye0402 commented Apr 26, 2024

shing100 commented Apr 29, 2024

iakashpaul commented Apr 29, 2024 • edited

For anyone else stumbling onto this thread-

cc: @ehartford

mili-tan commented Apr 29, 2024

hanzlaramey commented Apr 29, 2024

ehartford commented Apr 30, 2024

ehartford commented Apr 30, 2024

BBjie commented Apr 30, 2024 • edited

jammsen commented Apr 30, 2024

BBjie commented Apr 30, 2024

laktosterror commented Apr 30, 2024 • edited

laktosterror commented Apr 30, 2024 • edited

BBjie commented Apr 30, 2024

dhiltgen commented May 2, 2024

taozhiyuai commented May 3, 2024 • edited

dhiltgen commented May 4, 2024

skye0402 commented Jan 9, 2024 •

edited

Adphi commented Jan 31, 2024 •

edited

farhanhubble commented Feb 1, 2024 •

edited

0x77dev commented Apr 9, 2024 •

edited

guitmonk-1290 commented Apr 21, 2024 •

edited

jpmcb commented Apr 24, 2024 •

edited

iakashpaul commented Apr 29, 2024 •

edited

BBjie commented Apr 30, 2024 •

edited

laktosterror commented Apr 30, 2024 •

edited

laktosterror commented Apr 30, 2024 •

edited

taozhiyuai commented May 3, 2024 •

edited