Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel requests #358

Closed
youssef02 opened this issue Aug 16, 2023 · 55 comments
Closed

Parallel requests #358

youssef02 opened this issue Aug 16, 2023 · 55 comments
Assignees
Labels
feature request New feature or request

Comments

@youssef02
Copy link

the app is amazing but the problem is If I want to create a multiagent from one api, I need to create a Queue system, as it can reply only to one request at a time, is there a way to improve this or do I have to implement a Queue system?

I just started here so sorry for any mistake, ;)

@jmorganca jmorganca added the feature request New feature or request label Aug 16, 2023
@jmorganca
Copy link
Member

Not a mistake – Ollama will serve one generation at a time currently, but supporting 2+ concurrent requests is definitely on the roadmap

@LeHaroun
Copy link

Is there a way to run multiple instances on the same machine

the app is amazing but the problem is If I want to create a multiagent from one api, I need to create a Queue system, as it can reply only to one request at a time, is there a way to improve this or do I have to implement a Queue system?

I just started here so sorry for any mistake, ;)

I am working on a similar implementation using MAS, If you have beefy hardware, I recommend you run several instances in closed envs. Save completions or generation and process them later on.

Also check LiteLLM APIs on top of Ollama. Quite helpful.

@jmorganca jmorganca changed the title Ollama does not handle multiple request at the same time. Parallel/batch requests Oct 23, 2023
@jmorganca jmorganca changed the title Parallel/batch requests Parallel requests Dec 22, 2023
@skye0402
Copy link

skye0402 commented Jan 9, 2024

Would be definitely a great addition to Ollama:

  • Concurrency of requests
  • Using GPU mem for several models

I'm running it on cloud using a T4 with 16GB GPU memory and having a phi-2 and codellama both in the V-RAM would be no issue at all. Ideally other models are kept in regular RAM instead of loading it from disk.
Adding to it: Users might switch models, so the queuing approach would apply to model switches, too 😃

@ivanfioravanti
Copy link

Any news on this one? Parallel requests can be a real game changer for Ollama

@ParisNeo
Copy link
Contributor

If you have enough VRAM to run multiple models, you can create multiple instances of ollama with different port numbers then use my proxy to manage access and route to the requests to each:
ParisNeo's ollama_proxy_server

@ehartford
Copy link

I would really like to have an m2 ultra 192gb on my company's intranet that can service the whole R&D department (a dozen people)

as long as I have enough ram, I wish to be able to run multiple inference request at the same time. Thank you for considering my wish!

@ivanfioravanti
Copy link

same here, multiple 7B models served by an M2 Ultra. My dream! 🙏

@Adphi
Copy link

Adphi commented Jan 31, 2024

At first glance, when I started examining the source code, I thought that the problem of concurrent requests was due to the fact that the current implementation referred to the use of global variables in its llama.cpp binding. After paying a little more attention to the original llapma.cpp source code, I realized that the original implementation wasn't exactly geared towards multithreaded or server-side use, but rather towards to a local development/experimentation use case.
So apart from setting up an API using queued workers (managed using gRPC for example) based on a fork/exec model to get around the lack of batch processing on the llama.cpp side, which doesn't seem to procure too much attention from the developpers (reference missing, but found in the github project's conversations), which Local AI does in a way, I don't see exactly how it could be implemented here.

@ParisNeo
Copy link
Contributor

That's why I had to build a proxy. You can install multiple servers on a single or multiple machines then use my proxy to service multiple users with a multi queue. For example if you have 2 servers then you can servicec at most 2 clients simultaniously and when the two are buzy, the current message is queued on the server that has the least full queue. In practice, since the generation is very fast, you don't get too much collisions. We rarely get more than 1 person in the queue.

I also added security features as well as logging features. It works fine and serves all of us perfectly.

Also, some dude built a docker docker version and I accepted his PR:
https://github.com/ParisNeo/ollama_proxy_server

If you are interested, you can try it, it is open source, you can also read the code, get inspired, help enhance it etc. It is Apache 2.0 so you can do whatever you want with it 100% free.

We get around 300 tokens/s for each user so it is not causing any significant delay.

@ehartford
Copy link

llama.cpp can be run as multiple process, multiple thread, or as server.
but it's totally fair for the feature request to be deprioritized, of course.

@farhanhubble
Copy link

farhanhubble commented Feb 1, 2024

The easiest way to multiplex Ollama, at least on linux system should be with a reverse proxy load balancer like HAProxy. Launch multiple instances of ollama serve on different ports and map them to a single port using HAproxy.

Note that the approach can sometimes deteriorate performance due to CPU contention. I have a decent system with 64 cores and 24GB of GPU RAM. When I run 3 instances of Ollama with HAproxy to generate embeddings it does speed up the process however if I try to generate text, the processing time is much worse than with a single instance.

@trymeouteh
Copy link

Would like to see the ability for using the same LLM in two or more apps at the same time and would like to see the ability to use multiple LLMs in two or more apps at the same time

@ParisNeo
Copy link
Contributor

ParisNeo commented Mar 1, 2024

well technically, you can run multiple instances of the same model by runnin multiple instances of ollama with different port numbers, configure them in the proxy config file then they can be accessed from multiple clients at once.

@ehartford
Copy link

That's just wrong 😂

Adphi added a commit to Adphi/ollama that referenced this issue Mar 30, 2024
fix ollama#358

Signed-off-by: Adphi <philippe.adrien.nousse@gmail.com>
Adphi added a commit to Adphi/ollama that referenced this issue Mar 30, 2024
fix ollama#358

Signed-off-by: Adphi <philippe.adrien.nousse@gmail.com>
@easp
Copy link
Contributor

easp commented Mar 30, 2024

Looks relevant: #3418

@darwinvelez58
Copy link

any update?

@0x77dev
Copy link

0x77dev commented Apr 9, 2024

@darwinvelez58 there is some work being done in #3418 and Adphi@a9195b3

was testing both options, at the moment seems very unstable

@ParisNeo
Copy link
Contributor

ParisNeo commented Apr 9, 2024

In the meantime you can try my ollama proxy :) It can be configured to fire up multiple ollama services, then you can also define users with keys to access the service and it balances the loads between the instances of ollama with multiple queues.

@ParisNeo Do you have a link for me please? Also how does that works from an architectur point of view? I guessing sticky-session, so people usng this wont get mixed up and have a english cooking recipe with c++ snippets and german dictionaries in there right?

Hi, sorry I was very very buzy lately and had no time to look at the github messages.
The proxy server can be found here:
https://github.com/ParisNeo/ollama_proxy_server

It has also an optional security to allow you to have a users base with a personal key for each user.
The architecture is simple:
You specify multiple ollama instances viewable only on the server side each one has a different port number, then the proxy manages both the authentication of the users, the logging of their access, and distribution of the users over multiple queues.
Each queue is behind one instance of ollama and the proxy will always put you in the least filled queue.
So if you run 4 ollama instances, you can simultaniously serve 4 users and the others will be queued.

I agree that you can do more if ollama manages this all inside to mutualize the weights and stuff. I have built my proxy because I needed the security and the management and back then ollama had no plans to do this so got to move, but I'll be happy if they integrate multiusers directly inside.

@TheMasterFX
Copy link

vLLM uses a PagedAttention. Is this a part which must be integrated in ollama or in the llama.cpp part?

@0x77dev
Copy link

0x77dev commented Apr 10, 2024

@TheMasterFX this is llama.cpp side of things

@TheMasterFX
Copy link

Seems like there are improvements coming:
https://twitter.com/AlexReibman/status/1778695203957977148

@guitmonk-1290
Copy link

guitmonk-1290 commented Apr 21, 2024

I was reading ray serve's docs and it seems they support Dynamic Request Batching, which creates a queue system.
https://docs.ray.io/en/latest/serve/advanced-guides/dyn-req-batch.html

I am new to this, so can we use this with Ollama for batch inference?

@jpmcb
Copy link

jpmcb commented Apr 24, 2024

I've had some success load balancing multiple Ollama instances on Kubernetes with a pool of GPU nodes:

❯ kubectl get nodes -A
NAME                     STATUS   ROLES    AGE   VERSION
defaultpool-88943984-0   Ready    <none>   5d    v1.29.2
defaultpool-88943984-1   Ready    <none>   5d    v1.29.2
gpupool-42074538-0       Ready    <none>   41h   v1.29.2
gpupool-42074538-1       Ready    <none>   41h   v1.29.2
gpupool-42074538-2       Ready    <none>   41h   v1.29.2
gpupool-42074538-3       Ready    <none>   41h   v1.29.2
gpupool-42074538-4       Ready    <none>   41h   v1.29.2

Notice the 5 "gpupool" nodes, each with a T4 nvidia GPU. Ollama is then deployed as a daemonset on each of the GPU nodes:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: ollama-daemonset
  namespace: ollama-ns
spec:
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
      - image: ollama/ollama:latest
        imagePullPolicy: Always
        lifecycle:
          postStart:
            exec:
              command:
              - ollama
              - pull
              - llama3
        name: ollama
        ports:
        - containerPort: 11434
          protocol: TCP

      # A few special selectors / tolerations to schedule on the GPU nodes
      nodeSelector:
        accelerator: nvidia
      tolerations:
      - effect: NoSchedule
        key: sku
        operator: Equal
        value: gpu

These then can be load balanced behind a Kubernetes service with the ClusterIP being exposed internally "as a service" to the rest of the cluster:

apiVersion: v1
kind: Service
metadata:
  name: ollama-service
  namespace: ollama-ns
spec:
  ports:
  - port: 80
    protocol: TCP
    targetPort: 11434
  selector:
    app: ollama
  sessionAffinity: None
  type: ClusterIP

Inside the cluster, I can hit the load balancing service resolved by CoreDNS:

$ curl ollama-service.ollama-ns.svc.cluster.local
Ollama is running

This works pretty well for slower requests but has 2 problems:

  • The scaling of this is directly tied to the number of nodes (and in this case, expensive GPU nodes) that have daemonsets on them
  • Kubernetes load balancing isn't perfect and could very easily send a request to a "busy" daemonset pod that already is servicing a request.
    • I believe when this happens, the whole pod seems to stale and doesn't print any additional logs, just seems to hang. Might be worth some more experimentation to see if there's a smoking gun when multiple requests come through

It'd be absolutely incredible if multiple requests could be serviced by a single ollama serve: It would take this example daemonset deployment with a pool of GPUs and make it much more scalable.

Thanks for all the amazing work on ollama - I'm happy to test anything out and report back if there's a ollama/ollama:concurrency test image or if there's code reviews needed ❤️

@skye0402
Copy link

I'm with vLLM now for that use case, it's very fast and manages concurrency plus has a big model pool and openAI compatible API.
I'm still with Ollama for trying out new models that's where it really shines and saves time.

@shing100
Copy link

I really wish this feature was added.

@iakashpaul
Copy link

iakashpaul commented Apr 29, 2024

For anyone else stumbling onto this thread-

To use webUI with parallel chat generation on multiple clients use TGI or server.cpp/duplicate my HF Space onto a GPU instance directly with parallel flag set as 2 or more basis your VRAM, then run the ollama container with server.cpp acting as an OpenAI replacement as shown below.

Server.cpp

 ./server -m ./models/mistral-7b-instruct-v0.2.Q8_0.gguf -ngl 30 --host 127.0.0.1 --port 8080 --parallel 2

Ollama container

 docker run  -p 3000:8080   -v open-webui:/app/backend/data   -e OPENAI_API_BASE_URLS="http://127.0.0.1:8080/v1"   -e OPENAI_API_KEYS="randomplaceholder"   --restart always   ghcr.io/open-webui/open-webui:main

cc: @ehartford

Latest release has experimental flags,

OLLAMA_NUM_PARALLEL=4 OLLAMA_MAX_LOADED_MODELS=4 ollama serve

@mili-tan
Copy link

https://github.com/ollama/ollama/releases/tag/v0.1.33-rc5

@hanzlaramey
Copy link

game changer!!! Thank you @mili-tan

@ehartford
Copy link

Thank you!

@ehartford
Copy link

If I don't want to limit the number of loaded models, I just don't set that variable?

@BBjie
Copy link

BBjie commented Apr 30, 2024

My question is for the concurrency feature, can I edit and add them into my compose-up file.

ollama:
    image: ollama/ollama:latest
    container_name: ollama
    ports:
      - "11434:11434"

    
    volumes:
      - ollama:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
    command: serve
volumes:
  ollama:

or is there other way to pass the value in for
OLLAMA_NUM_PARALLEL=4 OLLAMA_MAX_LOADED_MODELS=4 ollama serve

@jammsen
Copy link

jammsen commented Apr 30, 2024

My question is for the concurrency feature, can I edit and add them into my compose-up file.

ollama:
    image: ollama/ollama:latest
    container_name: ollama
    ports:
      - "11434:11434"

    
    volumes:
      - ollama:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
    command: serve
volumes:
  ollama:

or is there other way to pass the value in for OLLAMA_NUM_PARALLEL=4 OLLAMA_MAX_LOADED_MODELS=4 ollama serve

I think thoose are 2 different things, the ENVs should be passed to ollama serve when you run the native app on your system. Not sure though how you can use this in a Docker Image of ollama @jmorganca could you clear this up for us please?

@BBjie
Copy link

BBjie commented Apr 30, 2024

Have not tested but should work:


ollama:

    image: ollama/ollama:latest

    container_name: ollama
    environment:
      - OLLAMA_NUM_PARALLEL=4
      - OLLAMA_MAX_LOADED_MODELS=4
...

I tested it one hour ago, it was not working...

@laktosterror
Copy link

laktosterror commented Apr 30, 2024

Sorry you might need to get the pre-release image. Corrected previous answear.

@laktosterror
Copy link

laktosterror commented Apr 30, 2024

On mobile. I seem to have messed up my original post.

Try this:

ollama:
    image: ollama/ollama:0.1.33-rc5
    container_name: ollama
    environment:
      - OLLAMA_NUM_PARALLEL=4
      - OLLAMA_MAX_LOADED_MODELS=4
...

@BBjie
Copy link

BBjie commented Apr 30, 2024

On mobile. I seem to have messed up my original post.

Try this:

ollama:
    image: ollama/ollama:0.1.33-rc5
    container_name: ollama
    environment:
      - OLLAMA_NUM_PARALLEL=4
      - OLLAMA_MAX_LOADED_MODELS=4
...
ollama  | {"function":"log_server_request","level":"INFO","line":2734,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":46394,"status":200,"tid":"140623411642368","timestamp":1714493452}
ollama  | {"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":4,"n_processing_slots":0,"task_id":399,"tid":"140624811528192","timestamp":1714493452}
ollama  | {"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":46410,"status":200,"tid":"140623403249664","timestamp":1714493452}
ollama  | {"function":"log_server_request","level":"INFO","line":2734,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":46410,"status":200,"tid":"140623403249664","timestamp":1714493452}
ollama  | {"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":4,"n_processing_slots":0,"task_id":400,"tid":"140624811528192","timestamp":1714493452}
ollama  | {"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":46410,"status":200,"tid":"140623403249664","timestamp":1ollama  | {"function":"print_timings","level":"INFO","line":269,"msg":"prompt eval time     =     991.40 ms /   713 tokens (    1.39 ms per token,   719.19 tokens per second)","n_prompt_tokens_processed":713,"n_tokens_second":719.1864417721239,"slot_id":1,"t_prompt_processing":991.398,"t_token":1.3904600280504908,"task_id":402,"tid":"140624811528192","timestamp":1714493454}
ollama  | {"function":"print_timings","level":"INFO","line":283,"msg":"generation eval time =    1251.42 ms /    56 runs   (   22.35 ms per token,    44.75 tokens per second)","n_decoded":56,"n_tokens_second":44.7492364661528,"slot_id":1,"t_token":22.346749999999997,"t_token_generation":1251.418,"task_id":402,"tid":"140624811528192","timestamp":1714493454}
ollama  | {"function":"print_timings","level":"INFO","line":293,"msg":"          total time =    2242.82 ms","slot_id":1,"t_prompt_processing":991.398,"t_token_generation":1251.418,"t_total":2242.816,"task_id":402,"tid":"140624811528192","timestamp":1714493454}
ollama  | {"function":"update_slots","level":"INFO","line":1640,"msg":"slot released","n_cache_tokens":832,"n_ctx":3904,"n_past":831,"n_system_tokens":0,"slot_id":1,"task_id":402,"tid":"140624811528192","timestamp":1714493454,"truncated":false}
ollama  | {"function":"log_server_request","level":"INFO","line":2734,"method":"POST","msg":"request","params":{},"path":"/completion","remote_addr":"127.0.0.1","remote_port":46416,"status":200,"tid":"140623966629888","timestamp":1714493454}
ollama  | [GIN] 2024/04/30 - 16:10:54 | 200 |  2.546651478s |      172.19.0.4 | POST     "/api/chat"

Thhanks, but I believe it still not working as expected..

@dhiltgen dhiltgen self-assigned this May 2, 2024
@dhiltgen
Copy link
Collaborator

dhiltgen commented May 2, 2024

I'm going to close this as fixed now in 0.1.33. As commenters above have pointed out, it's opt-in for now, but we do intend to eventually do concurrency automatically without requiring the env vars.

To clarify some questions above: There are 2 layers of concurrency. There's OLLAMA_NUM_PARALLEL which controls how many requests can be answered against a single loaded model, and there's OLLAMA_MAX_LOADED_MODELS which controls how many models can be loaded at the same time, up to the limits of VRAM. (note: we do not support loading two or more copies of the same model)

https://github.com/ollama/ollama/releases

@dhiltgen dhiltgen closed this as completed May 2, 2024
@taozhiyuai
Copy link

taozhiyuai commented May 3, 2024

I'm going to close this as fixed now in 0.1.33. As commenters above have pointed out, it's opt-in for now, but we do intend to eventually do concurrency automatically without requiring the env vars.

To clarify some questions above: There are 2 layers of concurrency. There's OLLAMA_NUM_PARALLEL which controls how many requests can be answered against a single loaded model, and there's OLLAMA_MAX_LOADED_MODELS which controls how many models can be loaded at the same time, up to the limits of VRAM. (note: we do not support loading two or more copies of the same model)

https://github.com/ollama/ollama/releases

@dhiltgen
I install ollama on Mac.
are these the default value for this two env?OLLAMA_NUM_PARALLEL=4 OLLAMA_MAX_LOADED_MODELS=4.

the following should work, right? :)
t@603e5f4a42f1 ~ % launchctl setenv OLLAMA_NUM_PARALLEL 5
t@603e5f4a42f1 ~ % launchctl setenv OLLAMA_MAX_LOADED_MODELS 5

to check the value of ENVs :)
launchctl getenv OLLAMA_NUM_PARALLEL
launchctl getenv OLLAMA_MAX_LOADED_MODELS

restart ollama to activate new value of ENVs. :) very nice!

@dhiltgen
Copy link
Collaborator

dhiltgen commented May 4, 2024

The default values for these settings in 0.1.33 retain existing behavior of only 1 request at a time, and only 1 model at a time. In a future version we plan to adjust the default to enable this automatically, but until then, yes, if you want to use concurrency, you'll have to set these environment variables on the server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests