New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel requests #358
Comments
Not a mistake – Ollama will serve one generation at a time currently, but supporting 2+ concurrent requests is definitely on the roadmap |
Is there a way to run multiple instances on the same machine
I am working on a similar implementation using MAS, If you have beefy hardware, I recommend you run several instances in closed envs. Save completions or generation and process them later on. Also check LiteLLM APIs on top of Ollama. Quite helpful. |
Would be definitely a great addition to Ollama:
I'm running it on cloud using a T4 with 16GB GPU memory and having a phi-2 and codellama both in the V-RAM would be no issue at all. Ideally other models are kept in regular RAM instead of loading it from disk. |
Any news on this one? Parallel requests can be a real game changer for Ollama |
If you have enough VRAM to run multiple models, you can create multiple instances of ollama with different port numbers then use my proxy to manage access and route to the requests to each: |
I would really like to have an m2 ultra 192gb on my company's intranet that can service the whole R&D department (a dozen people) as long as I have enough ram, I wish to be able to run multiple inference request at the same time. Thank you for considering my wish! |
same here, multiple 7B models served by an M2 Ultra. My dream! 🙏 |
At first glance, when I started examining the source code, I thought that the problem of concurrent requests was due to the fact that the current implementation referred to the use of global variables in its llama.cpp binding. After paying a little more attention to the original llapma.cpp source code, I realized that the original implementation wasn't exactly geared towards multithreaded or server-side use, but rather towards to a local development/experimentation use case. |
That's why I had to build a proxy. You can install multiple servers on a single or multiple machines then use my proxy to service multiple users with a multi queue. For example if you have 2 servers then you can servicec at most 2 clients simultaniously and when the two are buzy, the current message is queued on the server that has the least full queue. In practice, since the generation is very fast, you don't get too much collisions. We rarely get more than 1 person in the queue. I also added security features as well as logging features. It works fine and serves all of us perfectly. Also, some dude built a docker docker version and I accepted his PR: If you are interested, you can try it, it is open source, you can also read the code, get inspired, help enhance it etc. It is Apache 2.0 so you can do whatever you want with it 100% free. We get around 300 tokens/s for each user so it is not causing any significant delay. |
llama.cpp can be run as multiple process, multiple thread, or as server. |
The easiest way to multiplex Ollama, at least on linux system should be with a reverse proxy load balancer like HAProxy. Launch multiple instances of Note that the approach can sometimes deteriorate performance due to CPU contention. I have a decent system with 64 cores and 24GB of GPU RAM. When I run 3 instances of Ollama with HAproxy to generate embeddings it does speed up the process however if I try to generate text, the processing time is much worse than with a single instance. |
Would like to see the ability for using the same LLM in two or more apps at the same time and would like to see the ability to use multiple LLMs in two or more apps at the same time |
well technically, you can run multiple instances of the same model by runnin multiple instances of ollama with different port numbers, configure them in the proxy config file then they can be accessed from multiple clients at once. |
That's just wrong 😂 |
fix ollama#358 Signed-off-by: Adphi <philippe.adrien.nousse@gmail.com>
fix ollama#358 Signed-off-by: Adphi <philippe.adrien.nousse@gmail.com>
Looks relevant: #3418 |
any update? |
@darwinvelez58 there is some work being done in #3418 was testing both options, at the moment seems very unstable |
Hi, sorry I was very very buzy lately and had no time to look at the github messages. It has also an optional security to allow you to have a users base with a personal key for each user. I agree that you can do more if ollama manages this all inside to mutualize the weights and stuff. I have built my proxy because I needed the security and the management and back then ollama had no plans to do this so got to move, but I'll be happy if they integrate multiusers directly inside. |
vLLM uses a PagedAttention. Is this a part which must be integrated in ollama or in the llama.cpp part? |
@TheMasterFX this is llama.cpp side of things |
Seems like there are improvements coming: |
I was reading ray serve's docs and it seems they support Dynamic Request Batching, which creates a queue system. I am new to this, so can we use this with Ollama for batch inference? |
I've had some success load balancing multiple Ollama instances on Kubernetes with a pool of GPU nodes:
Notice the 5 "gpupool" nodes, each with a T4 nvidia GPU. Ollama is then deployed as a daemonset on each of the GPU nodes: apiVersion: apps/v1
kind: DaemonSet
metadata:
name: ollama-daemonset
namespace: ollama-ns
spec:
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
containers:
- image: ollama/ollama:latest
imagePullPolicy: Always
lifecycle:
postStart:
exec:
command:
- ollama
- pull
- llama3
name: ollama
ports:
- containerPort: 11434
protocol: TCP
# A few special selectors / tolerations to schedule on the GPU nodes
nodeSelector:
accelerator: nvidia
tolerations:
- effect: NoSchedule
key: sku
operator: Equal
value: gpu These then can be load balanced behind a Kubernetes service with the ClusterIP being exposed internally "as a service" to the rest of the cluster: apiVersion: v1
kind: Service
metadata:
name: ollama-service
namespace: ollama-ns
spec:
ports:
- port: 80
protocol: TCP
targetPort: 11434
selector:
app: ollama
sessionAffinity: None
type: ClusterIP Inside the cluster, I can hit the load balancing service resolved by CoreDNS:
This works pretty well for slower requests but has 2 problems:
It'd be absolutely incredible if multiple requests could be serviced by a single Thanks for all the amazing work on ollama - I'm happy to test anything out and report back if there's a |
I'm with vLLM now for that use case, it's very fast and manages concurrency plus has a big model pool and openAI compatible API. |
I really wish this feature was added. |
For anyone else stumbling onto this thread-To use webUI with parallel chat generation on multiple clients use TGI or server.cpp/duplicate my HF Space onto a GPU instance directly with parallel flag set as 2 or more basis your VRAM, then run the ollama container with server.cpp acting as an OpenAI replacement as shown below.
./server -m ./models/mistral-7b-instruct-v0.2.Q8_0.gguf -ngl 30 --host 127.0.0.1 --port 8080 --parallel 2
docker run -p 3000:8080 -v open-webui:/app/backend/data -e OPENAI_API_BASE_URLS="http://127.0.0.1:8080/v1" -e OPENAI_API_KEYS="randomplaceholder" --restart always ghcr.io/open-webui/open-webui:main cc: @ehartfordLatest release has experimental flags,
|
game changer!!! Thank you @mili-tan |
Thank you! |
If I don't want to limit the number of loaded models, I just don't set that variable? |
My question is for the concurrency feature, can I edit and add them into my compose-up file.
or is there other way to pass the value in for |
I think thoose are 2 different things, the ENVs should be passed to ollama serve when you run the native app on your system. Not sure though how you can use this in a Docker Image of ollama @jmorganca could you clear this up for us please? |
I tested it one hour ago, it was not working... |
Sorry you might need to get the pre-release image. Corrected previous answear. |
On mobile. I seem to have messed up my original post. Try this:
|
Thhanks, but I believe it still not working as expected.. |
I'm going to close this as fixed now in 0.1.33. As commenters above have pointed out, it's opt-in for now, but we do intend to eventually do concurrency automatically without requiring the env vars. To clarify some questions above: There are 2 layers of concurrency. There's |
@dhiltgen the following should work, right? :) to check the value of ENVs :) restart ollama to activate new value of ENVs. :) very nice! |
The default values for these settings in 0.1.33 retain existing behavior of only 1 request at a time, and only 1 model at a time. In a future version we plan to adjust the default to enable this automatically, but until then, yes, if you want to use concurrency, you'll have to set these environment variables on the server. |
the app is amazing but the problem is If I want to create a multiagent from one api, I need to create a Queue system, as it can reply only to one request at a time, is there a way to improve this or do I have to implement a Queue system?
I just started here so sorry for any mistake, ;)
The text was updated successfully, but these errors were encountered: