-
-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: Ability to work with multiple Ollama servers #278
Comments
Hi, Thanks for the suggestion. I'll take a look in the near future and assess it's usability/feasibility. Thanks! |
+1 but I would like to use it slightly differently. I have 2 ollama instances on 2 different machines, one that can only do CPU inference (but with a lot of RAM so can run larger models) and one that is not always available (due to power cost concerns) but uses fast GPU. Ideally I would be able to define multiple ollama servers in the web-ui and an order in which to use them. For example, if the machine with the GPU is available (and the selected model is on that machine) then route requests to that instance otherwise use the slower (but always on) machine. Manual selection would also be fine, or listing them all in the same way the UI currently separates ollama and external (openai api) models. But being able to define a primary and a backup connection (or more than 2, with priorities) would be great as it would mean I can use a single interface to interact with ollama and speed things up by just switching a faster machine on if it's deemed necessary for the task. |
There would be a "context" problem if you switch from an instance to another in the middle of a conversation. A really cool feature that ollama itself could implement would be to save the "chat context" data into a shared database (like mongodb) and reuse that as context for the next prompt response, just speculating, don't know if possible. |
My use case only really needs to be able to spin up a machine and connect when I know I'm going to need it. Wouldn't really happen mid conversation. Related to saving the chat context though, llama.cpp has a |
It would be possible in ollama then, the backend only needs to save the prompt cache using the chat id and always reuse it when asked a new question on that same chat, ofc using a shared volume of data. |
Just for anyone looking over this thread before the feature gets introduced. If you want to run the same models on multiple ollama instances (basically just to shorten queue times if there are a decent amount of people using your webui instance) you can do load balancing with nginx. This solution isn't completely useful for some of the people in this thread, although it came in handy for me as I have two GPU's and wanted to make sure that multiple people could generate at once without queueing initially. If you are using the All you need is an nginx instance running (there's heaps of tutorials for that) and then to put this in your
@Loki321 if you want to prioritise one server over the other instead of whichever has the smallest queue (ie. prioritise your GPU server when it's available otherwise fallback to cpu) you can remove the
|
@SethBurkart123 fascinating, I didn't think this would actually work but it seems to so far... I can increase my Mac Studio's throughput running 3x Ollama "workers" on the same machine, it's got enough CPU and RAM. It's not quite triple the throughput, but it's definitely an improvement for users. |
@SethBurkart123 I followed your instructions, but I am having a bit of trouble getting the nginx to forward to the different ollama servers. Below are some code snippets from my nginx config file and the docker compose to setup the deployment across 4 separate GPUs.
With the setup below it will run only on the container with 11434 port. If I change 11434 to a different ollama container it loads on that respective GPU. I tested with The load balance will not send to the other ollama containers. I test with two browsers on same machine and multiple different machines. It may be something simple I missed; open to suggestions. Any assistance is greatly appreciated
worker_processes auto;
events { worker_connections 1024; }
http {
upstream ollama {
least_conn; # Enable least connections load balancing
server ollama-00:11434; # Ollama server 0
server ollama-01:11435; # Ollama server 1
server ollama-02:11436; # Ollama server 2
server ollama-03:11437; # Ollama server 3
}
server {
listen 9090; # This is the port you would use for the Ollama API URL in the WebUI eg. http://localhost:9090/api
location / {
proxy_pass http://ollama; # Forward requests to the upstream block
}
}
}
|
@davidamacey here are two possible solutions to address the issue in your Nginx configuration file:
Either of these changes should resolve the issue with your Nginx configuration file. |
@justinh-rahb and @SethBurkart123 Thank you for the prompt response! I tried a few more tests.
I greatly appreciate your quick feedback. I will continue to troubleshoot. |
@justinh-rahb and @SethBurkart123 I appreciate the guidance. Unfortunately, I was not able to get multiple ollama containers deployed behind an nginx load balancer. It seems the FastAPI StreamRequest is receiving a stream of data followed by the final POST once the chat is completed. Therefore, myself (consulting with friends), that the nginx is working but it doesn't handle the API stream as we expect. This drove me to learn about vLLM. vLLM has direct compliance with OpenAI's API format, therefore I can deploy a local vLLM container with selected LLM. Using the UI I enter an EMPTY key and the URL to the vLLM instance. Select from model in chat and off to the races! This provides faster response times and async requests. Con to vLLM is that it requires an NVIDIA GPU, so not all users will have this due to the popularity of Apple Silicon M Chips, etc. I am happy to report your application works with vLLM backend without much effort. There is one potential bug on the System Prompt but I will make an issue for it. Below is my Docker compose setup for those that are interested in giving it a try. Note, ollama container is required, otherwise the UI will throw errors that there isn't an ollama connection.
Happy coding! |
I recommend that webui become a frontend to litellm proxy. There are a lot of things that litellm can do that we could be the pretty face for. Managing multiple endpoints is one of them. https://docs.litellm.ai/docs/proxy/configs |
I've been beating this drum for a while now, it does seem to me to be the quickest way to bootstrap support for other backends. |
Speaking of litellm, I got it to work with my open-webui and it handles loadbalancing very well (tested with 2 GPUs and 4 Ollama Instances). Also it would be nice to set different bearer tokens per user, so i can be able to track the usage of the litellm api on a per user basis. I can also create a new issue with this, if wanted :) Edit: the corresponding header is e.g. |
Works in newest version, thanks, you can set the token in the api key field :} |
I know this isn't exactly what you guys are discussing, but I'm also not sure if it's sufficiently different to open a new feature request. If it is, I can do that. It would be awesome if users could add their own OpenAI endpoints/api keys, or at least allow multiple OpenAI endpoints from the admin's perspective (I think this option would require database changes and I'm not familiar enough with peewee at the moment to implement this). One way this could be done specifically for individual users is how TypingMind does it—storing the API key and endpoint in localStorage and making a direct request from the browser to the OpenAI endpoint (although this may introduce CORS errors, I think only when working with custom APIs that follow the OpenAI API specs). However, I think the ideal solution would allow users to store their custom keys/endpoints in the database so the request can be made on the backend to avoid CORS errors. Any thoughts on this? |
@christopher-kapic, the WebUI initially processed OpenAI requests solely on the browser side, with settings stored in local storage exactly as you say. However, we received several requests about it and decided to change the implementation to proxy through the backend like the Ollama API requests. It appears that supporting both methods might be necessary to cater to all users, but this could become quite intricate. |
Is your feature request related to a problem? Please describe.
On my system I have a capable CPU with a large amount of RAM that is able to run quite large models, albeit slowly. I also have an RTX 3060 which is able to run smaller models quite quickly. I can easily have two Docker instances of Ollama running, one for GPU mode and one for CPU only mode. It would be great if we could have an ability to have a single instance of Ollama-Webui with the ability to switch between the two Ollama instances.
Describe the solution you'd like
Have a settings screen where I can add one, or more, additional Ollama server. Along with the entry screen, have a toggle or drop-down list that would ideally let me set the default Ollama server on a per model level.
Describe alternatives you've considered
An alternative solution, which would be less ideal, would be to simply run two instances of Ollama-Webui, each pointing to a different Ollama container.
The text was updated successfully, but these errors were encountered: