Feat: Ability to work with multiple Ollama servers #278

jeremiahsb · 2023-12-26T14:04:15Z

Is your feature request related to a problem? Please describe.
On my system I have a capable CPU with a large amount of RAM that is able to run quite large models, albeit slowly. I also have an RTX 3060 which is able to run smaller models quite quickly. I can easily have two Docker instances of Ollama running, one for GPU mode and one for CPU only mode. It would be great if we could have an ability to have a single instance of Ollama-Webui with the ability to switch between the two Ollama instances.

Describe the solution you'd like
Have a settings screen where I can add one, or more, additional Ollama server. Along with the entry screen, have a toggle or drop-down list that would ideally let me set the default Ollama server on a per model level.

Describe alternatives you've considered
An alternative solution, which would be less ideal, would be to simply run two instances of Ollama-Webui, each pointing to a different Ollama container.

tjbck · 2023-12-27T01:27:16Z

Hi, Thanks for the suggestion. I'll take a look in the near future and assess it's usability/feasibility. Thanks!

Loki321 · 2023-12-30T00:33:25Z

+1 but I would like to use it slightly differently.

I have 2 ollama instances on 2 different machines, one that can only do CPU inference (but with a lot of RAM so can run larger models) and one that is not always available (due to power cost concerns) but uses fast GPU.

Ideally I would be able to define multiple ollama servers in the web-ui and an order in which to use them.

For example, if the machine with the GPU is available (and the selected model is on that machine) then route requests to that instance otherwise use the slower (but always on) machine.

Manual selection would also be fine, or listing them all in the same way the UI currently separates ollama and external (openai api) models. But being able to define a primary and a backup connection (or more than 2, with priorities) would be great as it would mean I can use a single interface to interact with ollama and speed things up by just switching a faster machine on if it's deemed necessary for the task.

dnviti · 2024-01-03T22:44:47Z

There would be a "context" problem if you switch from an instance to another in the middle of a conversation.
I tried to asses this during the kubernetes support development.
As you say the only way that i can think of is to be able to switch instance manually before having the conversation, in kubernetes this would mean to be able to select the exact statefulset using the statefulset id.

A really cool feature that ollama itself could implement would be to save the "chat context" data into a shared database (like mongodb) and reuse that as context for the next prompt response, just speculating, don't know if possible.

Loki321 · 2024-01-03T22:56:59Z

My use case only really needs to be able to spin up a machine and connect when I know I'm going to need it. Wouldn't really happen mid conversation.

Related to saving the chat context though, llama.cpp has a --prompt-cache flag where you can save the prompt cache to a file and load it back in again later. I would think it could be leveraged to achieve what you're talking about. Like you say though, would need to be done in ollama itself.

dnviti · 2024-01-04T08:49:56Z

It would be possible in ollama then, the backend only needs to save the prompt cache using the chat id and always reuse it when asked a new question on that same chat, ofc using a shared volume of data.

SethBurkart123 · 2024-01-16T21:47:39Z

Just for anyone looking over this thread before the feature gets introduced. If you want to run the same models on multiple ollama instances (basically just to shorten queue times if there are a decent amount of people using your webui instance) you can do load balancing with nginx.

This solution isn't completely useful for some of the people in this thread, although it came in handy for me as I have two GPU's and wanted to make sure that multiple people could generate at once without queueing initially. If you are using the least_conn; method (not the one described later for @Loki321) - Both Ollama instances must have the same models. Otherwise you can get errors due to the webui thinking that certain models are available.

All you need is an nginx instance running (there's heaps of tutorials for that) and then to put this in your nginx.conf:

http {
    upstream backend_servers {
        least_conn;                     # Enable least connections load balancing

        # Add your Ollama servers here eg. if one of them was http://localhost:11434/api then you would add `server localhost:11434`
        server localhost:11434;          # Ollama server 1
        server localhost:11435;          # Ollama server 2
    }

    server {
        listen 9090;                     # This is the port you would use for the Ollama API URL in the WebUI eg. http://localhost:9090/api

        location / {
            proxy_pass http://backend_servers; # Forward requests to the upstream block
        }
    }
}

@Loki321 if you want to prioritise one server over the other instead of whichever has the smallest queue (ie. prioritise your GPU server when it's available otherwise fallback to cpu) you can remove the least_conn; line and modify the servers to:

        server localhost:11434 max_fails=3 fail_timeout=30s;  # Primary GPU server
        server localhost:11435 backup;                      # Backup CPU only server

justinh-rahb · 2024-01-17T07:10:55Z

@SethBurkart123 fascinating, I didn't think this would actually work but it seems to so far... I can increase my Mac Studio's throughput running 3x Ollama "workers" on the same machine, it's got enough CPU and RAM. It's not quite triple the throughput, but it's definitely an improvement for users.

davidamacey · 2024-01-24T00:14:23Z

@SethBurkart123 I followed your instructions, but I am having a bit of trouble getting the nginx to forward to the different ollama servers. Below are some code snippets from my nginx config file and the docker compose to setup the deployment across 4 separate GPUs.

Changed port numbers of ollama containers in compose and conf file, in various combinations
Restarted compose after each change

With the setup below it will run only on the container with 11434 port. If I change 11434 to a different ollama container it loads on that respective GPU. I tested with least_conn and none, which defaults to round_robin

The load balance will not send to the other ollama containers. I test with two browsers on same machine and multiple different machines.

It may be something simple I missed; open to suggestions.

Any assistance is greatly appreciated

version: '3.8'

services:

  nginx:
    image: nginx:latest
    container_name: nginx
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf
    ports:
      - 9090:80
    networks:
      - ollama_net

  ollama-00:
    volumes:
      - /mnt/md0/ollama_models:/root/.ollama
    container_name: ollama-00
    pull_policy: always
    tty: true
    restart: unless-stopped
    image: ollama/ollama:latest
    ports:
      - 11434:11434
    # GPU support
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: [ '0' ]
              capabilities:
                - gpu
    networks:
      - ollama_net

  ollama-01:
    volumes:
      - /mnt/md0/ollama_models:/root/.ollama
    container_name: ollama-01
    pull_policy: always
    tty: true
    restart: unless-stopped
    image: ollama/ollama:latest
    ports:
      - 11435:11434
    # GPU support
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: [ '1' ]
              capabilities:
                - gpu
    networks:
      - ollama_net

  ollama-02:
    volumes:
      - /mnt/md0/ollama_models:/root/.ollama
    container_name: ollama-02
    pull_policy: always
    tty: true
    restart: unless-stopped
    image: ollama/ollama:latest
    ports:
      - 11436:11434
    # GPU support
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: [ '2' ]
              capabilities:
                - gpu
    networks:
      - ollama_net

  ollama-03:
    volumes:
      - /mnt/md0/ollama_models:/root/.ollama
    container_name: ollama-03
    pull_policy: always
    tty: true
    restart: unless-stopped
    image: ollama/ollama:latest
    ports:
      - 11437:11434
    # GPU support
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: [ '3' ]
              capabilities:
                - gpu
    networks:
      - ollama_net

  ollama-webui:
    build:
      context: .
      dockerfile: Dockerfile
    container_name: ollama-webui
    volumes:
      - /mnt/md0/ollama_webui/:/app/backend/data
    depends_on:
      - nginx
      - ollama-00
      - ollama-01
      - ollama-02
      - ollama-03
    ports:
      - 3000:8080
    environment:
      - OLLAMA_API_BASE_URL=http://nginx:9090/api
    # extra_hosts:
    #   - host.docker.internal:host-gateway
    restart: unless-stopped
    networks:
      - ollama_net

networks:
  ollama_net:
    driver: bridge

# volumes:
#   ollama: {}
#   ollama-webui: {}

worker_processes auto;

events { worker_connections 1024; }

http {
    upstream ollama {
        least_conn;                     # Enable least connections load balancing

        server ollama-00:11434;          # Ollama server 0
        server ollama-01:11435;          # Ollama server 1
        server ollama-02:11436;          # Ollama server 2
        server ollama-03:11437;          # Ollama server 3
    }

    server {
        listen 9090;                     # This is the port you would use for the Ollama API URL in the WebUI eg. http://localhost:9090/api

        location / {
            proxy_pass http://ollama; # Forward requests to the upstream block
        }
    }
}

justinh-rahb · 2024-01-24T00:30:57Z

@davidamacey here are two possible solutions to address the issue in your Nginx configuration file:

Change all the ports in your nginx.conf file to 11434, while keeping the current hostnames as they are. This will ensure that Nginx communicates with the containers on their internal Docker network with port 11434. In this case you don't actually need to publish external ports unless you have something else accessing your Ollama instances directly from outside of Docker.
Change the ollama-##:1143# hostnames in your nginx.conf file to host.docker.internal:1143# instead. This will ensure that Nginx communicates with the containers using their published external ports. This is one way to do it, but I'd recommend the first.

Either of these changes should resolve the issue with your Nginx configuration file.

davidamacey · 2024-01-24T07:01:02Z

@justinh-rahb and @SethBurkart123 Thank you for the prompt response! I tried a few more tests.

I changed all ports in config to the same 11434, but I had the same result, sending two simultaneous requests from two browsers resulted in a queue for GPU 0
host.docker.internal was not resolvable in nginx
I tried making separate volumes for each ollama container, in case there was config per container. Currently I am using the same volume for each ollama model files.

I greatly appreciate your quick feedback. I will continue to troubleshoot.

davidamacey · 2024-01-25T08:20:48Z

@justinh-rahb and @SethBurkart123 I appreciate the guidance. Unfortunately, I was not able to get multiple ollama containers deployed behind an nginx load balancer. It seems the FastAPI StreamRequest is receiving a stream of data followed by the final POST once the chat is completed. Therefore, myself (consulting with friends), that the nginx is working but it doesn't handle the API stream as we expect.

This drove me to learn about vLLM. vLLM has direct compliance with OpenAI's API format, therefore I can deploy a local vLLM container with selected LLM. Using the UI I enter an EMPTY key and the URL to the vLLM instance. Select from model in chat and off to the races! This provides faster response times and async requests.

Con to vLLM is that it requires an NVIDIA GPU, so not all users will have this due to the popularity of Apple Silicon M Chips, etc.

I am happy to report your application works with vLLM backend without much effort. There is one potential bug on the System Prompt but I will make an issue for it.

Below is my Docker compose setup for those that are interested in giving it a try. Note, ollama container is required, otherwise the UI will throw errors that there isn't an ollama connection.

version: '3.8'

services:

  ollama-00:
    volumes:
      - /mnt/nas/ollama_webui/ollama:/root/.ollama
    container_name: ollama-00
    pull_policy: always
    tty: true
    restart: unless-stopped
    image: ollama/ollama:latest
    ports:
      - 11434:11434
    # GPU support
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: [ '0' ]
              capabilities:
                - gpu
    networks:
      - ollama_net

  vllm:
    container_name: vllm
    image: vllm/vllm-openai:latest
    pull_policy: always
    volumes:
      - /mnt/nas/hf_vllm_models/:/root/.cache/huggingface
    environment:
      - HUGGING_FACE_HUB_TOKEN=<token>
    ports:
      - "8000:8000"
    ipc: host
    # command: ["--model", "mistralai/Mixtral-8x7B-Instruct-v0.1"]
    command: ["--model", "mistralai/Mistral-7B-Instruct-v0.2", "--tensor-parallel-size", "2"]
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0', '2' ]
              capabilities:
                - gpu
    networks:
      - ollama_net

  ollama-webui:
    image: ghcr.io/ollama-webui/ollama-webui:main
    container_name: ollama-webui
    pull_policy: always
    volumes:
      - /mnt/nas/ollama_webui/webui:/app/backend/data
    depends_on:
      - vllm
      - ollama-00
    ports:
      - 3000:8080
    environment:
      - OLLAMA_API_BASE_URL=http://ollama-00:11434/api
      - OPENAI_API_BASE_URL=http://vllm:8000/v1
      - OPENAI_API_KEY=EMPTY
    restart: unless-stopped
    networks:
      - ollama_net

networks:
  ollama_net:
    driver: bridge

Happy coding!

explorigin · 2024-02-12T17:07:34Z

I recommend that webui become a frontend to litellm proxy. There are a lot of things that litellm can do that we could be the pretty face for. Managing multiple endpoints is one of them. https://docs.litellm.ai/docs/proxy/configs

justinh-rahb · 2024-02-12T17:10:30Z

I recommend that webui become a frontend to litellm proxy. There are a lot of things that litellm can do that we could be the pretty face for. Managing multiple endpoints is one of them. https://docs.litellm.ai/docs/proxy/configs

I've been beating this drum for a while now, it does seem to me to be the quickest way to bootstrap support for other backends.

VfBfoerst · 2024-02-20T11:53:22Z

Speaking of litellm, I got it to work with my open-webui and it handles loadbalancing very well (tested with 2 GPUs and 4 Ollama Instances).
The only "problem" which appeared was after adding authentication to the litellm proxy server. Then, the webui couldn't speak with the API anymore, I guess because of the missing possibility to add the bearer token.
Can you maybe add a bearer token field when adding the external litellm-api URL, e.g. here?:

Also it would be nice to set different bearer tokens per user, so i can be able to track the usage of the litellm api on a per user basis.

I can also create a new issue with this, if wanted :)

Edit: the corresponding header is e.g. curl http://123.123.123.123/v1/chat/completions -H 'Authorization: Bearer sk-1234'

VfBfoerst · 2024-02-22T09:20:27Z

Speaking of litellm, I got it to work with my open-webui and it handles loadbalancing very well (tested with 2 GPUs and 4 Ollama Instances). The only "problem" which appeared was after adding authentication to the litellm proxy server. Then, the webui couldn't speak with the API anymore, I guess because of the missing possibility to add the bearer token. Can you maybe add a bearer token field when adding the external litellm-api URL, e.g. here?:

Also it would be nice to set different bearer tokens per user, so i can be able to track the usage of the litellm api on a per user basis.

I can also create a new issue with this, if wanted :)

Edit: the corresponding header is e.g. curl http://123.123.123.123/v1/chat/completions -H 'Authorization: Bearer sk-1234'

Works in newest version, thanks, you can set the token in the api key field :}

christopher-kapic · 2024-02-22T18:18:15Z

I know this isn't exactly what you guys are discussing, but I'm also not sure if it's sufficiently different to open a new feature request. If it is, I can do that.

It would be awesome if users could add their own OpenAI endpoints/api keys, or at least allow multiple OpenAI endpoints from the admin's perspective (I think this option would require database changes and I'm not familiar enough with peewee at the moment to implement this).

One way this could be done specifically for individual users is how TypingMind does it—storing the API key and endpoint in localStorage and making a direct request from the browser to the OpenAI endpoint (although this may introduce CORS errors, I think only when working with custom APIs that follow the OpenAI API specs). However, I think the ideal solution would allow users to store their custom keys/endpoints in the database so the request can be made on the backend to avoid CORS errors.

Any thoughts on this?

justinh-rahb · 2024-02-22T18:27:38Z

@christopher-kapic, the WebUI initially processed OpenAI requests solely on the browser side, with settings stored in local storage exactly as you say. However, we received several requests about it and decided to change the implementation to proxy through the backend like the Ollama API requests. It appears that supporting both methods might be necessary to cater to all users, but this could become quite intricate.

tjbck added this to the v1.0 milestone Dec 30, 2023

tjbck mentioned this issue Jan 25, 2024

External Ollama and interfacing UI #392

Closed

tjbck mentioned this issue Feb 3, 2024

Separate ollama instance for titles #638

Closed

explorigin mentioned this issue Feb 9, 2024

feat: openai api compatible model presets (profiles) #665

Closed

displague mentioned this issue Feb 12, 2024

Support multiple OpenAI Compatible APIs #697

Closed

tjbck self-assigned this Mar 2, 2024

This was referenced Mar 5, 2024

feat: multiple ollama support #1044

Merged

0.1.109 #1033

Merged

tjbck linked a pull request Mar 5, 2024 that will close this issue

0.1.109 #1033

Merged

tjbck closed this as completed in #1033 Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: Ability to work with multiple Ollama servers #278

Feat: Ability to work with multiple Ollama servers #278

jeremiahsb commented Dec 26, 2023

tjbck commented Dec 27, 2023

Loki321 commented Dec 30, 2023

dnviti commented Jan 3, 2024 •

edited

Loki321 commented Jan 3, 2024

dnviti commented Jan 4, 2024

SethBurkart123 commented Jan 16, 2024 •

edited

justinh-rahb commented Jan 17, 2024 •

edited

davidamacey commented Jan 24, 2024

justinh-rahb commented Jan 24, 2024 •

edited

davidamacey commented Jan 24, 2024

davidamacey commented Jan 25, 2024

explorigin commented Feb 12, 2024

justinh-rahb commented Feb 12, 2024

VfBfoerst commented Feb 20, 2024 •

edited

VfBfoerst commented Feb 22, 2024

christopher-kapic commented Feb 22, 2024

justinh-rahb commented Feb 22, 2024

Feat: Ability to work with multiple Ollama servers #278

Feat: Ability to work with multiple Ollama servers #278

Comments

jeremiahsb commented Dec 26, 2023

tjbck commented Dec 27, 2023

Loki321 commented Dec 30, 2023

dnviti commented Jan 3, 2024 • edited

Loki321 commented Jan 3, 2024

dnviti commented Jan 4, 2024

SethBurkart123 commented Jan 16, 2024 • edited

justinh-rahb commented Jan 17, 2024 • edited

davidamacey commented Jan 24, 2024

justinh-rahb commented Jan 24, 2024 • edited

davidamacey commented Jan 24, 2024

davidamacey commented Jan 25, 2024

explorigin commented Feb 12, 2024

justinh-rahb commented Feb 12, 2024

VfBfoerst commented Feb 20, 2024 • edited

VfBfoerst commented Feb 22, 2024

christopher-kapic commented Feb 22, 2024

justinh-rahb commented Feb 22, 2024

dnviti commented Jan 3, 2024 •

edited

SethBurkart123 commented Jan 16, 2024 •

edited

justinh-rahb commented Jan 17, 2024 •

edited

justinh-rahb commented Jan 24, 2024 •

edited

VfBfoerst commented Feb 20, 2024 •

edited