Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Ability to work with multiple Ollama servers #278

Closed
jeremiahsb opened this issue Dec 26, 2023 · 17 comments · Fixed by #1033
Closed

Feat: Ability to work with multiple Ollama servers #278

jeremiahsb opened this issue Dec 26, 2023 · 17 comments · Fixed by #1033
Assignees
Milestone

Comments

@jeremiahsb
Copy link

Is your feature request related to a problem? Please describe.
On my system I have a capable CPU with a large amount of RAM that is able to run quite large models, albeit slowly. I also have an RTX 3060 which is able to run smaller models quite quickly. I can easily have two Docker instances of Ollama running, one for GPU mode and one for CPU only mode. It would be great if we could have an ability to have a single instance of Ollama-Webui with the ability to switch between the two Ollama instances.

Describe the solution you'd like
Have a settings screen where I can add one, or more, additional Ollama server. Along with the entry screen, have a toggle or drop-down list that would ideally let me set the default Ollama server on a per model level.

Describe alternatives you've considered
An alternative solution, which would be less ideal, would be to simply run two instances of Ollama-Webui, each pointing to a different Ollama container.

@tjbck
Copy link
Contributor

tjbck commented Dec 27, 2023

Hi, Thanks for the suggestion. I'll take a look in the near future and assess it's usability/feasibility. Thanks!

@Loki321
Copy link

Loki321 commented Dec 30, 2023

+1 but I would like to use it slightly differently.

I have 2 ollama instances on 2 different machines, one that can only do CPU inference (but with a lot of RAM so can run larger models) and one that is not always available (due to power cost concerns) but uses fast GPU.

Ideally I would be able to define multiple ollama servers in the web-ui and an order in which to use them.

For example, if the machine with the GPU is available (and the selected model is on that machine) then route requests to that instance otherwise use the slower (but always on) machine.

Manual selection would also be fine, or listing them all in the same way the UI currently separates ollama and external (openai api) models. But being able to define a primary and a backup connection (or more than 2, with priorities) would be great as it would mean I can use a single interface to interact with ollama and speed things up by just switching a faster machine on if it's deemed necessary for the task.

@tjbck tjbck added this to the v1.0 milestone Dec 30, 2023
@dnviti
Copy link
Contributor

dnviti commented Jan 3, 2024

There would be a "context" problem if you switch from an instance to another in the middle of a conversation.
I tried to asses this during the kubernetes support development.
As you say the only way that i can think of is to be able to switch instance manually before having the conversation, in kubernetes this would mean to be able to select the exact statefulset using the statefulset id.

A really cool feature that ollama itself could implement would be to save the "chat context" data into a shared database (like mongodb) and reuse that as context for the next prompt response, just speculating, don't know if possible.

@Loki321
Copy link

Loki321 commented Jan 3, 2024

My use case only really needs to be able to spin up a machine and connect when I know I'm going to need it. Wouldn't really happen mid conversation.

Related to saving the chat context though, llama.cpp has a --prompt-cache flag where you can save the prompt cache to a file and load it back in again later. I would think it could be leveraged to achieve what you're talking about. Like you say though, would need to be done in ollama itself.

@dnviti
Copy link
Contributor

dnviti commented Jan 4, 2024

It would be possible in ollama then, the backend only needs to save the prompt cache using the chat id and always reuse it when asked a new question on that same chat, ofc using a shared volume of data.

@SethBurkart123
Copy link

SethBurkart123 commented Jan 16, 2024

Just for anyone looking over this thread before the feature gets introduced. If you want to run the same models on multiple ollama instances (basically just to shorten queue times if there are a decent amount of people using your webui instance) you can do load balancing with nginx.

This solution isn't completely useful for some of the people in this thread, although it came in handy for me as I have two GPU's and wanted to make sure that multiple people could generate at once without queueing initially. If you are using the least_conn; method (not the one described later for @Loki321) - Both Ollama instances must have the same models. Otherwise you can get errors due to the webui thinking that certain models are available.

All you need is an nginx instance running (there's heaps of tutorials for that) and then to put this in your nginx.conf:

http {
    upstream backend_servers {
        least_conn;                     # Enable least connections load balancing

        # Add your Ollama servers here eg. if one of them was http://localhost:11434/api then you would add `server localhost:11434`
        server localhost:11434;          # Ollama server 1
        server localhost:11435;          # Ollama server 2
    }

    server {
        listen 9090;                     # This is the port you would use for the Ollama API URL in the WebUI eg. http://localhost:9090/api

        location / {
            proxy_pass http://backend_servers; # Forward requests to the upstream block
        }
    }
}

@Loki321 if you want to prioritise one server over the other instead of whichever has the smallest queue (ie. prioritise your GPU server when it's available otherwise fallback to cpu) you can remove the least_conn; line and modify the servers to:

        server localhost:11434 max_fails=3 fail_timeout=30s;  # Primary GPU server
        server localhost:11435 backup;                      # Backup CPU only server

@justinh-rahb
Copy link
Collaborator

justinh-rahb commented Jan 17, 2024

@SethBurkart123 fascinating, I didn't think this would actually work but it seems to so far... I can increase my Mac Studio's throughput running 3x Ollama "workers" on the same machine, it's got enough CPU and RAM. It's not quite triple the throughput, but it's definitely an improvement for users.

@davidamacey
Copy link

@SethBurkart123 I followed your instructions, but I am having a bit of trouble getting the nginx to forward to the different ollama servers. Below are some code snippets from my nginx config file and the docker compose to setup the deployment across 4 separate GPUs.

  1. Changed port numbers of ollama containers in compose and conf file, in various combinations
  2. Restarted compose after each change

With the setup below it will run only on the container with 11434 port. If I change 11434 to a different ollama container it loads on that respective GPU. I tested with least_conn and none, which defaults to round_robin

The load balance will not send to the other ollama containers. I test with two browsers on same machine and multiple different machines.

It may be something simple I missed; open to suggestions.

Any assistance is greatly appreciated

version: '3.8'

services:

  nginx:
    image: nginx:latest
    container_name: nginx
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf
    ports:
      - 9090:80
    networks:
      - ollama_net

  ollama-00:
    volumes:
      - /mnt/md0/ollama_models:/root/.ollama
    container_name: ollama-00
    pull_policy: always
    tty: true
    restart: unless-stopped
    image: ollama/ollama:latest
    ports:
      - 11434:11434
    # GPU support
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: [ '0' ]
              capabilities:
                - gpu
    networks:
      - ollama_net

  ollama-01:
    volumes:
      - /mnt/md0/ollama_models:/root/.ollama
    container_name: ollama-01
    pull_policy: always
    tty: true
    restart: unless-stopped
    image: ollama/ollama:latest
    ports:
      - 11435:11434
    # GPU support
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: [ '1' ]
              capabilities:
                - gpu
    networks:
      - ollama_net

  ollama-02:
    volumes:
      - /mnt/md0/ollama_models:/root/.ollama
    container_name: ollama-02
    pull_policy: always
    tty: true
    restart: unless-stopped
    image: ollama/ollama:latest
    ports:
      - 11436:11434
    # GPU support
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: [ '2' ]
              capabilities:
                - gpu
    networks:
      - ollama_net

  ollama-03:
    volumes:
      - /mnt/md0/ollama_models:/root/.ollama
    container_name: ollama-03
    pull_policy: always
    tty: true
    restart: unless-stopped
    image: ollama/ollama:latest
    ports:
      - 11437:11434
    # GPU support
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: [ '3' ]
              capabilities:
                - gpu
    networks:
      - ollama_net

  ollama-webui:
    build:
      context: .
      dockerfile: Dockerfile
    container_name: ollama-webui
    volumes:
      - /mnt/md0/ollama_webui/:/app/backend/data
    depends_on:
      - nginx
      - ollama-00
      - ollama-01
      - ollama-02
      - ollama-03
    ports:
      - 3000:8080
    environment:
      - OLLAMA_API_BASE_URL=http://nginx:9090/api
    # extra_hosts:
    #   - host.docker.internal:host-gateway
    restart: unless-stopped
    networks:
      - ollama_net

networks:
  ollama_net:
    driver: bridge

# volumes:
#   ollama: {}
#   ollama-webui: {}
worker_processes auto;

events { worker_connections 1024; }

http {
    upstream ollama {
        least_conn;                     # Enable least connections load balancing

        server ollama-00:11434;          # Ollama server 0
        server ollama-01:11435;          # Ollama server 1
        server ollama-02:11436;          # Ollama server 2
        server ollama-03:11437;          # Ollama server 3
    }

    server {
        listen 9090;                     # This is the port you would use for the Ollama API URL in the WebUI eg. http://localhost:9090/api

        location / {
            proxy_pass http://ollama; # Forward requests to the upstream block
        }
    }
}

@justinh-rahb
Copy link
Collaborator

justinh-rahb commented Jan 24, 2024

@davidamacey here are two possible solutions to address the issue in your Nginx configuration file:

  1. Change all the ports in your nginx.conf file to 11434, while keeping the current hostnames as they are. This will ensure that Nginx communicates with the containers on their internal Docker network with port 11434. In this case you don't actually need to publish external ports unless you have something else accessing your Ollama instances directly from outside of Docker.
  2. Change the ollama-##:1143# hostnames in your nginx.conf file to host.docker.internal:1143# instead. This will ensure that Nginx communicates with the containers using their published external ports. This is one way to do it, but I'd recommend the first.

Either of these changes should resolve the issue with your Nginx configuration file.

@davidamacey
Copy link

@justinh-rahb and @SethBurkart123 Thank you for the prompt response! I tried a few more tests.

  1. I changed all ports in config to the same 11434, but I had the same result, sending two simultaneous requests from two browsers resulted in a queue for GPU 0
  2. host.docker.internal was not resolvable in nginx
  3. I tried making separate volumes for each ollama container, in case there was config per container. Currently I am using the same volume for each ollama model files.

I greatly appreciate your quick feedback. I will continue to troubleshoot.

@davidamacey
Copy link

@justinh-rahb and @SethBurkart123 I appreciate the guidance. Unfortunately, I was not able to get multiple ollama containers deployed behind an nginx load balancer. It seems the FastAPI StreamRequest is receiving a stream of data followed by the final POST once the chat is completed. Therefore, myself (consulting with friends), that the nginx is working but it doesn't handle the API stream as we expect.

This drove me to learn about vLLM. vLLM has direct compliance with OpenAI's API format, therefore I can deploy a local vLLM container with selected LLM. Using the UI I enter an EMPTY key and the URL to the vLLM instance. Select from model in chat and off to the races! This provides faster response times and async requests.

Con to vLLM is that it requires an NVIDIA GPU, so not all users will have this due to the popularity of Apple Silicon M Chips, etc.

I am happy to report your application works with vLLM backend without much effort. There is one potential bug on the System Prompt but I will make an issue for it.

Below is my Docker compose setup for those that are interested in giving it a try. Note, ollama container is required, otherwise the UI will throw errors that there isn't an ollama connection.

version: '3.8'

services:

  ollama-00:
    volumes:
      - /mnt/nas/ollama_webui/ollama:/root/.ollama
    container_name: ollama-00
    pull_policy: always
    tty: true
    restart: unless-stopped
    image: ollama/ollama:latest
    ports:
      - 11434:11434
    # GPU support
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: [ '0' ]
              capabilities:
                - gpu
    networks:
      - ollama_net

  vllm:
    container_name: vllm
    image: vllm/vllm-openai:latest
    pull_policy: always
    volumes:
      - /mnt/nas/hf_vllm_models/:/root/.cache/huggingface
    environment:
      - HUGGING_FACE_HUB_TOKEN=<token>
    ports:
      - "8000:8000"
    ipc: host
    # command: ["--model", "mistralai/Mixtral-8x7B-Instruct-v0.1"]
    command: ["--model", "mistralai/Mistral-7B-Instruct-v0.2", "--tensor-parallel-size", "2"]
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0', '2' ]
              capabilities:
                - gpu
    networks:
      - ollama_net

  ollama-webui:
    image: ghcr.io/ollama-webui/ollama-webui:main
    container_name: ollama-webui
    pull_policy: always
    volumes:
      - /mnt/nas/ollama_webui/webui:/app/backend/data
    depends_on:
      - vllm
      - ollama-00
    ports:
      - 3000:8080
    environment:
      - OLLAMA_API_BASE_URL=http://ollama-00:11434/api
      - OPENAI_API_BASE_URL=http://vllm:8000/v1
      - OPENAI_API_KEY=EMPTY
    restart: unless-stopped
    networks:
      - ollama_net

networks:
  ollama_net:
    driver: bridge

Happy coding!

@explorigin
Copy link
Contributor

I recommend that webui become a frontend to litellm proxy. There are a lot of things that litellm can do that we could be the pretty face for. Managing multiple endpoints is one of them. https://docs.litellm.ai/docs/proxy/configs

@justinh-rahb
Copy link
Collaborator

I recommend that webui become a frontend to litellm proxy. There are a lot of things that litellm can do that we could be the pretty face for. Managing multiple endpoints is one of them. https://docs.litellm.ai/docs/proxy/configs

I've been beating this drum for a while now, it does seem to me to be the quickest way to bootstrap support for other backends.

@VfBfoerst
Copy link

VfBfoerst commented Feb 20, 2024

Speaking of litellm, I got it to work with my open-webui and it handles loadbalancing very well (tested with 2 GPUs and 4 Ollama Instances).
The only "problem" which appeared was after adding authentication to the litellm proxy server. Then, the webui couldn't speak with the API anymore, I guess because of the missing possibility to add the bearer token.
Can you maybe add a bearer token field when adding the external litellm-api URL, e.g. here?:
image

Also it would be nice to set different bearer tokens per user, so i can be able to track the usage of the litellm api on a per user basis.

I can also create a new issue with this, if wanted :)

Edit: the corresponding header is e.g. curl http://123.123.123.123/v1/chat/completions -H 'Authorization: Bearer sk-1234'

@VfBfoerst
Copy link

Speaking of litellm, I got it to work with my open-webui and it handles loadbalancing very well (tested with 2 GPUs and 4 Ollama Instances). The only "problem" which appeared was after adding authentication to the litellm proxy server. Then, the webui couldn't speak with the API anymore, I guess because of the missing possibility to add the bearer token. Can you maybe add a bearer token field when adding the external litellm-api URL, e.g. here?: image

Also it would be nice to set different bearer tokens per user, so i can be able to track the usage of the litellm api on a per user basis.

I can also create a new issue with this, if wanted :)

Edit: the corresponding header is e.g. curl http://123.123.123.123/v1/chat/completions -H 'Authorization: Bearer sk-1234'

Works in newest version, thanks, you can set the token in the api key field :}

@christopher-kapic
Copy link

I know this isn't exactly what you guys are discussing, but I'm also not sure if it's sufficiently different to open a new feature request. If it is, I can do that.

It would be awesome if users could add their own OpenAI endpoints/api keys, or at least allow multiple OpenAI endpoints from the admin's perspective (I think this option would require database changes and I'm not familiar enough with peewee at the moment to implement this).

One way this could be done specifically for individual users is how TypingMind does it—storing the API key and endpoint in localStorage and making a direct request from the browser to the OpenAI endpoint (although this may introduce CORS errors, I think only when working with custom APIs that follow the OpenAI API specs). However, I think the ideal solution would allow users to store their custom keys/endpoints in the database so the request can be made on the backend to avoid CORS errors.

Any thoughts on this?

@justinh-rahb
Copy link
Collaborator

@christopher-kapic, the WebUI initially processed OpenAI requests solely on the browser side, with settings stored in local storage exactly as you say. However, we received several requests about it and decided to change the implementation to proxy through the backend like the Ollama API requests. It appears that supporting both methods might be necessary to cater to all users, but this could become quite intricate.

@tjbck tjbck self-assigned this Mar 2, 2024
This was referenced Mar 5, 2024
@tjbck tjbck linked a pull request Mar 5, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants