Support loading multiple models at the same time #2109

Picaso2 · 2024-01-20T11:06:39Z

is it possible to create one model from multiple models? or even load multiple models?

Dbone29 · 2024-01-22T22:26:53Z

You can merge 2 models with another tool. https://huggingface.co/Undi95 He doe it with some models. After that you can a gguf file auf this model und use it in ollama when you want to. Ollama on its own isn't able to combine 2 modells.

cmndcntrlcyber · 2024-03-02T08:52:59Z

You can merge 2 models with another tool. https://huggingface.co/Undi95 He doe it with some models. After that you can a gguf file auf this model und use it in ollama when you want to. Ollama on its own isn't able to combine 2 modells.

Do you happen to have a link or name of the tool?

Dbone29 · 2024-03-02T12:13:16Z

There are many tools for this task, but unfortunately, I am not familiar enough to say which one is the best or what the differences between them are. However, here's an example of a tool that I came across last year:

https://github.com/arcee-ai/mergekit

pdevine · 2024-03-11T17:50:09Z

@Picaso2 other than the multimodal models we don't yet support loading multiple models into memory simultaneously. What is the use case you're trying to do?

mofanke · 2024-03-12T06:08:37Z

@Picaso2 other than the multimodal models we don't yet support loading multiple models into memory simultaneously. What is the use case you're trying to do?

I encountered a similar requirement, and I want to implement a RAG (Retrieval-Augmented Generation) system. It requires using both an embedding model and a chat model separately. Currently, the implementation with Ollama requires constantly switching between models, which slows down the process. It would be much more efficient if there was a way to use them simultaneously.

Picaso2 · 2024-03-12T22:48:05Z

Ultimately i would like to have an system that i can have a conversation with on various topics from science to politic to math.

…

________________________________ From: mofanke ***@***.***> Sent: Tuesday, March 12, 2024 02:09 To: ollama/ollama ***@***.***> Cc: Picaso2 ***@***.***>; Mention ***@***.***> Subject: Re: [ollama/ollama] multiple models (Issue #2109) @Picaso2<https://github.com/Picaso2> other than the multimodal models we don't yet support loading multiple models into memory simultaneously. What is the use case you're trying to do? I encountered a similar requirement, and I want to implement a RAG (Retrieval-Augmented Generation) system. It requires using both an embedding model and a chat model separately. Currently, the implementation with Ollama requires constantly switching between models, which slows down the process. It would be much more efficient if there was a way to use them simultaneously. — Reply to this email directly, view it on GitHub<#2109 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ASTD6ECNM3664GEZHXMF34TYX2LXZAVCNFSM6AAAAABCDDUZMSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJQG42TIMBZGM>. You are receiving this because you were mentioned.Message ID: ***@***.***>

Fruetel · 2024-03-16T08:50:44Z

@Picaso2 other than the multimodal models we don't yet support loading multiple models into memory simultaneously. What is the use case you're trying to do?

I also have a use case for this.

I'm using Crew.ai with Ollama. I have agents which need to use tools, such as search or document retrieval and then there are agents who work on data which are provided by the tool users. For tool using agent I use Hermes-2-Pro-Mistral, which is optimized for tool usage but not that smart with 7 billion parameters. Would be awesome to be able to load a smart Mixtral model for the thinking agents in parallel to the Hermes for the tool using ones.

dizzyriver · 2024-03-16T17:25:27Z

Same. I'd like llava for image to text and mixtral for language reasoning

alfi4000 · 2024-03-19T00:25:28Z

Same. I'd like llava for text to image and mixtral for language reasoning

same

alfi4000 · 2024-03-19T00:35:36Z

would It be possible to run several models 1 over gpu and the other ones over cpu and ram because I want to be able to run several models at once so if one of my family members are using Ollama over open webui and I do the same time it would be good that one runs on cpu and the other one one gpu!

oldmanjk · 2024-03-19T23:05:16Z

I have a rig with three graphics cards that I would like to run three separate models on simultaneously and have them group chat

alfi4000 · 2024-03-23T08:55:10Z

I have a rig with three graphics cards that I would like to run three separate models on simultaneously and have them group chat

Try run this: and edit the ip Adress 127.0.0.1 to your rigs ip Adress or just let it how it is and then add the ip Adress with Port to your instances for example I used open webui added those 3 and it managed one ip Adress connection per Chat window so it should handle all 3 graphics car but only if you run it like I have you!:

Linux (I tested in Ubuntu)
OLLAMA_HOST=127.0.0.1:11435 ollama serve & OLLAMA_HOST=127.0.0.1:11436 ollama serve & OLLAMA_HOST=127.0.0.1:11437 ollama serve

just an example you can use different ports but on one connection only one gpu and 1 llm not several other wise it will first finish first the second and then third!

To stop it
Linux (on Ubuntu tested)
Command: grep ollama
Output:
1828
2883
1284
Command: kill 1284 & kill 1828 & kill 2883
If that doesn’t work try to kill each process manually!

command: kill 1828
command: kill 2883
command: kill 1284

oldmanjk · 2024-03-30T04:30:27Z

That's what I'm currently doing (loosely), but you also have to map each instance to a specific GPU. It works, but it's very clunky to setup. A GUI would be nice

leporel · 2024-04-06T01:08:56Z

run in docker, stick containers separately with gpu1,gpu2 or cpu only, open-webui can work with multiply ollama instances

version: '3.8'

services:
  ollama:
    volumes:
      - type: bind
        source: C:\MyPrograms\ollama\data
        target: /root/.ollama
      - type: bind
        source: C:\MyPrograms\ollama\models
        target: /models
    container_name: ollama
    pull_policy: always
    tty: true
    restart: unless-stopped
    image: ollama/ollama:latest
    ports:
      - ${OPEN_WEBUI_PORT-11434}:11434
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0']
              capabilities:
                - gpu

  ollama-cpu:
    volumes:
      - type: bind
        source: C:\MyPrograms\ollama\data
        target: /root/.ollama
      - type: bind
        source: C:\MyPrograms\ollama\models
        target: /models
    container_name: ollama-cpu
    pull_policy: always
    tty: true
    restart: unless-stopped
    image: ollama/ollama:latest
    ports:
      - ${OPEN_WEBUI_PORT2-11435}:11434
                
  open-webui:
    image: ghcr.io/open-webui/open-webui:latest
    container_name: open-webui
    volumes:
      - type: bind
        source: C:\MyPrograms\ollama\web
        target: /app/backend/data
    depends_on:
      - ollama
    ports:
      - ${OPEN_WEBUI_PORT-3000}:8080
    environment:
      - 'OLLAMA_BASE_URL=http://ollama:${OPEN_WEBUI_PORT-11434}'
      - 'WEBUI_SECRET_KEY='
    extra_hosts:
      - host.docker.internal:host-gateway
    restart: unless-stopped

oldmanjk · 2024-04-06T04:53:01Z

run in docker, stick containers separately with gpu1,gpu2 or cpu only, open-webui can work with multiply ollama instances

version: '3.8'

services:
  ollama:
    volumes:
      - type: bind
        source: C:\MyPrograms\ollama\data
        target: /root/.ollama
      - type: bind
        source: C:\MyPrograms\ollama\models
        target: /models
    container_name: ollama
    pull_policy: always
    tty: true
    restart: unless-stopped
    image: ollama/ollama:latest
    ports:
      - ${OPEN_WEBUI_PORT-11434}:11434
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0']
              capabilities:
                - gpu

  ollama-cpu:
    volumes:
      - type: bind
        source: C:\MyPrograms\ollama\data
        target: /root/.ollama
      - type: bind
        source: C:\MyPrograms\ollama\models
        target: /models
    container_name: ollama-cpu
    pull_policy: always
    tty: true
    restart: unless-stopped
    image: ollama/ollama:latest
    ports:
      - ${OPEN_WEBUI_PORT2-11435}:11434
                
  open-webui:
    image: ghcr.io/open-webui/open-webui:latest
    container_name: open-webui
    volumes:
      - type: bind
        source: C:\MyPrograms\ollama\web
        target: /app/backend/data
    depends_on:
      - ollama
    ports:
      - ${OPEN_WEBUI_PORT-3000}:8080
    environment:
      - 'OLLAMA_BASE_URL=http://ollama:${OPEN_WEBUI_PORT-11434}'
      - 'WEBUI_SECRET_KEY='
    extra_hosts:
      - host.docker.internal:host-gateway
    restart: unless-stopped

No offense, but that's even clunkier. You already don't need to use docker

oldmanjk · 2024-04-23T18:22:14Z

Can we have control over which model is run on which GPU?

dhiltgen · 2024-04-23T23:39:56Z

Can we have control over which model is run on which GPU?

This is something we can look at adding incrementally as this feature matures. Feel free to file a new issue and capture how you'd like it to work.

dougy83 · 2024-06-09T14:51:17Z

Can we have control over which model is run on which GPU?

You can create a new cpu-only model name using the following (e.g. for phi3 model) ollama show phi3 --model-file > phi3-cpuonly.modelfile, editing that file to include PARAMETER num_cpu 0 and update the FROM section as it describes, and then using ollama create -f phi3-cpuonly.modelfile phi3-cpuonly
You then just reference the the phi3-cpuonly, and it loads into system RAM. You can call the file and model whatever you want.

dhiltgen mentioned this issue Mar 12, 2024

Attempting to load a model smaller than 10GiB into 12.2GiB GPU results in failing over to load into the host RAM. #1416

Closed

dhiltgen changed the title ~~multiple models~~ Support loading multiple models at the same time Mar 12, 2024

dhiltgen self-assigned this Mar 12, 2024

dhiltgen mentioned this issue Mar 12, 2024

Request: The ability to load multiple models into the same GPUs and running them concurrently. #1389

Closed

dhiltgen mentioned this issue Mar 21, 2024

PrivatyGPT Embedding with Ollama : "concurrent llm servers not yet supported" #3281

Closed

dhiltgen mentioned this issue Apr 12, 2024

Request and model concurrency #3418

Merged

dhiltgen closed this as completed in #3418 Apr 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support loading multiple models at the same time #2109

Support loading multiple models at the same time #2109

Picaso2 commented Jan 20, 2024

Dbone29 commented Jan 22, 2024

cmndcntrlcyber commented Mar 2, 2024

Dbone29 commented Mar 2, 2024

pdevine commented Mar 11, 2024

mofanke commented Mar 12, 2024

Picaso2 commented Mar 12, 2024 via email

Fruetel commented Mar 16, 2024

dizzyriver commented Mar 16, 2024 •

edited

alfi4000 commented Mar 19, 2024

alfi4000 commented Mar 19, 2024

oldmanjk commented Mar 19, 2024

alfi4000 commented Mar 23, 2024 •

edited

oldmanjk commented Mar 30, 2024 •

edited

leporel commented Apr 6, 2024 •

edited

oldmanjk commented Apr 6, 2024

oldmanjk commented Apr 23, 2024

dhiltgen commented Apr 23, 2024

dougy83 commented Jun 9, 2024 •

edited

Support loading multiple models at the same time #2109

Support loading multiple models at the same time #2109

Comments

Picaso2 commented Jan 20, 2024

Dbone29 commented Jan 22, 2024

cmndcntrlcyber commented Mar 2, 2024

Dbone29 commented Mar 2, 2024

pdevine commented Mar 11, 2024

mofanke commented Mar 12, 2024

Picaso2 commented Mar 12, 2024 via email

Fruetel commented Mar 16, 2024

dizzyriver commented Mar 16, 2024 • edited

alfi4000 commented Mar 19, 2024

alfi4000 commented Mar 19, 2024

oldmanjk commented Mar 19, 2024

alfi4000 commented Mar 23, 2024 • edited

oldmanjk commented Mar 30, 2024 • edited

leporel commented Apr 6, 2024 • edited

oldmanjk commented Apr 6, 2024

oldmanjk commented Apr 23, 2024

dhiltgen commented Apr 23, 2024

dougy83 commented Jun 9, 2024 • edited

dizzyriver commented Mar 16, 2024 •

edited

alfi4000 commented Mar 23, 2024 •

edited

oldmanjk commented Mar 30, 2024 •

edited

leporel commented Apr 6, 2024 •

edited

dougy83 commented Jun 9, 2024 •

edited