Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support loading multiple models at the same time #2109

Closed
Picaso2 opened this issue Jan 20, 2024 · 18 comments · Fixed by #3418
Closed

Support loading multiple models at the same time #2109

Picaso2 opened this issue Jan 20, 2024 · 18 comments · Fixed by #3418
Assignees

Comments

@Picaso2
Copy link

Picaso2 commented Jan 20, 2024

is it possible to create one model from multiple models? or even load multiple models?

@Dbone29
Copy link

Dbone29 commented Jan 22, 2024

You can merge 2 models with another tool. https://huggingface.co/Undi95 He doe it with some models. After that you can a gguf file auf this model und use it in ollama when you want to. Ollama on its own isn't able to combine 2 modells.

@cmndcntrlcyber
Copy link

You can merge 2 models with another tool. https://huggingface.co/Undi95 He doe it with some models. After that you can a gguf file auf this model und use it in ollama when you want to. Ollama on its own isn't able to combine 2 modells.

Do you happen to have a link or name of the tool?

@Dbone29
Copy link

Dbone29 commented Mar 2, 2024

There are many tools for this task, but unfortunately, I am not familiar enough to say which one is the best or what the differences between them are. However, here's an example of a tool that I came across last year:

https://github.com/arcee-ai/mergekit

@pdevine
Copy link
Contributor

pdevine commented Mar 11, 2024

@Picaso2 other than the multimodal models we don't yet support loading multiple models into memory simultaneously. What is the use case you're trying to do?

@mofanke
Copy link
Contributor

mofanke commented Mar 12, 2024

@Picaso2 other than the multimodal models we don't yet support loading multiple models into memory simultaneously. What is the use case you're trying to do?

I encountered a similar requirement, and I want to implement a RAG (Retrieval-Augmented Generation) system. It requires using both an embedding model and a chat model separately. Currently, the implementation with Ollama requires constantly switching between models, which slows down the process. It would be much more efficient if there was a way to use them simultaneously.

@Picaso2
Copy link
Author

Picaso2 commented Mar 12, 2024 via email

@Fruetel
Copy link

Fruetel commented Mar 16, 2024

@Picaso2 other than the multimodal models we don't yet support loading multiple models into memory simultaneously. What is the use case you're trying to do?

I also have a use case for this.

I'm using Crew.ai with Ollama. I have agents which need to use tools, such as search or document retrieval and then there are agents who work on data which are provided by the tool users. For tool using agent I use Hermes-2-Pro-Mistral, which is optimized for tool usage but not that smart with 7 billion parameters. Would be awesome to be able to load a smart Mixtral model for the thinking agents in parallel to the Hermes for the tool using ones.

@dizzyriver
Copy link

dizzyriver commented Mar 16, 2024

Same. I'd like llava for image to text and mixtral for language reasoning

@alfi4000
Copy link

Same. I'd like llava for text to image and mixtral for language reasoning

same

@alfi4000
Copy link

would It be possible to run several models 1 over gpu and the other ones over cpu and ram because I want to be able to run several models at once so if one of my family members are using Ollama over open webui and I do the same time it would be good that one runs on cpu and the other one one gpu!

@oldmanjk
Copy link

I have a rig with three graphics cards that I would like to run three separate models on simultaneously and have them group chat

@alfi4000
Copy link

alfi4000 commented Mar 23, 2024

I have a rig with three graphics cards that I would like to run three separate models on simultaneously and have them group chat

Try run this: and edit the ip Adress 127.0.0.1 to your rigs ip Adress or just let it how it is and then add the ip Adress with Port to your instances for example I used open webui added those 3 and it managed one ip Adress connection per Chat window so it should handle all 3 graphics car but only if you run it like I have you!:

Linux (I tested in Ubuntu)
OLLAMA_HOST=127.0.0.1:11435 ollama serve & OLLAMA_HOST=127.0.0.1:11436 ollama serve & OLLAMA_HOST=127.0.0.1:11437 ollama serve

just an example you can use different ports but on one connection only one gpu and 1 llm not several other wise it will first finish first the second and then third!

To stop it
Linux (on Ubuntu tested)
Command: grep ollama
Output:
1828
2883
1284
Command: kill 1284 & kill 1828 & kill 2883
If that doesn’t work try to kill each process manually!

command: kill 1828
command: kill 2883
command: kill 1284

@oldmanjk
Copy link

oldmanjk commented Mar 30, 2024

That's what I'm currently doing (loosely), but you also have to map each instance to a specific GPU. It works, but it's very clunky to setup. A GUI would be nice

@leporel
Copy link

leporel commented Apr 6, 2024

run in docker, stick containers separately with gpu1,gpu2 or cpu only, open-webui can work with multiply ollama instances

version: '3.8'

services:
  ollama:
    volumes:
      - type: bind
        source: C:\MyPrograms\ollama\data
        target: /root/.ollama
      - type: bind
        source: C:\MyPrograms\ollama\models
        target: /models
    container_name: ollama
    pull_policy: always
    tty: true
    restart: unless-stopped
    image: ollama/ollama:latest
    ports:
      - ${OPEN_WEBUI_PORT-11434}:11434
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0']
              capabilities:
                - gpu

  ollama-cpu:
    volumes:
      - type: bind
        source: C:\MyPrograms\ollama\data
        target: /root/.ollama
      - type: bind
        source: C:\MyPrograms\ollama\models
        target: /models
    container_name: ollama-cpu
    pull_policy: always
    tty: true
    restart: unless-stopped
    image: ollama/ollama:latest
    ports:
      - ${OPEN_WEBUI_PORT2-11435}:11434
                
  open-webui:
    image: ghcr.io/open-webui/open-webui:latest
    container_name: open-webui
    volumes:
      - type: bind
        source: C:\MyPrograms\ollama\web
        target: /app/backend/data
    depends_on:
      - ollama
    ports:
      - ${OPEN_WEBUI_PORT-3000}:8080
    environment:
      - 'OLLAMA_BASE_URL=http://ollama:${OPEN_WEBUI_PORT-11434}'
      - 'WEBUI_SECRET_KEY='
    extra_hosts:
      - host.docker.internal:host-gateway
    restart: unless-stopped

@oldmanjk
Copy link

oldmanjk commented Apr 6, 2024

run in docker, stick containers separately with gpu1,gpu2 or cpu only, open-webui can work with multiply ollama instances

version: '3.8'

services:
  ollama:
    volumes:
      - type: bind
        source: C:\MyPrograms\ollama\data
        target: /root/.ollama
      - type: bind
        source: C:\MyPrograms\ollama\models
        target: /models
    container_name: ollama
    pull_policy: always
    tty: true
    restart: unless-stopped
    image: ollama/ollama:latest
    ports:
      - ${OPEN_WEBUI_PORT-11434}:11434
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0']
              capabilities:
                - gpu

  ollama-cpu:
    volumes:
      - type: bind
        source: C:\MyPrograms\ollama\data
        target: /root/.ollama
      - type: bind
        source: C:\MyPrograms\ollama\models
        target: /models
    container_name: ollama-cpu
    pull_policy: always
    tty: true
    restart: unless-stopped
    image: ollama/ollama:latest
    ports:
      - ${OPEN_WEBUI_PORT2-11435}:11434
                
  open-webui:
    image: ghcr.io/open-webui/open-webui:latest
    container_name: open-webui
    volumes:
      - type: bind
        source: C:\MyPrograms\ollama\web
        target: /app/backend/data
    depends_on:
      - ollama
    ports:
      - ${OPEN_WEBUI_PORT-3000}:8080
    environment:
      - 'OLLAMA_BASE_URL=http://ollama:${OPEN_WEBUI_PORT-11434}'
      - 'WEBUI_SECRET_KEY='
    extra_hosts:
      - host.docker.internal:host-gateway
    restart: unless-stopped

No offense, but that's even clunkier. You already don't need to use docker

@oldmanjk
Copy link

Can we have control over which model is run on which GPU?

@dhiltgen
Copy link
Collaborator

Can we have control over which model is run on which GPU?

This is something we can look at adding incrementally as this feature matures. Feel free to file a new issue and capture how you'd like it to work.

@dougy83
Copy link

dougy83 commented Jun 9, 2024

Can we have control over which model is run on which GPU?

You can create a new cpu-only model name using the following (e.g. for phi3 model) ollama show phi3 --model-file > phi3-cpuonly.modelfile, editing that file to include PARAMETER num_cpu 0 and update the FROM section as it describes, and then using ollama create -f phi3-cpuonly.modelfile phi3-cpuonly
You then just reference the the phi3-cpuonly, and it loads into system RAM. You can call the file and model whatever you want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.