-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support loading multiple models at the same time #2109
Comments
You can merge 2 models with another tool. https://huggingface.co/Undi95 He doe it with some models. After that you can a gguf file auf this model und use it in ollama when you want to. Ollama on its own isn't able to combine 2 modells. |
Do you happen to have a link or name of the tool? |
There are many tools for this task, but unfortunately, I am not familiar enough to say which one is the best or what the differences between them are. However, here's an example of a tool that I came across last year: |
@Picaso2 other than the multimodal models we don't yet support loading multiple models into memory simultaneously. What is the use case you're trying to do? |
I encountered a similar requirement, and I want to implement a RAG (Retrieval-Augmented Generation) system. It requires using both an embedding model and a chat model separately. Currently, the implementation with Ollama requires constantly switching between models, which slows down the process. It would be much more efficient if there was a way to use them simultaneously. |
Ultimately i would like to have an system that i can have a conversation with on various topics from science to politic to math.
…________________________________
From: mofanke ***@***.***>
Sent: Tuesday, March 12, 2024 02:09
To: ollama/ollama ***@***.***>
Cc: Picaso2 ***@***.***>; Mention ***@***.***>
Subject: Re: [ollama/ollama] multiple models (Issue #2109)
@Picaso2<https://github.com/Picaso2> other than the multimodal models we don't yet support loading multiple models into memory simultaneously. What is the use case you're trying to do?
I encountered a similar requirement, and I want to implement a RAG (Retrieval-Augmented Generation) system. It requires using both an embedding model and a chat model separately. Currently, the implementation with Ollama requires constantly switching between models, which slows down the process. It would be much more efficient if there was a way to use them simultaneously.
—
Reply to this email directly, view it on GitHub<#2109 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ASTD6ECNM3664GEZHXMF34TYX2LXZAVCNFSM6AAAAABCDDUZMSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJQG42TIMBZGM>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
I also have a use case for this. I'm using Crew.ai with Ollama. I have agents which need to use tools, such as search or document retrieval and then there are agents who work on data which are provided by the tool users. For tool using agent I use Hermes-2-Pro-Mistral, which is optimized for tool usage but not that smart with 7 billion parameters. Would be awesome to be able to load a smart Mixtral model for the thinking agents in parallel to the Hermes for the tool using ones. |
Same. I'd like llava for image to text and mixtral for language reasoning |
same |
would It be possible to run several models 1 over gpu and the other ones over cpu and ram because I want to be able to run several models at once so if one of my family members are using Ollama over open webui and I do the same time it would be good that one runs on cpu and the other one one gpu! |
I have a rig with three graphics cards that I would like to run three separate models on simultaneously and have them group chat |
Try run this: and edit the ip Adress 127.0.0.1 to your rigs ip Adress or just let it how it is and then add the ip Adress with Port to your instances for example I used open webui added those 3 and it managed one ip Adress connection per Chat window so it should handle all 3 graphics car but only if you run it like I have you!: Linux (I tested in Ubuntu) just an example you can use different ports but on one connection only one gpu and 1 llm not several other wise it will first finish first the second and then third! To stop it command: kill 1828 |
That's what I'm currently doing (loosely), but you also have to map each instance to a specific GPU. It works, but it's very clunky to setup. A GUI would be nice |
run in docker, stick containers separately with gpu1,gpu2 or cpu only, open-webui can work with multiply ollama instances
|
No offense, but that's even clunkier. You already don't need to use docker |
Can we have control over which model is run on which GPU? |
This is something we can look at adding incrementally as this feature matures. Feel free to file a new issue and capture how you'd like it to work. |
You can create a new cpu-only model name using the following (e.g. for phi3 model) |
is it possible to create one model from multiple models? or even load multiple models?
The text was updated successfully, but these errors were encountered: