-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to run Ollama only on a dedicated GPU? (Instead of all GPUs) #1813
Comments
You could give me the other two |
:-) |
Could it be that the numbers of GPUs used with Ollama is related to the model? |
That's just the number of layers. I don't think there's a way to control GPU affinity but I would also like to do this. Another issue for me is it is automatically splitting a model between 2 GPUs even though it would fit on a single GPU (which would be faster) so I would like to just make it use the one with bigger VRAM. |
I tried a bit of research - it seems the relevant llama options are
Checking the https://github.com/jmorganca/ollama/blob/main/docs/api.md docs we should be able to pass in main_gpu to the API, so I tried with setting main_gpu to 1
This didn't seem to work as the same memory split took place rather than it using only the second GPU. Maybe the option is not yet passed onto llama from ollama. I had a look at the ollama code but i'm not familiar with Go so i'm not sure. |
Thx tarbard...I will check it. |
If you're running in three separate containers via docker you can start up each container to only be "aware" of one GPU. https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/docker-specialized.html docker run --gpus '"device=1,2"' \
nvidia/cuda nvidia-smi --query-gpu=uuid --format=csv |
@houstonhaynes...I had the same Idea, but it doesn't work for me. Ollama, running inside Docker, takes all GPUs no matter how I use the the Docker Parameter "--gpu" (also tried the ID of a GPU). My solution now is to splt/distribute the 3090 to different PCs. To my surprise, even with very old PC Hardware, Ollama runs fast! |
That is wild - I guess I "trust the manual" too much! I have two machines with an RTX3050 on each and haven't moved one over to have two on one machine. I was just doing some spelunking for GPU driven inference with postgresml and spotted that "deep" info from NVidia along the way. I thought it would be useful when I upgrade. I'm sorry it's not more helpful but maybe the controls "under the hood" suggested above will give you the right lever(s). I'd love to know how that turns out in case it comes calling after I put a bunch of cards in a GPU chassis! 😸 |
BTW you can use Unfortunately, the name of the environment variable is kinda a lie. It appears the other GPUs are still visible, just not accessible, so when |
Same challenge here.
Would really be awesome if either ...
Will check out if Damn! What i do get since activating {'main_gpu': 1} though ... is a log output when a model is loaded saying With my current solution i spin up another instance of
... and whenever I know a model fits on one GPU i connect to this port on my local machine. Thx for the |
-damn, I was not hoping for this outcome. Has anyone figured out how to restrict it to just one?- nvm, using CUDA_VISIBLE_DEVICES seemed to have done the trick |
Why this still unsupported? I'm running LM Studio to dedicate a GPU using tensor split 0,35 so I can fully offload Mistral 32k context to a 3060. I hope there's a tensor split on Ollama modelfile. |
CUDA_VISIBLE_DEVICES should work. We do have a defect related to memory prediction calculations in this case tracked via #1514 If you're seeing it load onto unexpected GPUs when this variable is set, please share the server log and some more details about the setup and I'll re-open. |
As you can see in the above image, I have 3 GPUs. 2x RTX A6000 and 1x 3070. I use the A6000s for bigger models through Ollama, and the smaller GPU I want to reserve for embedding models. However, when I start the server using the systemd config below:
Restart Ollama, and use say dolphin-mixtral:8x7b-v2.7-q8_0 (a model that will occupy more GPU memory than i have on any one GPU), it distributes it over device 0 and 1 instead of 0 and 2. I can wholly confirm I did a So it doesn't seem as though CUDA_VISIBLE_DEVICES is working as intended. For completeness here's the output of nvidia-smi:
Any help would be appreciated. @dhiltgen |
@jeremytregunna it sounds like there might be an ordering/enumeration bug where we're not consistent with other tools. If I had to guess, I'd speculate this is some tools/libraries using PCI bus/slot, and others sorting by capability/performance. Can you enable OLLAMA_DEBUG=1 and start up the server? Also try |
Hrmm... I've run it with debug logs on a few times, and the ordering never seems to change, it always reports the output below:
I verified they're the same devices by looking at the serial number. I also tried what you said with using `CUDA_VISIBLE_DEVICES=0,1" and 1,2 with no luck The whole log is preserved below, note this is with
|
@jeremytregunna looking back on that screen shot you posted above, I think the problem may be a result of how you have your cards plugged into your PCI slots. I believe you have 1 of the A6000's and the 3070 in the PCI 4@16x slots, but the other A6000 is in a older/slower PCI 1@16x slot. If you put both of the A6000's into the gen 4 slots and the 3070 into the gen 1 slot, perhaps things will be selected properly. |
Nope that's not it, but you are correct in one respect. The second A6000, since not being used, is currently at PCE1 speeds but, if I select it specifically in some other torch code, it bumps up to PCIE4x16 speeds. nvtop right now reports all 3 cards at PCE gen1 speeds because nothing is loaded. I can assure you, they're all plugged into gen 4 x16 slots. |
Can you try setting Use Hopefully some combination of these will get things aligned. |
Ok this had an interesting effect. Loading dolphin-mixtral:8x7b-v2.7-q8_0 again, it splits 50%/50% on the A6000s now with |
@dhiltgen So tried with the explicit UUIDs with |
@dhiltgen Thank you, CUDA_VISIBLE_DEVICES works. Finally. |
mark |
It can also be specified like this: |
damn. CUDA_VISIBLE_DEVICES is fine for me. thank you. |
@jeremytregunna |
Automate/Easy GPU Selection for Ollama Hi everyone, I wanted to share a handy script I created for automating GPU selection when running Ollama. You can find the script here. This script allows you to specify which GPU(s) Ollama should utilize, making it easier to manage resources and optimize performance. How to Use:
Additionally, I've included aliases in the gist for easier switching between GPU selections. Feel free to customize these aliases to suit your preferences. If you encounter any issues or have suggestions for improvement, please let me know! I hope this script helps streamline your Ollama workflow. Happy coding! |
Thank you, I can run this successful. |
Hi,
I have 3x3090 and I want to run Ollama Instance only on a dedicated GPU. The reason for this: To have 3xOllama Instances (with different ports) for using with Autogen.
I also tried the "Docker Ollama" without luck.
Or is there an other solution?
Let me know...
Thanks in advance
Steve
The text was updated successfully, but these errors were encountered: