-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MultiGPU: not splitting model to multiple GPUs - CUDA out of memory #1341
Comments
we have the Docker version installed on a 3 GPU system with nvidia-container-toolkit, and while the CLI doesn't work we can use the API interface. I see ollama-runner spawned on all three GPUs. |
and does it split models to different GPU's depending on VRAM? that doesn't work for me. |
It seems to me like the tasks are divided among the GPUs. I could not find any documentation to support this though. This is for model:
Division of memory seems to be asymmetric, probably for good reasons. |
no docker-compose...... $ docker run --gpus=all -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama $ curl http://localhost:11434/api/generate -d '{"model": "llama2","prompt": "Who was the most famous person of all time?","stream":false}' Mon Dec 4 11:53:13 2023 +-----------------------------------------------------------------------------+ |
For what it's worth, I have similar system specs as you do, and I am getting the same error log messages. Out of memory errors and also:
I am not running the docker version of ollama; I downloaded the binary to my ~/bin directory and run it from there. ollama version 0.1.13 ollama was working fine until I put in the 2nd GPU. |
We've been making improvements to our memory prediction algorithm, but it still isn't perfect yet. In general, there's a chunk of memory that gets allocated on the first GPU, then the remainder is spread evenly across the GPUs. In the next release (0.1.29) we'll be adding a new setting that can allow you to set a lower VRAM setting to workaround this type of crash until we get the prediction logic fixed. |
Can you make the parameter parse units? For example |
@insunaa this variable is only meant to be a temporary workaround until we get the memory prediction fixed so OOM crashes no longer happen. |
The latest release 0.1.33 further refines our handling of multi-GPU setups, and our memory prediction algorithms. Please give it a try and let us know if you're still seeing problems. |
This should be fixed now – however we are still working on multi-gpu memory allocation so please do share any issues you're hitting! |
trying to load a model (deepseek-coder) to 2 GPUs fails with OOM-error.
the setup:
Linux: ubu 22.04
HW: i5-7400 (AVX, AVX2), 32GB
GPU: 4 x 3070 8GB
ollama: 0.1.12, running in docker
nvidia-smi from within the container shows 2 x 3070.
Because of the big contect-size, I want to load the model on 2 GPUs, but it never uses the second one and fails, after reaching OOM at the first GPU.
modelfile:
AVX:
it does not recognize/report AVX2 as you can see in the log.
HINT:
num_gpus
, describing "layers to offload" ist most missleading.your paramter
num_gpus
which is used at all other loaders, like fastchat, oooba's, vllm, etc. to describes the numbers of GPUS to use is very missleading.IMHO, parameter-names like that, would be more telling:
CUDA_VISIBLE_DEVICES
here the log of the failure.
thats the part where it OOM-erros on GPU0 and start loading to CPU
The text was updated successfully, but these errors were encountered: