MultiGPU: not splitting model to multiple GPUs - CUDA out of memory #1341

chymian · 2023-12-01T08:12:03Z

trying to load a model (deepseek-coder) to 2 GPUs fails with OOM-error.

the setup:
Linux: ubu 22.04
HW: i5-7400 (AVX, AVX2), 32GB
GPU: 4 x 3070 8GB
ollama: 0.1.12, running in docker
nvidia-smi from within the container shows 2 x 3070.

Because of the big contect-size, I want to load the model on 2 GPUs, but it never uses the second one and fails, after reaching OOM at the first GPU.

modelfile:

ollama show --modelfile coder-16k
# Modelfile generated by "ollama show"
# To build a new Modelfile based on this one, replace the FROM line with:
# FROM coder-16k:latest

FROM deepseek-coder:6.7b-base-q5_0
TEMPLATE """{{ .Prompt }}"""
PARAMETER num_ctx 16384
PARAMETER num_gpu 128
PARAMETER num_predict 756
PARAMETER seed 42
PARAMETER temperature 0.1
PARAMETER top_k 22
PARAMETER top_p 0.5

AVX:
it does not recognize/report AVX2 as you can see in the log.

HINT:
num_gpus, describing "layers to offload" ist most missleading.
your paramter num_gpus which is used at all other loaders, like fastchat, oooba's, vllm, etc. to describes the numbers of GPUS to use is very missleading.

IMHO, parameter-names like that, would be more telling:

tensor_split: amount of GPUs to use
offload_layers: number of layers to offload
gpus: which GPUs to use like CUDA_VISIBLE_DEVICES

here the log of the failure.
thats the part where it OOM-erros on GPU0 and start loading to CPU

...
ollama-GPU23  | llm_load_print_meta: LF token  = 126 'Ä'                                                                            
ollama-GPU23  | llm_load_tensors: ggml ctx size =    0.11 MiB                                                                       
ollama-GPU23  | llm_load_tensors: using CUDA for GPU acceleration                                                                   
ollama-GPU23  | ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3070) as main device                                  
ollama-GPU23  | llm_load_tensors: mem required  =   86.73 MiB                                                                       
ollama-GPU23  | llm_load_tensors: offloading 32 repeating layers to GPU                                                    
ollama-GPU23  | llm_load_tensors: offloading non-repeating layers to GPU                                                   
ollama-GPU23  | llm_load_tensors: offloaded 35/35 layers to GPU                                                                     
ollama-GPU23  | llm_load_tensors: VRAM used: 4350.38 MiB                                                                            
ollama-GPU23  | ..................................................................................................                  
ollama-GPU23  | llama_new_context_with_model: n_ctx      = 16384                                                                    
ollama-GPU23  | llama_new_context_with_model: freq_base  = 100000.0                                                                 
ollama-GPU23  | llama_new_context_with_model: freq_scale = 0.25                                                                     
ollama-GPU23  | llama_kv_cache_init: offloading v cache to GPU                                                                      
ollama-GPU23  |                                                                                                                     
ollama-GPU23  | CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:7957: out of memory             
ollama-GPU23  | current device: 0                                                                                                   
ollama-GPU23  | 2023/12/01 07:27:54 llama.go:436: 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:7957: out
 of memory                                                                                                                          
ollama-GPU23  | current device: 0                                                                                                   
ollama-GPU23  | 2023/12/01 07:27:54 llama.go:444: error starting llama runner: llama runner process has terminated         
ollama-GPU23  | 2023/12/01 07:27:54 llama.go:510: llama runner stopped successfully                                        
ollama-GPU23  | 2023/12/01 07:27:54 llama.go:421: starting llama runner                                                             
ollama-GPU23  | 2023/12/01 07:27:54 llama.go:479: waiting for llama runner to start responding                             
ollama-GPU23  | {"timestamp":1701415674,"level":"WARNING","function":"server_params_parse","line":2035,"message":"Not compiled with 
GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support","n_gpu_
layers":-1}                                                                                                                         
ollama-GPU23  | {"timestamp":1701415674,"level":"INFO","function":"main","line":2534,"message":"build info","build":375,"commit":"96
56026"}                                                                                                                             
ollama-GPU23  | {"timestamp":1701415674,"level":"INFO","function":"main","line":2537,"message":"system info","n_threads":4,"n_thread
s_batch":-1,"total_threads":4,"system_info":"AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 
0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | "}                           
ollama-GPU23  | llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256:
5d80d0c539a5c90b360fbb2bc49261f3e28fae0e937452aea3948788c40cbba7 (version GGUF V2)                                                  
ollama-GPU23  | 
...

The text was updated successfully, but these errors were encountered:

mlewis1973 · 2023-12-01T17:07:08Z

we have the Docker version installed on a 3 GPU system with nvidia-container-toolkit, and while the CLI doesn't work we can use the API interface. I see ollama-runner spawned on all three GPUs.

chymian · 2023-12-02T09:45:56Z

we have the Docker version installed on a 3 GPU system with nvidia-container-toolkit, and while the CLI doesn't work we can use the API interface. I see ollama-runner spawned on all three GPUs.

and does it split models to different GPU's depending on VRAM? that doesn't work for me.
if you have that running, can you pls. post your modelfile, docker-compose, version etc.

anuradhawick · 2023-12-03T08:03:18Z

It seems to me like the tasks are divided among the GPUs. I could not find any documentation to support this though.

This is for model: llama2:13b

ollama-runner has two processes on each GPU. In the logs I read as follows;

2023/12/03 18:30:11 llama.go:292: 46054 MB VRAM available, loading up to 196 GPU layers
2023/12/03 18:30:11 llama.go:421: starting llama runner
2023/12/03 18:30:11 llama.go:479: waiting for llama runner to start responding
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6
...........................LAYER INFO.............................
llm_load_tensors: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090) as main device
llm_load_tensors: mem required  =   88.02 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 43/43 layers to GPU
llm_load_tensors: VRAM used: 6936.01 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 3200.00 MiB
llama_new_context_with_model: kv self size  = 3200.00 MiB
llama_build_graph: non-view tensors processed: 924/924
llama_new_context_with_model: compute buffer total size = 361.07 MiB
llama_new_context_with_model: VRAM scratch buffer: 358.00 MiB
llama_new_context_with_model: total VRAM used: 10494.01 MiB (model: 6936.01 MiB, context: 3558.00 MiB)

Division of memory seems to be asymmetric, probably for good reasons.

mlewis1973 · 2023-12-04T17:56:38Z

we have the Docker version installed on a 3 GPU system with nvidia-container-toolkit, and while the CLI doesn't work we can use the API interface. I see ollama-runner spawned on all three GPUs.

and does it split models to different GPU's depending on VRAM? that doesn't work for me. if you have that running, can you pls. post your modelfile, docker-compose, version etc.

no docker-compose......
Ubuntu 20
$ docker info
Client: Docker Engine - Community
Version: 24.0.7
....
Runtimes: io.containerd.runc.v2 nvidia runc
.....

$ docker run --gpus=all -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

$ curl http://localhost:11434/api/generate -d '{"model": "llama2","prompt": "Who was the most famous person of all time?","stream":false}'
{"model":"llama2","created_at":"2023-12-04T17:52:38.826266993Z","response":" Determining the most famous person of all time is a difficult task, as it depends on various factors such as cultural context, historical period, and personal opinions. However, here are some of the most renowned individuals throughout history who have had a significant impact on human civilization:\n\n1. Jesus Christ: Known as the central figure of Christianity, Jesus is considered by many to be the most famous person in history. His teachings, life, death, and resurrection have had a profound impact on billions of people around the world.\n2. Muhammad: As the prophet of Islam, Muhammad is revered by over 1.8 billion Muslims globally. His teachings and example have shaped the lives of millions of people for centuries,.....

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2316 G /usr/lib/xorg/Xorg 46MiB |
| 0 N/A N/A 2382 G /usr/bin/gnome-shell 13MiB |
| 0 N/A N/A 162891 C ...ffice/program/soffice.bin 145MiB |
| 0 N/A N/A 182102 C python 2461MiB |
| 0 N/A N/A 477037 C ...ld/cuda/bin/ollama-runner 3325MiB |
| 1 N/A N/A 2316 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 477037 C ...ld/cuda/bin/ollama-runner 1915MiB |
| 2 N/A N/A 2316 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 477037 C ...ld/cuda/bin/ollama-runner 1981MiB |
+-----------------------------------------------------------------------------+

Stampede · 2023-12-10T23:53:59Z

trying to load a model (deepseek-coder) to 2 GPUs fails with OOM-error.

the setup: Linux: ubu 22.04 HW: i5-7400 (AVX, AVX2), 32GB GPU: 4 x 3070 8GB ollama: 0.1.12, running in docker nvidia-smi from within the container shows 2 x 3070.

For what it's worth, I have similar system specs as you do, and I am getting the same error log messages.

Out of memory errors and also:

"Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support"

I am not running the docker version of ollama; I downloaded the binary to my ~/bin directory and run it from there.

ollama version 0.1.13

ollama was working fine until I put in the 2nd GPU.

dhiltgen · 2024-03-12T16:08:17Z

We've been making improvements to our memory prediction algorithm, but it still isn't perfect yet. In general, there's a chunk of memory that gets allocated on the first GPU, then the remainder is spread evenly across the GPUs.

In the next release (0.1.29) we'll be adding a new setting that can allow you to set a lower VRAM setting to workaround this type of crash until we get the prediction logic fixed. OLLAMA_MAX_VRAM=<bytes> For example, you could start with 30G and experiment until you find a setting that loads the as many layers as possible without hitting the OOM crash. OLLAMA_MAX_VRAM=32212254720

insunaa · 2024-03-14T14:06:37Z

Can you make the parameter parse units? For example OLLAMA_MAX_VRAM=30G or OLLAMA_MAX_VRAM=1T etc.

dhiltgen · 2024-03-21T13:51:57Z

@insunaa this variable is only meant to be a temporary workaround until we get the memory prediction fixed so OOM crashes no longer happen.

dhiltgen · 2024-05-02T21:21:06Z

The latest release 0.1.33 further refines our handling of multi-GPU setups, and our memory prediction algorithms. Please give it a try and let us know if you're still seeing problems.

https://github.com/ollama/ollama/releases

jmorganca · 2024-05-09T22:25:10Z

This should be fixed now – however we are still working on multi-gpu memory allocation so please do share any issues you're hitting!

dhiltgen self-assigned this Mar 12, 2024

dhiltgen changed the title ~~MultiGPU: not splitting model to multiple GPUs~~ MultiGPU: not splitting model to multiple GPUs - CUDA out of memory Mar 21, 2024

dhiltgen added bug Something isn't working nvidia Issues relating to Nvidia GPUs and CUDA labels Mar 21, 2024

dhiltgen assigned mxyng and unassigned dhiltgen Mar 21, 2024

kungfu-eric mentioned this issue May 6, 2024

Long context models don't split memory correctly leads to OOM error #4212

Open

jmorganca closed this as completed May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MultiGPU: not splitting model to multiple GPUs - CUDA out of memory #1341

MultiGPU: not splitting model to multiple GPUs - CUDA out of memory #1341

chymian commented Dec 1, 2023

mlewis1973 commented Dec 1, 2023

chymian commented Dec 2, 2023

anuradhawick commented Dec 3, 2023 •

edited

mlewis1973 commented Dec 4, 2023

Stampede commented Dec 10, 2023

dhiltgen commented Mar 12, 2024

insunaa commented Mar 14, 2024

dhiltgen commented Mar 21, 2024

dhiltgen commented May 2, 2024

jmorganca commented May 9, 2024

MultiGPU: not splitting model to multiple GPUs - CUDA out of memory #1341

MultiGPU: not splitting model to multiple GPUs - CUDA out of memory #1341

Comments

chymian commented Dec 1, 2023

mlewis1973 commented Dec 1, 2023

chymian commented Dec 2, 2023

anuradhawick commented Dec 3, 2023 • edited

mlewis1973 commented Dec 4, 2023

Stampede commented Dec 10, 2023

dhiltgen commented Mar 12, 2024

insunaa commented Mar 14, 2024

dhiltgen commented Mar 21, 2024

dhiltgen commented May 2, 2024

jmorganca commented May 9, 2024

anuradhawick commented Dec 3, 2023 •

edited