Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiGPU: not splitting model to multiple GPUs - CUDA out of memory #1341

Closed
chymian opened this issue Dec 1, 2023 · 10 comments
Closed

MultiGPU: not splitting model to multiple GPUs - CUDA out of memory #1341

chymian opened this issue Dec 1, 2023 · 10 comments
Assignees
Labels
bug Something isn't working nvidia Issues relating to Nvidia GPUs and CUDA

Comments

@chymian
Copy link

chymian commented Dec 1, 2023

trying to load a model (deepseek-coder) to 2 GPUs fails with OOM-error.

the setup:
Linux: ubu 22.04
HW: i5-7400 (AVX, AVX2), 32GB
GPU: 4 x 3070 8GB
ollama: 0.1.12, running in docker
nvidia-smi from within the container shows 2 x 3070.

Because of the big contect-size, I want to load the model on 2 GPUs, but it never uses the second one and fails, after reaching OOM at the first GPU.

modelfile:

ollama show --modelfile coder-16k
# Modelfile generated by "ollama show"
# To build a new Modelfile based on this one, replace the FROM line with:
# FROM coder-16k:latest

FROM deepseek-coder:6.7b-base-q5_0
TEMPLATE """{{ .Prompt }}"""
PARAMETER num_ctx 16384
PARAMETER num_gpu 128
PARAMETER num_predict 756
PARAMETER seed 42
PARAMETER temperature 0.1
PARAMETER top_k 22
PARAMETER top_p 0.5

AVX:
it does not recognize/report AVX2 as you can see in the log.

HINT:
num_gpus, describing "layers to offload" ist most missleading.
your paramter num_gpus which is used at all other loaders, like fastchat, oooba's, vllm, etc. to describes the numbers of GPUS to use is very missleading.

IMHO, parameter-names like that, would be more telling:

  • tensor_split: amount of GPUs to use
  • offload_layers: number of layers to offload
  • gpus: which GPUs to use like CUDA_VISIBLE_DEVICES

here the log of the failure.
thats the part where it OOM-erros on GPU0 and start loading to CPU

...
ollama-GPU23  | llm_load_print_meta: LF token  = 126 'Ä'                                                                            
ollama-GPU23  | llm_load_tensors: ggml ctx size =    0.11 MiB                                                                       
ollama-GPU23  | llm_load_tensors: using CUDA for GPU acceleration                                                                   
ollama-GPU23  | ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3070) as main device                                  
ollama-GPU23  | llm_load_tensors: mem required  =   86.73 MiB                                                                       
ollama-GPU23  | llm_load_tensors: offloading 32 repeating layers to GPU                                                    
ollama-GPU23  | llm_load_tensors: offloading non-repeating layers to GPU                                                   
ollama-GPU23  | llm_load_tensors: offloaded 35/35 layers to GPU                                                                     
ollama-GPU23  | llm_load_tensors: VRAM used: 4350.38 MiB                                                                            
ollama-GPU23  | ..................................................................................................                  
ollama-GPU23  | llama_new_context_with_model: n_ctx      = 16384                                                                    
ollama-GPU23  | llama_new_context_with_model: freq_base  = 100000.0                                                                 
ollama-GPU23  | llama_new_context_with_model: freq_scale = 0.25                                                                     
ollama-GPU23  | llama_kv_cache_init: offloading v cache to GPU                                                                      
ollama-GPU23  |                                                                                                                     
ollama-GPU23  | CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:7957: out of memory             
ollama-GPU23  | current device: 0                                                                                                   
ollama-GPU23  | 2023/12/01 07:27:54 llama.go:436: 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:7957: out
 of memory                                                                                                                          
ollama-GPU23  | current device: 0                                                                                                   
ollama-GPU23  | 2023/12/01 07:27:54 llama.go:444: error starting llama runner: llama runner process has terminated         
ollama-GPU23  | 2023/12/01 07:27:54 llama.go:510: llama runner stopped successfully                                        
ollama-GPU23  | 2023/12/01 07:27:54 llama.go:421: starting llama runner                                                             
ollama-GPU23  | 2023/12/01 07:27:54 llama.go:479: waiting for llama runner to start responding                             
ollama-GPU23  | {"timestamp":1701415674,"level":"WARNING","function":"server_params_parse","line":2035,"message":"Not compiled with 
GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support","n_gpu_
layers":-1}                                                                                                                         
ollama-GPU23  | {"timestamp":1701415674,"level":"INFO","function":"main","line":2534,"message":"build info","build":375,"commit":"96
56026"}                                                                                                                             
ollama-GPU23  | {"timestamp":1701415674,"level":"INFO","function":"main","line":2537,"message":"system info","n_threads":4,"n_thread
s_batch":-1,"total_threads":4,"system_info":"AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 
0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | "}                           
ollama-GPU23  | llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256:
5d80d0c539a5c90b360fbb2bc49261f3e28fae0e937452aea3948788c40cbba7 (version GGUF V2)                                                  
ollama-GPU23  | 
...
@mlewis1973
Copy link

we have the Docker version installed on a 3 GPU system with nvidia-container-toolkit, and while the CLI doesn't work we can use the API interface. I see ollama-runner spawned on all three GPUs.

@chymian
Copy link
Author

chymian commented Dec 2, 2023

we have the Docker version installed on a 3 GPU system with nvidia-container-toolkit, and while the CLI doesn't work we can use the API interface. I see ollama-runner spawned on all three GPUs.

and does it split models to different GPU's depending on VRAM? that doesn't work for me.
if you have that running, can you pls. post your modelfile, docker-compose, version etc.

@anuradhawick
Copy link

anuradhawick commented Dec 3, 2023

It seems to me like the tasks are divided among the GPUs. I could not find any documentation to support this though.

This is for model: llama2:13b

ollama-runner has two processes on each GPU. In the logs I read as follows;

2023/12/03 18:30:11 llama.go:292: 46054 MB VRAM available, loading up to 196 GPU layers
2023/12/03 18:30:11 llama.go:421: starting llama runner
2023/12/03 18:30:11 llama.go:479: waiting for llama runner to start responding
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6
...........................LAYER INFO.............................
llm_load_tensors: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090) as main device
llm_load_tensors: mem required  =   88.02 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 43/43 layers to GPU
llm_load_tensors: VRAM used: 6936.01 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 3200.00 MiB
llama_new_context_with_model: kv self size  = 3200.00 MiB
llama_build_graph: non-view tensors processed: 924/924
llama_new_context_with_model: compute buffer total size = 361.07 MiB
llama_new_context_with_model: VRAM scratch buffer: 358.00 MiB
llama_new_context_with_model: total VRAM used: 10494.01 MiB (model: 6936.01 MiB, context: 3558.00 MiB)

Division of memory seems to be asymmetric, probably for good reasons.

@mlewis1973
Copy link

we have the Docker version installed on a 3 GPU system with nvidia-container-toolkit, and while the CLI doesn't work we can use the API interface. I see ollama-runner spawned on all three GPUs.

and does it split models to different GPU's depending on VRAM? that doesn't work for me. if you have that running, can you pls. post your modelfile, docker-compose, version etc.

no docker-compose......
Ubuntu 20
$ docker info
Client: Docker Engine - Community
Version: 24.0.7
....
Runtimes: io.containerd.runc.v2 nvidia runc
.....

$ docker run --gpus=all -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

$ curl http://localhost:11434/api/generate -d '{"model": "llama2","prompt": "Who was the most famous person of all time?","stream":false}'
{"model":"llama2","created_at":"2023-12-04T17:52:38.826266993Z","response":" Determining the most famous person of all time is a difficult task, as it depends on various factors such as cultural context, historical period, and personal opinions. However, here are some of the most renowned individuals throughout history who have had a significant impact on human civilization:\n\n1. Jesus Christ: Known as the central figure of Christianity, Jesus is considered by many to be the most famous person in history. His teachings, life, death, and resurrection have had a profound impact on billions of people around the world.\n2. Muhammad: As the prophet of Islam, Muhammad is revered by over 1.8 billion Muslims globally. His teachings and example have shaped the lives of millions of people for centuries,.....

Mon Dec 4 11:53:13 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.223.02 Driver Version: 470.223.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA TITAN Xp Off | 00000000:05:00.0 Off | N/A |
| 27% 46C P8 10W / 250W | 5997MiB / 12192MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA TITAN Xp Off | 00000000:09:00.0 Off | N/A |
| 23% 42C P8 10W / 250W | 1925MiB / 12196MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA TITAN Xp Off | 00000000:0B:00.0 Off | N/A |
| 23% 39C P8 10W / 250W | 1991MiB / 12196MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2316 G /usr/lib/xorg/Xorg 46MiB |
| 0 N/A N/A 2382 G /usr/bin/gnome-shell 13MiB |
| 0 N/A N/A 162891 C ...ffice/program/soffice.bin 145MiB |
| 0 N/A N/A 182102 C python 2461MiB |
| 0 N/A N/A 477037 C ...ld/cuda/bin/ollama-runner 3325MiB |
| 1 N/A N/A 2316 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 477037 C ...ld/cuda/bin/ollama-runner 1915MiB |
| 2 N/A N/A 2316 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 477037 C ...ld/cuda/bin/ollama-runner 1981MiB |
+-----------------------------------------------------------------------------+

@Stampede
Copy link

trying to load a model (deepseek-coder) to 2 GPUs fails with OOM-error.

the setup: Linux: ubu 22.04 HW: i5-7400 (AVX, AVX2), 32GB GPU: 4 x 3070 8GB ollama: 0.1.12, running in docker nvidia-smi from within the container shows 2 x 3070.

For what it's worth, I have similar system specs as you do, and I am getting the same error log messages.

Out of memory errors and also:

"Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support"

I am not running the docker version of ollama; I downloaded the binary to my ~/bin directory and run it from there.

ollama version 0.1.13

ollama was working fine until I put in the 2nd GPU.

@dhiltgen
Copy link
Collaborator

We've been making improvements to our memory prediction algorithm, but it still isn't perfect yet. In general, there's a chunk of memory that gets allocated on the first GPU, then the remainder is spread evenly across the GPUs.

In the next release (0.1.29) we'll be adding a new setting that can allow you to set a lower VRAM setting to workaround this type of crash until we get the prediction logic fixed. OLLAMA_MAX_VRAM=<bytes> For example, you could start with 30G and experiment until you find a setting that loads the as many layers as possible without hitting the OOM crash. OLLAMA_MAX_VRAM=32212254720

@dhiltgen dhiltgen self-assigned this Mar 12, 2024
@insunaa
Copy link

insunaa commented Mar 14, 2024

Can you make the parameter parse units? For example OLLAMA_MAX_VRAM=30G or OLLAMA_MAX_VRAM=1T etc.

@dhiltgen dhiltgen changed the title MultiGPU: not splitting model to multiple GPUs MultiGPU: not splitting model to multiple GPUs - CUDA out of memory Mar 21, 2024
@dhiltgen dhiltgen added bug Something isn't working nvidia Issues relating to Nvidia GPUs and CUDA labels Mar 21, 2024
@dhiltgen dhiltgen assigned mxyng and unassigned dhiltgen Mar 21, 2024
@dhiltgen
Copy link
Collaborator

@insunaa this variable is only meant to be a temporary workaround until we get the memory prediction fixed so OOM crashes no longer happen.

@dhiltgen
Copy link
Collaborator

dhiltgen commented May 2, 2024

The latest release 0.1.33 further refines our handling of multi-GPU setups, and our memory prediction algorithms. Please give it a try and let us know if you're still seeing problems.

https://github.com/ollama/ollama/releases

@jmorganca
Copy link
Member

This should be fixed now – however we are still working on multi-gpu memory allocation so please do share any issues you're hitting!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working nvidia Issues relating to Nvidia GPUs and CUDA
Projects
None yet
Development

No branches or pull requests

8 participants