Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to run Ollama only on a dedicated GPU? (Instead of all GPUs) #1813

Closed
sthufnagl opened this issue Jan 5, 2024 · 29 comments · Fixed by #3282
Closed

How to run Ollama only on a dedicated GPU? (Instead of all GPUs) #1813

sthufnagl opened this issue Jan 5, 2024 · 29 comments · Fixed by #3282
Assignees
Labels

Comments

@sthufnagl
Copy link

Hi,

I have 3x3090 and I want to run Ollama Instance only on a dedicated GPU. The reason for this: To have 3xOllama Instances (with different ports) for using with Autogen.
I also tried the "Docker Ollama" without luck.
Or is there an other solution?

Let me know...

Thanks in advance

Steve

@Tomatcree01
Copy link

You could give me the other two

@sthufnagl
Copy link
Author

:-)

@sthufnagl
Copy link
Author

Could it be that the numbers of GPUs used with Ollama is related to the model?
At the page https://github.com/jmorganca/ollama/blob/main/docs/modelfile.md they mentioned a "num_gpu" parameter.
==> I have to create a new Model File from an existant Model? And include this parameter?
Still searching....

@tarbard
Copy link

tarbard commented Jan 6, 2024

Could it be that the numbers of GPUs used with Ollama is related to the model? At the page https://github.com/jmorganca/ollama/blob/main/docs/modelfile.md they mentioned a "num_gpu" parameter.

That's just the number of layers. I don't think there's a way to control GPU affinity but I would also like to do this. Another issue for me is it is automatically splitting a model between 2 GPUs even though it would fit on a single GPU (which would be faster) so I would like to just make it use the one with bigger VRAM.

@tarbard
Copy link

tarbard commented Jan 6, 2024

I tried a bit of research - it seems the relevant llama options are

-mg i, --main-gpu i: When using multiple GPUs this option controls which GPU is used for small tensors for which the overhead of splitting the computation across all GPUs is not worthwhile. The GPU in question will use slightly more VRAM to store a scratch buffer for temporary results. By default GPU 0 is used. Requires cuBLAS.

-ts SPLIT, --tensor-split SPLIT: When using multiple GPUs this option controls how large tensors should be split across all GPUs. SPLIT` is a comma-separated list of non-negative values that assigns the proportion of data that each GPU should get in order. For example, "3,2" will assign 60% of the data to GPU 0 and 40% to GPU 1. By default the data is split in proportion to VRAM but this may not be optimal for performance. Requires cuBLAS.

Checking the https://github.com/jmorganca/ollama/blob/main/docs/api.md docs we should be able to pass in main_gpu to the API, so I tried with setting main_gpu to 1

curl http://localhost:11434/api/generate -d '{
  "model": "llama2",
  "prompt": "Why is the sky blue?",
  "stream": false,
  "options": {
    "num_keep": 5,
    "seed": 42,
    "num_predict": 100,
    "top_k": 20,
    "top_p": 0.9,
    "tfs_z": 0.5,
    "typical_p": 0.7,
    "repeat_last_n": 33,
    "temperature": 0.8,
    "repeat_penalty": 1.2,
    "presence_penalty": 1.5,
    "frequency_penalty": 1.0,
    "mirostat": 1,
    "mirostat_tau": 0.8,
    "mirostat_eta": 0.6,
    "penalize_newline": true,
    "stop": ["\n", "user:"],
    "numa": false,
    "num_ctx": 1024,
    "num_batch": 2,
    "num_gqa": 1,
    "main_gpu": 1,
    "low_vram": false,
    "f16_kv": true,
    "vocab_only": false,
    "use_mmap": true,
    "use_mlock": false,
    "embedding_only": false,
    "rope_frequency_base": 1.1,
    "rope_frequency_scale": 0.8,
    "num_thread": 8
  }
}'

This didn't seem to work as the same memory split took place rather than it using only the second GPU. Maybe the option is not yet passed onto llama from ollama. I had a look at the ollama code but i'm not familiar with Go so i'm not sure.

@sthufnagl
Copy link
Author

Thx tarbard...I will check it.

@houstonhaynes
Copy link

houstonhaynes commented Jan 7, 2024

If you're running in three separate containers via docker you can start up each container to only be "aware" of one GPU.

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/docker-specialized.html

docker run --gpus '"device=1,2"' \
    nvidia/cuda nvidia-smi --query-gpu=uuid --format=csv

@sthufnagl
Copy link
Author

@houstonhaynes...I had the same Idea, but it doesn't work for me. Ollama, running inside Docker, takes all GPUs no matter how I use the the Docker Parameter "--gpu" (also tried the ID of a GPU).
:-(
Does it work for you?

My solution now is to splt/distribute the 3090 to different PCs. To my surprise, even with very old PC Hardware, Ollama runs fast!
Also the uploading of a Model to VRAM is nearly the same.

@houstonhaynes
Copy link

houstonhaynes commented Jan 8, 2024

That is wild - I guess I "trust the manual" too much! I have two machines with an RTX3050 on each and haven't moved one over to have two on one machine. I was just doing some spelunking for GPU driven inference with postgresml and spotted that "deep" info from NVidia along the way. I thought it would be useful when I upgrade. I'm sorry it's not more helpful but maybe the controls "under the hood" suggested above will give you the right lever(s). I'd love to know how that turns out in case it comes calling after I put a bunch of cards in a GPU chassis! 😸

@null-dev
Copy link

null-dev commented Jan 11, 2024

BTW you can use CUDA_VISIBLE_DEVICES for this, see: https://stackoverflow.com/questions/39649102/how-do-i-select-which-gpu-to-run-a-job-on

Unfortunately, the name of the environment variable is kinda a lie. It appears the other GPUs are still visible, just not accessible, so when ollama calculates the compute capability level of the GPUs, it will take into account the other GPUs. This is bad, because if you have GPU 0 with compute capability X, and GPU 1 with compute capability Y and you set CUDA_VISIBLE_DEVICES=0, ollama will detect the compute capability as min(X, Y) when instead compute capability X is the best value. EDIT: Nevermind, this isn't a problem because it looks like Ollama doesn't actually do anything with the detected compute capability information, it's just used to validate whether or not to use GPUs at all.

@cgint
Copy link

cgint commented Jan 21, 2024

Same challenge here.

CUDA_VISIBLE_DEVICES somehow does not work for me as a switch between models that fit onto one GPU and others that need 2. I could though spin up two instances of ollama on two ports where one has CUDA_VISIBLE_DEVICES set to only 'see' one device and the second instance has access to both. Then I would have to decide myself depending on the model which instance to connect to.

Would really be awesome if either ...

  • there was a config option for OLLAMA that changes behaviour in a way that is does not try to balance the used VRAM over all available GPUs but e.g. only use one GPU if this already has enough VRAM to hold model + context.
  • there was an option to specify this on inference-calls. main_gpu mentioned by @tarbard sounds like that.

Will check out if main_gpu works on my system.

Damn!
Not working with Ollama in Python - although the option is handed over to the HTTP-Request to Ollama-Endpoint. 🤷

What i do get since activating {'main_gpu': 1} though ... is a log output when a model is loaded saying
ollama[1733]: ggml_cuda_set_main_device: using device 1 (NVIDIA GeForce RTX 4060 Ti) as main device.
But the model is still distributed across my 2 GPUs although it would fit onto one.

With my current solution i spin up another instance of ollama with the following command ...

CUDA_VISIBLE_DEVICES=0 OLLAMA_HOST=0.0.0.0:22222 ollama serve

... and whenever I know a model fits on one GPU i connect to this port on my local machine.

Thx for the CUDA_VISIBLE_DEVICES @null-dev

@matbee-eth
Copy link

matbee-eth commented Jan 27, 2024

-damn, I was not hoping for this outcome. Has anyone figured out how to restrict it to just one?- nvm, using CUDA_VISIBLE_DEVICES seemed to have done the trick

@Koesn
Copy link

Koesn commented Feb 25, 2024

Why this still unsupported? I'm running LM Studio to dedicate a GPU using tensor split 0,35 so I can fully offload Mistral 32k context to a 3060. I hope there's a tensor split on Ollama modelfile.

@dhiltgen
Copy link
Collaborator

CUDA_VISIBLE_DEVICES should work. We do have a defect related to memory prediction calculations in this case tracked via #1514

If you're seeing it load onto unexpected GPUs when this variable is set, please share the server log and some more details about the setup and I'll re-open.

@jeremytregunna
Copy link

CUDA_VISIBLE_DEVICES should work. We do have a defect related to memory prediction calculations in this case tracked via #1514

If you're seeing it load onto unexpected GPUs when this variable is set, please share the server log and some more details about the setup and I'll re-open.

image

As you can see in the above image, I have 3 GPUs. 2x RTX A6000 and 1x 3070. I use the A6000s for bigger models through Ollama, and the smaller GPU I want to reserve for embedding models. However, when I start the server using the systemd config below:

[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/home/ubuntu/.local/bin:/home/ubuntu/miniconda3/bin:/home/ubuntu/miniconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin"
Environment="CUDA_VISIBLE_DEVICES=0,2"
Environment="OLLAMA_HOST=0.0.0.0:11434"

[Install]
WantedBy=default.target

Restart Ollama, and use say dolphin-mixtral:8x7b-v2.7-q8_0 (a model that will occupy more GPU memory than i have on any one GPU), it distributes it over device 0 and 1 instead of 0 and 2. I can wholly confirm I did a systemctl daemon-reload, then a systemctl restart ollama before then sending a message to the dolphin-mixtral model and watching nvtop.

So it doesn't seem as though CUDA_VISIBLE_DEVICES is working as intended. For completeness here's the output of nvidia-smi:

Thu Mar 14 22:51:19 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:81:00.0 Off |                    0 |
| 30%   57C    P8              22W / 300W |  43657MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3070        Off | 00000000:C1:00.0 Off |                  N/A |
|  0%   47C    P8              22W / 270W |   5246MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A6000               Off | 00000000:C2:00.0 Off |                  Off |
| 31%   60C    P8              28W / 300W |      1MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2873      C   /usr/local/bin/ollama                     43650MiB |
|    1   N/A  N/A      2873      C   /usr/local/bin/ollama                      5240MiB |
+---------------------------------------------------------------------------------------+

Any help would be appreciated. @dhiltgen

@dhiltgen dhiltgen reopened this Mar 15, 2024
@dhiltgen
Copy link
Collaborator

@jeremytregunna it sounds like there might be an ordering/enumeration bug where we're not consistent with other tools. If I had to guess, I'd speculate this is some tools/libraries using PCI bus/slot, and others sorting by capability/performance.

Can you enable OLLAMA_DEBUG=1 and start up the server?

Also try CUDA_VISIBLE_DEVICES=0,1 and from what you describe, that sounds like it might get the GPU assignment right.

@jeremytregunna
Copy link

@jeremytregunna it sounds like there might be an ordering/enumeration bug where we're not consistent with other tools. If I had to guess, I'd speculate this is some tools/libraries using PCI bus/slot, and others sorting by capability/performance.

Can you enable OLLAMA_DEBUG=1 and start up the server?

Also try CUDA_VISIBLE_DEVICES=0,1 and from what you describe, that sounds like it might get the GPU assignment right.

Hrmm... I've run it with debug logs on a few times, and the ordering never seems to change, it always reports the output below:

CUDA driver version: 535.161.07
time=2024-03-15T23:25:09.751Z level=INFO source=gpu.go:82 msg="Nvidia GPU detected"
time=2024-03-15T23:25:09.751Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
[0] CUDA device name: NVIDIA RTX A6000
[0] CUDA part number: 900-5G133-0300-000
[0] CUDA S/N: 1651922013945
[0] CUDA vbios version: 94.02.5C.00.06
[0] CUDA brand: 13
[0] CUDA totalMem 48305799168
[0] CUDA usedMem 467599360
[1] CUDA device name: NVIDIA GeForce RTX 3070
[1] CUDA part number: 
nvmlDeviceGetSerial failed: 3
[1] CUDA vbios version: 94.04.67.00.3E
[1] CUDA brand: 5
[1] CUDA totalMem 8589934592
[1] CUDA usedMem 230031360
[2] CUDA device name: NVIDIA RTX A6000
[2] CUDA part number: 900-5G133-1700-000
[2] CUDA S/N: 1320722000285
[2] CUDA vbios version: 94.02.5C.00.02
[2] CUDA brand: 13
[2] CUDA totalMem 51527024640
[2] CUDA usedMem 486866944
time=2024-03-15T23:25:09.769Z level=INFO source=gpu.go:119 msg="CUDA Compute Capability detected: 8.6"
time=2024-03-15T23:25:09.769Z level=DEBUG source=gpu.go:180 msg="cuda detected 3 devices with 92043M available memory"

I verified they're the same devices by looking at the serial number. I also tried what you said with using `CUDA_VISIBLE_DEVICES=0,1" and 1,2 with no luck

The whole log is preserved below, note this is with 0,2 but as I previously mentioned, that made no difference:

Mar 15 23:35:20 calgary systemd[1]: Stopping Ollama Service...
Mar 15 23:35:20 calgary systemd[1]: ollama.service: Deactivated successfully.
Mar 15 23:35:20 calgary systemd[1]: Stopped Ollama Service.
Mar 15 23:35:20 calgary systemd[1]: ollama.service: Consumed 5.777s CPU time.
Mar 15 23:35:20 calgary systemd[1]: Started Ollama Service.
Mar 15 23:35:20 calgary ollama[5122]: time=2024-03-15T23:35:20.657Z level=INFO source=images.go:806 msg="total blobs: 48"
Mar 15 23:35:20 calgary ollama[5122]: time=2024-03-15T23:35:20.657Z level=INFO source=images.go:813 msg="total unused blobs removed: 0"
Mar 15 23:35:20 calgary ollama[5122]: time=2024-03-15T23:35:20.658Z level=INFO source=routes.go:1110 msg="Listening on [::]:11434 (version 0.1.29)"
Mar 15 23:35:20 calgary ollama[5122]: time=2024-03-15T23:35:20.658Z level=INFO source=payload_common.go:112 msg="Extracting dynamic libraries to /tmp/ollama4171821284/runners ..."
Mar 15 23:35:24 calgary ollama[5122]: time=2024-03-15T23:35:24.312Z level=INFO source=payload_common.go:139 msg="Dynamic LLM libraries [cpu_avx2 cuda_v11 rocm_v60000 cpu cpu_avx]"
Mar 15 23:35:24 calgary ollama[5122]: time=2024-03-15T23:35:24.312Z level=INFO source=gpu.go:77 msg="Detecting GPU type"
Mar 15 23:35:24 calgary ollama[5122]: time=2024-03-15T23:35:24.312Z level=INFO source=gpu.go:191 msg="Searching for GPU management library libnvidia-ml.so"
Mar 15 23:35:24 calgary ollama[5122]: time=2024-03-15T23:35:24.317Z level=INFO source=gpu.go:237 msg="Discovered GPU libraries: [/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.535.161.07]"
Mar 15 23:35:24 calgary ollama[5122]: time=2024-03-15T23:35:24.334Z level=INFO source=gpu.go:82 msg="Nvidia GPU detected"
Mar 15 23:35:24 calgary ollama[5122]: time=2024-03-15T23:35:24.334Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
Mar 15 23:35:24 calgary ollama[5122]: time=2024-03-15T23:35:24.352Z level=INFO source=gpu.go:119 msg="CUDA Compute Capability detected: 8.6"
Mar 15 23:36:35 calgary ollama[5122]: time=2024-03-15T23:36:35.942Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
Mar 15 23:36:35 calgary ollama[5122]: time=2024-03-15T23:36:35.942Z level=INFO source=gpu.go:119 msg="CUDA Compute Capability detected: 8.6"
Mar 15 23:36:35 calgary ollama[5122]: time=2024-03-15T23:36:35.942Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
Mar 15 23:36:35 calgary ollama[5122]: time=2024-03-15T23:36:35.942Z level=INFO source=gpu.go:119 msg="CUDA Compute Capability detected: 8.6"
Mar 15 23:36:35 calgary ollama[5122]: time=2024-03-15T23:36:35.942Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
Mar 15 23:36:35 calgary ollama[5122]: time=2024-03-15T23:36:35.959Z level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: /tmp/ollama4171821284/runners/cuda_v11/libext_server.so"
Mar 15 23:36:35 calgary ollama[5122]: time=2024-03-15T23:36:35.959Z level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server"
Mar 15 23:36:36 calgary ollama[5122]: ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   yes
Mar 15 23:36:36 calgary ollama[5122]: ggml_init_cublas: CUDA_USE_TENSOR_CORES: no
Mar 15 23:36:36 calgary ollama[5122]: ggml_init_cublas: found 2 CUDA devices:
Mar 15 23:36:36 calgary ollama[5122]:   Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
Mar 15 23:36:36 calgary ollama[5122]:   Device 1: NVIDIA GeForce RTX 3070, compute capability 8.6, VMM: yes
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: loaded meta data with 24 key-value pairs and 995 tensors from /usr/share/ollama/.ollama/models/blobs/sha256:a03abff90c35c22bb4e10be3fcb0b974525e50c5e65ce1b4db59781fc413dc2e (version GGUF V3 (latest))
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv   0:                       general.architecture str              = llama
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv   1:                               general.name str              = cognitivecomputations
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv   4:                          llama.block_count u32              = 32
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv   9:                         llama.expert_count u32              = 8
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv  10:                    llama.expert_used_count u32              = 2
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv  11:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv  12:                       llama.rope.freq_base f32              = 1000000.000000
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv  13:                          general.file_type u32              = 7
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = llama
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,32002]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,32002]   = [0.000000, 0.000000, 0.000000, 0.0000...
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,32002]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 1
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 32000
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv  23:               general.quantization_version u32              = 2
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - type  f32:   65 tensors
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - type  f16:   32 tensors
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - type q8_0:  898 tensors
Mar 15 23:36:37 calgary ollama[5122]: llm_load_vocab: special tokens definition check successful ( 261/32002 ).
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: format           = GGUF V3 (latest)
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: arch             = llama
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: vocab type       = SPM
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_vocab          = 32002
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_merges         = 0
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_ctx_train      = 32768
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_embd           = 4096
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_head           = 32
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_head_kv        = 8
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_layer          = 32
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_rot            = 128
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_embd_head_k    = 128
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_embd_head_v    = 128
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_gqa            = 4
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_embd_k_gqa     = 1024
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_embd_v_gqa     = 1024
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: f_norm_eps       = 0.0e+00
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_ff             = 14336
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_expert         = 8
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_expert_used    = 2
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: pooling type     = 0
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: rope type        = 0
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: rope scaling     = linear
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: freq_base_train  = 1000000.0
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: freq_scale_train = 1
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_yarn_orig_ctx  = 32768
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: rope_finetuned   = unknown
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: model type       = 7B
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: model ftype      = Q8_0
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: model params     = 46.70 B
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: model size       = 46.22 GiB (8.50 BPW)
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: general.name     = cognitivecomputations
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: BOS token        = 1 '<s>'
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: EOS token        = 32000 '<|im_end|>'
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: UNK token        = 0 '<unk>'
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: LF token         = 13 '<0x0A>'
Mar 15 23:36:37 calgary ollama[5122]: llm_load_tensors: ggml ctx size =    1.14 MiB
Mar 15 23:36:42 calgary ollama[5122]: llm_load_tensors: offloading 32 repeating layers to GPU
Mar 15 23:36:42 calgary ollama[5122]: llm_load_tensors: offloading non-repeating layers to GPU
Mar 15 23:36:42 calgary ollama[5122]: llm_load_tensors: offloaded 33/33 layers to GPU
Mar 15 23:36:42 calgary ollama[5122]: llm_load_tensors:        CPU buffer size =   132.82 MiB
Mar 15 23:36:42 calgary ollama[5122]: llm_load_tensors:      CUDA0 buffer size = 42647.22 MiB
Mar 15 23:36:42 calgary ollama[5122]: llm_load_tensors:      CUDA1 buffer size =  4544.62 MiB
Mar 15 23:36:48 calgary ollama[5122]: ....................................................................................................
Mar 15 23:36:48 calgary ollama[5122]: llama_new_context_with_model: n_ctx      = 2048
Mar 15 23:36:48 calgary ollama[5122]: llama_new_context_with_model: freq_base  = 1000000.0
Mar 15 23:36:48 calgary ollama[5122]: llama_new_context_with_model: freq_scale = 1
Mar 15 23:36:48 calgary ollama[5122]: llama_kv_cache_init:      CUDA0 KV buffer size =   232.00 MiB
Mar 15 23:36:48 calgary ollama[5122]: llama_kv_cache_init:      CUDA1 KV buffer size =    24.00 MiB
Mar 15 23:36:48 calgary ollama[5122]: llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
Mar 15 23:36:48 calgary ollama[5122]: llama_new_context_with_model:  CUDA_Host input buffer size   =    13.02 MiB
Mar 15 23:36:48 calgary ollama[5122]: llama_new_context_with_model:      CUDA0 compute buffer size =   184.03 MiB
Mar 15 23:36:48 calgary ollama[5122]: llama_new_context_with_model:      CUDA1 compute buffer size =   192.01 MiB
Mar 15 23:36:48 calgary ollama[5122]: llama_new_context_with_model:  CUDA_Host compute buffer size =     8.00 MiB
Mar 15 23:36:48 calgary ollama[5122]: llama_new_context_with_model: graph splits (measure): 3
Mar 15 23:36:48 calgary ollama[5122]: loading library /tmp/ollama4171821284/runners/cuda_v11/libext_server.so
Mar 15 23:36:48 calgary ollama[5122]: {"function":"initialize","level":"INFO","line":440,"msg":"initializing slots","n_slots":1,"tid":"137734259725888","timestamp":1710545808}
Mar 15 23:36:48 calgary ollama[5122]: {"function":"initialize","level":"INFO","line":449,"msg":"new slot","n_ctx_slot":2048,"slot_id":0,"tid":"137734259725888","timestamp":1710545808}
Mar 15 23:36:48 calgary ollama[5122]: time=2024-03-15T23:36:48.328Z level=INFO source=dyn_ext_server.go:162 msg="Starting llama main loop"
Mar 15 23:36:48 calgary ollama[5122]: {"function":"update_slots","level":"INFO","line":1590,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"137729434187328","timestamp":1710545808}
Mar 15 23:36:48 calgary ollama[5122]: {"function":"launch_slot_with_data","level":"INFO","line":830,"msg":"slot is processing task","slot_id":0,"task_id":0,"tid":"137729434187328","timestamp":1710545808}
Mar 15 23:36:48 calgary ollama[5122]: {"function":"update_slots","ga_i":0,"level":"INFO","line":1821,"msg":"slot progression","n_past":0,"n_past_se":0,"n_prompt_tokens_processed":111,"slot_id":0,"task_id":0,"tid":"137729434187328","timestamp":1710545808}
Mar 15 23:36:48 calgary ollama[5122]: {"function":"update_slots","level":"INFO","line":1848,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":0,"tid":"137729434187328","timestamp":1710545808}
Mar 15 23:36:59 calgary ollama[5122]: {"function":"print_timings","level":"INFO","line":257,"msg":"prompt eval time     =     781.86 ms /   111 tokens (    7.04 ms per token,   141.97 tokens per second)","n_prompt_tokens_processed":111,"n_tokens_second":141.96842417607155,"slot_id":0,"t_prompt_processing":781.864,"t_token":7.04381981981982,"task_id":0,"tid":"137729434187328","timestamp":1710545819}
Mar 15 23:36:59 calgary ollama[5122]: {"function":"print_timings","level":"INFO","line":271,"msg":"generation eval time =   10352.39 ms /   327 runs   (   31.66 ms per token,    31.59 tokens per second)","n_decoded":327,"n_tokens_second":31.586915019027494,"slot_id":0,"t_token":31.65867889908257,"t_token_generation":10352.388,"task_id":0,"tid":"137729434187328","timestamp":1710545819}
Mar 15 23:36:59 calgary ollama[5122]: {"function":"print_timings","level":"INFO","line":281,"msg":"          total time =   11134.25 ms","slot_id":0,"t_prompt_processing":781.864,"t_token_generation":10352.388,"t_total":11134.252,"task_id":0,"tid":"137729434187328","timestamp":1710545819}
Mar 15 23:36:59 calgary ollama[5122]: {"function":"update_slots","level":"INFO","line":1652,"msg":"slot released","n_cache_tokens":438,"n_ctx":2048,"n_past":437,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"137729434187328","timestamp":1710545819,"truncated":false}
Mar 15 23:36:59 calgary ollama[5122]: [GIN] 2024/03/15 - 23:36:59 | 200 | 23.883120028s |      10.7.14.22 | POST     "/api/chat"
Mar 15 23:36:59 calgary ollama[5122]: {"function":"launch_slot_with_data","level":"INFO","line":830,"msg":"slot is processing task","slot_id":0,"task_id":330,"tid":"137729434187328","timestamp":1710545819}
Mar 15 23:36:59 calgary ollama[5122]: {"function":"update_slots","ga_i":0,"level":"INFO","line":1821,"msg":"slot progression","n_past":21,"n_past_se":0,"n_prompt_tokens_processed":131,"slot_id":0,"task_id":330,"tid":"137729434187328","timestamp":1710545819}
Mar 15 23:36:59 calgary ollama[5122]: {"function":"update_slots","level":"INFO","line":1848,"msg":"kv cache rm [p0, end)","p0":21,"slot_id":0,"task_id":330,"tid":"137729434187328","timestamp":1710545819}
Mar 15 23:37:00 calgary ollama[5122]: {"function":"print_timings","level":"INFO","line":257,"msg":"prompt eval time     =     836.71 ms /   131 tokens (    6.39 ms per token,   156.57 tokens per second)","n_prompt_tokens_processed":131,"n_tokens_second":156.56578332490747,"slot_id":0,"t_prompt_processing":836.709,"t_token":6.387091603053435,"task_id":330,"tid":"137729434187328","timestamp":1710545820}
Mar 15 23:37:00 calgary ollama[5122]: {"function":"print_timings","level":"INFO","line":271,"msg":"generation eval time =     190.45 ms /     7 runs   (   27.21 ms per token,    36.75 tokens per second)","n_decoded":7,"n_tokens_second":36.75486083034482,"slot_id":0,"t_token":27.207285714285714,"t_token_generation":190.451,"task_id":330,"tid":"137729434187328","timestamp":1710545820}
Mar 15 23:37:00 calgary ollama[5122]: {"function":"print_timings","level":"INFO","line":281,"msg":"          total time =    1027.16 ms","slot_id":0,"t_prompt_processing":836.709,"t_token_generation":190.451,"t_total":1027.1599999999999,"task_id":330,"tid":"137729434187328","timestamp":1710545820}
Mar 15 23:37:00 calgary ollama[5122]: {"function":"update_slots","level":"INFO","line":1652,"msg":"slot released","n_cache_tokens":159,"n_ctx":2048,"n_past":158,"n_system_tokens":0,"slot_id":0,"task_id":330,"tid":"137729434187328","timestamp":1710545820,"truncated":false}
Mar 15 23:37:00 calgary ollama[5122]: [GIN] 2024/03/15 - 23:37:00 | 200 |   1.02968349s |      10.7.14.22 | POST     "/api/generate"

@dhiltgen
Copy link
Collaborator

@jeremytregunna looking back on that screen shot you posted above, I think the problem may be a result of how you have your cards plugged into your PCI slots. I believe you have 1 of the A6000's and the 3070 in the PCI 4@16x slots, but the other A6000 is in a older/slower PCI 1@16x slot. If you put both of the A6000's into the gen 4 slots and the 3070 into the gen 1 slot, perhaps things will be selected properly.

@jeremytregunna
Copy link

@jeremytregunna looking back on that screen shot you posted above, I think the problem may be a result of how you have your cards plugged into your PCI slots. I believe you have 1 of the A6000's and the 3070 in the PCI 4@16x slots, but the other A6000 is in a older/slower PCI 1@16x slot. If you put both of the A6000's into the gen 4 slots and the 3070 into the gen 1 slot, perhaps things will be selected properly.

Nope that's not it, but you are correct in one respect. The second A6000, since not being used, is currently at PCE1 speeds but, if I select it specifically in some other torch code, it bumps up to PCIE4x16 speeds. nvtop right now reports all 3 cards at PCE gen1 speeds because nothing is loaded. I can assure you, they're all plugged into gen 4 x16 slots.

@dhiltgen
Copy link
Collaborator

Can you try setting CUDA_DEVICE_ORDER as well. Options are FASTEST_FIRST or PCI_BUS_ID
It looks like you can also specify device UUIDs for the visible device setting which might help. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#id205

Use nvidia-smi -L to get the UUIDs of your GPUs.

Hopefully some combination of these will get things aligned.

@jeremytregunna
Copy link

Can you try setting CUDA_DEVICE_ORDER as well. Options are FASTEST_FIRST or PCI_BUS_ID It looks like you can also specify device UUIDs for the visible device setting which might help. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#id205

Use nvidia-smi -L to get the UUIDs of your GPUs.

Hopefully some combination of these will get things aligned.

Ok this had an interesting effect. Loading dolphin-mixtral:8x7b-v2.7-q8_0 again, it splits 50%/50% on the A6000s now with FASTEST_FIRST, but it also uses about 1/4 of memory on the 3070. I can confirm all memory usage on all the GPUs is nominal before dolphin-mixtral is loaded. I essentially need to keep tho 3070 out of consideration for ollama entirely, so this won't exactly work since it'll always be in the mix.

2024-03-20T20:33:46,349694373-06:00

@jeremytregunna
Copy link

@dhiltgen So tried with the explicit UUIDs with CUDA_VISIBLE_DEVICES and that works, but their GPU instance IDs do not work. For now, this is resolved, but I am left wondering if Ollama can do better?

@Koesn
Copy link

Koesn commented Mar 25, 2024

@dhiltgen Thank you, CUDA_VISIBLE_DEVICES works. Finally.

@datalee
Copy link

datalee commented Apr 12, 2024

mark

@datalee
Copy link

datalee commented Apr 12, 2024

It can also be specified like this:
CUDA_VISIBLE_DEVICES=xx OLLAMA_HOST=0.0.0.0:xxx OLLAMA_MODELS=xxx/ollama_cache ollama serve

@papandadj
Copy link

damn. CUDA_VISIBLE_DEVICES is fine for me. thank you.

@charles-cai
Copy link

charles-cai commented Apr 30, 2024

@jeremytregunna gpustat --watch looks very cool :)
ah it's actually nvtop!

@pykeras
Copy link

pykeras commented May 8, 2024

Automate/Easy GPU Selection for Ollama

Hi everyone,

I wanted to share a handy script I created for automating GPU selection when running Ollama. You can find the script here. This script allows you to specify which GPU(s) Ollama should utilize, making it easier to manage resources and optimize performance.

How to Use:

  • Download the ollama_gpu_selector.sh script from the gist.
  • Make it executable: chmod +x ollama_gpu_selector.sh.
  • Run the script with administrative privileges: sudo ./ollama_gpu_selector.sh.
  • Follow the prompts to select the GPU(s) for Ollama.

Additionally, I've included aliases in the gist for easier switching between GPU selections. Feel free to customize these aliases to suit your preferences.

If you encounter any issues or have suggestions for improvement, please let me know! I hope this script helps streamline your Ollama workflow.

Happy coding!

@emourdavid
Copy link

Automate/Easy GPU Selection for Ollama

Hi everyone,

I wanted to share a handy script I created for automating GPU selection when running Ollama. You can find the script here. This script allows you to specify which GPU(s) Ollama should utilize, making it easier to manage resources and optimize performance.

How to Use:

  • Download the ollama_gpu_selector.sh script from the gist.
  • Make it executable: chmod +x ollama_gpu_selector.sh.
  • Run the script with administrative privileges: sudo ./ollama_gpu_selector.sh.
  • Follow the prompts to select the GPU(s) for Ollama.

Additionally, I've included aliases in the gist for easier switching between GPU selections. Feel free to customize these aliases to suit your preferences.

If you encounter any issues or have suggestions for improvement, please let me know! I hope this script helps streamline your Ollama workflow.

Happy coding!

Thank you, I can run this successful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.