Integrated GPU support #2637

DocMAX · 2024-02-21T14:56:12Z

Opening a new issue (see #2195) to track support for integrated GPUs. I have a AMD 5800U CPU with integrated graphics. As far as i did research ROCR lately does support integrated graphics too.

Currently Ollama seems to ignore iGPUs in general.

GZGavinZhao · 2024-02-22T00:39:28Z

ROCm's support for integrated GPUs is not that well. This issue may largely depend on AMD's progress on improving ROCm.

DocMAX · 2024-02-22T09:18:38Z

OK, but i would like to have an option to have it enable. Just to check if it works.

DocMAX · 2024-02-22T15:52:33Z

This is what i get with the new docker image (rocm support). Detects Radeon and then says no GPU detected?!?

GZGavinZhao · 2024-02-22T16:13:48Z

Their AMDDetected() function is a bit broken and I haven't figured out a fix for it.

sid-cypher · 2024-02-23T14:23:48Z

I've seen this behavior in #2411, but only with the version from ollama.com.
Try it with the latest released binary?
https://github.com/ollama/ollama/releases/tag/v0.1.27

GZGavinZhao · 2024-02-23T16:18:11Z

Yes, latest release fixed this behavior.

DocMAX · 2024-02-23T19:22:18Z

I had a permission issue with lxc/docker. Now:

time=2024-02-23T19:27:29.715Z level=INFO source=images.go:710 msg="total blobs: 31"
time=2024-02-23T19:27:29.716Z level=INFO source=images.go:717 msg="total unused blobs removed: 0"
time=2024-02-23T19:27:29.717Z level=INFO source=routes.go:1019 msg="Listening on [::]:11434 (version 0.1.27)"
time=2024-02-23T19:27:29.717Z level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..."
time=2024-02-23T19:27:33.385Z level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [cpu_avx rocm_v6 rocm_v5 cuda_v11 cpu_avx2]"
time=2024-02-23T19:27:33.385Z level=INFO source=gpu.go:94 msg="Detecting GPU type"
time=2024-02-23T19:27:33.385Z level=INFO source=gpu.go:265 msg="Searching for GPU management library libnvidia-ml.so"
time=2024-02-23T19:27:33.387Z level=INFO source=gpu.go:311 msg="Discovered GPU libraries: []"
time=2024-02-23T19:27:33.387Z level=INFO source=gpu.go:265 msg="Searching for GPU management library librocm_smi64.so"
time=2024-02-23T19:27:33.388Z level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [/opt/rocm/lib/librocm_smi64.so.5.0.50701 /opt/rocm-5.7.1/lib/librocm_smi64.so.5.0.50701]"
time=2024-02-23T19:27:33.391Z level=INFO source=gpu.go:109 msg="Radeon GPU detected"
time=2024-02-23T19:27:33.391Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-02-23T19:27:33.391Z level=INFO source=gpu.go:181 msg="ROCm unsupported integrated GPU detected"
time=2024-02-23T19:27:33.392Z level=INFO source=routes.go:1042 msg="no GPU detected"

So as the topic says, please add integrated GPU support (AMD 5800U here)

robertvazan · 2024-02-24T03:01:45Z

Latest (0.1.27) docker image with ROCm works for me on Ryzen 5600G with 8GB VRAM allocation. Prompt processing is 2x faster than with CPU. Generation runs at max speed even if CPU is busy running other processes. I am on Fedora 39.

Container setup:

HSA_OVERRIDE_GFX_VERSION=9.0.0
~~HCC_AMDGPU_TARGETS=gfx900~~ (unnecessary)
share devices: ~~/dev/dri/card1, /dev/dri/renderD128~~, /dev/dri, /dev/kfd
~~additional options: --group-add video --security-opt seccomp:unconfined~~ (unnecessary)

It's however still shaky:

With topk1, output should be fully reproducible, but first iGPU generation differs from the following ones for the same prompt. Both first and following iGPU generations differ from what CPU produces. Differences are minor though.
Output is sometimes garbage on iGPU as if the prompt is ignored. Restarting ollama fixes the problem.
Ollama often fails to offload all layers to the iGPU when switching models, reporting low VRAM as if parts of the previous model are still in VRAM. Restarting ollama fixes the problem for a while.
Partial offload with 13B model works, but mixtral is broken. It just hangs.

robertvazan · 2024-02-24T03:17:07Z

See also discussion in the #738 epic.

DocMAX · 2024-02-24T10:17:25Z

Why does it work for you??
Still not working here.

services:
  ollama:
    #image: ollama/ollama:latest
    image: ollama/ollama:0.1.27-rocm
    container_name: ollama
    volumes:
      - data:/root/.ollama
    restart: unless-stopped
    devices:
      - /dev/dri
      - /dev/kfd
    security_opt:
      - "seccomp:unconfined"
    group_add:
      - video
    environment:
      - 'HSA_OVERRIDE_GFX_VERSION=9.0.0'
      - 'HCC_AMDGPU_TARGETS=gfx900'

time=2024-02-24T10:16:09.280Z level=INFO source=images.go:710 msg="total blobs: 31"
time=2024-02-24T10:16:09.284Z level=INFO source=images.go:717 msg="total unused blobs removed: 0"
time=2024-02-24T10:16:09.285Z level=INFO source=routes.go:1019 msg="Listening on [::]:11434 (version 0.1.27)"
time=2024-02-24T10:16:09.285Z level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..."
time=2024-02-24T10:16:12.184Z level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [rocm_v5 rocm_v6 cpu cpu_avx cpu_avx2 cuda_v11]"
time=2024-02-24T10:16:12.184Z level=INFO source=gpu.go:94 msg="Detecting GPU type"
time=2024-02-24T10:16:12.184Z level=INFO source=gpu.go:265 msg="Searching for GPU management library libnvidia-ml.so"
time=2024-02-24T10:16:12.188Z level=INFO source=gpu.go:311 msg="Discovered GPU libraries: []"
time=2024-02-24T10:16:12.188Z level=INFO source=gpu.go:265 msg="Searching for GPU management library librocm_smi64.so"
time=2024-02-24T10:16:12.189Z level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [/opt/rocm/lib/librocm_smi64.so.5.0.50701 /opt/rocm-5.7.1/lib/librocm_smi64.so.5.0.50701]"
time=2024-02-24T10:16:12.191Z level=INFO source=gpu.go:109 msg="Radeon GPU detected"
time=2024-02-24T10:16:12.191Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-02-24T10:16:12.192Z level=INFO source=gpu.go:181 msg="ROCm unsupported integrated GPU detected"
time=2024-02-24T10:16:12.192Z level=INFO source=routes.go:1042 msg="no GPU detected"

Also the non-docker version doesnt work...

root@ollama:~# HCC_AMDGPU_TARGETS=gfx900 HSA_OVERRIDE_GFX_VERSION=9.0.0 LD_LIBRARY_PATH=/usr/lib ollama serve
time=2024-02-24T10:40:14.582Z level=INFO source=images.go:710 msg="total blobs: 0"
time=2024-02-24T10:40:14.582Z level=INFO source=images.go:717 msg="total unused blobs removed: 0"
time=2024-02-24T10:40:14.583Z level=INFO source=routes.go:1019 msg="Listening on 127.0.0.1:11434 (version 0.1.27)"
time=2024-02-24T10:40:14.583Z level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..."
time=2024-02-24T10:40:17.691Z level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [cpu_avx cpu_avx2 rocm_v6 cuda_v11 rocm_v5 cpu]"
time=2024-02-24T10:40:17.691Z level=INFO source=gpu.go:94 msg="Detecting GPU type"
time=2024-02-24T10:40:17.691Z level=INFO source=gpu.go:265 msg="Searching for GPU management library libnvidia-ml.so"
time=2024-02-24T10:40:17.692Z level=INFO source=gpu.go:311 msg="Discovered GPU libraries: []"
time=2024-02-24T10:40:17.692Z level=INFO source=gpu.go:265 msg="Searching for GPU management library librocm_smi64.so"
time=2024-02-24T10:40:17.693Z level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [/usr/lib/librocm_smi64.so.1.0]"
time=2024-02-24T10:40:17.696Z level=INFO source=gpu.go:109 msg="Radeon GPU detected"
time=2024-02-24T10:40:17.696Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-02-24T10:40:17.696Z level=INFO source=gpu.go:181 msg="ROCm unsupported integrated GPU detected"
time=2024-02-24T10:40:17.696Z level=INFO source=routes.go:1042 msg="no GPU detected"

root@ollama:~# rocminfo
ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE
System Endianness:       LITTLE

==========
HSA Agents
==========
*******
Agent 1
*******
  Name:                    AMD Ryzen 7 5800H with Radeon Graphics
  Uuid:                    CPU-XX
  Marketing Name:          AMD Ryzen 7 5800H with Radeon Graphics
  Vendor Name:             CPU
  Feature:                 None specified
  Profile:                 FULL_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        0(0x0)
  Queue Min Size:          0(0x0)
  Queue Max Size:          0(0x0)
  Queue Type:              MULTI
  Node:                    0
  Device Type:             CPU
  Cache Info:
    L1:                      32768(0x8000) KB
  Chip ID:                 0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   4463
  BDFID:                   0
  Internal Node ID:        0
  Compute Unit:            16
  SIMDs per CU:            0
  Shader Engines:          0
  Shader Arrs. per Eng.:   0
  WatchPts on Addr. Ranges:1
  Features:                None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    65216764(0x3e320fc) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 2
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    65216764(0x3e320fc) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 3
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    65216764(0x3e320fc) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
  ISA Info:
*******
Agent 2
*******
  Name:                    gfx90c
  Uuid:                    GPU-XX
  Marketing Name:          AMD Radeon Graphics
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    1
  Device Type:             GPU
  Cache Info:
    L1:                      16(0x10) KB
    L2:                      1024(0x400) KB
  Chip ID:                 5688(0x1638)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2000
  BDFID:                   1536
  Internal Node ID:        1
  Compute Unit:            8
  SIMDs per CU:            4
  Shader Engines:          1
  Shader Arrs. per Eng.:   1
  WatchPts on Addr. Ranges:4
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          64(0x40)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        40(0x28)
  Max Work-item Per CU:    2560(0xa00)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    524288(0x80000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx90c:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*** Done ***

@dhiltgen please have a look

DocMAX · 2024-02-24T11:48:11Z

And by the way there is no /sys/module/amdgpu/version. You have to correct the code.

robertvazan · 2024-02-24T11:54:23Z

ROCm unsupported integrated GPU detected

Ollama skipped the iGPU, because it has less than 1GB of VRAM. You have to configure VRAM allocation for the iGPU in BIOS to something like 8GB.

DocMAX · 2024-02-24T11:56:24Z

Ollama skipped the iGPU, because it has less than 1GB of VRAM. You have to configure VRAM allocation for the iGPU in BIOS to something like 8GB.

Thanks i will check if i can do that.
But normal behaviour for the iGPU should be that it requests more VRAM if needed.

robertvazan · 2024-02-24T11:58:35Z

But normal behaviour for the iGPU should be that it requests more VRAM if needed.

Why do you think so? Where is it documented? Mine maxes at 512MB unless I explicitly configure it in BIOS.

sid-cypher · 2024-02-24T13:16:17Z

Ollama skipped the iGPU, because it has less than 1GB of VRAM. You have to configure VRAM allocation for the iGPU in BIOS to something like 8GB.

Detecting and using this VRAM information without sharing with the user the reason for the iGPU rejection leads to "missing support" issues being opened, rather than "increase my VRAM allocation" steps taken. I think the log output should be improved in this case. This task would probably qualify for a "good first issue" tag, too.

DocMAX · 2024-02-24T13:17:50Z

Totally agree!

chiragkrishna · 2024-02-24T13:22:12Z

i have 2 systems.
Ryzen 5500U system always gets stuck here. ive allotted 4gb vram for it in the bios. its the max.

export HSA_OVERRIDE_GFX_VERSION=9.0.0
export HCC_AMDGPU_TARGETS=gfx900

llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 25/25 layers to GPU
llm_load_tensors:      ROCm0 buffer size =   703.44 MiB
llm_load_tensors:        CPU buffer size =    35.44 MiB

building with

export CGO_CFLAGS="-g"
export AMDGPU_TARGETS="gfx1030;gfx900"
go generate ./...
go build .

my 6750xt system works perfectly

DocMAX · 2024-02-24T18:29:11Z

But normal behaviour for the iGPU should be that it requests more VRAM if needed.

Why do you think so? Where is it documented? Mine maxes at 512MB unless I explicitly configure it in BIOS.

OK i was wrong. Works now with 8GB VRAM, thank you!

discovered 1 ROCm GPU Devices
[0] ROCm device name: Cezanne [Radeon Vega Series / Radeon Vega Mobile Series]
[0] ROCm brand: Cezanne [Radeon Vega Series / Radeon Vega Mobile Series]
[0] ROCm vendor: Advanced Micro Devices, Inc. [AMD/ATI]
[0] ROCm VRAM vendor: unknown
[0] ROCm S/N: 
[0] ROCm subsystem name: 0x123
[0] ROCm vbios version: 113-CEZANNE-018
[0] ROCm totalMem 8589934592
[0] ROCm usedMem 25907200
time=2024-02-24T18:27:14.013Z level=DEBUG source=gpu.go:254 msg="rocm detected 1 devices with 7143M available memory"

DocMAX · 2024-02-24T20:19:29Z

Hmm, i see the model loaded into VRAM, but nothing happens...

llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      ROCm0 buffer size =  3577.56 MiB
llm_load_tensors:        CPU buffer size =    70.31 MiB

DocMAX · 2024-02-24T21:09:03Z

Do i need another amdgpu module on the host than the one from the kernel (6.7.6)?

sid-cypher · 2024-02-24T21:43:12Z

Do i need another amdgpu module on the host than the one from the kernel (6.7.6)?

Maybe, ROCm/ROCm#816 seems relevant. I'm just using AMD-provided DKMS modules from https://repo.radeon.com/amdgpu/6.0.2/ubuntu to be sure.

DocMAX · 2024-02-24T22:27:05Z

Hmm, tinyllama model does work with 5800U. The bigger ones stuck as i mentioned before.
Edit: Codellama works too.

chiragkrishna · 2024-02-25T10:37:21Z

i added this "-DLLAMA_HIP_UMA=ON" to "ollama/llm/generate/gen_linux.sh"

CMAKE_DEFS="${COMMON_CMAKE_DEFS} ${CMAKE_DEFS} -DLLAMA_HIPBLAS=on -DLLAMA_HIP_UMA=ON -DCMAKE_C_COMPILER=$ROCM_PATH/llvm/bin/clang -DCMAKE_CXX_COMPILER=$ROCM_PATH/llvm/bin/clang++ -DAMDGPU_TARGETS=$(amdGPUs) -DGPU_TARGETS=$(amdGPUs)"

now its stuck here

llm_load_tensors: offloading 22 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 23/23 layers to GPU
llm_load_tensors:      ROCm0 buffer size =   809.59 MiB
llm_load_tensors:        CPU buffer size =    51.27 MiB
...............................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      ROCm0 KV buffer size =    44.00 MiB
llama_new_context_with_model: KV self size  =   44.00 MiB, K (f16):   22.00 MiB, V (f16):   22.00 MiB
llama_new_context_with_model:  ROCm_Host input buffer size   =     9.02 MiB
llama_new_context_with_model:      ROCm0 compute buffer size =   148.01 MiB
llama_new_context_with_model:  ROCm_Host compute buffer size =     4.00 MiB
llama_new_context_with_model: graph splits (measure): 3
[1708857011] warming up the model with an empty run

robertvazan · 2024-02-25T13:13:51Z

iGPUs indeed do allocate system RAM on demand. It's called GTT/GART. Here's what I get when I run sudo dmesg | grep "M of" on my system with 32GB RAM:

If I set VRAM to Auto in BIOS:

[    4.654736] [drm] amdgpu: 512M of VRAM memory ready
[    4.654737] [drm] amdgpu: 15688M of GTT memory ready.

If I set VRAM to 8GB in BIOS:

[    4.670921] [drm] amdgpu: 8192M of VRAM memory ready
[    4.670923] [drm] amdgpu: 11908M of GTT memory ready.

If I set VRAM to 16GB in BIOS:

[    4.600060] [drm] amdgpu: 16384M of VRAM memory ready
[    4.600062] [drm] amdgpu: 7888M of GTT memory ready.

It looks like GTT size is 0.5*(RAM-VRAM). I wonder how far can this go if you have 64GB or 96GB RAM. Can you have iGPU with 32GB or 48GB of GTT memory? That would make $200 APU with $200 DDR5 RAM superior to $2,000 dGPU for running Mixtral and future sparse models. I also wonder whether any BIOS offers 32GB VRAM setting if you have 64GB of RAM.

Unfortunately, ROCm does not use GTT. That thread mentions several workarounds (torch-apu-helper, force-host-alloction-APU, Rusticl, unlock VRAM allocation), but I am not sure whether Ollama would be able to use any of them. Chances are highest in docker container where Ollama has greatest control over dependencies.

DocMAX · 2024-02-25T19:52:32Z

Very cool findings. Interesting you mention 96GB. I did a research and it seems thats the max. we can buy right now for SO-DIMMS. Wasn't aware it's called GTT. Let's hope someday we get support for this.
If host can't handle GTT for ROCm, then i doubt docker can't do anything about it.

https://github.com/segurac/force-host-alloction-APU looks like the best solution to me if it works. Will try in my docker containers...

[So Feb 25 21:31:38 2024] [drm] amdgpu: 512M of VRAM memory ready
[So Feb 25 21:31:38 2024] [drm] amdgpu: 31844M of GTT memory ready.

This is how much i would get :-) (64GB system)

DocMAX · 2024-02-25T22:36:48Z

OK, doesn't work with ollama. Wasn't aware that it doesn't use PyTorch right?

chiragkrishna · 2024-02-26T00:59:28Z

llama.cpp supports it. thats what i was trying to do in my previous post. Support AMD Ryzen Unified Memory Architecture (UMA)

robertvazan · 2024-02-26T11:17:39Z

@chiragkrishna Do you mean this? ggerganov/llama.cpp#4449

Since llama.cpp already supports UMA (GGT/GART), Ollama could perhaps include llama.cpp build with UMA enabled and use it when the conditions are right (AMD iGPU with VRAM smaller than the model).

PS: UMA support seems a bit unstable, so perhaps enable it with environment variable at first.

DocMAX · 2024-02-26T12:04:35Z

How does the env thing work? Like this? (Doesn't do anything btw)
LLAMA_HIP_UMA=1 HSA_OVERRIDE_GFX_VERSION=9.0.0 HCC_AMDGPU_TARGETS==gfx900 ollama start

robertvazan · 2024-02-26T12:07:19Z

@DocMAX I don't think there's UMA support in ollama yet. It's a compile-time option in llama.cpp. The other env variables (HSA_OVERRIDE_GFX_VERSION was sufficient in my experiments) are correctly passed down to ROCm.

DocMAX · 2024-05-20T10:24:17Z

I'm using docker image ollama/ollama:rocm

arilou · 2024-05-20T10:25:55Z

What I suggested is a change over ollama, OLLAMA_VRAM_OVERRIDE, this is not part of ollama today...

DocMAX · 2024-05-20T10:27:24Z

OK, then i have to wait for the docker version, because i want stay on docker.

qkiel · 2024-05-20T15:23:27Z

Not sure I'm following if you use the VRAM override env and also libforcegttalloc.so to expose your entire system memory it wont show this print. The code verifies you have at least 1Gb of ram... pretty sure your system memory has more than 1Gb

Curious question - why do you use libforcegttalloc.so with ollama? Isn't it only intended for use with applications that require PyTorch? Without LD_PRELOAD everything should work exactly the same.

arilou · 2024-05-20T16:02:22Z

Well the reason is that if you will look when you compile ollama/llama.cpp (even with LLAMA_HIP_UMA=on)
It will charge the VRAM memory (you can use radeontop/amdgpu_top)
That limits you to amount of VRAM you can assign in the BIOS (max is 16Gb)
But let's say on my system I have 64Gb of memory, there is no reason I wont be able to load much larger models like 50Gb
After all there is no "real" meaning to those 16Gb being VRam, they sit on the say DIMM... so by using the trick done
in libforcegttalloc.so you basically charge memory from the OS but since we are all here with APUs it's the same thing
we just need to go through hips in order for the iGFX to understand it can access those pointers regularly

So with this trick you can now load models much bigger, and "steal" less memory from your system for your GPU.

So for example I loaded llama3:70b-instruct-q4_K_M which is about 40Gb, and I still get 0.8tps which is fairly ok for the power of our iGPU...

qkiel · 2024-05-20T16:49:57Z

Well the reason is that if you will look when you compile ollama/llama.cpp (even with LLAMA_HIP_UMA=on) It will charge the VRAM memory (you can use radeontop/amdgpu_top) That limits you to amount of VRAM you can assign in the BIOS (max is 16Gb) But let's say on my system I have 64Gb of memory, there is no reason I wont be able to load much larger models like 50Gb After all there is no "real" meaning to those 16Gb being VRam, they sit on the say DIMM... so by using the trick done in libforcegttalloc.so you basically charge memory from the OS but since we are all here with APUs it's the same thing we just need to go through hips in order for the iGFX to understand it can access those pointers regularly

So with this trick you can now load models much bigger, and "steal" less memory from your system for your GPU.

So for example I loaded llama3:70b-instruct-q4_K_M which is about 40Gb, and I still get 0.8tps which is fairly ok for the power of our iGPU...

Interesting. I have an AMD 5600G APU with UMA_AUTO set in UEFI/BIOS (which means 512 MB is taken from my RAM for VRAM). On my Ubuntu 22.04 the libforcegttalloc.so is required only for Stable Diffusion apps like Fooocus.

Running ollama with or without LD_PRELOAD makes no difference in my case. VRAM is kept at 512 MB, models are loaded to RAM, and the compute is done on GPU.

Have you tried running ollama without LD_PRELOAD?

arilou · 2024-05-20T16:55:54Z

Interesting for me it crashes if I tried to module bigger than the allocated VRAM, I wonder if it's an issue because in Fedora 40, the default ROCm is 6.0.

Jonnybravo · 2024-05-20T19:08:12Z

Hey, I also have a 5600G and I wanted to make use of it. I read the whole thread, but I'm confused about which steps should I do to change the build in order to make this work with this iGPU.

Is there a version already pre-compiled that everyone can use? I tried to follow @qkiel steps, but this fails miserably when I try to compile and build using go...

qkiel · 2024-05-20T21:13:01Z

Hey, I also have a 5600G and I wanted to make use of it. I read the whole thread, but I'm confused about which steps should I do to change the build in order to make this work with this iGPU.

Is there a version already pre-compiled that everyone can use? I tried to follow @qkiel steps, but this fails miserably when I try to compile and build using go...

When you download source code and compile it with commands below, do you still get an error?

git clone --depth 1 --branch v0.1.38 https://github.com/ollama/ollama
cd ollama
go generate ./...
go build .

Jonnybravo · 2024-05-20T22:37:14Z

Last time I ran the generate command I got this:

I didn't try anything else after this. Before I got here, the generate command would give me an error related to the wrong version of go being installed.

qkiel · 2024-05-21T07:19:13Z

I updated my instruction a bit, see if it works this time. If not, I can send you my binary.

Jonnybravo · 2024-05-21T10:54:24Z

I followed everything again and made sure about the versions of the requirements. This time I managed to pass the generate step, but it seems that I have a problem with the goroot path when I try to run the builder command:

Could it be because of this being installed in a custom folder?

EDIT: Meanwhile I tried to point both variables to different paths, but now I have an error on GOPROXY. @qkiel, did you also experienced this? What are your paths for each variable GOROOT, GOPATH and GOPROXY?

qkiel · 2024-05-21T12:53:44Z

This warning doesn't matter, just run ollama:

/<path>/./ollama serve

Then in a second terminal window:

/<path>/./ollama --help

Jonnybravo · 2024-05-21T15:22:15Z

This warning doesn't matter, just run ollama:
/<path>/./ollama serve
Then in a second terminal window:
/<path>/./ollama --help

I did that and I believe I'm still running it using the CPU. Is there a way to confirm that I'm running this using the GPU?

qkiel · 2024-05-21T15:38:38Z

Look at GPU utilization. I use nvtop for that (also available as a snap):

sudo apt install nvtop

Jonnybravo · 2024-05-21T16:42:58Z

Installed it, but I think I messed up on the ROCm installation. Do I need to do some extra step besides this?:

sudo apt update wget https://repo.radeon.com/amdgpu-install/6.1.1/ubuntu/jammy/amdgpu-install_6.1.60101-1_all.deb sudo apt install ./amdgpu-install_6.1.60101-1_all.deb

If it helps, I'm using windows with wsl.

qkiel · 2024-05-21T17:52:20Z

Unfortunately, there is no equivalent of the HSA_OVERRIDE_GFX_VERSION environment variable on Windows, so you cannot present your iGPU to ROCm as supported.

Secondly, you install ROCm differently on Windows.

I don't think it can be done on Windows the same way as on Linux.

Edit:

sudo apt update wget https://repo.radeon.com/amdgpu-install/6.1.1/ubuntu/jammy/amdgpu-install_6.1.60101-1_all.deb sudo apt install ./amdgpu-install_6.1.60101-1_all.deb

Besides that, you can this to install ROCm:

sudo amdgpu-install --usecase=rocm --no-dkms

But chances of success are very slim.

Jonnybravo · 2024-05-22T09:59:54Z

Yeah, I tried it and got the same problem mentioned on this thread: ROCm/ROCm#3051

And what about a Virtual Machine running linux, @qkiel? Do you think that could work or am I stretching too much here?

qkiel · 2024-05-22T17:27:06Z

I have no idea. If you have a regular GPU, then you can pass iGPU to the VM and that could work. I don't think that 5600G supports SR-IOV so you can't partition iGPU and pass only part of it.

xwry · 2024-05-26T08:15:12Z

This warning doesn't matter, just run ollama:
/<path>/./ollama serve
Then in a second terminal window:
/<path>/./ollama --help
I did that and I believe I'm still running it using the CPU. Is there a way to confirm that I'm running this using the GPU?

You can try radeontop, it works fine on iGPU from AMD, -c flag ads colorized output.
sudo apt install radeontop

smellouk · 2024-06-19T23:35:07Z

@qkiel thx for this tip 🙏
I followed the steps as you describe but I'm facing this error

Error: llama runner process has terminated: signal: aborted error:Could not initialize Tensile host: No devices found

My current setup:

Proxmox running on host machine
LXC ubuntu22.04
Shared GPU

What is crazy now, is that if I install docker in this LXC and run docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama --device=/dev/kfd --device=/dev/dri/renderD128 --env HSA_OVERRIDE_GFX_VERSION=9.0.0 --env HSA_ENABLE_SDMA=0 ollama/ollama:rocm everything work great. More details here: #5143 (comment)

qkiel · 2024-06-20T17:54:10Z

@smellouk I have tutorials on how I install ROCm and Ollama in Icnus containers (fork of LXD):

Do you do this similarly or somehow differently?

smellouk · 2024-06-21T16:44:03Z

@qkiel I used that article and I just noticed you are the same owner 😆, that Ai tutorial: ROCm and PyTorch on AMD APU or GPU led to me here. I follow everything and same issue 😢

qkiel · 2024-06-21T20:17:07Z

When you run this command, what do you see?

ls -alF /dev/dri

Do card0 and renderD128 belong to video or render group?

crw-rw----  1 root video 226,   0 cze 21 22:07 card0
crw-rw----  1 root video 226, 128 cze 21 22:07 renderD128

If they belong to root root, that means you didn't set a proper gid when adding the GPU device to the container. For Ubuntu containers, that would be:

incus config device add <container_name> gpu gpu gid=44

Or your user inside the container doesn't belong to video and render groups. For Ubuntu containers, that would be (this requires a restart of the container to take effect):

sudo usermod -a -G render,video ubuntu

smellouk · 2024-06-22T10:36:35Z

@qkiel permissions are correct as expected

arilou · 2024-06-25T02:52:35Z

I added the following patch to ollama:

diff --git a/gpu/amd_linux.go b/gpu/amd_linux.go
index 6b08ac2..579186b 100644
--- a/gpu/amd_linux.go
+++ b/gpu/amd_linux.go
@@ -229,6 +229,15 @@ func AMDGetGPUInfo() []GpuInfo {
 		}

 		// iGPU detection, remove this check once we can support an iGPU variant of the rocm library
+		if override, exists := os.LookupEnv("OLLAMA_VRAM_OVERRIDE"); exists {
+			// Convert the environment variable to an integer
+			if value, err := strconv.ParseUint(override, 10, 64); err == nil {
+				totalMemory = value
+			} else {
+				fmt.Println("Error parsing OLLAMA_VRAM_OVERRIDE:", err)
+			}
+		}
+
 		if totalMemory < IGPUMemLimit {
 			slog.Info("unsupported Radeon iGPU detected skipping", "id", gpuID, "total", format.HumanBytes2(totalMemory))
 			continue

@dhiltgen perhaps you want to consider adding this patch to ollama? (I dont have any NVIDIA computer to test and do the same for CUDA, or whatever Intel has/will have) to test this with, but i know it works well for AMD

dhiltgen mentioned this issue May 21, 2024

No Devices Found on Ryzen 7 8840u #4358

Closed

Sean-StarLabs mentioned this issue Jun 14, 2024

[StarBook Mk VI - AMD] - Add option to set VRAM allocation StarLabsLtd/firmware#169

Open

dhiltgen mentioned this issue Jun 19, 2024

Ollama failing to run with error: No devices found #5143

Open

dhiltgen mentioned this issue Jun 25, 2024

best performence with which gpu or cpu? for notebook #5250

Closed

Integrated GPU support #2637

Integrated GPU support #2637

Comments

DocMAX commented Feb 21, 2024

GZGavinZhao commented Feb 22, 2024

DocMAX commented Feb 22, 2024

DocMAX commented Feb 22, 2024

GZGavinZhao commented Feb 22, 2024

sid-cypher commented Feb 23, 2024

GZGavinZhao commented Feb 23, 2024

DocMAX commented Feb 23, 2024 • edited Loading

robertvazan commented Feb 24, 2024 • edited Loading

robertvazan commented Feb 24, 2024

DocMAX commented Feb 24, 2024 • edited Loading

DocMAX commented Feb 24, 2024

robertvazan commented Feb 24, 2024

DocMAX commented Feb 24, 2024 • edited Loading

robertvazan commented Feb 24, 2024 • edited Loading

sid-cypher commented Feb 24, 2024

DocMAX commented Feb 24, 2024

chiragkrishna commented Feb 24, 2024

DocMAX commented Feb 24, 2024

DocMAX commented Feb 24, 2024

DocMAX commented Feb 24, 2024

sid-cypher commented Feb 24, 2024

DocMAX commented Feb 24, 2024 • edited Loading

chiragkrishna commented Feb 25, 2024

robertvazan commented Feb 25, 2024

DocMAX commented Feb 25, 2024 • edited Loading

DocMAX commented Feb 25, 2024

chiragkrishna commented Feb 26, 2024

robertvazan commented Feb 26, 2024 • edited Loading

DocMAX commented Feb 26, 2024

robertvazan commented Feb 26, 2024

DocMAX commented May 20, 2024

arilou commented May 20, 2024

DocMAX commented May 20, 2024

qkiel commented May 20, 2024

arilou commented May 20, 2024

qkiel commented May 20, 2024

arilou commented May 20, 2024

Jonnybravo commented May 20, 2024

qkiel commented May 20, 2024

Jonnybravo commented May 20, 2024

qkiel commented May 21, 2024

Jonnybravo commented May 21, 2024 • edited Loading

qkiel commented May 21, 2024 • edited Loading

Jonnybravo commented May 21, 2024

qkiel commented May 21, 2024

Jonnybravo commented May 21, 2024 • edited Loading

qkiel commented May 21, 2024 • edited Loading

Jonnybravo commented May 22, 2024

qkiel commented May 22, 2024

xwry commented May 26, 2024

smellouk commented Jun 19, 2024

qkiel commented Jun 20, 2024 • edited Loading

smellouk commented Jun 21, 2024 • edited Loading

qkiel commented Jun 21, 2024

smellouk commented Jun 22, 2024

arilou commented Jun 25, 2024 • edited Loading

DocMAX commented Feb 23, 2024 •

edited

Loading

robertvazan commented Feb 24, 2024 •

edited

Loading

DocMAX commented Feb 24, 2024 •

edited

Loading

DocMAX commented Feb 24, 2024 •

edited

Loading

robertvazan commented Feb 24, 2024 •

edited

Loading

DocMAX commented Feb 24, 2024 •

edited

Loading

DocMAX commented Feb 25, 2024 •

edited

Loading

robertvazan commented Feb 26, 2024 •

edited

Loading

Jonnybravo commented May 21, 2024 •

edited

Loading

qkiel commented May 21, 2024 •

edited

Loading

Jonnybravo commented May 21, 2024 •

edited

Loading

qkiel commented May 21, 2024 •

edited

Loading

qkiel commented Jun 20, 2024 •

edited

Loading

smellouk commented Jun 21, 2024 •

edited

Loading

arilou commented Jun 25, 2024 •

edited

Loading