Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrated GPU support #2637

Open
DocMAX opened this issue Feb 21, 2024 · 104 comments
Open

Integrated GPU support #2637

DocMAX opened this issue Feb 21, 2024 · 104 comments
Assignees
Labels
amd Issues relating to AMD GPUs and ROCm feature request New feature or request

Comments

@DocMAX
Copy link

DocMAX commented Feb 21, 2024

Opening a new issue (see #2195) to track support for integrated GPUs. I have a AMD 5800U CPU with integrated graphics. As far as i did research ROCR lately does support integrated graphics too.

Currently Ollama seems to ignore iGPUs in general.

@GZGavinZhao
Copy link

ROCm's support for integrated GPUs is not that well. This issue may largely depend on AMD's progress on improving ROCm.

@DocMAX
Copy link
Author

DocMAX commented Feb 22, 2024

OK, but i would like to have an option to have it enable. Just to check if it works.

@DocMAX
Copy link
Author

DocMAX commented Feb 22, 2024

This is what i get with the new docker image (rocm support). Detects Radeon and then says no GPU detected?!?

image

image

@GZGavinZhao
Copy link

Their AMDDetected() function is a bit broken and I haven't figured out a fix for it.

@sid-cypher
Copy link

I've seen this behavior in #2411, but only with the version from ollama.com.
Try it with the latest released binary?
https://github.com/ollama/ollama/releases/tag/v0.1.27

@GZGavinZhao
Copy link

Yes, latest release fixed this behavior.

@DocMAX
Copy link
Author

DocMAX commented Feb 23, 2024

I had a permission issue with lxc/docker. Now:

time=2024-02-23T19:27:29.715Z level=INFO source=images.go:710 msg="total blobs: 31"
time=2024-02-23T19:27:29.716Z level=INFO source=images.go:717 msg="total unused blobs removed: 0"
time=2024-02-23T19:27:29.717Z level=INFO source=routes.go:1019 msg="Listening on [::]:11434 (version 0.1.27)"
time=2024-02-23T19:27:29.717Z level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..."
time=2024-02-23T19:27:33.385Z level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [cpu_avx rocm_v6 rocm_v5 cuda_v11 cpu_avx2]"
time=2024-02-23T19:27:33.385Z level=INFO source=gpu.go:94 msg="Detecting GPU type"
time=2024-02-23T19:27:33.385Z level=INFO source=gpu.go:265 msg="Searching for GPU management library libnvidia-ml.so"
time=2024-02-23T19:27:33.387Z level=INFO source=gpu.go:311 msg="Discovered GPU libraries: []"
time=2024-02-23T19:27:33.387Z level=INFO source=gpu.go:265 msg="Searching for GPU management library librocm_smi64.so"
time=2024-02-23T19:27:33.388Z level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [/opt/rocm/lib/librocm_smi64.so.5.0.50701 /opt/rocm-5.7.1/lib/librocm_smi64.so.5.0.50701]"
time=2024-02-23T19:27:33.391Z level=INFO source=gpu.go:109 msg="Radeon GPU detected"
time=2024-02-23T19:27:33.391Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-02-23T19:27:33.391Z level=INFO source=gpu.go:181 msg="ROCm unsupported integrated GPU detected"
time=2024-02-23T19:27:33.392Z level=INFO source=routes.go:1042 msg="no GPU detected"

So as the topic says, please add integrated GPU support (AMD 5800U here)

@robertvazan
Copy link

robertvazan commented Feb 24, 2024

Latest (0.1.27) docker image with ROCm works for me on Ryzen 5600G with 8GB VRAM allocation. Prompt processing is 2x faster than with CPU. Generation runs at max speed even if CPU is busy running other processes. I am on Fedora 39.

Container setup:

  • HSA_OVERRIDE_GFX_VERSION=9.0.0
  • HCC_AMDGPU_TARGETS=gfx900 (unnecessary)
  • share devices: /dev/dri/card1, /dev/dri/renderD128, /dev/dri, /dev/kfd
  • additional options: --group-add video --security-opt seccomp:unconfined (unnecessary)

It's however still shaky:

  • With topk1, output should be fully reproducible, but first iGPU generation differs from the following ones for the same prompt. Both first and following iGPU generations differ from what CPU produces. Differences are minor though.
  • Output is sometimes garbage on iGPU as if the prompt is ignored. Restarting ollama fixes the problem.
  • Ollama often fails to offload all layers to the iGPU when switching models, reporting low VRAM as if parts of the previous model are still in VRAM. Restarting ollama fixes the problem for a while.
  • Partial offload with 13B model works, but mixtral is broken. It just hangs.

@robertvazan
Copy link

See also discussion in the #738 epic.

@DocMAX
Copy link
Author

DocMAX commented Feb 24, 2024

Why does it work for you??
Still not working here.

services:
  ollama:
    #image: ollama/ollama:latest
    image: ollama/ollama:0.1.27-rocm
    container_name: ollama
    volumes:
      - data:/root/.ollama
    restart: unless-stopped
    devices:
      - /dev/dri
      - /dev/kfd
    security_opt:
      - "seccomp:unconfined"
    group_add:
      - video
    environment:
      - 'HSA_OVERRIDE_GFX_VERSION=9.0.0'
      - 'HCC_AMDGPU_TARGETS=gfx900'
time=2024-02-24T10:16:09.280Z level=INFO source=images.go:710 msg="total blobs: 31"
time=2024-02-24T10:16:09.284Z level=INFO source=images.go:717 msg="total unused blobs removed: 0"
time=2024-02-24T10:16:09.285Z level=INFO source=routes.go:1019 msg="Listening on [::]:11434 (version 0.1.27)"
time=2024-02-24T10:16:09.285Z level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..."
time=2024-02-24T10:16:12.184Z level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [rocm_v5 rocm_v6 cpu cpu_avx cpu_avx2 cuda_v11]"
time=2024-02-24T10:16:12.184Z level=INFO source=gpu.go:94 msg="Detecting GPU type"
time=2024-02-24T10:16:12.184Z level=INFO source=gpu.go:265 msg="Searching for GPU management library libnvidia-ml.so"
time=2024-02-24T10:16:12.188Z level=INFO source=gpu.go:311 msg="Discovered GPU libraries: []"
time=2024-02-24T10:16:12.188Z level=INFO source=gpu.go:265 msg="Searching for GPU management library librocm_smi64.so"
time=2024-02-24T10:16:12.189Z level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [/opt/rocm/lib/librocm_smi64.so.5.0.50701 /opt/rocm-5.7.1/lib/librocm_smi64.so.5.0.50701]"
time=2024-02-24T10:16:12.191Z level=INFO source=gpu.go:109 msg="Radeon GPU detected"
time=2024-02-24T10:16:12.191Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-02-24T10:16:12.192Z level=INFO source=gpu.go:181 msg="ROCm unsupported integrated GPU detected"
time=2024-02-24T10:16:12.192Z level=INFO source=routes.go:1042 msg="no GPU detected"

Also the non-docker version doesnt work...

root@ollama:~# HCC_AMDGPU_TARGETS=gfx900 HSA_OVERRIDE_GFX_VERSION=9.0.0 LD_LIBRARY_PATH=/usr/lib ollama serve
time=2024-02-24T10:40:14.582Z level=INFO source=images.go:710 msg="total blobs: 0"
time=2024-02-24T10:40:14.582Z level=INFO source=images.go:717 msg="total unused blobs removed: 0"
time=2024-02-24T10:40:14.583Z level=INFO source=routes.go:1019 msg="Listening on 127.0.0.1:11434 (version 0.1.27)"
time=2024-02-24T10:40:14.583Z level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..."
time=2024-02-24T10:40:17.691Z level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [cpu_avx cpu_avx2 rocm_v6 cuda_v11 rocm_v5 cpu]"
time=2024-02-24T10:40:17.691Z level=INFO source=gpu.go:94 msg="Detecting GPU type"
time=2024-02-24T10:40:17.691Z level=INFO source=gpu.go:265 msg="Searching for GPU management library libnvidia-ml.so"
time=2024-02-24T10:40:17.692Z level=INFO source=gpu.go:311 msg="Discovered GPU libraries: []"
time=2024-02-24T10:40:17.692Z level=INFO source=gpu.go:265 msg="Searching for GPU management library librocm_smi64.so"
time=2024-02-24T10:40:17.693Z level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [/usr/lib/librocm_smi64.so.1.0]"
time=2024-02-24T10:40:17.696Z level=INFO source=gpu.go:109 msg="Radeon GPU detected"
time=2024-02-24T10:40:17.696Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-02-24T10:40:17.696Z level=INFO source=gpu.go:181 msg="ROCm unsupported integrated GPU detected"
time=2024-02-24T10:40:17.696Z level=INFO source=routes.go:1042 msg="no GPU detected"
root@ollama:~# rocminfo
ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE
System Endianness:       LITTLE

==========
HSA Agents
==========
*******
Agent 1
*******
  Name:                    AMD Ryzen 7 5800H with Radeon Graphics
  Uuid:                    CPU-XX
  Marketing Name:          AMD Ryzen 7 5800H with Radeon Graphics
  Vendor Name:             CPU
  Feature:                 None specified
  Profile:                 FULL_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        0(0x0)
  Queue Min Size:          0(0x0)
  Queue Max Size:          0(0x0)
  Queue Type:              MULTI
  Node:                    0
  Device Type:             CPU
  Cache Info:
    L1:                      32768(0x8000) KB
  Chip ID:                 0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   4463
  BDFID:                   0
  Internal Node ID:        0
  Compute Unit:            16
  SIMDs per CU:            0
  Shader Engines:          0
  Shader Arrs. per Eng.:   0
  WatchPts on Addr. Ranges:1
  Features:                None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    65216764(0x3e320fc) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 2
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    65216764(0x3e320fc) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 3
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    65216764(0x3e320fc) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
  ISA Info:
*******
Agent 2
*******
  Name:                    gfx90c
  Uuid:                    GPU-XX
  Marketing Name:          AMD Radeon Graphics
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    1
  Device Type:             GPU
  Cache Info:
    L1:                      16(0x10) KB
    L2:                      1024(0x400) KB
  Chip ID:                 5688(0x1638)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2000
  BDFID:                   1536
  Internal Node ID:        1
  Compute Unit:            8
  SIMDs per CU:            4
  Shader Engines:          1
  Shader Arrs. per Eng.:   1
  WatchPts on Addr. Ranges:4
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          64(0x40)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        40(0x28)
  Max Work-item Per CU:    2560(0xa00)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    524288(0x80000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx90c:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*** Done ***

@dhiltgen please have a look

@DocMAX
Copy link
Author

DocMAX commented Feb 24, 2024

And by the way there is no /sys/module/amdgpu/version. You have to correct the code.

@robertvazan
Copy link

ROCm unsupported integrated GPU detected

Ollama skipped the iGPU, because it has less than 1GB of VRAM. You have to configure VRAM allocation for the iGPU in BIOS to something like 8GB.

@DocMAX
Copy link
Author

DocMAX commented Feb 24, 2024

Ollama skipped the iGPU, because it has less than 1GB of VRAM. You have to configure VRAM allocation for the iGPU in BIOS to something like 8GB.

Thanks i will check if i can do that.
But normal behaviour for the iGPU should be that it requests more VRAM if needed.

@robertvazan
Copy link

robertvazan commented Feb 24, 2024

But normal behaviour for the iGPU should be that it requests more VRAM if needed.

Why do you think so? Where is it documented? Mine maxes at 512MB unless I explicitly configure it in BIOS.

@sid-cypher
Copy link

Ollama skipped the iGPU, because it has less than 1GB of VRAM. You have to configure VRAM allocation for the iGPU in BIOS to something like 8GB.

Detecting and using this VRAM information without sharing with the user the reason for the iGPU rejection leads to "missing support" issues being opened, rather than "increase my VRAM allocation" steps taken. I think the log output should be improved in this case. This task would probably qualify for a "good first issue" tag, too.

@DocMAX
Copy link
Author

DocMAX commented Feb 24, 2024

Totally agree!

@chiragkrishna
Copy link

i have 2 systems.
Ryzen 5500U system always gets stuck here. ive allotted 4gb vram for it in the bios. its the max.

export HSA_OVERRIDE_GFX_VERSION=9.0.0
export HCC_AMDGPU_TARGETS=gfx900

llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 25/25 layers to GPU
llm_load_tensors:      ROCm0 buffer size =   703.44 MiB
llm_load_tensors:        CPU buffer size =    35.44 MiB

building with

export CGO_CFLAGS="-g"
export AMDGPU_TARGETS="gfx1030;gfx900"
go generate ./...
go build .

my 6750xt system works perfectly

@DocMAX
Copy link
Author

DocMAX commented Feb 24, 2024

But normal behaviour for the iGPU should be that it requests more VRAM if needed.

Why do you think so? Where is it documented? Mine maxes at 512MB unless I explicitly configure it in BIOS.

OK i was wrong. Works now with 8GB VRAM, thank you!

discovered 1 ROCm GPU Devices
[0] ROCm device name: Cezanne [Radeon Vega Series / Radeon Vega Mobile Series]
[0] ROCm brand: Cezanne [Radeon Vega Series / Radeon Vega Mobile Series]
[0] ROCm vendor: Advanced Micro Devices, Inc. [AMD/ATI]
[0] ROCm VRAM vendor: unknown
[0] ROCm S/N: 
[0] ROCm subsystem name: 0x123
[0] ROCm vbios version: 113-CEZANNE-018
[0] ROCm totalMem 8589934592
[0] ROCm usedMem 25907200
time=2024-02-24T18:27:14.013Z level=DEBUG source=gpu.go:254 msg="rocm detected 1 devices with 7143M available memory"

@DocMAX
Copy link
Author

DocMAX commented Feb 24, 2024

Hmm, i see the model loaded into VRAM, but nothing happens...

llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      ROCm0 buffer size =  3577.56 MiB
llm_load_tensors:        CPU buffer size =    70.31 MiB

@DocMAX
Copy link
Author

DocMAX commented Feb 24, 2024

Do i need another amdgpu module on the host than the one from the kernel (6.7.6)?

@sid-cypher
Copy link

Do i need another amdgpu module on the host than the one from the kernel (6.7.6)?

Maybe, ROCm/ROCm#816 seems relevant. I'm just using AMD-provided DKMS modules from https://repo.radeon.com/amdgpu/6.0.2/ubuntu to be sure.

@DocMAX
Copy link
Author

DocMAX commented Feb 24, 2024

Hmm, tinyllama model does work with 5800U. The bigger ones stuck as i mentioned before.
Edit: Codellama works too.

@chiragkrishna
Copy link

i added this "-DLLAMA_HIP_UMA=ON" to "ollama/llm/generate/gen_linux.sh"

CMAKE_DEFS="${COMMON_CMAKE_DEFS} ${CMAKE_DEFS} -DLLAMA_HIPBLAS=on -DLLAMA_HIP_UMA=ON -DCMAKE_C_COMPILER=$ROCM_PATH/llvm/bin/clang -DCMAKE_CXX_COMPILER=$ROCM_PATH/llvm/bin/clang++ -DAMDGPU_TARGETS=$(amdGPUs) -DGPU_TARGETS=$(amdGPUs)"

now its stuck here

llm_load_tensors: offloading 22 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 23/23 layers to GPU
llm_load_tensors:      ROCm0 buffer size =   809.59 MiB
llm_load_tensors:        CPU buffer size =    51.27 MiB
...............................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      ROCm0 KV buffer size =    44.00 MiB
llama_new_context_with_model: KV self size  =   44.00 MiB, K (f16):   22.00 MiB, V (f16):   22.00 MiB
llama_new_context_with_model:  ROCm_Host input buffer size   =     9.02 MiB
llama_new_context_with_model:      ROCm0 compute buffer size =   148.01 MiB
llama_new_context_with_model:  ROCm_Host compute buffer size =     4.00 MiB
llama_new_context_with_model: graph splits (measure): 3
[1708857011] warming up the model with an empty run

@robertvazan
Copy link

iGPUs indeed do allocate system RAM on demand. It's called GTT/GART. Here's what I get when I run sudo dmesg | grep "M of" on my system with 32GB RAM:

If I set VRAM to Auto in BIOS:

[    4.654736] [drm] amdgpu: 512M of VRAM memory ready
[    4.654737] [drm] amdgpu: 15688M of GTT memory ready.

If I set VRAM to 8GB in BIOS:

[    4.670921] [drm] amdgpu: 8192M of VRAM memory ready
[    4.670923] [drm] amdgpu: 11908M of GTT memory ready.

If I set VRAM to 16GB in BIOS:

[    4.600060] [drm] amdgpu: 16384M of VRAM memory ready
[    4.600062] [drm] amdgpu: 7888M of GTT memory ready.

It looks like GTT size is 0.5*(RAM-VRAM). I wonder how far can this go if you have 64GB or 96GB RAM. Can you have iGPU with 32GB or 48GB of GTT memory? That would make $200 APU with $200 DDR5 RAM superior to $2,000 dGPU for running Mixtral and future sparse models. I also wonder whether any BIOS offers 32GB VRAM setting if you have 64GB of RAM.

Unfortunately, ROCm does not use GTT. That thread mentions several workarounds (torch-apu-helper, force-host-alloction-APU, Rusticl, unlock VRAM allocation), but I am not sure whether Ollama would be able to use any of them. Chances are highest in docker container where Ollama has greatest control over dependencies.

@DocMAX
Copy link
Author

DocMAX commented Feb 25, 2024

Very cool findings. Interesting you mention 96GB. I did a research and it seems thats the max. we can buy right now for SO-DIMMS. Wasn't aware it's called GTT. Let's hope someday we get support for this.
If host can't handle GTT for ROCm, then i doubt docker can't do anything about it.

https://github.com/segurac/force-host-alloction-APU looks like the best solution to me if it works. Will try in my docker containers...

[So Feb 25 21:31:38 2024] [drm] amdgpu: 512M of VRAM memory ready
[So Feb 25 21:31:38 2024] [drm] amdgpu: 31844M of GTT memory ready.

This is how much i would get :-) (64GB system)

@DocMAX
Copy link
Author

DocMAX commented Feb 25, 2024

OK, doesn't work with ollama. Wasn't aware that it doesn't use PyTorch right?

@chiragkrishna
Copy link

llama.cpp supports it. thats what i was trying to do in my previous post. Support AMD Ryzen Unified Memory Architecture (UMA)

@robertvazan
Copy link

robertvazan commented Feb 26, 2024

@chiragkrishna Do you mean this? ggerganov/llama.cpp#4449

Since llama.cpp already supports UMA (GGT/GART), Ollama could perhaps include llama.cpp build with UMA enabled and use it when the conditions are right (AMD iGPU with VRAM smaller than the model).

PS: UMA support seems a bit unstable, so perhaps enable it with environment variable at first.

@DocMAX
Copy link
Author

DocMAX commented Feb 26, 2024

How does the env thing work? Like this? (Doesn't do anything btw)
LLAMA_HIP_UMA=1 HSA_OVERRIDE_GFX_VERSION=9.0.0 HCC_AMDGPU_TARGETS==gfx900 ollama start

@robertvazan
Copy link

@DocMAX I don't think there's UMA support in ollama yet. It's a compile-time option in llama.cpp. The other env variables (HSA_OVERRIDE_GFX_VERSION was sufficient in my experiments) are correctly passed down to ROCm.

@DocMAX
Copy link
Author

DocMAX commented May 20, 2024

I'm using docker image ollama/ollama:rocm

@arilou
Copy link

arilou commented May 20, 2024

What I suggested is a change over ollama, OLLAMA_VRAM_OVERRIDE, this is not part of ollama today...

@DocMAX
Copy link
Author

DocMAX commented May 20, 2024

OK, then i have to wait for the docker version, because i want stay on docker.

@qkiel
Copy link

qkiel commented May 20, 2024

Not sure I'm following if you use the VRAM override env and also libforcegttalloc.so to expose your entire system memory it wont show this print. The code verifies you have at least 1Gb of ram... pretty sure your system memory has more than 1Gb

Curious question - why do you use libforcegttalloc.so with ollama? Isn't it only intended for use with applications that require PyTorch? Without LD_PRELOAD everything should work exactly the same.

@arilou
Copy link

arilou commented May 20, 2024

Well the reason is that if you will look when you compile ollama/llama.cpp (even with LLAMA_HIP_UMA=on)
It will charge the VRAM memory (you can use radeontop/amdgpu_top)
That limits you to amount of VRAM you can assign in the BIOS (max is 16Gb)
But let's say on my system I have 64Gb of memory, there is no reason I wont be able to load much larger models like 50Gb
After all there is no "real" meaning to those 16Gb being VRam, they sit on the say DIMM... so by using the trick done
in libforcegttalloc.so you basically charge memory from the OS but since we are all here with APUs it's the same thing
we just need to go through hips in order for the iGFX to understand it can access those pointers regularly

So with this trick you can now load models much bigger, and "steal" less memory from your system for your GPU.

So for example I loaded llama3:70b-instruct-q4_K_M which is about 40Gb, and I still get 0.8tps which is fairly ok for the power of our iGPU...

@qkiel
Copy link

qkiel commented May 20, 2024

Well the reason is that if you will look when you compile ollama/llama.cpp (even with LLAMA_HIP_UMA=on) It will charge the VRAM memory (you can use radeontop/amdgpu_top) That limits you to amount of VRAM you can assign in the BIOS (max is 16Gb) But let's say on my system I have 64Gb of memory, there is no reason I wont be able to load much larger models like 50Gb After all there is no "real" meaning to those 16Gb being VRam, they sit on the say DIMM... so by using the trick done in libforcegttalloc.so you basically charge memory from the OS but since we are all here with APUs it's the same thing we just need to go through hips in order for the iGFX to understand it can access those pointers regularly

So with this trick you can now load models much bigger, and "steal" less memory from your system for your GPU.

So for example I loaded llama3:70b-instruct-q4_K_M which is about 40Gb, and I still get 0.8tps which is fairly ok for the power of our iGPU...

Interesting. I have an AMD 5600G APU with UMA_AUTO set in UEFI/BIOS (which means 512 MB is taken from my RAM for VRAM). On my Ubuntu 22.04 the libforcegttalloc.so is required only for Stable Diffusion apps like Fooocus.

Running ollama with or without LD_PRELOAD makes no difference in my case. VRAM is kept at 512 MB, models are loaded to RAM, and the compute is done on GPU.

Have you tried running ollama without LD_PRELOAD?

@arilou
Copy link

arilou commented May 20, 2024

Interesting for me it crashes if I tried to module bigger than the allocated VRAM, I wonder if it's an issue because in Fedora 40, the default ROCm is 6.0.

@Jonnybravo
Copy link

Hey, I also have a 5600G and I wanted to make use of it. I read the whole thread, but I'm confused about which steps should I do to change the build in order to make this work with this iGPU.

Is there a version already pre-compiled that everyone can use? I tried to follow @qkiel steps, but this fails miserably when I try to compile and build using go...

@qkiel
Copy link

qkiel commented May 20, 2024

Hey, I also have a 5600G and I wanted to make use of it. I read the whole thread, but I'm confused about which steps should I do to change the build in order to make this work with this iGPU.

Is there a version already pre-compiled that everyone can use? I tried to follow @qkiel steps, but this fails miserably when I try to compile and build using go...

When you download source code and compile it with commands below, do you still get an error?

git clone --depth 1 --branch v0.1.38 https://github.com/ollama/ollama
cd ollama
go generate ./...
go build .

@Jonnybravo
Copy link

Last time I ran the generate command I got this:

image

I didn't try anything else after this. Before I got here, the generate command would give me an error related to the wrong version of go being installed.

@qkiel
Copy link

qkiel commented May 21, 2024

I updated my instruction a bit, see if it works this time. If not, I can send you my binary.

@Jonnybravo
Copy link

Jonnybravo commented May 21, 2024

I followed everything again and made sure about the versions of the requirements. This time I managed to pass the generate step, but it seems that I have a problem with the goroot path when I try to run the builder command:

image

Could it be because of this being installed in a custom folder?

EDIT: Meanwhile I tried to point both variables to different paths, but now I have an error on GOPROXY. @qkiel, did you also experienced this? What are your paths for each variable GOROOT, GOPATH and GOPROXY?

@qkiel
Copy link

qkiel commented May 21, 2024

This warning doesn't matter, just run ollama:

/<path>/./ollama serve

Then in a second terminal window:

/<path>/./ollama --help

@Jonnybravo
Copy link

This warning doesn't matter, just run ollama:

/<path>/./ollama serve

Then in a second terminal window:

/<path>/./ollama --help

I did that and I believe I'm still running it using the CPU. Is there a way to confirm that I'm running this using the GPU?

@qkiel
Copy link

qkiel commented May 21, 2024

Look at GPU utilization. I use nvtop for that (also available as a snap):

sudo apt install nvtop

@Jonnybravo
Copy link

Jonnybravo commented May 21, 2024

image

Installed it, but I think I messed up on the ROCm installation. Do I need to do some extra step besides this?:

sudo apt update wget https://repo.radeon.com/amdgpu-install/6.1.1/ubuntu/jammy/amdgpu-install_6.1.60101-1_all.deb sudo apt install ./amdgpu-install_6.1.60101-1_all.deb

If it helps, I'm using windows with wsl.

@qkiel
Copy link

qkiel commented May 21, 2024

Unfortunately, there is no equivalent of the HSA_OVERRIDE_GFX_VERSION environment variable on Windows, so you cannot present your iGPU to ROCm as supported.

Secondly, you install ROCm differently on Windows.

I don't think it can be done on Windows the same way as on Linux.

Edit:

sudo apt update wget https://repo.radeon.com/amdgpu-install/6.1.1/ubuntu/jammy/amdgpu-install_6.1.60101-1_all.deb sudo apt install ./amdgpu-install_6.1.60101-1_all.deb

Besides that, you can this to install ROCm:

sudo amdgpu-install --usecase=rocm --no-dkms

But chances of success are very slim.

@Jonnybravo
Copy link

Yeah, I tried it and got the same problem mentioned on this thread: ROCm/ROCm#3051

And what about a Virtual Machine running linux, @qkiel? Do you think that could work or am I stretching too much here?

@qkiel
Copy link

qkiel commented May 22, 2024

I have no idea. If you have a regular GPU, then you can pass iGPU to the VM and that could work. I don't think that 5600G supports SR-IOV so you can't partition iGPU and pass only part of it.

@xwry
Copy link

xwry commented May 26, 2024

This warning doesn't matter, just run ollama:

/<path>/./ollama serve

Then in a second terminal window:

/<path>/./ollama --help

I did that and I believe I'm still running it using the CPU. Is there a way to confirm that I'm running this using the GPU?

You can try radeontop, it works fine on iGPU from AMD, -c flag ads colorized output.
sudo apt install radeontop

@smellouk
Copy link

@qkiel thx for this tip 🙏
I followed the steps as you describe but I'm facing this error

Error: llama runner process has terminated: signal: aborted error:Could not initialize Tensile host: No devices found

My current setup:

  • Proxmox running on host machine
  • LXC ubuntu22.04
  • Shared GPU

What is crazy now, is that if I install docker in this LXC and run docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama --device=/dev/kfd --device=/dev/dri/renderD128 --env HSA_OVERRIDE_GFX_VERSION=9.0.0 --env HSA_ENABLE_SDMA=0 ollama/ollama:rocm everything work great. More details here: #5143 (comment)

@qkiel
Copy link

qkiel commented Jun 20, 2024

@smellouk I have tutorials on how I install ROCm and Ollama in Icnus containers (fork of LXD):

Do you do this similarly or somehow differently?

@smellouk
Copy link

smellouk commented Jun 21, 2024

@qkiel I used that article and I just noticed you are the same owner 😆, that Ai tutorial: ROCm and PyTorch on AMD APU or GPU led to me here. I follow everything and same issue 😢

@qkiel
Copy link

qkiel commented Jun 21, 2024

When you run this command, what do you see?

ls -alF /dev/dri

Do card0 and renderD128 belong to video or render group?

crw-rw----  1 root video 226,   0 cze 21 22:07 card0
crw-rw----  1 root video 226, 128 cze 21 22:07 renderD128

If they belong to root root, that means you didn't set a proper gid when adding the GPU device to the container. For Ubuntu containers, that would be:

incus config device add <container_name> gpu gpu gid=44

Or your user inside the container doesn't belong to video and render groups. For Ubuntu containers, that would be (this requires a restart of the container to take effect):

sudo usermod -a -G render,video ubuntu

@smellouk
Copy link

@qkiel permissions are correct as expected

@arilou
Copy link

arilou commented Jun 25, 2024

I added the following patch to ollama:

diff --git a/gpu/amd_linux.go b/gpu/amd_linux.go
index 6b08ac2..579186b 100644
--- a/gpu/amd_linux.go
+++ b/gpu/amd_linux.go
@@ -229,6 +229,15 @@ func AMDGetGPUInfo() []GpuInfo {
 		}

 		// iGPU detection, remove this check once we can support an iGPU variant of the rocm library
+		if override, exists := os.LookupEnv("OLLAMA_VRAM_OVERRIDE"); exists {
+			// Convert the environment variable to an integer
+			if value, err := strconv.ParseUint(override, 10, 64); err == nil {
+				totalMemory = value
+			} else {
+				fmt.Println("Error parsing OLLAMA_VRAM_OVERRIDE:", err)
+			}
+		}
+
 		if totalMemory < IGPUMemLimit {
 			slog.Info("unsupported Radeon iGPU detected skipping", "id", gpuID, "total", format.HumanBytes2(totalMemory))
 			continue

@dhiltgen perhaps you want to consider adding this patch to ollama? (I dont have any NVIDIA computer to test and do the same for CUDA, or whatever Intel has/will have) to test this with, but i know it works well for AMD

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
amd Issues relating to AMD GPUs and ROCm feature request New feature or request
Projects
None yet
Development

No branches or pull requests