Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROCm support #814

Closed
wants to merge 1 commit into from
Closed

ROCm support #814

wants to merge 1 commit into from

Conversation

65a
Copy link

@65a 65a commented Oct 17, 2023

#667 got closed during a bad rebase attempt. This should be just about the minimum I can come up with to use build tags to switch between ROCm and CUDA, as well as docs for how to build it. The existing dockerfiles are updated so they do not break.

Please let me know @jmorganca @mxyng @BruceMacD if you'd like this in a different approach or something, or if you don't want to do this. Closes #738. Will post test results for GGML and GGUF files.

@65a
Copy link
Author

65a commented Oct 17, 2023

Example build instructions (for Arch, other distributions may have different paths for CLBlast Cmake includes and ROCm install directory):

ROCM_PATH=/opt/rocm CLBlast_DIR=/usr/lib/cmake/CLBlast go generate -tags rocm ./...

then

go build -tags rocm

GGUF (uses ROCm for acceleration, RX6950XT) mistral-7b-q8:

llama_print_timings:      sample time =   171.71 ms /   343 runs   (    0.50 ms per token,  1997.54 tokens per second)
llama_print_timings: prompt eval time =    67.25 ms /     2 tokens (   33.63 ms per token,    29.74 tokens per second)
llama_print_timings:        eval time =  6391.72 ms /   342 runs   (   18.69 ms per token,    53.51 tokens per second)

GGML (legacy, uses CLBlast for acceleration, RX6950XT) llama-7b-q2k:

llama_print_timings:      sample time =    48.79 ms /    87 runs   (    0.56 ms per token,  1783.19 tokens per second)
llama_print_timings: prompt eval time =  1712.33 ms /     2 tokens (  856.17 ms per token,     1.17 tokens per second)
llama_print_timings:        eval time =  2790.58 ms /    86 runs   (   32.45 ms per token,    30.82 tokens per second)

@jmorganca
Copy link
Member

Hi @65a, sorry for not jumping in on the other PR sooner. Will take a look at this, and thank you so much for taking all of the time to give adding ROCm a go!

@65a
Copy link
Author

65a commented Oct 17, 2023

@jmorganca no worries, I'm using it locally so I have to keep going regardless :)

@nanowinner
Copy link

nanowinner commented Oct 17, 2023

Hey @65a, I had been monitoring #667 and will continue to do so with this PR. Thank you for your time, regardless if you're using it locally yourself or not - it's much appreciated!

@TheScreechingBagel
Copy link

TheScreechingBagel commented Oct 21, 2023

just wanted to say, your repo built flawlessly and is working great on my 6700XT, thank you!

@TheScreechingBagel
Copy link

TheScreechingBagel commented Oct 21, 2023

actually, doesn't seem to work with mistral 7B, guessing it's because it's using a different backed or something in ollama?

(as in, slow, no gpu activity, and it's not making any of the usual noises)

@65a
Copy link
Author

65a commented Oct 23, 2023

@TheScreechingBagel you can see above I tested with Mistral-7b. Likely you are falling back to CPU, there will be an error in your logs, but perhaps we can continue the conversation in #738

@65a
Copy link
Author

65a commented Oct 28, 2023

Rebased on HEAD and incorporated changes, testing again. W7900 is still out of commission and going around in RMA world, but I have a 7900XTX to test with now.

@65a
Copy link
Author

65a commented Oct 28, 2023

ROCm: 7900XTX GGUF (Mistral 7b q8):

llama_print_timings:      sample time =      58.70 ms /   422 runs   (    0.14 ms per token,  7189.34 tokens per second)
llama_print_timings: prompt eval time =      64.29 ms /     3 tokens (   21.43 ms per token,    46.66 tokens per second)
llama_print_timings:        eval time =    5419.35 ms /   421 runs   (   12.87 ms per token,    77.68 tokens per second)

OpenCL: 7900XTX GGML (Llama-7b q2k):

llama_print_timings:      sample time =    58.91 ms /   102 runs   (    0.58 ms per token,  1731.37 tokens per second)
llama_print_timings: prompt eval time =   983.92 ms /     3 tokens (  327.97 ms per token,     3.05 tokens per second)
llama_print_timings:        eval time =  3137.99 ms /   101 runs   (   31.07 ms per token,    32.19 tokens per second)

Seems to work, and ready for review @jmorganca

@pdevine
Copy link
Contributor

pdevine commented Nov 9, 2023

We should be able to test this now. I ordered a Radeon 7900 XTX and it just came in, but I still have to pull a machine apart and get it working. Thanks for your patience!

@65a
Copy link
Author

65a commented Nov 9, 2023

@pdevine sounds good! I can try syncing to head and rebuilding to make sure things are still in a good state.

@65a
Copy link
Author

65a commented Nov 9, 2023

Seems like it's working still (7900XTX, Mistral-7b quantized to q8):

llama_print_timings:        load time =   60239.98 ms
llama_print_timings:      sample time =       1.74 ms /    13 runs   (    0.13 ms per token,  7475.56 tokens per second)
llama_print_timings: prompt eval time =      65.56 ms /     7 tokens (    9.37 ms per token,   106.77 tokens per second)
llama_print_timings:        eval time =     153.20 ms /    12 runs   (   12.77 ms per token,    78.33 tokens per second)
llama_print_timings:       total time =     221.35 ms

@lu4p
Copy link

lu4p commented Nov 9, 2023

Tested this on my Vega 56 on Linux, with llama2 (7b, 13b) and mistral, works! Thanks a lot.

How do I run the benchmark you did? I'm curious about how my old card stacks up.

llm/accelerator_none.go Outdated Show resolved Hide resolved
@K1ngjulien
Copy link

K1ngjulien commented Nov 9, 2023

Hi I'm currently trying to run this with my 6700XT.

is there a way to specify a make -j24 when running go generate I have a bunch of extra cores but it doesn't look like theyre being used lol.

setting the parallel level from the environment doesn't seem to help:
CMAKE_BUILD_PARALLEL_LEVEL=24 ROCM_PATH=/opt/rocm CLBlast_DIR=/usr/lib/cmake/CLBlast go generate -tags rocm ./...

also it looks like its building a lot of ggml cuda code. can we turn that off somehow?

-- Generating done (0.1s)
-- Build files have been written to: /home/julian/opt/ollama/llm/llama.cpp/gguf/build/rocm
[  6%] Building CXX object CMakeFiles/ggml-rocm.dir/ggml-cuda.cu.o
/home/julian/opt/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:4001:5: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transf
ormation ordering [-Wpass-failed=transform-warning]   
   ...

@65a
Copy link
Author

65a commented Nov 10, 2023

@K1ngjulien The "cuda" code is actually hipified for ROCm, and it's compiled for several targets (hence slow). I'll leave more parallelism for the next PR, if it's possible, though if it's bottlenecked on compiling the "cuda" (actually ROCm) kernels, it might help to just override AMDGPU_TARGETS and friends to only your card and trade portability for compile time locally.

@65a
Copy link
Author

65a commented Nov 10, 2023

Rebased on HEAD and made sure nVidia behavior matches it by copying new changes in CheckVRAM and generators

@65a
Copy link
Author

65a commented Nov 10, 2023

@lu4p the benchmark is in the logs for wherever you are running ollama serve (maybe a terminal, or maybe a systemd log or something). I also have a similar gfx906 card (Mi60-like ebay card), should be much faster than a CPU or iGPU.

@lu4p
Copy link

lu4p commented Nov 10, 2023

My card a lot slower, than yours by factor 10 or so.

llama_print_timings:        load time =    1052.60 ms
llama_print_timings:      sample time =       7.35 ms /    26 runs   (    0.28 ms per token,  3535.97 tokens per second)
llama_print_timings: prompt eval time =    2107.46 ms /    25 tokens (   84.30 ms per token,    11.86 tokens per second)
llama_print_timings:        eval time =    2796.51 ms /    25 runs   (  111.86 ms per token,     8.94 tokens per second)
llama_print_timings:       total time =    4917.49 ms

My cpu ryzen 5 3600 for reference, around 30% slower.

llama_print_timings:        load time =    3325.86 ms
llama_print_timings:      sample time =       4.73 ms /    16 runs   (    0.30 ms per token,  3382.66 tokens per second)
llama_print_timings: prompt eval time =    2452.26 ms /    21 tokens (  116.77 ms per token,     8.56 tokens per second)
llama_print_timings:        eval time =    2442.76 ms /    15 runs   (  162.85 ms per token,     6.14 tokens per second)
llama_print_timings:       total time =    4903.49 ms

Is this a problem? (output from ollama serve)

2023/11/10 04:10:18 accelerator_rocm.go:71: ROCm presenting 0 bytes of available VRAM on device ""

@65a
Copy link
Author

65a commented Nov 10, 2023

I'd prefer to provide support in #738, if that's all right. I would need the full log from ollama serve, but I suspect there's an error and ollama is falling back to CPU.

@lu4p lu4p mentioned this pull request Nov 10, 2023
@65a
Copy link
Author

65a commented Nov 10, 2023

@lu4p I put some more validation around the error case, hopefully we can figure out what the error you have is on #738, but it will now return "no GPU" if the total parsed VRAM is 0, and log the cards it finds and the amount of free VRAM in MiB.

@65a
Copy link
Author

65a commented Nov 10, 2023

Fixed typo in generate_linux_rocm.go

@65a
Copy link
Author

65a commented Nov 10, 2023

Finally managed a successful test on 6700S again:

llama_print_timings:      sample time =      45.18 ms /   422 runs   (    0.11 ms per token,  9339.80 tokens per second)
llama_print_timings: prompt eval time =     164.41 ms /     3 tokens (   54.80 ms per token,    18.25 tokens per second)
llama_print_timings:        eval time =   13110.55 ms /   421 runs   (   31.14 ms per token,    32.11 tokens per second)

HIP_VISIBLE_DEVICES=0 needs to be set in environment on devices with AMD iGPU+dGPU it appears.

@K1ngjulien
Copy link

hmm looks like its detecting the gpu correctly, but then something goes wrong and it falls back to the cpu:

2023/11/10 10:42:55 routes.go:696: Listening on 127.0.0.1:11434 (version 0.0.0)                                                                                                                                                       [378/1748]
2023/11/10 10:42:55 accelerator_rocm.go:66: ROCm found 11462 MiB of available VRAM on device "card0"
2023/11/10 10:42:55 accelerator_rocm.go:76: ROCm selecting device "card0"
[GIN] 2023/11/10 - 10:43:15 | 200 |        25.9µs |       127.0.0.1 | HEAD     "/"
[GIN] 2023/11/10 - 10:43:15 | 200 |     362.294µs |       127.0.0.1 | POST     "/api/show"
2023/11/10 10:43:15 accelerator_rocm.go:66: ROCm found 11462 MiB of available VRAM on device "card0"
2023/11/10 10:43:15 accelerator_rocm.go:76: ROCm selecting device "card0"
2023/11/10 10:43:15 llama.go:254: 11462 MB VRAM available, loading up to 75 GPU layers
2023/11/10 10:43:15 llama.go:379: starting llama runner
2023/11/10 10:43:15 llama.go:437: waiting for llama runner to start responding

rocBLAS error: Cannot read /opt/rocm/lib/rocblas/library/TensileLibrary.dat: Illegal seek
2023/11/10 10:43:15 llama.go:394: Cannot read /opt/rocm/lib/rocblas/library/TensileLibrary.dat: Illegal seek
2023/11/10 10:43:15 llama.go:402: error starting llama runner: llama runner process has terminated
2023/11/10 10:43:15 llama.go:468: llama runner stopped successfully
2023/11/10 10:43:15 llama.go:379: starting llama runner
2023/11/10 10:43:15 llama.go:437: waiting for llama runner to start responding
{"timestamp":1699609395,"level":"WARNING","function":"server_params_parse","line":871,"message":"Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support",
"n_gpu_layers":-1}
{"timestamp":1699609395,"level":"INFO","function":"main","line":1323,"message":"build info","build":1412,"commit":"9e70cc0"}
{"timestamp":1699609395,"level":"INFO","function":"main","line":1325,"message":"system info","n_threads":12,"n_threads_batch":-1,"total_threads":24,"system_info":"AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0
 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | "}
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /home/julian/.ollama/models/blobs/sha256:22f7f8ef5f4c791c1b03d7eb414399294764d7cc82c7e94aa81a1feb80a983a2 (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_0     [  4096, 32000,     1,     1 ]

any ideas?

@K1ngjulien
Copy link

K1ngjulien commented Nov 10, 2023

GOT IT!

from this comment.
my 6700xt seems to not officially be supported yet but overriding the version makes it work anyway lol.

before on Ryzen 9 5900X cpu:

llama_print_timings:        load time =     751.10 ms
llama_print_timings:      sample time =      34.37 ms /   242 runs   (    0.14 ms per token,  7041.43 tokens per second)
llama_print_timings: prompt eval time =    1007.78 ms /    26 tokens (   38.76 ms per token,    25.80 tokens per second)
llama_print_timings:        eval time =   24948.13 ms /   241 runs   (  103.52 ms per token,     9.66 tokens per second)
llama_print_timings:       total time =   26043.89 ms

now with HSA_OVERRIDE_GFX_VERSION=10.3.0 ROCM_PATH=/opt/rocm ./ollama serve running on Radeon 6700xt gpu:


llama_print_timings:        load time =    1556.60 ms
llama_print_timings:      sample time =      14.41 ms /   121 runs   (    0.12 ms per token,  8396.36 tokens per second)
llama_print_timings: prompt eval time =     104.54 ms /    25 tokens (    4.18 ms per token,   239.13 tokens per second)
llama_print_timings:        eval time =    2132.34 ms /   120 runs   (   17.77 ms per token,    56.28 tokens per second)
llama_print_timings:       total time =    2259.01 ms

so we went from 25t/s to 240t/s running codellama. I'd call that a win 🎉

@65a
Copy link
Author

65a commented Nov 11, 2023

Minor code cleanup to make the way runners are accumulated (though I'm not sure CPU fallback is ever good UX...out of scope for this change). It would be great to have someone test the cuda side of this change as well, I only made sure it compiles, as I don't have any nvidia cards.

@65a
Copy link
Author

65a commented Nov 11, 2023

Sync to HEAD, tested 6700S (dGPU) again on a clean checkout with a mistral-7b q5k quant:

llama_print_timings:      sample time =      20.29 ms /   191 runs   (    0.11 ms per token,  9413.50 tokens per second)
llama_print_timings: prompt eval time =     163.72 ms /     2 tokens (   81.86 ms per token,    12.22 tokens per second)
llama_print_timings:        eval time =    5803.15 ms /   190 runs   (   30.54 ms per token,    32.74 tokens per second)

@65a
Copy link
Author

65a commented Nov 11, 2023

Tested again with 7900XTX (a mistral-7b, not quantized/f16):

llama_print_timings:      sample time =      36.39 ms /   258 runs   (    0.14 ms per token,  7089.08 tokens per second)
llama_print_timings: prompt eval time =      58.05 ms /     3 tokens (   19.35 ms per token,    51.68 tokens per second)
llama_print_timings:        eval time =    6778.75 ms /   257 runs   (   26.38 ms per token,    37.91 tokens per second)

retested OpenCL on the same card (ggml acceleration for older models, llama-7b q2k):

llama_print_timings:      sample time =   537.35 ms /   946 runs   (    0.57 ms per token,  1760.50 tokens per second)
llama_print_timings: prompt eval time =  1150.85 ms /     3 tokens (  383.62 ms per token,     2.61 tokens per second)
llama_print_timings:        eval time = 30633.54 ms /   945 runs   (   32.42 ms per token,    30.85 tokens per second)

@TeddyDD
Copy link

TeddyDD commented Nov 11, 2023

Finally, managed to compile it. Compiler looked for OpenCL lib in fixed patch /usr/lib/x86_64-linux-gnu/libOpenCL.so so I had to create symlink. Seems to work fine. Is there any way to set number of layers offloated to GPU (like llama.cpp -ngl argument?)

@65a
Copy link
Author

65a commented Nov 18, 2023

@shamb0 Your graphics architecture seems unsupported, you probably need HSA_OVERRIDE_GFX_VERSION to be set in the environment to something ROCm does support. Please use #738 for questions/support, your issue is not a problem with the PR. Try HSA_OVERRIDE_GFX_VERSION=1030 ollama serve

@shamb0
Copy link

shamb0 commented Nov 18, 2023

Thanks @65a .

Tried below commands:

  • HSA_OVERRIDE_GFX_VERSION=10.3.0 ./ollama serve , no luck. Hit with "Memory access fault"

  • HSA_OVERRIDE_GFX_VERSION=1030 ./ollama serve, no luck. Hit with "Could not initialize Tensile host"

With Command HSA_OVERRIDE_GFX_VERSION=10.3.0 ./ollama serve

>$ HSA_OVERRIDE_GFX_VERSION=10.3.0 ./ollama serve

 HSA_OVERRIDE_GFX_VERSION=10.3.0 ./ollama serve
2023/11/18 23:12:04 images.go:799: total blobs: 6
2023/11/18 23:12:04 images.go:806: total unused blobs removed: 0
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
 - using env:	export GIN_MODE=release
 - using code:	gin.SetMode(gin.ReleaseMode)

[GIN-debug] POST   /api/pull                 --> github.com/jmorganca/ollama/server.PullModelHandler (5 handlers)
[GIN-debug] POST   /api/generate             --> github.com/jmorganca/ollama/server.GenerateHandler (5 handlers)
[GIN-debug] POST   /api/embeddings           --> github.com/jmorganca/ollama/server.EmbeddingHandler (5 handlers)
[GIN-debug] POST   /api/create               --> github.com/jmorganca/ollama/server.CreateModelHandler (5 handlers)
[GIN-debug] POST   /api/push                 --> github.com/jmorganca/ollama/server.PushModelHandler (5 handlers)
[GIN-debug] POST   /api/copy                 --> github.com/jmorganca/ollama/server.CopyModelHandler (5 handlers)
[GIN-debug] DELETE /api/delete               --> github.com/jmorganca/ollama/server.DeleteModelHandler (5 handlers)
[GIN-debug] POST   /api/show                 --> github.com/jmorganca/ollama/server.ShowModelHandler (5 handlers)
[GIN-debug] POST   /api/blobs/:digest        --> github.com/jmorganca/ollama/server.CreateBlobHandler (5 handlers)
[GIN-debug] HEAD   /api/blobs/:digest        --> github.com/jmorganca/ollama/server.HeadBlobHandler (5 handlers)
[GIN-debug] GET    /                         --> github.com/jmorganca/ollama/server.Serve.func2 (5 handlers)
[GIN-debug] GET    /api/tags                 --> github.com/jmorganca/ollama/server.ListModelsHandler (5 handlers)
[GIN-debug] HEAD   /                         --> github.com/jmorganca/ollama/server.Serve.func2 (5 handlers)
[GIN-debug] HEAD   /api/tags                 --> github.com/jmorganca/ollama/server.ListModelsHandler (5 handlers)
2023/11/18 23:12:04 routes.go:777: Listening on 127.0.0.1:11434 (version 0.0.0)
2023/11/18 23:12:04 accelerator_rocm.go:39: warning: ROCM_PATH is not set. Trying a likely fallback path, but it is recommended to set this variable in the environment.
2023/11/18 23:12:04 accelerator_rocm.go:73: ROCm found 7435 MiB of available VRAM on device "card0"
2023/11/18 23:12:04 accelerator_rocm.go:83: ROCm selecting device "card0"
2023/11/18 23:12:58 accelerator_rocm.go:39: warning: ROCM_PATH is not set. Trying a likely fallback path, but it is recommended to set this variable in the environment.
2023/11/18 23:12:58 accelerator_rocm.go:73: ROCm found 7450 MiB of available VRAM on device "card0"
2023/11/18 23:12:58 accelerator_rocm.go:83: ROCm selecting device "card0"
2023/11/18 23:12:58 llama.go:247: 7450 MB VRAM available, loading up to 48 GPU layers
2023/11/18 23:12:58 llama.go:372: starting llama runner
2023/11/18 23:12:58 llama.go:430: waiting for llama runner to start responding
ggml_init_cublas: found 1 ROCm devices:
  Device 0: AMD Radeon RX 5700, compute capability 10.3
{"timestamp":1700329379,"level":"INFO","function":"main","line":1324,"message":"build info","build":1412,"commit":"9e70cc0"}
{"timestamp":1700329379,"level":"INFO","function":"main","line":1330,"message":"system info","n_threads":8,"n_threads_batch":-1,"total_threads":16,"system_info":"AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | "}
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /home/popoyi/.ollama/models/blobs/sha256:22f7f8ef5f4c791c1b03d7eb414399294764d7cc82c7e94aa81a1feb80a983a2 (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_0     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]

With Command HSA_OVERRIDE_GFX_VERSION=1030 ./ollama serve

>$ HSA_OVERRIDE_GFX_VERSION=1030 ./ollama serve

--- ws21-01-ollama/ollama ‹main› » HSA_OVERRIDE_GFX_VERSION=1030 ollama serve
zsh: command not found: ollama
--- ws21-01-ollama/ollama ‹main› » HSA_OVERRIDE_GFX_VERSION=1030 ./ollama serve                                         127 ↵
2023/11/18 23:05:39 images.go:799: total blobs: 6
2023/11/18 23:05:39 images.go:806: total unused blobs removed: 0
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
 - using env:	export GIN_MODE=release
 - using code:	gin.SetMode(gin.ReleaseMode)

[GIN-debug] POST   /api/pull                 --> github.com/jmorganca/ollama/server.PullModelHandler (5 handlers)
[GIN-debug] POST   /api/generate             --> github.com/jmorganca/ollama/server.GenerateHandler (5 handlers)
[GIN-debug] POST   /api/embeddings           --> github.com/jmorganca/ollama/server.EmbeddingHandler (5 handlers)
[GIN-debug] POST   /api/create               --> github.com/jmorganca/ollama/server.CreateModelHandler (5 handlers)
[GIN-debug] POST   /api/push                 --> github.com/jmorganca/ollama/server.PushModelHandler (5 handlers)
[GIN-debug] POST   /api/copy                 --> github.com/jmorganca/ollama/server.CopyModelHandler (5 handlers)
[GIN-debug] DELETE /api/delete               --> github.com/jmorganca/ollama/server.DeleteModelHandler (5 handlers)
[GIN-debug] POST   /api/show                 --> github.com/jmorganca/ollama/server.ShowModelHandler (5 handlers)
[GIN-debug] POST   /api/blobs/:digest        --> github.com/jmorganca/ollama/server.CreateBlobHandler (5 handlers)
[GIN-debug] HEAD   /api/blobs/:digest        --> github.com/jmorganca/ollama/server.HeadBlobHandler (5 handlers)
[GIN-debug] GET    /                         --> github.com/jmorganca/ollama/server.Serve.func2 (5 handlers)
[GIN-debug] GET    /api/tags                 --> github.com/jmorganca/ollama/server.ListModelsHandler (5 handlers)
[GIN-debug] HEAD   /                         --> github.com/jmorganca/ollama/server.Serve.func2 (5 handlers)
[GIN-debug] HEAD   /api/tags                 --> github.com/jmorganca/ollama/server.ListModelsHandler (5 handlers)
2023/11/18 23:05:39 routes.go:777: Listening on 127.0.0.1:11434 (version 0.0.0)
2023/11/18 23:05:39 accelerator_rocm.go:39: warning: ROCM_PATH is not set. Trying a likely fallback path, but it is recommended to set this variable in the environment.
2023/11/18 23:05:39 accelerator_rocm.go:73: ROCm found 7431 MiB of available VRAM on device "card0"
2023/11/18 23:05:39 accelerator_rocm.go:83: ROCm selecting device "card0"
[GIN] 2023/11/18 - 23:05:58 | 200 |     243.164µs |       127.0.0.1 | GET      "/api/tags"
[GIN] 2023/11/18 - 23:05:58 | 200 |     678.047µs |       127.0.0.1 | POST     "/api/show"
2023/11/18 23:08:12 accelerator_rocm.go:39: warning: ROCM_PATH is not set. Trying a likely fallback path, but it is recommended to set this variable in the environment.
2023/11/18 23:08:12 accelerator_rocm.go:73: ROCm found 7449 MiB of available VRAM on device "card0"
2023/11/18 23:08:12 accelerator_rocm.go:83: ROCm selecting device "card0"
2023/11/18 23:08:12 llama.go:247: 7449 MB VRAM available, loading up to 48 GPU layers
2023/11/18 23:08:12 llama.go:372: starting llama runner
2023/11/18 23:08:12 llama.go:430: waiting for llama runner to start responding

rocBLAS error: Could not initialize Tensile host: No devices found
2023/11/18 23:08:13 llama.go:387: Could not initialize Tensile host: No devices found
2023/11/18 23:08:13 llama.go:395: error starting llama runner: llama runner process has terminated
2023/11/18 23:08:13 llama.go:461: llama runner stopped successfully
2023/11/18 23:08:13 llama.go:372: starting llama runner
2023/11/18 23:08:13 llama.go:430: waiting for llama runner to start responding
{"timestamp":1700329092,"level":"WARNING","function":"server_params_parse","line":871,"message":"Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support","n_gpu_layers":-1}
{"timestamp":1700329093,"level":"INFO","function":"main","line":1323,"message":"build info","build":1412,"commit":"9e70cc0"}
{"timestamp":1700329093,"level":"INFO","function":"main","line":1325,"message":"system info","n_threads":8,"n_threads_batch":-1,"total_threads":16,"system_info":"AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | "}

@GZGavinZhao
Copy link

@shamb0 The issue you reported is unrelated to this PR. The rocBLAS library that you installed was not built with your GPU's architecture support, so we can't do anything here. Please use #738 if you need ROCm questions and support.

@purinda
Copy link
Contributor

purinda commented Nov 18, 2023

@purinda I think ollama will pick the 6900xt, but llama won't, which is annoying (I have a similar issue with the iGPU + dGPU on a laptop). Does it work if you run HIP_VISIBLE_DEVICES=1 ollama serve? This is not necessarily ideal, but it might get you out of the edge case. I need to look at how ollama handles this for cuda and see if that can be used here as well...

@65a I have produced a fix to switch the GPU to be used as primary in a multi-GPU environment. PR #1192

…Linux.

    The build tags rocm or cuda must be specified to both go generate and go build.
    ROCm builds should have both ROCM_PATH set (and the ROCM SDK present) as well as CLBlast installed (for GGML) and CLBlast_DIR set in the environment to the CLBlast cmake directory (likely /usr/lib/cmake/CLBlast).
    Build tags are also used to switch VRAM detection between cuda and rocm implementations, using added "accelerator_foo.go" files which contain architecture specific functions and variables.
    accelerator_none is used when no tags are set, and a helper function addRunner will ignore it if it is the chosen accelerator.
    Fix go generate commands, thanks @deadmeu for testing.
@tuhochi
Copy link

tuhochi commented Dec 1, 2023

We should be able to test this now. I ordered a Radeon 7900 XTX and it just came in, but I still have to pull a machine apart and get it working. Thanks for your patience!

Hi there! It's been a couple of weeks since you mentioned ordering your Radeon 7900 XTX. I'm curious, have you had the chance to test it out yet? I'm particularly interested in knowing how it's performing in terms of stability and speed, especially compared to an NVIDIA card. Looking forward to hearing about your experience!

@pdevine
Copy link
Contributor

pdevine commented Dec 1, 2023

@tuhochi Still hoping to get this sorted soon, although it's still probably a few weeks out unfortunately. Feel free to ping me on the discord if you want more details.

@ml2s
Copy link

ml2s commented Dec 3, 2023

Thank you all for the efforts, I seem to have built successfully and ollama is running from rocm 5.6. My system is Ubuntu 22.04.03 on a Ryzen 5800h/Vega 8 (w/ 16g out of 64g ram assigned) with HSA_OVERRIDE_GFX_VERSION=9.0.0. ROCm 5.6 works fine with pytorch 2.1, verified by running stable diffusion webui.

Now, the output from ollama runner:

❯ ROCM_PATH=/opt/rocm ./ollama serve
2023/12/03 15:29:47 images.go:799: total blobs: 14
2023/12/03 15:29:47 images.go:806: total unused blobs removed: 0
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
 - using env:	export GIN_MODE=release
 - using code:	gin.SetMode(gin.ReleaseMode)

[GIN-debug] POST   /api/pull                 --> github.com/jmorganca/ollama/server.PullModelHandler (5 handlers)
[GIN-debug] POST   /api/generate             --> github.com/jmorganca/ollama/server.GenerateHandler (5 handlers)
[GIN-debug] POST   /api/embeddings           --> github.com/jmorganca/ollama/server.EmbeddingHandler (5 handlers)
[GIN-debug] POST   /api/create               --> github.com/jmorganca/ollama/server.CreateModelHandler (5 handlers)
[GIN-debug] POST   /api/push                 --> github.com/jmorganca/ollama/server.PushModelHandler (5 handlers)
[GIN-debug] POST   /api/copy                 --> github.com/jmorganca/ollama/server.CopyModelHandler (5 handlers)
[GIN-debug] DELETE /api/delete               --> github.com/jmorganca/ollama/server.DeleteModelHandler (5 handlers)
[GIN-debug] POST   /api/show                 --> github.com/jmorganca/ollama/server.ShowModelHandler (5 handlers)
[GIN-debug] POST   /api/blobs/:digest        --> github.com/jmorganca/ollama/server.CreateBlobHandler (5 handlers)
[GIN-debug] HEAD   /api/blobs/:digest        --> github.com/jmorganca/ollama/server.HeadBlobHandler (5 handlers)
[GIN-debug] GET    /                         --> github.com/jmorganca/ollama/server.Serve.func2 (5 handlers)
[GIN-debug] GET    /api/tags                 --> github.com/jmorganca/ollama/server.ListModelsHandler (5 handlers)
[GIN-debug] HEAD   /                         --> github.com/jmorganca/ollama/server.Serve.func2 (5 handlers)
[GIN-debug] HEAD   /api/tags                 --> github.com/jmorganca/ollama/server.ListModelsHandler (5 handlers)
2023/12/03 15:29:47 routes.go:777: Listening on 127.0.0.1:11434 (version 0.0.0)
2023/12/03 15:29:47 accelerator_rocm.go:73: ROCm found 15681 MiB of available VRAM on device "card0"
2023/12/03 15:29:47 accelerator_rocm.go:83: ROCm selecting device "card0"
[GIN] 2023/12/03 - 15:29:58 | 200 |      52.161µs |       127.0.0.1 | HEAD     "/"
[GIN] 2023/12/03 - 15:29:58 | 200 |      289.85µs |       127.0.0.1 | POST     "/api/show"
2023/12/03 15:29:59 accelerator_rocm.go:73: ROCm found 15681 MiB of available VRAM on device "card0"
2023/12/03 15:29:59 accelerator_rocm.go:83: ROCm selecting device "card0"
2023/12/03 15:29:59 llama.go:248: 15681 MB VRAM available, loading up to 96 GPU layers
2023/12/03 15:29:59 llama.go:373: starting llama runner
2023/12/03 15:29:59 llama.go:431: waiting for llama runner to start responding
ggml_init_cublas: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, compute capability 9.0
{"timestamp":1701646199,"level":"INFO","function":"main","line":1324,"message":"build info","build":1,"commit":"9e70cc0"}
{"timestamp":1701646199,"level":"INFO","function":"main","line":1330,"message":"system info","n_threads":8,"n_threads_batch":-1,"total_threads":16,"system_info":"AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | "}
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /home/verbie/.ollama/models/blobs/sha256:6ae28029995007a3ee8d0b8556d50f3b59b831074cf19c84de87acf51fb54054 (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_0     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:              blk.0.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    2:              blk.0.attn_k.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor    3:              blk.0.attn_v.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor    4:         blk.0.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_gate.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.ffn_up.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor    7:            blk.0.ffn_down.weight q4_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor    8:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    9:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   10:              blk.1.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   11:              blk.1.attn_k.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   12:              blk.1.attn_v.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   13:         blk.1.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   14:            blk.1.ffn_gate.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   15:              blk.1.ffn_up.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   16:            blk.1.ffn_down.weight q4_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   17:           blk.1.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   18:            blk.1.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   19:              blk.2.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   20:              blk.2.attn_k.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   21:              blk.2.attn_v.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   22:         blk.2.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   23:            blk.2.ffn_gate.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   24:              blk.2.ffn_up.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   25:            blk.2.ffn_down.weight q4_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   26:           blk.2.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   27:            blk.2.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   28:              blk.3.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   29:              blk.3.attn_k.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   30:              blk.3.attn_v.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   31:         blk.3.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   32:            blk.3.ffn_gate.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   33:              blk.3.ffn_up.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   34:            blk.3.ffn_down.weight q4_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   35:           blk.3.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   36:            blk.3.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   37:              blk.4.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   38:              blk.4.attn_k.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   39:              blk.4.attn_v.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   40:         blk.4.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   41:            blk.4.ffn_gate.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   42:              blk.4.ffn_up.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   43:            blk.4.ffn_down.weight q4_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   44:           blk.4.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   45:            blk.4.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   46:              blk.5.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   47:              blk.5.attn_k.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   48:              blk.5.attn_v.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   49:         blk.5.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   50:            blk.5.ffn_gate.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   51:              blk.5.ffn_up.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   52:            blk.5.ffn_down.weight q4_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   53:           blk.5.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   54:            blk.5.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   55:              blk.6.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   56:              blk.6.attn_k.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   57:              blk.6.attn_v.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   58:         blk.6.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   59:            blk.6.ffn_gate.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   60:              blk.6.ffn_up.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   61:            blk.6.ffn_down.weight q4_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   62:           blk.6.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   63:            blk.6.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   64:              blk.7.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   65:              blk.7.attn_k.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   66:              blk.7.attn_v.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   67:         blk.7.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   68:            blk.7.ffn_gate.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   69:              blk.7.ffn_up.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   70:            blk.7.ffn_down.weight q4_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   71:           blk.7.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   72:            blk.7.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   73:              blk.8.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   74:              blk.8.attn_k.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   75:              blk.8.attn_v.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   76:         blk.8.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   77:            blk.8.ffn_gate.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   78:              blk.8.ffn_up.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   79:            blk.8.ffn_down.weight q4_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   80:           blk.8.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   81:            blk.8.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   82:              blk.9.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   83:              blk.9.attn_k.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   84:              blk.9.attn_v.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   85:         blk.9.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   86:            blk.9.ffn_gate.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   87:              blk.9.ffn_up.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   88:            blk.9.ffn_down.weight q4_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   89:           blk.9.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   90:            blk.9.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   91:             blk.10.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   92:             blk.10.attn_k.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   93:             blk.10.attn_v.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   94:        blk.10.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   95:           blk.10.ffn_gate.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   96:             blk.10.ffn_up.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   97:           blk.10.ffn_down.weight q4_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   98:          blk.10.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   99:           blk.10.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  100:             blk.11.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  101:             blk.11.attn_k.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  102:             blk.11.attn_v.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  103:        blk.11.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  104:           blk.11.ffn_gate.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  105:             blk.11.ffn_up.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  106:           blk.11.ffn_down.weight q4_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  107:          blk.11.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  108:           blk.11.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  109:             blk.12.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  110:             blk.12.attn_k.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  111:             blk.12.attn_v.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  112:        blk.12.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  113:           blk.12.ffn_gate.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  114:             blk.12.ffn_up.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  115:           blk.12.ffn_down.weight q4_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  116:          blk.12.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  117:           blk.12.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  118:             blk.13.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  119:             blk.13.attn_k.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  120:             blk.13.attn_v.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  121:        blk.13.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  122:           blk.13.ffn_gate.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  123:             blk.13.ffn_up.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  124:           blk.13.ffn_down.weight q4_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  125:          blk.13.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  126:           blk.13.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  127:             blk.14.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  128:             blk.14.attn_k.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  129:             blk.14.attn_v.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  130:        blk.14.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  131:           blk.14.ffn_gate.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  132:             blk.14.ffn_up.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  133:           blk.14.ffn_down.weight q4_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  134:          blk.14.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  135:           blk.14.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  136:             blk.15.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  137:             blk.15.attn_k.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  138:             blk.15.attn_v.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  139:        blk.15.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  140:           blk.15.ffn_gate.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  141:             blk.15.ffn_up.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  142:           blk.15.ffn_down.weight q4_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  143:          blk.15.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  144:           blk.15.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  145:             blk.16.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  146:             blk.16.attn_k.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  147:             blk.16.attn_v.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  148:        blk.16.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  149:           blk.16.ffn_gate.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  150:             blk.16.ffn_up.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  151:           blk.16.ffn_down.weight q4_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  152:          blk.16.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  153:           blk.16.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  154:             blk.17.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  155:             blk.17.attn_k.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  156:             blk.17.attn_v.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  157:        blk.17.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  158:           blk.17.ffn_gate.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  159:             blk.17.ffn_up.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  160:           blk.17.ffn_down.weight q4_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  161:          blk.17.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  162:           blk.17.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  163:             blk.18.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  164:             blk.18.attn_k.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  165:             blk.18.attn_v.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  166:        blk.18.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  167:           blk.18.ffn_gate.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  168:             blk.18.ffn_up.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  169:           blk.18.ffn_down.weight q4_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  170:          blk.18.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  171:           blk.18.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  172:             blk.19.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  173:             blk.19.attn_k.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  174:             blk.19.attn_v.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  175:        blk.19.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  176:           blk.19.ffn_gate.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  177:             blk.19.ffn_up.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  178:           blk.19.ffn_down.weight q4_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  179:          blk.19.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  180:           blk.19.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  181:             blk.20.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  182:             blk.20.attn_k.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  183:             blk.20.attn_v.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  184:        blk.20.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  185:           blk.20.ffn_gate.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  186:             blk.20.ffn_up.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  187:           blk.20.ffn_down.weight q4_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  188:          blk.20.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  189:           blk.20.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  190:             blk.21.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  191:             blk.21.attn_k.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  192:             blk.21.attn_v.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  193:        blk.21.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  194:           blk.21.ffn_gate.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  195:             blk.21.ffn_up.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  196:           blk.21.ffn_down.weight q4_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  197:          blk.21.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  198:           blk.21.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  199:             blk.22.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  200:             blk.22.attn_k.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  201:             blk.22.attn_v.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  202:        blk.22.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  203:           blk.22.ffn_gate.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  204:             blk.22.ffn_up.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  205:           blk.22.ffn_down.weight q4_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  206:          blk.22.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  207:           blk.22.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  208:             blk.23.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  209:             blk.23.attn_k.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  210:             blk.23.attn_v.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  211:        blk.23.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  212:           blk.23.ffn_gate.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  213:             blk.23.ffn_up.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  214:           blk.23.ffn_down.weight q4_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  215:          blk.23.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  216:           blk.23.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  217:             blk.24.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  218:             blk.24.attn_k.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  219:             blk.24.attn_v.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  220:        blk.24.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  221:           blk.24.ffn_gate.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  222:             blk.24.ffn_up.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  223:           blk.24.ffn_down.weight q4_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  224:          blk.24.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  225:           blk.24.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  226:             blk.25.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  227:             blk.25.attn_k.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  228:             blk.25.attn_v.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  229:        blk.25.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  230:           blk.25.ffn_gate.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  231:             blk.25.ffn_up.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  232:           blk.25.ffn_down.weight q4_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  233:          blk.25.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  234:           blk.25.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  235:             blk.26.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  236:             blk.26.attn_k.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  237:             blk.26.attn_v.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  238:        blk.26.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  239:           blk.26.ffn_gate.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  240:             blk.26.ffn_up.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  241:           blk.26.ffn_down.weight q4_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  242:          blk.26.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  243:           blk.26.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  244:             blk.27.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  245:             blk.27.attn_k.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  246:             blk.27.attn_v.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  247:        blk.27.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  248:           blk.27.ffn_gate.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  249:             blk.27.ffn_up.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  250:           blk.27.ffn_down.weight q4_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  251:          blk.27.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  252:           blk.27.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  253:             blk.28.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  254:             blk.28.attn_k.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  255:             blk.28.attn_v.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  256:        blk.28.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  257:           blk.28.ffn_gate.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  258:             blk.28.ffn_up.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  259:           blk.28.ffn_down.weight q4_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  260:          blk.28.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  261:           blk.28.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  262:             blk.29.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  263:             blk.29.attn_k.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  264:             blk.29.attn_v.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  265:        blk.29.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  266:           blk.29.ffn_gate.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  267:             blk.29.ffn_up.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  268:           blk.29.ffn_down.weight q4_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  269:          blk.29.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  270:           blk.29.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  271:             blk.30.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  272:             blk.30.attn_k.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  273:             blk.30.attn_v.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  274:        blk.30.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  275:           blk.30.ffn_gate.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  276:             blk.30.ffn_up.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  277:           blk.30.ffn_down.weight q4_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  278:          blk.30.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  279:           blk.30.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  280:             blk.31.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  281:             blk.31.attn_k.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  282:             blk.31.attn_v.weight q4_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  283:        blk.31.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  284:           blk.31.ffn_gate.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  285:             blk.31.ffn_up.weight q4_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  286:           blk.31.ffn_down.weight q4_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  287:          blk.31.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  288:           blk.31.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  289:               output_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  290:                    output.weight q6_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str     
llama_model_loader: - kv   1:                               general.name str     
llama_model_loader: - kv   2:                       llama.context_length u32     
llama_model_loader: - kv   3:                     llama.embedding_length u32     
llama_model_loader: - kv   4:                          llama.block_count u32     
llama_model_loader: - kv   5:                  llama.feed_forward_length u32     
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32     
llama_model_loader: - kv   7:                 llama.attention.head_count u32     
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32     
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32     
llama_model_loader: - kv  10:                       llama.rope.freq_base f32     
llama_model_loader: - kv  11:                          general.file_type u32     
llama_model_loader: - kv  12:                       tokenizer.ggml.model str     
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr     
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr     
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr     
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32     
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32     
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32     
llama_model_loader: - kv  19:               general.quantization_version u32     
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = mostly Q4_0
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 3.83 GiB (4.54 BPW) 
llm_load_print_meta: general.name   = mistralai
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.10 MB
llm_load_tensors: using ROCm for GPU acceleration
llm_load_tensors: mem required  =   70.41 MB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 35/35 layers to GPU
llm_load_tensors: VRAM used: 3847.55 MB
..................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 256.00 MB
llama_new_context_with_model: kv self size  =  256.00 MB
llama_new_context_with_model: compute buffer total size = 162.13 MB
llama_new_context_with_model: VRAM scratch buffer: 156.00 MB
llama_new_context_with_model: total VRAM used: 4259.56 MB (model: 3847.55 MB, context: 412.00 MB)

llama server listening at http://127.0.0.1:54849

{"timestamp":1701646205,"level":"INFO","function":"main","line":1749,"message":"HTTP server listening","hostname":"127.0.0.1","port":54849}
{"timestamp":1701646205,"level":"INFO","function":"log_server_request","line":1240,"message":"request","remote_addr":"127.0.0.1","remote_port":53270,"status":200,"method":"HEAD","path":"/","params":{}}
2023/12/03 15:30:05 llama.go:445: llama runner started in 6.801142 seconds
[GIN] 2023/12/03 - 15:30:05 | 200 |  6.986254418s |       127.0.0.1 | POST     "/api/generate"

When I run mistral, the model got loaded into vram and gpu is working on inference:

========================= ROCm System Management Interface =========================
=================================== Concise Info ===================================
ERROR: GPU[0]	: sclk clock is unsupported
====================================================================================
GPU[0]		: get_power_cap, Not supported on the given system
GPU  Temp (DieEdge)  AvgPwr  SCLK  MCLK     Fan  Perf  PwrCap       VRAM%  GPU%  
0    69.0c           29.0W   None  1600Mhz  0%   auto  Unsupported   32%   82%   
====================================================================================
=============================== End of ROCm SMI Log ================================

However, the output of the model is garbage:

❯ ./ollama run mistral
>>> why is the sky blue

The##
```diff
blue color of the##
 È dici Arbitro togetteraamo disorder. È

>>> continue
``lersACCES tiempo! Here'eso tiempo, locally!

I have tried running on CPU, everything works fine. Just wonder what is going wrong? Thanks!

@65a
Copy link
Author

65a commented Dec 14, 2023

Note, I will close and delete my branch when #1146 merges.

@65a
Copy link
Author

65a commented Dec 18, 2023

@ml2s if you haven't sorted it out by now, that's usually a prompting/sampling problem, but there is an issue upstream right now (at llama.cpp) which looks like that. I haven't encountered it on ROCm, but it may be hardware or environment specific.

@Redhawk18
Copy link

This would be awesome to have official support, rocm is for sure harder to setup than cuda but I have the gpu I have.

@Chief-Detektor
Copy link

Hi!
This is really an awesome project and I just rebased this pull request to main. And it works with my RX 6800 XT
Is there any specific reason why it's not merged, yet?
And if there are reasons: How could I help?
Thanks in regards!

@ThatOneCalculator
Copy link

Fixed merge conflicts on https://github.com/65a/ollama/pull/1

@pdevine
Copy link
Contributor

pdevine commented Dec 20, 2023

This change is getting carried in #1146 which is just about to go in.

@Chief-Detektor
Copy link

Nice! But one tiny thing I had to add was an additional Environment entry in the service file:

Environment="ROCM_PATH=/opt/rocm"

But other than that it worked out of the box.

@65a 65a closed this by deleting the head repository Dec 24, 2023
@misaligar
Copy link

For uninitiated like myself, does this mean ollama will support AMD graphics cards like 7900 XTX going forward?

@pdevine
Copy link
Contributor

pdevine commented Dec 24, 2023

Yes, it should be working in main now. We're seeing > 100 toks/sec on a 7900 xtx w/ the 7b models.

@ThatOneCalculator
Copy link

ThatOneCalculator commented Dec 24, 2023

I tried building from source, but got this error (Arch Linux, RX 6700XT):

❯ ROCM_PATH=/opt/rocm ollama serve
2023/12/24 00:23:11 images.go:828: total blobs: 23
2023/12/24 00:23:11 images.go:835: total unused blobs removed: 0
2023/12/24 00:23:11 routes.go:887: Listening on 127.0.0.1:11434 (version 0.0.0)
2023/12/24 00:23:11 gpu.go:33: Detecting GPU type
2023/12/24 00:23:11 gpu.go:38: CUDA not detected: Unable to load libnvidia-ml.so library to query for Nvidia GPUs: /usr/lib/wsl/lib/libnvidia-ml.so.1: cannot open shared object file: No such file or directory
2023/12/24 00:23:11 gpu.go:47: Radeon GPU detected
[GIN] 2023/12/24 - 00:23:13 | 200 |       31.91µs |       127.0.0.1 | HEAD     "/"
[GIN] 2023/12/24 - 00:23:13 | 200 |     381.366µs |       127.0.0.1 | POST     "/api/show"
[GIN] 2023/12/24 - 00:23:13 | 200 |     279.804µs |       127.0.0.1 | POST     "/api/show"
Lazy loading /tmp/ollama698218904/librocm_server.so library
2023/12/24 00:23:13 shim_ext_server.go:94: Loading Dynamic Shim llm server: /tmp/ollama698218904/librocm_server.so
2023/12/24 00:23:13 gpu.go:131: 11562 MB VRAM available, loading up to 75 ROCM GPU layers out of 32
2023/12/24 00:23:13 ext_server.go:189: Initializing internal llama server
disabling verbose llm logging

# At the time of running the second command below
rocBLAS error: Cannot read /opt/rocm/lib/rocblas/library/TensileLibrary.dat: No such file or directory for GPU arch : gfx1031
 List of available TensileLibrary Files : 
"/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx908.dat"
"/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1100.dat"
"/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1030.dat"
"/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx940.dat"
"/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx906.dat"
"/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx90a.dat"
"/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1102.dat"
"/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1101.dat"
"/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx942.dat"
"/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx941.dat"
"/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx900.dat"
"/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx803.dat"
[1]    289847 IOT instruction (core dumped)  ROCM_PATH=/opt/rocm ollama serve
❯ ROCM_PATH=/opt/rocm ollama run codellama
Error: Post "http://127.0.0.1:11434/api/generate": EOF

@Chief-Detektor
Copy link

I do not have TensileLibrary.dat in /opt/rocm/lib/rocblas as well.
It's not needed

my outout of ollama serve:
out.txt

There is no mention of TensileLibrary.dat there..

But again I use this branch here rebased onto main (23dc179). I don't know what happend on main since then as I haven't checked it out, yet.

@ThatOneCalculator
Copy link

ThatOneCalculator commented Dec 24, 2023

After some trial and error (and much appreciated help from the Discord), I got it working on my 6700XT! https://discord.com/channels/1128867683291627614/1188401254284669008/1188411154725351485

TL;DR

  • yay -S git python3 python-virtualenv wget make python-pip rocm-hip-sdk rocm-opencl-sdk gperftools
  • Add your user to the render and video groups
  • Reboot!!
  • Set environment variables HSA_OVERRIDE_GFX_VERSION=10.3.0, HCC_AMDGPU_TARGET=gfx1030, and ROCM_PATH=/opt/rocm
  • ollama serve and enjoy!

@jmorganca
Copy link
Member

@65a excited to see this get in through #1146. Thanks so much for all of the hard work on ROCm (and the many rebases along the way)

@misaligar
Copy link

Can someone please tell me how to enable the AMD support? I've installed ollama and it still shows "WARNING: No NVIDIA GPU detected. Ollama will run in CPU-only mode". ollama version is 0.1.19

@dhiltgen
Copy link
Collaborator

The pre-release for 0.1.21 is up now, and we've made various improvements to support ROCm cards, covering both v5 and v6 of the ROCm libraries. You'll have to install ROCm, but then the Ollama binary should work.

Please let us know if you run into any problems by filing new tickets.

@misaligar
Copy link

Thanks @dhiltgen. Looks like the problem is that ROCm does not support Debian 12.

@misaligar
Copy link

I have switched to Arch hoping that I could use ollama with AMD support. Posted my issue here:

#2285

Anyone can help me sort this out?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AMD GPU & ROCm support