GPU support crashing on Linux in 0.2 releases #39

GarciaLnk · 2023-12-02T17:57:54Z

When I run llamafile on my system, the model loads fine into my GPU VRAM, however whenever I try to send a prompt llamafile crashes with the following error:

slot 0 is processing [task id: 0]
slot 0 : in cache: 0 tokens | to process: 53 tokens
slot 0 : kv cache rm - [0, end)

CUDA error 304 at /home/garcia/.llamafile/ggml-cuda.cu:6006: OS call failed or operation not supported on this OS
current device: 0

This error happens even when -ngl is not set.

Here is some info about my system:

$ lspci | grep -i nvidia
01:00.0 VGA compatible controller: NVIDIA Corporation GA106M [GeForce RTX 3060 Mobile / Max-Q] (rev a1)
$ uname -r
6.6.2-101.fc38.x86_64
$ cat /sys/module/nvidia/version
545.29.06
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Nov__3_17:16:49_PDT_2023
Cuda compilation tools, release 12.3, V12.3.103
Build cuda_12.3.r12.3/compiler.33492891_0

The text was updated successfully, but these errors were encountered:

jart · 2023-12-02T18:25:21Z

There's no CUDA_CHECK() anywhere near line 6006 in ggml-cuda.cu. Could you upload /home/garcia/.llamafile/ggml-cuda.cu to this issue tracker so I can see what's failing?

GarciaLnk · 2023-12-02T18:40:07Z

Sure, here it is attached, however after a quick diff it seems to be identical to this repo's llama.cpp/ggml-cuda.cu file (and in fact, there's a CUDA_CHECK() in line 6006).

ggml-cuda.cu

jart · 2023-12-02T19:04:44Z

You are correct about line 6006. Apologies for any confusion. So here's what failing:

    CUDA_CHECK(cudaMalloc((void **) &ptr, look_ahead_size));

Looks like you're running out of GPU memory. But you said you're not passing the -ngl flag which defaults to zero. I want to understand why it's possible to run out of GPU memory when the GPU isn't being used. Help wanted.

jart · 2023-12-02T19:05:40Z

Also do you know if this happens if you use llama.cpp upstream?

r3drock · 2023-12-02T19:14:22Z

I am getting the same error on my system. My system is kind of similar. I also got a mobile Nvidia gpu.
Error message for the mistral server llamafile:

llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  =   70.42 MB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 35/35 layers to GPU
llm_load_tensors: VRAM used: 4095.05 MB
...............................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 64.00 MB
llama_new_context_with_model: kv self size  =   64.00 MB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 79.63 MB
llama_new_context_with_model: VRAM scratch buffer: 73.00 MB
llama_new_context_with_model: total VRAM used: 4232.06 MB (model: 4095.05 MB, context: 137.00 MB)
Available slots:
 -> Slot 0 - max context: 512

llama server listening at http://127.0.0.1:8080

loading weights...
{"timestamp":1701544135,"level":"INFO","function":"main","line":3039,"message":"HTTP server listening","hostname":"127.0.0.1","port":8080}
all slots are idle and system prompt is empty, clear the KV cache
{"timestamp":1701544135,"level":"INFO","function":"log_server_request","line":2591,"message":"request","remote_addr":"127.0.0.1","remote_port":37040,"status":200,"method":"GET","path":"/","params":{}}
{"timestamp":1701544135,"level":"INFO","function":"log_server_request","line":2591,"message":"request","remote_addr":"127.0.0.1","remote_port":37040,"status":200,"method":"GET","path":"/index.js","params":{}}
{"timestamp":1701544135,"level":"INFO","function":"log_server_request","line":2591,"message":"request","remote_addr":"127.0.0.1","remote_port":37044,"status":200,"method":"GET","path":"/completion.js","params":{}}
{"timestamp":1701544135,"level":"INFO","function":"log_server_request","line":2591,"message":"request","remote_addr":"127.0.0.1","remote_port":37056,"status":200,"method":"GET","path":"/json-schema-to-grammar.mjs","params":{}}
slot 0 is processing [task id: 0]
slot 0 : in cache: 0 tokens | to process: 5 tokens
slot 0 : kv cache rm - [0, end)

CUDA error 304 at /home/r3d/.llamafile/ggml-cuda.cu:6006: OS call failed or operation not supported on this OS
current device: 0

some sysinfo:

$ lspci | grep -i nvidia
0000:01:00.0 VGA compatible controller: NVIDIA Corporation GA104M [GeForce RTX 3080 Mobile / Max-Q 8GB/16GB] (rev a1)
0000:01:00.1 Audio device: NVIDIA Corporation GA104 High Definition Audio Controller (rev a1)
$ uname -r
6.6.3-arch1-1
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Sep__8_19:17:24_PDT_2023
Cuda compilation tools, release 12.3, V12.3.52
Build cuda_12.3.r12.3/compiler.33281558_0
$ cat /sys/module/nvidia/version
545.29.06

For me, it does not happen if I use mistral with llama.cpp.

GarciaLnk · 2023-12-02T19:16:56Z

Also do you know if this happens if you use llama.cpp upstream?

It does not happen with llama.cpp, both -ngl 35 and -ngl 0 work fine there using the same model.

GarciaLnk · 2023-12-02T19:26:12Z

I just tried the earlier releases and everything works fine in v0.1, but it breaks in v0.2, so some breaking change must have been introduced there.

jart · 2023-12-02T19:45:30Z

I'm reasonably certain if you pass the --unsecure flag, things will work. Could you confirm this?

r3drock · 2023-12-02T19:58:47Z

works for me

jart · 2023-12-02T20:01:48Z

Great! I'll update all the llamafiles on HuggingFace so their .args files pass the --unsecure flag. That will rollback the new security until the next release can do better.

GarciaLnk · 2023-12-02T20:08:12Z

Yup, that works, thank you!

See #39

jart added the awaiting response label Dec 2, 2023

jart added help wanted and removed awaiting response labels Dec 2, 2023

jart changed the title ~~CUDA error 304 on Fedora~~ GPU support crashing on Linux in 0.2 releases Dec 2, 2023

jart added bug and removed help wanted labels Dec 2, 2023

jart closed this as completed in 61944b5 Dec 2, 2023

jart added a commit that referenced this issue Dec 2, 2023

Add tool for checking alignment of PKZIP assets

1fbc21f

See #39

jart added a commit that referenced this issue Dec 2, 2023

Add support for replacing zip files to zipalign

7b2fbcb

See #39

hapasa mentioned this issue Dec 2, 2023

GPU support crash on Linux #41

Closed

jart mentioned this issue Dec 4, 2023

[NVIDIA cuBLAS GPU] Bad system call #55

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU support crashing on Linux in 0.2 releases #39

GPU support crashing on Linux in 0.2 releases #39

GarciaLnk commented Dec 2, 2023

jart commented Dec 2, 2023

GarciaLnk commented Dec 2, 2023

jart commented Dec 2, 2023

jart commented Dec 2, 2023

r3drock commented Dec 2, 2023 •

edited

Loading

GarciaLnk commented Dec 2, 2023 •

edited

Loading

GarciaLnk commented Dec 2, 2023

jart commented Dec 2, 2023

r3drock commented Dec 2, 2023

jart commented Dec 2, 2023

GarciaLnk commented Dec 2, 2023

GPU support crashing on Linux in 0.2 releases #39

GPU support crashing on Linux in 0.2 releases #39

Comments

GarciaLnk commented Dec 2, 2023

jart commented Dec 2, 2023

GarciaLnk commented Dec 2, 2023

jart commented Dec 2, 2023

jart commented Dec 2, 2023

r3drock commented Dec 2, 2023 • edited Loading

GarciaLnk commented Dec 2, 2023 • edited Loading

GarciaLnk commented Dec 2, 2023

jart commented Dec 2, 2023

r3drock commented Dec 2, 2023

jart commented Dec 2, 2023

GarciaLnk commented Dec 2, 2023

r3drock commented Dec 2, 2023 •

edited

Loading

GarciaLnk commented Dec 2, 2023 •

edited

Loading