Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU support crashing on Linux in 0.2 releases #39

Closed
GarciaLnk opened this issue Dec 2, 2023 · 11 comments
Closed

GPU support crashing on Linux in 0.2 releases #39

GarciaLnk opened this issue Dec 2, 2023 · 11 comments
Labels

Comments

@GarciaLnk
Copy link

When I run llamafile on my system, the model loads fine into my GPU VRAM, however whenever I try to send a prompt llamafile crashes with the following error:

slot 0 is processing [task id: 0]
slot 0 : in cache: 0 tokens | to process: 53 tokens
slot 0 : kv cache rm - [0, end)

CUDA error 304 at /home/garcia/.llamafile/ggml-cuda.cu:6006: OS call failed or operation not supported on this OS
current device: 0

This error happens even when -ngl is not set.

Here is some info about my system:

$ lspci | grep -i nvidia
01:00.0 VGA compatible controller: NVIDIA Corporation GA106M [GeForce RTX 3060 Mobile / Max-Q] (rev a1)
$ uname -r
6.6.2-101.fc38.x86_64
$ cat /sys/module/nvidia/version
545.29.06
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Nov__3_17:16:49_PDT_2023
Cuda compilation tools, release 12.3, V12.3.103
Build cuda_12.3.r12.3/compiler.33492891_0
@jart
Copy link
Collaborator

jart commented Dec 2, 2023

There's no CUDA_CHECK() anywhere near line 6006 in ggml-cuda.cu. Could you upload /home/garcia/.llamafile/ggml-cuda.cu to this issue tracker so I can see what's failing?

@GarciaLnk
Copy link
Author

Sure, here it is attached, however after a quick diff it seems to be identical to this repo's llama.cpp/ggml-cuda.cu file (and in fact, there's a CUDA_CHECK() in line 6006).

ggml-cuda.cu

@jart
Copy link
Collaborator

jart commented Dec 2, 2023

You are correct about line 6006. Apologies for any confusion. So here's what failing:

    CUDA_CHECK(cudaMalloc((void **) &ptr, look_ahead_size));

Looks like you're running out of GPU memory. But you said you're not passing the -ngl flag which defaults to zero. I want to understand why it's possible to run out of GPU memory when the GPU isn't being used. Help wanted.

@jart
Copy link
Collaborator

jart commented Dec 2, 2023

Also do you know if this happens if you use llama.cpp upstream?

@r3drock
Copy link

r3drock commented Dec 2, 2023

I am getting the same error on my system. My system is kind of similar. I also got a mobile Nvidia gpu.
Error message for the mistral server llamafile:

llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  =   70.42 MB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 35/35 layers to GPU
llm_load_tensors: VRAM used: 4095.05 MB
...............................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 64.00 MB
llama_new_context_with_model: kv self size  =   64.00 MB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 79.63 MB
llama_new_context_with_model: VRAM scratch buffer: 73.00 MB
llama_new_context_with_model: total VRAM used: 4232.06 MB (model: 4095.05 MB, context: 137.00 MB)
Available slots:
 -> Slot 0 - max context: 512

llama server listening at http://127.0.0.1:8080

loading weights...
{"timestamp":1701544135,"level":"INFO","function":"main","line":3039,"message":"HTTP server listening","hostname":"127.0.0.1","port":8080}
all slots are idle and system prompt is empty, clear the KV cache
{"timestamp":1701544135,"level":"INFO","function":"log_server_request","line":2591,"message":"request","remote_addr":"127.0.0.1","remote_port":37040,"status":200,"method":"GET","path":"/","params":{}}
{"timestamp":1701544135,"level":"INFO","function":"log_server_request","line":2591,"message":"request","remote_addr":"127.0.0.1","remote_port":37040,"status":200,"method":"GET","path":"/index.js","params":{}}
{"timestamp":1701544135,"level":"INFO","function":"log_server_request","line":2591,"message":"request","remote_addr":"127.0.0.1","remote_port":37044,"status":200,"method":"GET","path":"/completion.js","params":{}}
{"timestamp":1701544135,"level":"INFO","function":"log_server_request","line":2591,"message":"request","remote_addr":"127.0.0.1","remote_port":37056,"status":200,"method":"GET","path":"/json-schema-to-grammar.mjs","params":{}}
slot 0 is processing [task id: 0]
slot 0 : in cache: 0 tokens | to process: 5 tokens
slot 0 : kv cache rm - [0, end)

CUDA error 304 at /home/r3d/.llamafile/ggml-cuda.cu:6006: OS call failed or operation not supported on this OS
current device: 0

some sysinfo:

$ lspci | grep -i nvidia
0000:01:00.0 VGA compatible controller: NVIDIA Corporation GA104M [GeForce RTX 3080 Mobile / Max-Q 8GB/16GB] (rev a1)
0000:01:00.1 Audio device: NVIDIA Corporation GA104 High Definition Audio Controller (rev a1)
$ uname -r
6.6.3-arch1-1
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Sep__8_19:17:24_PDT_2023
Cuda compilation tools, release 12.3, V12.3.52
Build cuda_12.3.r12.3/compiler.33281558_0
$ cat /sys/module/nvidia/version
545.29.06

For me, it does not happen if I use mistral with llama.cpp.

@GarciaLnk
Copy link
Author

GarciaLnk commented Dec 2, 2023

Also do you know if this happens if you use llama.cpp upstream?

It does not happen with llama.cpp, both -ngl 35 and -ngl 0 work fine there using the same model.

@GarciaLnk
Copy link
Author

I just tried the earlier releases and everything works fine in v0.1, but it breaks in v0.2, so some breaking change must have been introduced there.

@jart
Copy link
Collaborator

jart commented Dec 2, 2023

I'm reasonably certain if you pass the --unsecure flag, things will work. Could you confirm this?

@jart jart changed the title CUDA error 304 on Fedora GPU support crashing on Linux in 0.2 releases Dec 2, 2023
@jart jart added bug and removed help wanted labels Dec 2, 2023
@jart jart closed this as completed in 61944b5 Dec 2, 2023
@r3drock
Copy link

r3drock commented Dec 2, 2023

works for me

@jart
Copy link
Collaborator

jart commented Dec 2, 2023

Great! I'll update all the llamafiles on HuggingFace so their .args files pass the --unsecure flag. That will rollback the new security until the next release can do better.

@GarciaLnk
Copy link
Author

Yup, that works, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants