LocalAI version:
- LocalAI v4.1.3 (fdc9f7b) running via Docker Compose
Environment, CPU architecture, OS, and Version:
- AMD Ryzen 9 5950X 16-Core Processor with 128GB Ram
- NVidia Blackwell GPU running Cuda 12.8
- Containers running on Truenas Scale (25.10.3 - Goldeye)
- Backend:
llama-cpp-development
uname -a: Linux truenas 6.12.33-production+truenas #1 SMP PREEMPT_DYNAMIC Mon Apr 13 19:09:57 UTC 2026 x86_64 GNU/Linux
Describe the bug
To Reproduce
- Start any chat, for example:
Hello
Expected behavior
- A response from LocalAI, for example:
Hello! How can I assist you today?
Logs
11:36:04 AM.781stderrgguf_init_from_file: failed to open GGUF file 'llama-cpp/models/Llama-3.1-8B-Instruct-Q6/Meta-Llama-3.1-8B-Instruct-Q6_K.gguf' (No such file or directory)
11:36:04 AM.815stderrllama_model_load: error loading model: llama_model_loader: failed to load model from llama-cpp/models/Llama-3.1-8B-Instruct-Q6/Meta-Llama-3.1-8B-Instruct-Q6_K.gguf
11:36:04 AM.815stderrllama_model_load_from_file_impl: failed to load model
11:36:04 AM.816stderrsrv load_model: failed to load draft model, 'llama-cpp/models/Llama-3.1-8B-Instruct-Q6/Meta-Llama-3.1-8B-Instruct-Q6_K.gguf'
Additional context
- This doesn't make sense, because I copy pasted the draft model entry from the original model and double-checked that it is there.
- Both models run as expected individually
- "file not found" error occurs even though the paths are identical for the model individually and as a draft
- Draft model config:
name: draft-model
backend: llama-cpp-development
context_size: 24576
description: Nemotron 70B Instruct with Speculative Decoding using Llama 3.1 8B draft
function:
automatic_tool_parsing_fallback: true
grammar:
disable: true
known_usecases:
- chat
parameters:
model: llama-cpp/models/Llama-3.1-70B-Instruct-Nemotron-HF-Q5/Llama-3.1-Nemotron-70B-Instruct-HF-Q5_K_S.gguf
repeat_penalty: 1
temperature: 0.85
min_p: 0.1
top_k: -1
top_p: 0.9
template:
use_tokenizer_template: true
draft_model: llama-cpp/models/Llama-3.1-8B-Instruct-Q6/Meta-Llama-3.1-8B-Instruct-Q6_K.gguf
n_draft: 6
options:
- spec_type:draft
- spec_temp:1.15
- spec_p_min:0.2
- draft_ctx_size:24576
- draft_gpu_layers:999
- flash_attention:true
LocalAI version:
Environment, CPU architecture, OS, and Version:
llama-cpp-developmentuname -a:Linux truenas 6.12.33-production+truenas #1 SMP PREEMPT_DYNAMIC Mon Apr 13 19:09:57 UTC 2026 x86_64 GNU/LinuxDescribe the bug
To Reproduce
HelloExpected behavior
Hello! How can I assist you today?Logs
Additional context