Draft Model Issue

**LocalAI version:**

- LocalAI v4.1.3 (fdc9f7bf350b05942323f5f4b264480c7b67b51c) running via Docker Compose

**Environment, CPU architecture, OS, and Version:**

1. AMD Ryzen 9 5950X 16-Core Processor with 128GB Ram 
2. NVidia Blackwell GPU running Cuda 12.8
3. Containers running on Truenas Scale (25.10.3 - Goldeye)
4. Backend: `llama-cpp-development`
5. `uname -a`: `Linux truenas 6.12.33-production+truenas #1 SMP PREEMPT_DYNAMIC Mon Apr 13 19:09:57 UTC 2026 x86_64 GNU/Linux`

**Describe the bug**

- File Not Found error

**To Reproduce**

- Start any chat, for example: `Hello`

**Expected behavior**

- A response from LocalAI, for example:`Hello! How can I assist you today?`

**Logs**

```bash
11:36:04 AM.781stderrgguf_init_from_file: failed to open GGUF file 'llama-cpp/models/Llama-3.1-8B-Instruct-Q6/Meta-Llama-3.1-8B-Instruct-Q6_K.gguf' (No such file or directory)
11:36:04 AM.815stderrllama_model_load: error loading model: llama_model_loader: failed to load model from llama-cpp/models/Llama-3.1-8B-Instruct-Q6/Meta-Llama-3.1-8B-Instruct-Q6_K.gguf
11:36:04 AM.815stderrllama_model_load_from_file_impl: failed to load model
11:36:04 AM.816stderrsrv    load_model: failed to load draft model, 'llama-cpp/models/Llama-3.1-8B-Instruct-Q6/Meta-Llama-3.1-8B-Instruct-Q6_K.gguf'
```

**Additional context**

1. This doesn't make sense, because I copy pasted the draft model entry from the original model and double-checked that it is there.
2. Both models run as expected individually
3. "file not found" error occurs even though the paths are identical for the model individually and as a draft
4. Draft model config:

```yaml
name: draft-model
backend: llama-cpp-development
context_size: 24576
description: Nemotron 70B Instruct with Speculative Decoding using Llama 3.1 8B draft
function:
    automatic_tool_parsing_fallback: true
    grammar:
        disable: true
known_usecases:
    - chat
parameters:
    model: llama-cpp/models/Llama-3.1-70B-Instruct-Nemotron-HF-Q5/Llama-3.1-Nemotron-70B-Instruct-HF-Q5_K_S.gguf
    repeat_penalty: 1
    temperature: 0.85
    min_p: 0.1
    top_k: -1
    top_p: 0.9
template:
    use_tokenizer_template: true
draft_model: llama-cpp/models/Llama-3.1-8B-Instruct-Q6/Meta-Llama-3.1-8B-Instruct-Q6_K.gguf
n_draft: 6
options:
    - spec_type:draft
    - spec_temp:1.15
    - spec_p_min:0.2
    - draft_ctx_size:24576
    - draft_gpu_layers:999
    - flash_attention:true
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Draft Model Issue #9675

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Draft Model Issue #9675

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions