Skip to content

Draft Model Issue #9675

@roobyz

Description

@roobyz

LocalAI version:

  • LocalAI v4.1.3 (fdc9f7b) running via Docker Compose

Environment, CPU architecture, OS, and Version:

  1. AMD Ryzen 9 5950X 16-Core Processor with 128GB Ram
  2. NVidia Blackwell GPU running Cuda 12.8
  3. Containers running on Truenas Scale (25.10.3 - Goldeye)
  4. Backend: llama-cpp-development
  5. uname -a: Linux truenas 6.12.33-production+truenas #1 SMP PREEMPT_DYNAMIC Mon Apr 13 19:09:57 UTC 2026 x86_64 GNU/Linux

Describe the bug

  • File Not Found error

To Reproduce

  • Start any chat, for example: Hello

Expected behavior

  • A response from LocalAI, for example:Hello! How can I assist you today?

Logs

11:36:04 AM.781stderrgguf_init_from_file: failed to open GGUF file 'llama-cpp/models/Llama-3.1-8B-Instruct-Q6/Meta-Llama-3.1-8B-Instruct-Q6_K.gguf' (No such file or directory)
11:36:04 AM.815stderrllama_model_load: error loading model: llama_model_loader: failed to load model from llama-cpp/models/Llama-3.1-8B-Instruct-Q6/Meta-Llama-3.1-8B-Instruct-Q6_K.gguf
11:36:04 AM.815stderrllama_model_load_from_file_impl: failed to load model
11:36:04 AM.816stderrsrv    load_model: failed to load draft model, 'llama-cpp/models/Llama-3.1-8B-Instruct-Q6/Meta-Llama-3.1-8B-Instruct-Q6_K.gguf'

Additional context

  1. This doesn't make sense, because I copy pasted the draft model entry from the original model and double-checked that it is there.
  2. Both models run as expected individually
  3. "file not found" error occurs even though the paths are identical for the model individually and as a draft
  4. Draft model config:
name: draft-model
backend: llama-cpp-development
context_size: 24576
description: Nemotron 70B Instruct with Speculative Decoding using Llama 3.1 8B draft
function:
    automatic_tool_parsing_fallback: true
    grammar:
        disable: true
known_usecases:
    - chat
parameters:
    model: llama-cpp/models/Llama-3.1-70B-Instruct-Nemotron-HF-Q5/Llama-3.1-Nemotron-70B-Instruct-HF-Q5_K_S.gguf
    repeat_penalty: 1
    temperature: 0.85
    min_p: 0.1
    top_k: -1
    top_p: 0.9
template:
    use_tokenizer_template: true
draft_model: llama-cpp/models/Llama-3.1-8B-Instruct-Q6/Meta-Llama-3.1-8B-Instruct-Q6_K.gguf
n_draft: 6
options:
    - spec_type:draft
    - spec_temp:1.15
    - spec_p_min:0.2
    - draft_ctx_size:24576
    - draft_gpu_layers:999
    - flash_attention:true

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions