Small VRAM System: Trimming VRAM allocation LocalAI + llama.cpp #9936

MattMalone · 2026-05-21T20:56:31Z

MattMalone
May 21, 2026

Ok, so I have been running models on my 6G 1060 video card for a while using Hugging Face, Python, PyTorch, Cuda for some time. I decided to move to LocalAI in hopes of getting an OpenAI style web interface that I could call for further developments. I have been stymied for days. Regardless of all the settings I have been recommended to try, for the particular very small model I am attempting, the process always reports (DEBUG=true):

May 21 16:52:46 DEBUG GRPC stderr id="Qwen3.5-4B-IQ4_NL.gguf-127.0.0.1:39133" line="0.02.735.271 W common_fit_params: failed to fit params to free device memory: **n_gpu_layers already set by user to 99999999**, abort" caller={caller.file="/home/runner/work/LocalAI/LocalAI/pkg/model/process.go"  caller.L=187 } 
...
May 21 16:37:42 DEBUG GRPC stderr id="Qwen3.5-4B-IQ4_NL.gguf-127.0.0.1:43283" line="0.03.781.690 E ggml_backend_cuda_buffer_type_alloc_buffer: **allocating 8192.00 MiB** on device 0: cudaMalloc failed: out of memory" caller={caller.file="/home/runner/work/LocalAI/LocalAI/pkg/model/process.go"  caller.L=187 }

The model is of size 2.58G, it is tiny.

backend: llama-cpp
args:
  - "-c 1024"
  - "--ctx-size 1024"
  - "--no-mmap"          # Alternative: Turn off if your system RAM is low
description: Imported from huggingface.co:///unsloth/Qwen3.5-4B-IQ4_NL.gguf
function:
    automatic_tool_parsing_fallback: true
    grammar:
        disable: true
known_usecases:
    - chat
min_p: 0
name: Qwen3.5-4B-IQ4_NL.gguf
options:
    - use_jinja:true
parameters:
    min_p: 0
    model: Qwen3.5-4B-IQ4_NL.gguf
    context_size: 1024
    type: q4_0
    no_kv_offload: true   # Forces the KV scratchpad buffer into system RAM instead of CUDA
    mmap: false           # Stops aggressive pre-allocation pinning
    gpu_layers: 0    
    presence_penalty: 1.5
    repeat_penalty: 1
    temperature: 0.7
    top_k: 20
    top_p: 0.8
presence_penalty: 1.5
repeat_penalty: 1
temperature: 0.7
template:
    use_tokenizer_template: true
top_k: 20
top_p: 0.8

Nowhere in my YAML file have I ever set n_gpu_layers to be 99999999.

I have gradually added context_size, gpu_layers, no-mmap, mmap=false, everything google is telling me to do to get rid of the memory allocation error. I even have gpu_layers=0 now, so it should run in CPU memory, without any Cuda allocation at all.

From the beginning, and completely unchanged in any attempt, there is an attempt to allocate 8G on a 6G card, like it is a fixed default minimum, which fails and the LLM does not load. To be clear, the LLM does not require 8G to run. It should run passably on 4G, and be able to at least load on 3G. But LocalAI + llama.cpp refuse to allocate less than 8G.

I have exhausted the YAML settings google is familiar with for LocalAI. Is there anything else to cause LocalAI to attempt to allocation anything less than 8G ?

I struggle to understand how no one has encountered this problem as there must be a few people who are trying to work with cards less than 10-12G (where 8G might be allocated along side OS usage of VRAM) but nothing comes up in searches of this discussion group.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Small VRAM System: Trimming VRAM allocation LocalAI + llama.cpp #9936

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Small VRAM System: Trimming VRAM allocation LocalAI + llama.cpp #9936

Uh oh!

MattMalone May 21, 2026

Replies: 0 comments

MattMalone
May 21, 2026