fix(llama-cpp): populate tensor_buft_override buffer so llama-cpp properly performs fit calculations by cvpcs · Pull Request #8560 · mudler/LocalAI

cvpcs · 2026-02-14T07:14:28Z

Description

While testing loading some large models on my PC (Qwen3-Coder-Next) I noticed that llama-cpp's CLI would load it and offload what it needed to system RAM, while LocalAI errored saying I didn't have enough VRAM.

Looking at the logs, I saw that llama-cpp was trying to perform fit calculations to load some tensors into VRAM and the rest into system RAM, but was erroring out:

Looking through llama-cpp's codebase, it looks like they expect the tensor_buft_overrides parameter to be populated with a buffer even if it isn't used.

This PR adds logic that mimics what llama-cpp does during arg parsing to fill this with an arbitrary buffer to avoid this bailout later.

Notes for Reviewers

Tested by building locally for cuda13 and validating that the 54GB model could load despite only 24GB VRAM available on a combination of VRAM/System RAM:

Signed commits

Yes, I signed my commits.

netlify · 2026-02-14T07:14:35Z

✅ Deploy Preview for localai ready!

Name	Link
🔨 Latest commit	`6590a54`
🔍 Latest deploy log	https://app.netlify.com/projects/localai/deploys/699020d8f725580008387ab0
😎 Deploy Preview	https://deploy-preview-8560--localai.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

cvpcs · 2026-02-14T07:29:29Z

As an extra note, I noticed by default this auto-fit logic from llama-cpp is bypassed because by default we pass a large value for gpu_layers (not sure if extra logic is involved, but that's what I saw for my model config).

I was able to enable auto-fit by setting gpu_layers = -1 in my model config which is associated to llama-cpp's auto-fit flag value, at which point I saw the tensor_buft_override buffer error which this PR addresses.

Curious if I should update the markdowns to advertise enabling auto-fit by setting gpu_layers=-1 for anyone interested in leveraging it, or if we should make changes to the default logic. @mudler thoughts?

mudler

ouch, good catch - thank you @cvpcs !

mudler · 2026-02-14T09:05:17Z

As an extra note, I noticed by default this auto-fit logic from llama-cpp is bypassed because by default we pass a large value for gpu_layers (not sure if extra logic is involved, but that's what I saw for my model config).

I was able to enable auto-fit by setting gpu_layers = -1 in my model config which is associated to llama-cpp's auto-fit flag value, at which point I saw the tensor_buft_override buffer error which this PR addresses.

Curious if I should update the markdowns to advertise enabling auto-fit by setting gpu_layers=-1 for anyone interested in leveraging it, or if we should make changes to the default logic. @mudler thoughts?

I'm still weighing the best approach here due to a subtle trade-off. LocalAI already unloads models based on a VRAM usage threshold, and enabling auto-fitting by default in llama.cpp would conflict with this mechanism. This requires either custom logic or a runtime toggle to fix correctly. In the short term, probably documenting the limitation is the best path forward.

mudler · 2026-02-14T09:22:38Z

Opened #8562 for discussion

fix auto-fit for llama-cpp

6590a54

mudler approved these changes Feb 14, 2026

View reviewed changes

mudler mentioned this pull request Feb 14, 2026

auto-fit vs unloading with threshold #8562

Open

mudler merged commit 42cb7bd into mudler:master Feb 14, 2026
45 of 46 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(llama-cpp): populate tensor_buft_override buffer so llama-cpp properly performs fit calculations#8560

fix(llama-cpp): populate tensor_buft_override buffer so llama-cpp properly performs fit calculations#8560
mudler merged 1 commit intomudler:masterfrom
cvpcs:llama-cpp-fit

cvpcs commented Feb 14, 2026

Uh oh!

netlify bot commented Feb 14, 2026 •

edited

Loading

Uh oh!

cvpcs commented Feb 14, 2026 •

edited

Loading

Uh oh!

mudler left a comment

Uh oh!

mudler commented Feb 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

mudler commented Feb 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

cvpcs commented Feb 14, 2026

Uh oh!

netlify bot commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for localai ready!

Uh oh!

cvpcs commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mudler left a comment

Choose a reason for hiding this comment

Uh oh!

mudler commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

mudler commented Feb 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

netlify bot commented Feb 14, 2026 •

edited

Loading

cvpcs commented Feb 14, 2026 •

edited

Loading

mudler commented Feb 14, 2026 •

edited

Loading